
I know what you’re thinking. Here comes another 500 words of blather about a concept that has been so worked over it’s become nearly meaningless. Don’t worry; I’m not going to pretend that I have anything useful to say about big data. Instead I’m going to summarize the ideas of someone who I think does.
On May 11 Jonathan Levin and Liran Einav of Stanford University published a National Bureau of Economic Research (NBER) working paper called “The Data Revolution and Economic Analysis.” The paper is about how economists can use big data, but begins with an excellent discussion about the nature of big data in general. Specifically Levin and Einav outline four characteristics of big data that differentiate it from data available in the past:
- Data is available in real time.
- Data is available at larger scale.
- Data is available on novel types of variables.
- Data come with less structure.
Levin and Einav acknowledge that the biggest application of big data is, of course, predictive modeling, a la Netflix, Amazon, Google and Apple. They note that the last couple decades have seen “a remarkable amount of work on the statistical and machine learning techniques that underlie these applications, such as Lasso and Ridge regressions and classification models.” As a recent graduate, I can tell you that “machine learning” is one of the hottest subjects in school, with a clout similar to what I imagine “aerospace engineering” had in the 1960s.
Levin and Einav also discuss the danger of overfitting, one of the problems most extolled by Nate Silver in his book The Signal and the Noise. (Disclaimer, I didn’t read the book, I just read Silver’s reddit Ask Me Anything.You should too). In the past, most data sets came with a large number of observations, and a small number of variables, like, for example, a 10-question survey with 500 respondents. Now, because computers can log almost every keystroke, and capture detailed consumer histories, it’s possible to encounter data sets with more variables than observations, as when you have a detailed purchase history and exhaustive demographic profile for only a handful of customers. Roughly, overfitting is the danger of identifying a trend that isn’t really there.
Levin and Einav go on to discuss several potential applications for academic economics, of which I think the most interesting is the ability to measure inflation and unemployment with retail and payroll data instead of relying on survey data, as the Bureau of Labor statistics currently does to construct the Consumer Price Index and the unemployment rate. They also mention a couple of potential applications for financial services, including fraud prevention and credit information. I know I’m keeping my eye on ZestFinance.
The primary conclusion of the paper isn’t really novel; big data is a big opportunity, but there are big challenges in figuring how to analyze it and apply it effectively. If you have time, it’s worth reading the paper for yourself. You can download it for free from Levin’s website, here. (A little secret about economics: unlike many other academic disciplines, most economic journals do not prohibit the author from making the paper available elsewhere, so if you ever find yourself blocked by a subscription requirement, look for the same paper on the author’s personal website).