Wired magazine has a new article that’s worth a read.
And a laugh or two. Yes, there’s some interesting stuff in here, but typical of Wired, it’s hyped.
Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”
This is highly contingent upon your definition of “succeed”.
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
Actually, they don’t. Numbers never speak for themselves.
Any social scientist will tell you that large datasets are fantastic. You can really find some interesting correlations in large populations when you have large enough datasets.
But there are big problems with large datasets, as well. For one thing, they’re never as big as you imagine them to be. There’s always data that exists outside the dataset that might actually be relevant to what you’re trying to figure out.
The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
With you so far, there, Mr. Anderson.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…
Here’s my bone of contention with this article. While Chris has some interesting points in the rest of the article, and a couple of great observations, this conclusion doesn’t precisely follow… and he does a terrible job of explaining exactly how “this approach to science” has actually been superseded, and by what.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Argh. No, correlation is not “enough”. Well, perhaps it is in one sense, more on that follows…
Chris Anderson goes on to give an example which illustrates directly what I’m talking about here:
Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.
This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It’s just data.
… and this represents… what?
If you know something exists because you mine a data set, that does indeed tell you something interesting. Something exists, and you didn’t know about it before.
But as pointed out in this very paragraph… it doesn’t tell you anything about it, other than it exists.
Certainly, this can change how you build a model. Instead of “hypothesize, model, test”, you now have “mine, discover, [something]“… but you have to have the “[something]” step, which he doesn’t talk about in this article, at all. Otherwise, all you’ve done is discover the existence of a previously unknown thing. The significance of the thing is still unknown.
[Edited to Add]: Ed Felten’s commentary on the article.