I Find Your Lack of Faith Disturbing   6 comments

Wired magazine has a new article that’s worth a read.

And a laugh or two. Yes, there’s some interesting stuff in here, but typical of Wired, it’s hyped.

Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

This is highly contingent upon your definition of “succeed”.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

Actually, they don’t. Numbers never speak for themselves.

Any social scientist will tell you that large datasets are fantastic. You can really find some interesting correlations in large populations when you have large enough datasets.

But there are big problems with large datasets, as well. For one thing, they’re never as big as you imagine them to be. There’s always data that exists outside the dataset that might actually be relevant to what you’re trying to figure out.

The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

With you so far, there, Mr. Anderson.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…

Here’s my bone of contention with this article. While Chris has some interesting points in the rest of the article, and a couple of great observations, this conclusion doesn’t precisely follow… and he does a terrible job of explaining exactly how “this approach to science” has actually been superseded, and by what.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Argh. No, correlation is not “enough”. Well, perhaps it is in one sense, more on that follows…

Chris Anderson goes on to give an example which illustrates directly what I’m talking about here:

Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It’s just data.

… and this represents… what?

If you know something exists because you mine a data set, that does indeed tell you something interesting. Something exists, and you didn’t know about it before.

But as pointed out in this very paragraph… it doesn’t tell you anything about it, other than it exists.

Certainly, this can change how you build a model. Instead of “hypothesize, model, test”, you now have “mine, discover, [something]“… but you have to have the “[something]” step, which he doesn’t talk about in this article, at all. Otherwise, all you’ve done is discover the existence of a previously unknown thing. The significance of the thing is still unknown.

[Edited to Add]: Ed Felten’s commentary on the article.

Advertisements

Posted June 24, 2008 by padraic2112 in science

6 responses to “I Find Your Lack of Faith Disturbing

Subscribe to comments with RSS.

  1. “All models are wrong, some models are useful”. Lets ingore the second part and focus on the first!

    Again I am coming at this from a different background, but doesn’t the data mining, in trying to recognize a pattern or an outlier, test against…. wait for it…. a model?

    I have my own personal bias, since I develop hardware models. In that system, if my data doesn’t match my model, either the system or the model is broken. So I can’t conceive of how the data analysis itself exists outside the concept of a model. Maybe its definition of terms? Applied mathmaticians… do they blanche at the word model?

  2. > “All models are wrong, some models are useful”. Lets
    > ignore the second part and focus on the first!

    The first part isn’t precisely correct, anyway.

    It’s not that all models are “wrong”, it’s that they aren’t *complete*. Newtonian mechanics work perfectly fine, thank you very much, unless you’re talking about extreme gravity, extreme distance, or extreme speed. Nobody with half a grain of sense would say that Newtonian mechanics was “wrong” unless they were being sensational and trying to gather attention to something.

    > Doesn’t the data mining, in trying to recognize a pattern
    > or an outlier, test against…. wait for it…. a model?

    > So I can’t conceive of how the data analysis itself exists
    > outside the concept of a model. Maybe its definition of
    > terms? Applied mathmaticians… do they blanche at
    > the word model?

    “You keep using that word. I do not think it means what you think it means.”

    I think that’s part of what’s going on here. Data mining guys are very gung ho about their tool, because it’s the new toy in the toybox. There are some legitimate reasons to find data mining cool -> it can reveal relationships in data, or incompleteness in data, that we didn’t know previously existed. But that only gives us a new avenue of exploration.

    I always liked describing science as exploring a gigantic cave system. There’s the big giant caverns that we know pretty well, and thousands of unexplored tunnels that branch off of those caverns and lead off into the darkness. Scientist Bob pops on a head lamp and goes off down one of the caverns. Sometimes he comes back and says he found something interesting, and then more explorers gather up lights and cameras and workbenches and cart it all after that guy, and set up in the new cavern and take pictures and describe how this looks different from one cavern or another, etc.

    Every once in a blue moon, someone comes running in *from the outside* and starts blurting how they’ve found a WHOLE NEW system of caves that’s TOTALLY different from ANYTHING we’ve seen before. Some portion of the science community rushes out of the existing cave and goes off to explore the new cave system… but after working around in it for a while, they find a tunnel that leads to a cavern that leads… right back into the cave system where they started off.

    The analogy isn’t complete, of course (because it’s a model!), but you get the idea. Saying that “science can learn something from Google” is hugely misleading… who do you think *works* at Google? Plumbers? Does the author really think that social scientists who have been combing huge data sets since before the US started the Census don’t already know about data mining? That evolutionary biologists don’t know that their models are incomplete, and that data mining huge data sets can reveal new avenues of exploration?

    That’s not really news.

  3. The all models are wrong, is a quote meant to convey the inherent imprecision. That some would attack it as wrong is countered by the utility still gained from an imprecise model. I have to make the trade between precision versus performance and design effort. So that is the crux of that quote. So we are 100% in agreement there. Sorry for the lack of context there.

    Thanks for the info. If I meet someone that is a data miner, who wants to talk smack about modeling, I know now that I should just nod and smile. And I’ll leave it to someone else to discuss your fascination with caves… I don’t really wanna know. But happy spelunking.

  4. Forgive me for not having read the article (yet). Just a little too busy right now.

    But, does it answer why the “correlation does not imply causation” mantra is rendered irrelevant by petabytes of data? Spurious correlations aren’t just about random fluctuations in data.

  5. @ scott

    > Does it answer why the “correlation does not imply
    > causation” mantra is rendered irrelevant by petabytes
    > of data?

    The direct quote from the article is, “Who cares why humans do what they do? The fact is, they do it…”

    Which I think shows that people are confusing “really big data sets” with “sufficiently large data sets”.

    Knowing with some degree of statistical certainty that a book will sell well in the U.S. market (which you can certainly compute given access to petabytes of the right sort of data) still tells you nothing about why that book might be currently popular, which means you know nothing about what other types of societies it might be attractive to, or whether or not it might be popular in five years when whatever causations have changed, etc.

  6. WTF? I’ll admit that I’m a huge fan of descriptive social science work–that is, work that describes what people do without worrying much about WHY they do it–but even with infinite data you can’t get at causality without a model to simplify the patterns. I doubt many of the sociologists and economists I know would take “Who cares why humans do what they do? The fact is, they do it…” as a sufficient explanation of the world. I can see how in marketing knowing that people who share characteristic X do thing Y might be enough to give you an idea of how to design/market your next big thing but it won’t get you very far in realms like public policy. Without models you can only describe what has happened. With models you can make some predictions (however imprecise) about what might happen in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: