Who needs theories when one has lots of data?

July 8th, 2008 by jose

This article poses an interesting question. Sometimes one has enough data to make accurate predictions without having an understanding of what causes the phenomenon (a model). Nowadays, it’s getting easier and easier to get huge datasets, which are often sufficient to do this.

For example… Google uses massive amounts of misspellings to give ‘on the fly’ corrections. It also uses massive corpora of bilingual texts, such as their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. But they don’t have any theory of language doing smart stuff in the background.

So are theories redundant, or obsolete, in a world where one can do proper predictions without them?

Wired’s own Chris Anderson explores the idea:

Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show.

The point here is that statistics can find patterns in basically any area; so maybe we don’t need an specific science to take care of those problems.

There are issues with this line of thinking. Of course, correlation doesn’t imply causation, so doing just this we’d be blind to cause-effect relationships:

Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.

Comments by Deepak:

We all know that more data means new approaches to science, especially since this has happened so quickly.

We’ve always worked with partial understanding, or in the case of medicine, less than partial understanding, but that’s precisely why medicine is beginning to fail. Not knowing mechanisms, etc is what results in a VIOXX. Not knowing why is what creates the next disaster.

Trying to solve the exact same problems as Google, we have a camp that does think that knowing ‘why’ is important: the semantic web proponents. Under this paradigm, the web would become a huge ontology. And machines would operate with propositions (RDF triplets) to deduce new knowledge. In this case, you do know how the machine reached certain conclusion. They do face the same huge datasets (i.e., try to operate with ‘the entire web’ at some point; not now, since only a small fraction of the sites use RDF at all), but instead of using the raw content that is prepared for human consumption, they will use machine-ready content.

If after plowing though petabytes of data, a semantic search engine reaches an interesting conclusion, at least it can show us the logical path it used. The promise for pharmaceutical companies is that they could find new drugs and interactions by just letting the algorithms traverse a corpus of, say, proteins. But, again, in this case, there is no ‘human’ postulating a theory either.

Probably, what all this means is that we scientists will need to adapt our methods to collaborate with these smart machines. There are things, like deep search, that are better left to them; whereas some other, like tagging images, are really hard for machines but trivial for humans.

If you enjoyed this post, make sure you !


4 Responses to “Who needs theories when one has lots of data?”

  1. Zdenek Says:

    The US financial meltdown shows the dangers of relying on data without a theory. Data have shown that never in the past real estate values decrease in all important regions of US. Wall Street fed enormous amount of data into empirically driven models to show CDOs (now known as toxic waste) were low risk. They lacked a theory that markets go down when bubble burst … price on that mistake: one trillion USD.

  2. drtaxsactoNo Gravatar Says:

    Hayek wrote in A Counter Revolution of Science that the danger of having lots of data is that people begin to believe the numbers. The audacious arguments about the Petabyte economy has two potential flaws. First, there is an assumption that the algorithms involved in the net will produce intelligent understanding of the dynamics in the system. That is a highly optimistic view of meta-systems and a very low view of the ability of people to infer meaning from even minimal data. Second, there is an assumption that the data collected actually is accurate. As an economist who occasionally struggles with data on what is going in sectors of the economy – that is a dubious premise.

  3. London PhotographerNo Gravatar Says:

    But when do theories actually start driving behaviour and therefore trends in data?!! So for example, is google’s theory that the importance of a site should be measured on the number of backlinks based on data or is it in fact driving mass backlinking behaviour that results in data that backs up this theory?

  4. John HunterNo Gravatar Says:

    This is an interesting area to examine. I agree that the new tools allow for new strategies. However, I do not believe this will replace the scientific method. Those that use these new tools well will be able to find interesting correlations which can then lead to new insight – by exploring what is going on that they could not have even noticed before.

Leave a Reply