There is a lot of discussion about the value of big data for companies. For example, Amazon matches your purchases and page views against those of other shoppers and tries to find people with similar interests. Then, Amazon suggests purchases of products those people liked under the assumption that you will like them as well.
Can big data be used to answer questions of interest to the research community in psychology? Seth Stephens-Davidowitz explored this very question with detail in his fascinating book, Everybody Lies.
What I like about Stephens-Davidowitz's book is how clear it is about both what we can learn from big data as well as some of the things hard to successfully use it for.
Big data is particularly good at addressing questions that people might otherwise be reluctant to answer on surveys. Often, the way people interact with computers reveals aspects of their interests that they would not express in an interview or even on an anonymous survey.
For example, Stephens-Davidowitz explores data related to sexual orientation. He points out that there are big regional differences in how many men report that they are gay. For example, far more men in Rhode Island identify as gay on surveys than men in Mississippi. It could be that gay men move to states that are more tolerant, but it could also be that gay men in less tolerant states are less likely to respond truthfully to surveys.
Stephens-Davidowitz used Facebook data on where men who self-identified as gay were born and where they moved. There was some tendency toward movement from less tolerant to more tolerant places. But, that movement alone would not explain the large regional differences seen in surveys.
He then used data from Google, which tracks the kinds of searches people make and provides information about the locations those searches originated from. In particular, he looked at the proportion of searches for pornography specifically seeking gay-male pornography. Roughly 5 percent of all pornography searches by men were for gay-male pornography. This was true in basically every state in the U.S., regardless of how tolerant the state is. This suggests that roughly 5 percent of the male population is attracted to men and that this is true in every state.
Big data can also be used to address questions that might be hard or impossible to answer in other ways. My favorite example in the book comes from an exploration of dreams. Freud suggested that dreams may reveal unconscious sexual desires symbolically. A banana or cucumber in a dream, then, might be a stand-in for a penis.
It is hard to disprove a theory like this because the desires Freud discussed were supposed to be unconscious. That means that even if people talk about their dreams, by definition they can’t know what the dream means.
Stephens-Davidowitz took data from an app that collected descriptions of dreams from users and looked at the descriptions of dreams and found all of the foods that were mentioned. He looked at factors that predict how often a particular food would appear in dreams and then found that how often those foods were consumed was a great predictor of their appearance in dreams as well as the tastiness of the foods.
So, there are phallus-shaped foods in dreams—like cucumbers and bananas—but they seem to appear more with the frequency they are eaten than anything else. For example, cucumbers are the seventh most popular vegetable in dreams, and they are also the seventh most popular vegetable overall. This suggests there is no reason to believe a banana in a dream is anything more than a banana.
Finally, Stephens-Davidowitz does a nice job of exploring some of the factors that can make analysis of big data unreliable. Suppose you have some complex trait, like intelligence, and you want to know if there are genetic predictors of intelligence. You might try to correlated scores on IQ tests with the genes of the people taking those tests. Now that scientists have data on gene sequences for so many people, this analysis has been done several times on several different datasets.
Every time this analysis has been done, particular genes pop out as being good predictors of IQ scores within that data set. The problem is that different genes have popped out in different analyses. The happens because even when you have a lot of data if you have a large number of potential predictors (like genes) and you have many opportunities to notice a correlation that is just the result of random variation in that data set. As a result, if you hear a report that a particular gene has been found that predicts some trait like intelligence, you should treat it skeptically until it has been validated on several different sets of data.
Big data will not replace the traditional ways we do psychology. Ultimately, big data provides us with opportunities to see how different aspects of the environment are related, but they cannot tell us what factors cause particular behaviors. To do that, psychology needs to continue doing the kind of experimentation that has been central to the field for the last century. But, big data does have great potential to be an important tool for understanding people’s behavior.
Follow me on Twitter.
Here is information on my newest book Brain Briefs.
Stephens-Davidowitz, S. (2017). Everybody lies. New York: Dey St. Publishers.