A New Kind of Clairvoyance
Big Data can produce big insights about our future.
Posted January 31, 2017
A father walked into a Target store near Minneapolis a few years ago, clutching a handful of coupons that Target had sent to his teenage daughter promoting baby clothes, maternity wear, and cribs. “Are you trying to encourage [my daughter] to get pregnant?” the man complained to the manager.
According to a report by Charles Duhigg of the New York Times, the Target manager apologized for the embarrassing error on the spot, and even called the man to apologize a second time. There was just one problem: Target wasn’t in error after all. The high-school girl in question, unbeknownst to her parents, actually was pregnant.
Target’s marketing group had intuited that the girl was expecting because her purchase patterns had changed recently in ways that predicted — based on Target’s Big Data analytics — that she was entering her second trimester. Target’s data analysis had discovered, for example, that women who abruptly switch from buying scented to unscented lotions usually are about four months pregnant (pregnant women often dislike strong smells). So Target began mailing coupons to such women, promoting all of the things they would need when they gave birth.
Even, as in the Minnesota case, to women who were still legally children.
Other than serving as a cautionary tale about using new technology without thinking through the implications, the Target story illustrates two important concepts about human behavior.
First, one behavior (switching lotions) can reliably predict another, later action (giving birth). Another example, described in a 2013 article in the journal Nature, showed that the volume of internet searches for the term “debt” provided a statistically significant prediction of near-term downturns in stock prices.
The chart below compares the volume of Wikipedia page views of the term “debt” with the Dow Jones Industrial Average. Wikipedia search interest in “debt” does indeed have some predictive value forecasting market downturns. Here, search behavior on the internet predicted selling behavior in the stock market. (Perhaps people worried about debt search the term before they sell stock to pay it off.)
This example illustrates the second important lesson flowing from Target’s scented-lotion experience: Very high “N” (large numbers of samples), through the power of inferential statistics, can reveal subtle but consistent relationships between one human behavior and another. The “debt” analysis just presented derives from more than 200,000 Wikipedia page views.
One way to think about predictions from web-derived “Big-Data” is that the internet, along with private data networks similar to Target’s, have thoroughly instrumented the human species, providing metrics and insights into behavior on an unprecedented scale. For instance, in addition to the massive stores of private data accumulated by Target, Walmart, Amazon, Google, and others, nearly 3.5 billion people now use the Web, leaving a variety of records of their use for Big Data analytics.
An intriguing recent example of the power that Big Data has placed in the hands of behavioral scientists is in the realm of political science. Much was made recently of the polling errors that produced so much surprise at President Trump’s election win. But to those with their noses buried deeper in Big Data, the election was no surprise at all.
Look at the relationship between the volume of Google searches (and for 2016, Wikipedia page views) for presidential candidates before elections in 2004, 2008, 2012, and 2016, and the eventual winners of each election.
In all four elections, the winner in internet search interest before the election (people Googling a candidate or checking them out on Wikipedia) also was the winner of the election. Presumably voters’ level of curiosity about a candidate is linked to their likelihood of voting for that candidate.
It’s important to observe, at this point, that Big Data correlations are far from perfect. In his book Spurious Correlations, Tyler Vigen, a Harvard Law School graduate and management consultant, illustrates a deep truth about statistics: Correlation does not prove causation.
For example, Vigen shows that there is a nearly perfect correlation between per capita margarine consumption and the divorce rate in Maine. Yet few would argue that margarine consumption causes divorce, or vice versa.
With an extremely high “N” of data sources (literally billions of different databases accessible on the Web alone), random correlations like this are not bound to happen, they are certain to happen.
Other “spurious” correlations that Vigen has uncovered include:
- Per capita cheese consumption and the number of people who die by getting tangled up in their bed sheets (a surprising 600+ per year).
- People who drowned after falling out of a fishing boat and marriages in Kentucky.
- Number of letters in the winning word at the Scripps National Spelling Bee and the number of deaths from venomous spider bites.
One of Vigen’s spurious correlations that caught my interest was the strong link between sales of Japanese cars in America and suicides by automobile in the U.S.
On its face this car sales/suicide 93.5 percent correlation seems to be the kind of artifact you’d expect when you “dip” an individual time-series pattern (e.g., yearly cars sales) into an ocean of data containing everything from suicides to cheese consumption to the annual marriage rate in Kentucky — something in that ocean of data is bound, by random chance, to match that pattern.
But the history of science is rich with examples of random discoveries that at first glance made no sense. Strong evidence for the big bang first appeared as unexplained “noise” in a telecommunication receiver. Proof of Einstein’s theory of general relativity was ultimately found in a weird anomaly in timing of the far-point (perihelion) of the orbit of Mercury around the sun. Penicillin was discovered when Fleming observed an unexpected dead spot in a petri dish of bacteria.
Just as the law of large numbers dictates that “Big Data” analytics will uncover a plethora of random correlations, the same law also dictates that, occasionally, random observation will uncover unexpected results — like a dead spot in a petri dish — that merit a closer look.
Having worked at an American auto company during the period of Japanese ascendance in car sales, it occurred to me that the car sales/car suicide correlation might not be so random after all. For one thing, increased sales of Japanese cars occurred as sales of American branded cars decreased, potentially triggering depression in a demoralized American workforce.
To explore this possibility, I compared sales of American branded cars (blue line below) over the same period of Vigen’s analysis. The comparison hints at a plausible link between sales volume of Japanese cars and U.S. suicides.
When sales of American branded cars rose relative to sales of Japanese cars from 2000 to 2001, suicides by car in America dipped somewhat about a year later. When American car sales started to decline in 2001, American suicides by car rose a year later, in 2002. A year after American branded car sales started a steep decline in 2005, car-related suicides took a steep jump.
One possible reason that suicides by car in America rose after a downturn in American car sales is that such downturns put people out of work in the auto industry and the thousands of businesses that depend on the industry. A recent article in the American Journal of Preventative Medicine found that economic recessions likely do increase suicides. Drs. Webb and Kapur, writing in Lancet Psychiatry, showed that more than 40,000 suicides per year were associated with global unemployment in 2006 and 2007 and that the 2008 recession was responsible for an additional 4,000-plus suicides in that year.
In the chart below, the brown line at the bottom represents total U.S. employment in the automotive sector. U.S. jobs did indeed evaporate as sales of Japanese cars increased.
Finally, CDC data indicate that during the 10-year decline in American branded automobile sales, the suicide rate in America (green line below) steadily rose.
Despite the possibility of a real connection between Japanese car sales and suicides by car in the U.S., the steep decline in car suicides in 2009, when there were also big drops in both auto industry employment and Japanese car sales, suggests that the relationship between car sales, unemployment, and suicide-by-car is not simple.
It’s also worth pointing out that the number of suicides by car each year (around 100) may be too small to draw firm conclusions about links to unemployment, car sales, or anything else.
Moreover, the difficulty of determining whether a given car crash really was a suicide further clouds the picture. Given that the U.S. suicide rate rose in 2009 while reported suicides by car dropped precipitously, the reliability of suicide by car statistics is suspect. Studies by Phillips and colleagues showing a spike in traffic fatalities a few days after well-publicized suicides strongly suggest that suicides by car, especially “copy-cat” suicides that quickly follow mass media reports of suicide, are significantly underreported.
Despite all of these caveats, the car sales/suicide story is worth paying attention to, because it teaches us not to dismiss unexpected Big Data correlations out of hand.
When you stop to think about it, unexpected findings — like the discovery of penicillin — have tremendous potential to be game changers, precisely because they don’t fit our current understanding of the world. So when we stumble upon the unexpected, we have an opportunity to radically change our understanding of nature ... and ourselves.
In that spirit, here’s something unexpected about the future economic outlook for America. In the chart below, the blue line shows U.S. Gross Domestic Product (GDP, an index of economic output) over the past 12 years, while the jagged red line represents the volume of Google searches for “Happy Belated Birthday.” I have purposely lagged the GDP data 6 months behind “Birthday” searches to show that there is a very high correlation (.96) between GDP and people Googling “Happy Belated Birthday” 6 months earlier (there is almost as high a correlation with “Happy Belated” and “Funny Happy Birthday”).
In other words, for this data set at least, the volume of birthday greeting-related searches (probably people looking for online birthday greetings) is a strong 6-month lead predictor of U.S. economic output.
Is this correlation spurious, like the connection between fishing boat drownings and marriages in Kentucky, or is it substantive? Intuition says that the correlation is spurious.
But I can think of ways the link might be meaningful. For instance, when people are consumed with worry about being laid off in the next six months, are they less likely to take time to send out birthday greetings? Could Google searchers, in aggregate, know more about where the economy is headed than economists? And could this awareness show up in changes in Google search behavior well ahead of economic statistics?
It's worth pondering...especially given that (see the far right part of the chart) searches for “Happy Belated Birthday” have recently taken a very steep plunge.
David P. Phillips Suicide, Motor Vehicle Fatalities, and the Mass Media: Evidence Toward a Theory of Suggestion American Journal of Sociology Vol. 84, No. 5 (Mar., 1979), pp. 1150-1174
Katherine Hempstead, Ph.D., director, Robert Wood Johnson Foundation and the Center for State Health Policy at Rutgers University, Princeton, N.J.; Christine Moutier, M.D., chief medical officer, American Foundation for Suicide Prevention; Feb. 27, 2015, American Journal of Preventive Medicine
Roger T Webb, Navneet Kapur, Suicide, unemployment, and the effect of economic recession
The Lancet Psychiatry, Volume 2, No. 3, p196–197, March 2015