Small Data

Let's reverse our strategy for data collection.

Posted Jul 22, 2018

Currently, the Big Data bandwagon continues to pick up momentum: Take advantage of all the data sources available to us via mobile devices, aerial and remote sensing, cameras, microphones, wireless sensor networks, and the like. The data are there, just waiting to be harvested in order to spot trends and find correlations. The enormous volume of data forces us to use various forms of computer-based search and analysis, including Machine Learning. The Big Data approach is exciting as it lets us take massive amounts of information into account. The Big Data approach is also unsettling as we face our insignificance and admit that the algorithms and smart machines know so much more than we ever can.

Previously, I have described some reasons to be uneasy about Big Data, the way the Big Data analytics will follow existing trends but miss subtle yet important changes in the situation that render these trends obsolete. That essay also raised the issue of missing data. People sometimes notice that something did NOT happen, and the absence of an event helps us make sense of a situation. Big Data typically covers events that did happen and ignores events that did not occur, even though these non-occurrences can be significant. 

This essay, however, is not about limitations in Big Data.

Instead, I want to suggest that we move in the opposite direction: Trying to collect as little data as possible, ideally just a single data point — but a data point that swings a decision. Rather than getting drowned in data overload, there are times when the right observation will put ambiguous cues into focus.

Here are some examples.

1. (This example comes from Trevor Hadley, a former US Government analyst.) In 2015 the CIA was trying to decide whether Russia and China were going to hold joint naval exercises in the Mediterranean Sea. There were no official statements. The trends were unclear, the evidence was inconclusive. Then an outside analyst, a superforecaster, wondered what it would take to re-supply a Chinese flotilla, and began hunting through online purchase orders from ship chandlers in Cyprus. He found new orders, huge orders, for rice and noodles where none had existed previously. Just to be safe, he also investigated the local coast guard Notices to Mariners and uncovered corroborating evidence. But it was the rice and noodles that did the trick. Case closed.

2. (This example also comes from Trevor Hadley.) In 2011, were the French intending to intervene in the civil war in Libya? The French denied that they were even considering such an intervention but the intelligence community had learned not to take such denials too seriously. There were reasons to expect the French to intervene. Attempts to make a forecast failed. A prediction market wasn’t helpful. Then an intelligence analyst spotted an obscure statement in a French civil service directive, a memorandum proposing modifications to life insurance regulations for members of the French military, listing countries where the French military was currently active — including Libya! The memo was pulled from the website in a few days and replaced with a version that omitted Libya, but it was too late. (Several months later the presence of French forces fighting in Libya was confirmed.) Case closed.

3. The U.S. government wanted to forecast how the U.K. would vote on Brexit. (So did many, many other countries.) The analysts pored over the polls, searching for some information that would tip the balance, but the signs just were not sufficiently clear. Then one observer noted that the European Union standards would require British housewives to use a different method for making tea. The current teapots for boiling the water were simply too energy inefficient, unnecessarily raising the carbon footprint. The E.U. required a more efficient device for boiling the water, but that would take five times as long! What effect was that going to have on inviting a neighbor over for a quick cuppa? Case closed.

4. In 1990 the U.S. intelligence community was trying to forecast whether Saddam Hussein actually intended to invade Kuwait. Some felt that he was getting ready to attack. Others doubted that he would be so foolhardy. They saw his movement of 30,000 troops on the Iraq/Kuwait border as a bullying tactic intended to intimidate Kuwait into making concessions. The usual types of evidence didn’t result in any conclusive judgment. The Egyptians believed that there would be a peaceful resolution of the complaints Saddam Hussein leveled against Kuwait. So did the U.S. ambassador to Iraq. And so did the Kuwaitis — even after Iraq had placed all those troops on its border, Kuwait didn’t mobilize its 18,000 soldier army and allowed many to go on leave. What was Saddam Hussein going to do? One U.S. intelligence analyst, working in the Department of Energy, noted that the Iraqi military had commandeered over 10,000 civilian trucks. The removal of all of these trucks was bound to have crippling effects on the Iraqi economy, disrupting all kinds of commercial activities. And this truck commandeering had been kept secret — it had not been publicly announced. It could not intimidate the Kuwaitis because they had no idea that it had been done. Why would Saddam Hussein do such a thing unless he suddenly decided he needed the trucks for a military action? Case closed.

5. The Toyota runaway acceleration problem. This problem caused Toyotas to accelerate uncontrollably, despite the driver’s frantic efforts to press on the brake and slow the car down. The case received national attention. Some thought the problem stemmed from thick floor mats that trapped the accelerator pedal, but the primary malfunction seemed to be a glitch in the software. Toyotas contain more than a hundred million lines of code, so some software bugs seem inevitable. Hundreds of cases of runaway acceleration were called in. Toyota was forced to pay billions of dollars in fines and settlements. However, the human factors community had a different diagnosis: the drivers were mistakenly pressing the accelerator pedal thinking it was the brake pedal. When the car speeded up rather than slowing down, the drivers perceived that the brakes had failed and that the acceleration was unintended and uncontrollable. The drivers naturally pressed the pedal harder and harder, believing it was the brake, only to see the acceleration get worse. There is no easy way to prove this explanation, with lots of back-and-forth debates about the data.  But it turns out that there are two killer arguments. One is that by examining the black boxes in the automobiles, investigators found that the brake pedal had not been depressed in the cases of runaway acceleration. The second killer argument comes from a Malcolm Gladwell podcast in season 1 of his Revisionist History series. Gladwell arranged for the magazine Car & Driver to put a Toyota Camry through its paces on a test track. The trained drivers mashed the accelerator pedal all the way down to the floor and then, with the accelerator pedal still mashed to the floor, hit the brakes. The car stopped. Trial after trial, the car stopped. No problem, no screeching, no smoke. The brakes easily overpowered the accelerator. No need to review the statistics. No need to review the hundreds of millions of lines of code. Case closed. 

These examples suggest that less is more. That the quality of information matters more than the quantity. 

The term “Small Data” is used in several different ways these days. There is even a marketing research book by Martin Lindstrom, Small Data: The tiny clues that uncover huge trends. And a Wikipedia entry. Here are a few attributes that I have identified regarding Small Data.

First, most of the references contrast Small Data to Big Data by asserting that Small Data is about a personal connection to a limited amount of information, whereas Big Data is about the need for smart machines to sort out the every-expanding volume of available signals.  

Second, Big Data is primarily about correlations whereas Small Data is about causal relationships.

Third, the personal connection fostered by Small Data depends on engaging a person’s expertise and experience.

Fourth, the Small Data approach is intended to foster insights (see Klein, 2013) and to transform mindsets. Bonde makes this point explicitly, that Small Data is intended to help us gain insights that we can put into practice.

Fifth, just about everyone agrees that Big Data and Small Data are not mutually exclusive or in competition. We can use both approaches.

Sixth, there is a divergence about how to search for meaningful items of Small Data. Some suggest that we should start with Big Data and then reduce the output, creating logs and other artefacts. I am not enthusiastic about that strategy. Instead, I think the power of Small Data comes when we use our mental models to notice or find the critical pieces of information. The five examples in this essay all illustrate the skillful discovery of critical data, rather than condensing the output of a Big Data exercise.

Seventh, there are times when we can support the decision makers by selecting a few representative cases from a much larger population and then give details about these cases. For example, if a politician is pondering how an increase in the price of gasoline will affect low-income people, it might be useful to define three specific individuals, say an elderly man on a fixed income who uses public transportation, a single mother shuttling between two or three jobs, and a retiree volunteering with a church group to drive congregants to various social, medical, and welfare-related events. 

Eighth, it takes expertise to notice the critical data points once we come across them. It takes reasonably sophisticated mental models to appreciate how the data point can be put into action — to see what it affords us. 

One risk of the Small Data approach is that is can be misused to cherry-pick examples and anecdotes that convey a misleading impression. Therefore, the Small Data approach should be used in the context of existing evidence. The Small Data approach does not eliminate the analysts’ obligation to survey the relevant variables. I wrote "Case closed" at the end of each of the five examples but in actuality the investigators appropriately sought additional data to confirm or to disconfirm their speculations. The Small Data approach, however, can curtail the tendency towards accumulating more and more data merely to satisfy compulsive needs for completeness. The Small Data approach values the meaningfulness of data over its accumulation.

The examples in this essay suggest that we should re-shape our efforts to gather information. Instead of vacuuming up every available tidbit we might do well to direct our information gathering towards sensemaking and discovery. We might search for truly diagnostic cues, for anomalies and for missing data — expected events that didn’t happen. We can be on the lookout for “differences that make a difference.” 

References

Martin Lindstrom, Small Data: The tiny clues that uncover huge trends. New York: St. Marten’s Press. 

Klein, G. (2013). Seeing what others don't: The remarkable ways we gain insights. New York: PublicAffairs.