What Do Tests of Statistical Significance Reveal?
Datapoints in a study vary. Statistical tests see if they vary more than chance
Posted Apr 10, 2014
This claim about statistics is nonsense. After all, statistical tests are merely a tool. Just as a gun can neither secure dinner in the woods nor injure a person in a bank until someone pulls the trigger, statistical test results cannot lie until someone uses and interprets them improperly. What follows is an explanation of when statistical tests are appropriate for interpreting data and what conclusions can be drawn from them.
Consider Oklahoma City bomber Timothy McVeigh’s execution on June 11, 2001. Were statistics required to claim that the heart-stopping potassium chloride injected into his leg caused his death four minutes later? No. The reason is that statistical tests require more than one observation. It makes no sense to say that being alive is statistically significantly different from being dead.
Statistical tests require multiple data points because the tests assess variance in the observations from a study.The reason for using statistics is to see whether the variance in those observations is outside the range you’d expect to see from chance or random variation in the data alone (1).
Say for instance you wanted to study road rage. Therefore, you invite a participant to drive around a track with heart-rate and blood-pressure gauges attached to him in order to assess physiological changes tied to anger. As he is driving, you are recording his baseline heart rates and blood pressure. You see that these go up and down quite naturally. Now you introduce your manipulation aimed at inducing anger. On cue, a confederate posing as another subject speeds by and cuts off the participant while changing lanes. Looking at the recordings of heart rate and blood pressure of your participant, you see that they suddenly went up a bit. But that also happened from time to time during the baseline phase simply due to chance variation. You then run statistical tests on your observations.You use the standard in psychology for statistical testing that allows a 5 percent chance of getting a false positive result. Thus, if you do get a positive result, you can say that you are 95 percent sure that the increase in the participant's heart rate and blood pressure that came right after the manipulation was not due merely to random fluctuations in the data.
That is pretty dang nice, don’t you agree? Now you can say that there was a statistically significant rise in the data after the manipulation as compared to during the baseline phase. How else would you be able to interpret the changes you observed?
Note, however, that one in 20 times, there will be a positive result when in truth the rise WAS due to chance variation that coincidentally occurred right after the manipulation. This means that when you run the study again, you are not likely to see that same rise. But this one-in-20 false-positive rate is not some big problem. After all, scientific knowledge is probabilistic, not absolute the way mathematics can be. Investigators know that there is a 1-in-20 chance of a false positive result.
What is a very big problem—and indeed is likely to yield a higher false-positive rate—has been identified as the multiple undisclosed statistical testing that psychology investigators are permitted to do (2). An example of multiple testing involves the premature peeking at their datasets that investigators do at multiple time points before all their data have been collected. Some do this to make sure that the manipulation is working properly before getting all the way to the end of the study. However, running these multiple tests would be like challenging a friend to a tennis match where you both determined in advance that you must play 10 games to get a good sense of which player is better. You start to play but soon wonder if you are strong enough to beat your tough opponent across all ten games. After barely winning the first game, you say “let’s call the winner the one who wins 2 games out of 3.” Your opponent wins the next game and is about to win the third one, and so you say “No, the winner will be determined by who wins 3 out of 5 games”…this continues until all 10 games are played, and indeed your opponent ends up winning after all 10. You can see from this example how giving yourself multiple times points to call yourself a winner increases the likelihood of making this false postiive claim.
The upshot of this discussion is that statistics are an invaluable tool for investigators to make sense of the results from a study. But if investigators monkey around by running more statistical tests than planned for a given study and then pick which test results to report, well indeed those findings cannot be trusted. However, it makes no sense to take statistical tests away from a competent investigator because incompetent ones have misapplied them. That would be like taking a gun from a hunter who needs it to keep his village from starving because someone used a gun to injure a bank teller. He would look at you like you were crazy, just as I do when friends tell me that I shouldn’t be using statistics.
1. Winch, R. F., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance. The American Sociologist, 4, 140-143.
2. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False–positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.
Many thanks to my colleague at Notre Dame, Scott Maxwell, the most caring and competent statistics teacher I could ever hope to meet. I have learned so much from this wonderful person and great thinker over the past 20 years!
This post stems from Anita Kelly’s Science of Honesty project, which was made possible through the support of a grant from the John Templeton Foundation. The opinions expressed in this post are those of the author and do not necessarily reflect the views of the John Templeton Foundation.