Small Samples, Big Hopes
Mind your N’s and p’s.
Posted Jun 19, 2017
Conventius: Every experience is a sample.
Scrupulus: Of what?
When we ask ‘What does science teach us?’ or ‘What do the data tell us?’ we do well do give the notion of the convenience sample another look. In Fisher’s system of statistics, all samples are convenience samples in the sense that we do not know from which population they were drawn and hence we do not know to which population we can generalize the statistical results. This population ignorance should dampen our inductive enthusiasm. Of course, any sample is a random set of observations with regard to some population, but this is a tautology and hence useless. Some populations attributed post hoc to the drawn sample are weird and uninformative. If, for example, we do a study using the online service of Mechanical Turk, we can say that our sample is drawn from the population of Turk workers. Even if we had grounds for claiming that the sample was random, what kind of a category is “Turk workers?”
Convenience sampling is a guilty secret in science. Many of us do it, and we pass on the uncertainties it brings to the inductive leap from data to theory. We have no statistical tools to do otherwise. The effort of experimental design goes into making sure that respondents are randomly assigned to experimental and control conditions. Random sampling from a well-defined population is often beyond reach.
Another type of effort goes into selecting sample sizes. To make the selection wisely is to answer the question of statistical power. As the complement of the probability of a type II error (failing to detect a true effect), power analysis comes out of the Neyman-Pearson paradigm (NPP) of statistics, not Fisher’s. The NPP asks that you know, or credibly pretend to know, the size of an effect assuming that it in fact exists (of course, we can perform multiple power analyses and display the results as a curve; but we still need to select one particular N for our study, which reveals our bet as to which hypothesis is true). The NPP presumes that we do what we know to be most difficult: sample from known or well-imagined populations. Otherwise, there is no reason to think that over many exact replications we would find type I and type II errors with the specified probabilities.
Even if we are unable to perform a theory-based power analysis and if we can’t sample randomly, sample size matters. The larger the sample, the greater is the precision with which we are measuring whatever we are measuring. How do we determine sample size? How large is large enough in the absence of a criterion provided by power analysis? Many researchers use one hard and one soft heuristic. The hard heuristic is that they want to avoid deficit spending. Research funds and facilities impose limitations that are difficult (though not impossible – hence a ‘heuristic’) to overcome. For this reason, studies involving brain imaging have much smaller samples than survey or behavioral economics studies on MTurk. The soft heuristic is to select a sample of the size typically seen in the literature of the particular field of study. In most fields, most of the published papers report significant results, and in many fields, most studies are underpowered (Ioannidis, 2005).
How do we know whether an individual study is underpowered if there is no a priori estimation of effect size? We don't. Performing a power analysis after the fact, assuming that the observed effect size is the true one, will only justify obtained significance as having been likely to occur. We could perform a p-curve analysis across many studies (if these are available) to see if there are more significant results than we would expect given the average effect size. If there are, then it is likely that (many) more studies were conducted than published. In other words, use of the literature-N heuristic runs the risk of making nonsignificance likely, and significance – when it occurs – suspect.
In a recent set of studies in an area with traditionally small samples, the N were 6, 8, and 10. Effect sizes were not reported, but that p < .05. What would the minimum observed standardized effect size (e.g., a difference between an observed and a theoretical mean divided by the standard deviation) to make it so? The answer is 1.1, .9, and .8. These effect sizes are quite large. For some areas in psychology, these effects are implausibly large (unless a great amount of aggregation has occurred first). These effects can occur, of course, as outliers in a distribution of effect sizes, since the spread of this distribution is great inasmuch as N is small. Another way to think about this is to compare two implications of different direction. If we know that the true effect is large, a small sample is more likely to show a large than a small effect. If, however, we know (from our study) that a small sample yielded a large effect, the underlying true effect can be large or small. In other words, inferring a large true effect from an observed small one is a fallible reverse inference (Krueger, 2017). This inference may be true, but don't bet the farm on it.
The audience – hoping to experience a reduction in uncertainty thanks to the data – is now confronted with a greater riddle: Did the researcher get lucky when seeking significance? Are the significant results true discoveries or false ones? And, are there unreported studies along on the path to the reported findings? The researcher may not know the answer to the first two questions and it would be indelicate to ask the third (although this norm seems to be weakening).
The researcher may resort to the literature-N heuristic and note that in this particularly field of study, effects tend to be large and that the low N reflect this fact. Running larger studies would be inefficient. How might it be possible that all (or just most) effects are large if they exist at all? I don’t think it’s possible. There are many more ways to create small effects than large ones. In a multivariate, multicausal world, some causes or effects will cancel one another out. Very large effects can only be obtained if all causes relevant to an effect push in the same direction. The all-true-effects-are-large argument assumes a bimodal distribution of effect sizes, with one mode at zero and another mode at ‘point large’ (e.g., 1.0). Null effects must be allowed lest testing is moot. A unimodal distribution around, e.g., 1.0, makes no sense because it implies that all effects we can think of are proven true once we think of them. In contrast, a unimodal distribution around 0.0 does make sense because it allows positive and negative effects and because it assumes, reasonably, that effects become rarer as they become larger (see Pleskac & Hertwig, 2014, for theory and research on the inverse relationship between probability and value).
To be able to say that in a particular field the studied effects tend to be large, a researcher must show that there is enough knowledge, from theory or experience, to locate those effects that are large if they exist. This kind of knowledge is not impossible, but it is hard to come by. And there is a paradoxical implication. If a theory is precise enough to predict where a large effect might lie, would that same theory not also tell us something about the probability of this effect indeed existing?
This question, if answered in the affirmative, creates a dilemma. On the one horn, the researcher already knows or strongly suspects that a particular treatment will create a large effect. If so, a significance test does not add much. On the other horn, the researcher merely hopes that the effect is large, but grants – with Pleskac & Hertwig – that even if the direction of the effect is right, small sizes are more likely than large ones. Then, finding p < .05 and a post-hoc power estimate of .80, as in real-world examples above, raises the specters of luck or type I errors, both of which eat away at our confidence. 
The end of p
If we follow the puzzle of small samples to its tragic conclusion, we come to the single-event sample. If this one observation lied, say, 3 standard deviations away from the average of the theoretical prediction, would we do a significance test?
We would not. Psychology is full of demonstrations where the case is made with a single compelling example. Many visual effects in particular can be established by creating an image that compels a desired perception. I am confident, for example, that you see both peanuts and the letter N in the photo above - because I do. Replication (intersubjectively shared experience) is assumed because it has worked so well in the past that we may expect low variation over individuals. This knowledge was gained from induction. Because it was so consistent, significance testing is now unnecessary.
The N = 1 (or a bit more) strategy works until it doesn't. It is easily overused. Many philosophers still believe that they can make a point with a cleverly designed thought experiment, which they create themselves to pump the desired intuition from their own minds, to then conclude that all minds yield the same result. Unfortunately, philosophers are famous for continued debate. Likewise, the idea that very-small-sample significance tests are unproblematic is easily regarded as self-evident by its practitioners and projected to other researchers.
What if the end was near but hasn't been reached?
The case of N = 1 is special because not only do we not wish to perform a significance test, but we cannot. In the one-sample t-test of the example above (and elsewhere), the sampled data serve to estimate both the raw effect size as the difference between the observed and the hypothetical mean, but also the variance, which then enters the estimation of the standardized effect size, the probable sampling error, and the test statistic and its probability. If N = 1, there is no variance and all the calculations that depend on it break down. If we assume - as we did above - that there is a single observation 3 standard deviations beyond the hypothetical mean, we must claim knowledge of that standard deviation on a priori grounds. That is, not only did we lose data when coming down to N = 1, we also have to spend theoretical capital to buy one more assumption (not only the hypothetical mean but also the hypothetical variance). If we cannot do this, the single observation is meaningless; it floats in empty space.
There is a categorical difference between N = 1 and N = 2. If N = 2, we can estimate the variance and compute all indices that require it. If we assume that the first observation is the same as the hypothetical mean and the second observation is 10 points apart, the standard deviation is 7.07. Pushing on against the advice of our statistics teachers, we find that t(1) = 2.00, p = .295. We have not rejected the null hypothesis. Indeed, we couldn't. No matter how large we make the second observation, t and p will remain pretty much the same. The increase in the variance nullifies the increase in the mean difference.
The solution is – if you can make it so – to keep the variance small while raising the mean difference. In our example, we would need one observation being 10 points and another being 9 points away from the hypothesized mean in order to see significance, t(1) = 20, p = .032. To predict this outcome with confidence, we would not only have to be able to predict a large effect but also a small variance. This seems like a tall order. The problem gradually eases as N increases. Is it negligible by the time we reach N = 6? The answer to this question requires some math or computer simulations.
 The pre-hoc power estimate for a one-sample t-test to declare a medium (.5 standard units) to be significant is .23. For N = 8, it is .29; for N = 10, it is .35.
Here's a link to a discussion of t-tests with low N.
And here are two recent posts on significance testing from this desk:
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2:e124. doi: 10.1371/journal.pmed.0020124
Krueger, J. I. (2017). Reverse inference. In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 110-124). New York, NY: Wiley.
Pleskac, T. J., & Hertwig, R. (2014). Ecologically rational choice and the structure of the environment. Journal of Experimental Psychology: General, 143, 2000–2019.