We are suspicious of the results of studies using small samples of participants, but does anyone really understand what the problem is with drawing conclusions from small sample studies? And does anyone stop to think what low sample studies (or, "low n studies") can offer over large sample studies in terms of reliability of findings?
In my research into intellectual skills training, we adopt a traditional behavioral approach associated with the Experimental Analysis of Behavior (EAB) and Applied Behavior Analysis. This is a tradition concerned with very high levels of control over behavior, and avoiding theory whenever possible.
Why low samples can beat large samples in treatment development
Behavior analysis has been supremely successful in developing treatments for every manner of behavioral and intellectual difficulty, using low n studies. My studies also employ highly complex training techniques to change behavior in highly complex ways, but with small samples of participants. We do not normally employ Randomized Control Trials because we generally do not need them to do good science (the politics of the RCT is another matter!).
We train what we call Relational Skills, which is short hand for arbitrarily applicable relational responding (AARR); a skill set crucial to most forms of intellectual activity. As a result of training these skills to a very high level we see very large increases in IQ scores. Simply because it is unexpected that IQ scores can be improved so much with an educational / intellectual skills intervention, our critics focus on our low sample sizes. But this underlies a dangerous and naive scientific assumption that all psychological effects are tiny, pseudo-random and only reliable if shown across huge numbers of people. But not every form of psychological science aims to establish weak, theoretically deduced effects across large samples. Many of us aim to create massive, reliable, and well controlled effects for each and very individual in our study and so using a large sample (n) flies in the face of common sense.
Consider for example a study designed to teach children to tie their shoe laces. Suppose this skill is broken down into 10 component skills by the experimenters, who teach each component skill in turn to criterion. Only when one level has been mastered, is the child moved on to master the next stage. On average, let's say the shoe lace tying takes 15 sessions across 10 weeks to master. There are 8 kids in the study. None of them could tie their laces before the study began, given ten minutes to do so. Let’s even assume that there is a control group of 8 further children who are not exposed to the intervention.
Now, following the skills intervention all ten component skills are in place and each kid in the experimental group can now tie their laces in 30 seconds without error 10 times in a row. None of the control group children can do so. The authors write up the report, and submit the paper on their effective shoe-lace tying intervention to a mainstream journal that likes group designs and large Randomized Controlled Trials when assessing the effectiveness of interventions (such as medical treatments). The paper is rejected because the sample is too small to make it reliable. Now what does this mean? Do the reviewers assume that the effect is random? Have they not read the procedure and understood that the skill was composed of ten component skills each of what had to be mastered in sequence, and that the skill had to be displayed 10 times in row without error for the intervention to be considered successful?
Psychology is about behavioral control, not predicting weak and near-random effects
Many psychologists do not understand low sample research because they think of psychology as a science of weak and fleeting effects, and they assume that total behavioral control is not possible. In other words, they assume a randomness to behavior. But in the case of this hypothetical study, randomness had been systematically omitted from the lace-tying behavior of the children by tight experimental procedures. There is no variance to worry about! The variance across the two groups of children is no longer due to chance–it is clearly due to good experimental control. So an inferential test is not even necessary in this case if we use common sense, laboratory and clinical judgment.
Small sample studies in behavioral research report fascinating and important effects that are often over looked because of theoretical differences between psychologists. In this hypothetical case, the journal has dismissed the findings as uninteresting because they are applying a medical type model to the analysis, in which it is assumed that the effect of a pill is either positive or not, and that assessing its effect is like tossing a coin a sufficient number of times. But studies on behavioral control are not an odds game. They involve complete control over the behavior of all, or a very large majority of the participants. The instances in which the behavior was not successfully changed are to be explained experimentally, rather than ignored as simply “random variation” in a massive pool of not-yet-understood data. However, when the controlling factors of behavior are known, the RCT loses its significance.
Not all approaches to Psychology use group designs and hypothesis testing
Psychologists need to be more mindful of different philosophies of science within our field and how these can alter our view of what counts as an important finding. For example, behavior analysts frequently use elegant ”small n” experimental designs such as multiple baseline designs. The behaviors we are trying to produce using these designs are often so complex that n’s of 1-3 subjects are still sometimes published. For example, consider this level of behavioral control typical of a study in the area of Relational Frame Theory. We train participants to press, perhaps 5 of 10 colored keys on a keyboard in a specific sequence given a rule made up entirely of nonsense words. The nonsense words have no function (i.e., meaning) at the start of the experiment. By the end of function training the participant can press the correct five keys, in the correct order, on 32 trials in a row (i.e., 32X5 correct consecutive key presses). Now this is reported in a paper as a demonstration of the effectiveness of the training for establishing the behavior pattern of interest. But consider the appropriateness of asking that researcher to prove by inference that the 5X32 key press sequence was not by chance. It flies in the face of clinical and laboratory judgment and puts inference before expertise. In this case, we can see on the learning curves the clear and steady emergence of the sequence of behavior. We can even manipulate it by altering the reinforcement contingencies so that the functional relationship between the training and the behavior of interest is highly apparent. Inference adds nothing in this case, and a second participant does not make the first instance of behavioral control any more or less impressive if behavioral control was demonstrated within the first participant across various stages of the training.
Reasons for obtaining less or more behavioral control with another individual may be due to variables that do not apply to all individuals and inference a the large group level will only mask this. The aim is to get control over each individual’s behavior, not necessarily using the same method in each case, only the same principles. This in essence, is applied behavior analysis (ABA). So additional participants are never viewed as providing further evidence for an effect because the group is getting larger, but because the number of replications using an n=1 is getting larger. This is crucially different to standard group design logic, which I sometimes think of as lazy and often leading to terrible behavioral control but impressive statistical behavior on the part of the observer.
Before you criticise the sample size, understand the procedure
The level of control required by behavioral studies needs to be borne in mind when one considers them. One recent and provocative paper of mine (Cassidy, Roche and Hayes, 2011) reported large IQ gains for children exposed to our relational skills training programme. That study required children to make extended and complex response sequences. To complete the intervention they had to press one of two keys on a keyboard given each of 16 different relational probe questions in a row, without error, dozens of times over in some cases. By the time they were finished, the chances of this skill improvement happening by chance were effectively zero. We also saw large IQ rises in every case, and only in the upwards direction, and all at least one standard deviation in size. And then the “power” criticisms started; the n was too low. While it may well be good practice to have large RCTs for every test of an intervention, it flew in the face of common sense that IQ scores, which are remarkably stable, could rise by more than one standard deviation by chance in every case. The control for that study was the century of research spent stabilizing IQ scores so that they cannot rise by a standard deviation by chance.
So it is a little naïve to assume that all effective treatments come from large n studies based on a Popperian hypothesis testing paradigm. Most of the best treatment for Autism devised by the ABA community, come from low n studies, precisely because their emphasis was on 100 percent precision and control and not mere probability of effectiveness.
Behavior analysts avoid theory because we are inductive in our approach to science, and as a result we eschew hypothesis testing (a method that can only arise from hypothethico-deductive reasoning). Instead, we insist on behavioral control and we do not ignore even one individual absence of effect in even one participant.
Large group studies mask weak and inconsistent effects
Large n studies are designed to ignore high failure rates across individuals. For instance, a headache pill may be found to be statistically effective in a large RCT, when in fact it works only on 60 percent of the users. Behavior analysts would prefer high levels of control over the behavior of every single individual with every variation accounted for (variation is our subject matter), rather than healthy p values in our reports. So we make up in precision what we lack in P values. We also use three and not just one criterion for an effective study outcome. We seek not just precision in our data, but also depth and scope in our explanations. This further reduces the emphasis on p and n.
And let’s not forget the role of the coherence of knowledge systems. What do these count for? Our conclusions that relational skills training can improve general intellectual functioning are the result of a many studies all triangulating in on the same conclusion. They are not based on a single study. This is a far more reliable way to draw a conclusion about the meaning of data rather than rely on a single large n experiment. Any psychologist who thinks otherwise must believe in the elusive Exprimentum Crucis and adopt some kind of naïve realist or even logical positivist approach to science, both of which are approaches that have exhausted their utility. Our philosophy of science is a form of pragmatism known as Functional Contextualism.
For an excellent treatise on good experimental design and low n studies see Murray Sidman’s classic text Tactics of Scientific Research.