Where Operant Conditioning Went Wrong
Why did Skinner's innovations stall?
Posted Jul 15, 2016
Operant conditioning is B. F. Skinner’s name for instrumental learning: learning by consequences. Not a new idea, of course. Humanity has always known how to teach children and animals by means of reward and punishment. What gave Skinner’s label the edge was his invention of a brilliant method of studying this kind of learning in individual organisms. The Skinner box and the cumulative recorder were an unbeatable duo.
Operant conditioning advanced rapidly at first. The discovery of schedules of reinforcement revealed unsuspected regularities. Each new reinforcement schedule yielded a new pattern of cumulative record: the fixed-interval “scallop”, steady responding on variable interval and break-and-run on fixed-ratio schedules. The patterns were reliable and could be recovered after the organism was switched to a new procedure. The data allowed full exploitation of the within-organism experimental method: comparing the behavior of a single animal reversibly exposed to two different procedures, rather than comparing two groups of animals. Group results apply to groups; they may or may not apply to the individuals that make up a group. In 2016, 52% of Britons approved of Brexit; but each individual was either 100% for or 100% against. All too often researchers assumed that group data showing a smooth learning curve meant that individual subjects also learn gradually. They do not.
The natural next step would have been to unravel the processes behind the order revealed by cumulative records. What is going on in this interaction between the schedule procedure and the individual organism that gives rise to these striking regularities? In other words, what is the organism learning and how is it learning? What is the process?
The field did not take this step. In this note I will try and explain why.
Three things have prevented operant conditioning from developing as a science: a limitation of the method, over-valuing order and distrust of theory.
The method. The cumulative record was a fantastic breakthrough in one respect: it allowed the study of the behavior of a single animal to be studied in real time. Until Skinner, the data of animal psychology consisted largely of group averages – how many animals in group X or Y turned left vs. right in maze, for example. Not only were individual animals lost in the group, so were the actual times – how long did the rat in the maze take to decide, how fast did it run? What did it explore before deciding?
But the Skinner-box setup is also limited – to one or a few pre-defined responses and to changes in their rate of occurrence. Operant conditioning in fact involves selection from a repertoire of activities: the trial bit of trial-and-error. The Skinner-box method encourages the study of just one or two already-learned responses. Of the repertoire, that set of possible responses emitted (in Skinner’s words) “for other reasons” – of all those possible modes of behavior lurking below threshold but available to be selected – of those covert responses, so essential to instrumental learning, there is no mention.
Too much order? The second problem is an unexamined respect for orderly data: smooth curves that might measure simple, atheoretical properties of behavior. Fred Skinner frequently quoted Pavlov: “control your conditions and you will see order.” But order in what? Is just any order worth getting? Or are some orderly results perhaps more informative than others?
The easiest way to get order, to reduce variation, is to take an average. Skinnerian experiments involve single animals, so the method discourages averaging across animals. But why not average all those pecks or lever presses? Skinner himself seemed to provide a rationale. In one of his few theoretical excursions, he proposed that responses have a strength equivalent to probability of response. He never really justified the idea, but it is so plausible that little justification seems to be required.
The next step was crucial: how to measure response probability? Rate of response is an obvious candidate. But cumulative records show that response rate varies from moment to moment on most schedules of reinforcement. On fixed-interval, for example, subjects quit responding right after each reinforcement and then slowly accelerate to a maximum as the time for the next reinforcement approaches. A fixed-interval schedule (FI) arranges that the first response after a fixed time, call it I, is reinforced. Post-reinforcement time is a reliable cue to when the next reward will be available. Organisms adapt accordingly, waiting a fixed fraction of time I before beginning to respond.
But on another schedule, variable-interval (VI), the time is variable. If it is completely random from moment to moment and the organism responds at a steady rate, postreinforcement time gives no information about the likelihood that the next response will be rewarded. Organisms adapt to the lack of information by responding at an unvarying rate on variable-interval schedules. This property of VI made it an obvious tool. The steady response rate it produces seemed to provide a simple way to measure Skinner’s response strength. Hence, the most widely used datum in operant psychology is the response rate sustained by a VI schedule. Rate is usually measured by the number of responses that occur over a time period of minutes or hours.
Another way to reduce variability is negative feedback. A thermostatically controlled HVAC system heats when inside temperature falls below a preset level, and cools when it rises above. In this way it reduces the variation in house temperature that would otherwise occur as outside temperature varies. Any kind of negative feedback will reduce variation in the controlled variable. Unfortunately, the more effective the feedback, the less the variation in the dependent variable and the less we can learn about the feedback mechanism itself. A perfect negative feedback process is invisible.
Operant conditioning, by definition, involves feedback since reward received depends on responses made. The more the organism responds, the more reward it gets – subject to the constraints of whatever reinforcement schedule is in effect. This is positive feedback. But the most-studied operant choice procedure – concurrent variable-interval schedule – also involves negative feedback. When the choice is between two variable-interval schedules, the more time is spent on one choice the higher the payoff probability for switching to the other. So no matter the difference in payoff rates for the choices, the organism will never just fixate on one. The result is a very regular relation between choice preference and relative payoff -- the matching law. (For the full technical story, check out Adaptive Behavior and Learning, 2016)
As technology advanced, these two things converged: the desire for order, enabled by averaging and negative feedback, and Skinner’s idea that response probability is an appropriate – the appropriate – dependent variable. Variable-interval schedules either singly or in two-choice situations, became a kind of measuring device. Response rate on VI is steady – no waits, pauses or sudden spikes. It seemed to offer a simple and direct way to measure response probability. From response rate as response probability to the theoretical idea of rate as somehow equivalent to response strength was but a short step. The matching law thus came to be regarded as a general principle. Researchers began to see it as underlying not just animal choice but the choice behavior of human beings in real-life situations.
Theory Response strength is a theoretical construct. It goes well beyond response rate or indeed any other directly measurable quantity. Unfortunately, most people think they know what they mean by “strength”. The Skinnerian tradition made it difficult to see that more is needed.
A landmark 1961 study by George Reynolds illustrates the problem (although George never saw it in this way). Here is a simplified version: Imagine two experimental conditions and two identical pigeons. Each condition runs for several daily sessions. In Condition A, pigeon A pecks a red key for food reward delivered on a VI 30-s schedule. In Condition B, pigeon B pecks a green key for food reward delivered on a VI 15-s schedule. Because both food rates are relatively high, after lengthy exposure to the procedure, the pigeons will be pecking at a high rate in both cases: response rates – hence ‘strengths’ – will be roughly the same. Now change the procedure for both pigeons. Instead of a single schedule, two schedules alternate, for a minute or so each, across a one-hour experimental session. The added, second schedule is the same for both pigeons: VI 15 s, signaled by a yellow key (alternating two signaled schedules in this way is called a multiple schedule). Thus, pigeon A is on a mult VI 30 VI 15 (red and yellow stimuli) and pigeon B on a mult VI 15 VI 15 (green and yellow stimuli). In summary, the two experimental conditions are (stimulus colors in ()):
Experiment A: VI 30 (Red); mult VI 30 (Red) VI 15 (Yellow)
Experiment B: VI 15 (Green); mult VI 15 (Green) VI 15 (Yellow)
Now look at the second condition for each pigeon. Unsurprisingly, B’s response rate in green will not change. All that that has changed for him is the key color – from green all the time to green and yellow alternating, both with the same payoff. But A’s response rate in red, the VI 30 stimulus, will be much depressed, and response rate in yellow for A will be considerably higher than B’s yellow response rate, even though the VI 15-s schedule is the same in both. The effect on responding in the yellow stimulus by pigeon A, an increase in response rate when a given schedule is alternated with a leaner one, is called positive behavioral contrast and the rate decrease in the leaner schedule for pigeon A is negative contrast.
Responding by And B in the presence of the red and green stimuli in the first condition is much same and so should be the strength of the two responses. But the very different effect of adding the alternative yellow stimulus, paid off on the richer schedule, on the two animals in the second condition shows that it is not.
The consensus that response rate is an adequate measure of the ‘strength’ of an operant response is wrong The steady rate maintained by VI schedules is misleading. It looks like a simple measure of strength. Because of Skinner’s emphasis on order, because the averaged-response and feedback-rich concurrent variable-interval schedule seemed to provide it and because it was easy to equate response probability with response rate, the idea took root. Yet even in the 1950s, it was well known that response rate can itself be manipulated – by so-called differential-reinforcement-of-low-rate (DRL) schedules, for example.
Conclusion Two factors -- Skinner's single-organism method and the desire for order -- conspired to give response rate a primary role in operant conditioning. Rate was assumed to be a measure of response strength. But a third factor, disdain for theory, meant that this linkage was never much scrutinized. It is of course false: response rate does not equal response strength. Indeed, the strength concept is itself ill-defined. Hence, the field’s emphasis on response rate as the dependent variable is probably a mistake. If the strength idea is to survive the demise of rate as its best measure, something more is needed: a theory about the factors that control an operant response. But because Skinner had successfully proclaimed that theories of learning are not necessary, an adequate theory was not forthcoming for many years (see The New Behaviorism, 2014, for more on the history of Skinnerian theory).