11 and 12

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

INTRODUCTION: WITHIN-SUBJECTS DESIGN

A within-subjects design—a design in which each subject serves in more than one condition of the
experiment. By having the same subjects in more than one condition, we are improving their
chances of detecting differences between conditions. From a statistical viewpoint, we refer to this as
increasing the power of the experiment. Increased power means a greater chance of detecting a
genuine effect of the independent variable. In a within-subjects design, subjects serve in more than
one condition of the experiment and are measured on the dependent variable after each treatment;
thus, the design is also known as a repeated-measures design.

We can set up a variety of within-subjects designs. The basic principles remain the same: Each
subject takes part in more than one condition of the experiment. We make comparisons of the
behavior of the same subjects under different conditions. If our independent variable is having an
effect, we are often more likely to find it if we use a within-subjects design. In a between-subjects
design, the effects of our independent variable can be masked by the differences between the
groups on all sorts of extraneous variables. A comparison within each subject is more precise. If we
see different behaviors under different treatment conditions, these differences are more likely to be
linked to our experimental manipulation. Remember that the whole point of an experiment is to set
up a situation in which the independent variable is the only thing that changes systematically across
the conditions of the experiment. In a between-subjects design, we change the independent variable
across conditions. However, we also use different subjects in the different conditions. We can
usually assume that randomization controls for extraneous variables that might affect the dependent
variable. But we have even better control with a within-subjects design because we use the same
subjects over and over.

Researchers can make their data more precise by comparing responses of the same subjects under
different conditions, which eliminates the error from differences between subjects. The responses of
the same subjects are likely to be more consistent from one measurement to another. Therefore, if
responses change across conditions, the changes are more likely to be caused by the independent
variable. Because it controls for individual differences among subjects, the greater power of a within-
subjects design also allowed the researchers to use fewer subjects.

WITHIN-SUBJECTS FACTORIAL DESIGNS

So far we have talked about within-subjects designs that tested a single independent variable.
However, these designs can also be set up as factorial designs. Suppose a researcher was
interested in measuring how long it takes to identify different facial expressions. She might decide to
show subjects slides of people displaying four different expressions—perhaps anger, fear,
happiness, and sadness—and measure how quickly people can recognize each one. She could use
a within-subjects design, showing each subject all four kinds of faces and timing how long it takes
each subject to identify each expression. Past research suggests that for communicating emotions
on their faces, women are generally better “senders” (Fridlund et al., 1987), so the researcher also
wants to show subjects both male and female faces. In this case, she can use the sex of the person
in the slide as an additional within-subjects factor, creating a within-subjects factorial design, a
factorial design in which subjects receive all conditions in the experiment. This experiment would be
a 4 x 2 within-subjects factorial design; each subject would take part in all eight conditions (4 x 2 =
8). Subjects would see and identify eight different faces: a man and a woman displaying each of the
four expressions. It is easy to see that a within-subjects factorial can require many fewer subjects
than a between-subjects factorial design that is testing the same hypothesis.

MIXED DESIGNS

We can also use a factorial design that combines one factor that is manipulated within subjects
(such as the four types of expressions) with a between-subjects factor (often a subject variable, such
as gender or age of the subjects) that cannot be manipulated by an experimenter. A design that
combines within- and between-subjects variables in a single experiment is called a mixed design.

For an example, we all have trouble recalling a name from time to time, but older adults are
particularly prone to name retrieval failures, and, in fact, they report failures to recall people’s names
as one of their most irritating and embarrassing memory problems (Maylor, 1990). In their
homophone experiments, Burke and her colleagues used both younger (mean age = 19.05 years)
and older (mean age = 72.23 years) adult subjects. They predicted that older subjects would benefit
more from homophone priming than younger subjects would. Testing this hypothesis required a 2
(homophone or unrelated word) x 2 (younger or older subject) mixed factorial design. The type of
word used in the definition was a within-subjects factor; the age group of the subject was a between-
subjects factor. Statistical analysis of the mixed design (Burke et al., 2004) showed that homophone
priming increased the naming speed for both groups of subjects, but it significantly improved name
recall only for the older group.

Australian researchers Jones and Menzies (2000) used a mixed factorial design in an interesting
experiment to explore differences between spider phobics and nonphobics. The between-subjects
factor was spider phobia status, a 2 As you will discover later in the module, the experimenter would
also need to use a special control procedure, called counterbalancing; different subjects would see
the eight faces in different orders to prevent confounding subject variable. The phobic group was
composed of subjects reporting the strongest levels of fear of harmless spiders. A group of control
subjects reporting the lowest levels of fear were selected from the same student population. The
within-subjects factor consisted of measuring subjects’ estimates of the likelihood of being bitten in
three successive treatment conditions: spider photo, real spider, and post-spider.

In the spider photo condition, subjects were taken to the testing room containing a large, but empty,
glass cylinder with the top uncovered. Once inside the testing room, they were shown a photograph
of the cylinder when it contained two large, but harmless, Huntsman spiders (see figure below).
Huntsman spiders, found in Australia, are huge, gray-brown spiders with flat bodies, measuring as
much as 15 centimeters (almost 6 inches) across the legs. Subjects were asked to imagine being in
the room with the spiders in the uncovered cylinder. Along with other measurements, subjects
reported their feelings about how likely it was that they would be bitten by a spider in that
circumstance.

In the spider condition, subjects were exposed to the cylinder containing real spiders (the spiders
were actually dead and had been pasted to the inside of the cylinder, but subjects thought they were
alive). The dependent measures were taken again. Finally, in the third treatment condition, subjects
were once again exposed to an empty cylinder and asked to give their ratings. The average ratings
of the two phobia status groups are shown in the figure below.

As expected, in the presence of real spiders, the phobics gave much higher estimates of the
chances of being bitten, but their estimates in the two conditions in which no real spider was present
were also higher than estimates of controls. This finding is interesting because it contradicts the
common wisdom about people with phobias—namely, that phobics can accurately evaluate the
danger of phobic stimuli when the stimuli are not immediately present.

Mixed designs are very common in all areas of psychology. The statistical procedures for analyzing
mixed designs are more complex than those for within-subjects or between-subjects factorial
designs; however, with computer statistical analysis programs so widely available, student
experimenters frequently select mixed designs.

ADVANTAGES OF WITHIN-SUBJECTS DESIGNS

In within-subjects experiments, we use the same subjects in different treatment conditions. This is a
big help when we cannot get many subjects. If we have four treatment conditions and want 15
subjects in each condition, we would only need 15 subjects if we used a within-subjects design.
Each of the 15 subjects would run through all four conditions. If we ran the same experiment
between subjects, we would need 60 subjects, 15 in each condition. A within-subjects design can
also save us time when we are actually running the experiment. If subjects must be trained, it is
more efficient to train each subject for several conditions instead of for just one.

We usually have the best chance of detecting the effect of our independent variable if we compare
the behavior of the same subjects under different conditions. The within-subjects design controls for
extraneous subject variables, the ways in which subjects differ from one another. That way, if we see
differences in behavior under different conditions, we know that they are not likely to be simply the
differences that occur because the subjects in one group do not act like the subjects in another just
because they are different individuals. From a statistical standpoint, we have a better chance of
detecting the effect of our experimental manipulation if we use a within-subjects design. By
using this type of design, we have increased the power of the experiment. The reasons parallel the
reasons we discussed in connection with the matched-groups procedures. In a sense, the within-
subjects design is the most perfect form of matching we can have. The influence of subject variables
across different treatment conditions is controlled because the same subjects take part in all the
treatment conditions. Each subject serves as his or her own control for extraneous subject variables
in the experiment.

In a within-subjects, or repeated-measures, design, the subject is measured after each treatment


condition. Therefore, in a within-subjects experiment, we can also get an ongoing record of
subjects’ behavior over time. This gives us a more complete picture of the way the independent
variable works in the experiment. We have both practical and methodological gains with this
approach. A within-subjects design has so many advantages, why not always use it?

DISADVANTAGES OF WITHIN-SUBJECTS DESIGNS


Practical Limitations

There are several reasons why within-subjects designs do not always work. Sometimes such
designs are just not practical. Within-subjects designs generally require each subject to spend
more time in the experiment. For instance, the various conditions of an experiment might require
the subjects to read and evaluate several stories. A researcher might need to schedule several
hours of testing if this experiment is run with a within-subjects design; each subject must spend
several hours reading and scoring several stories. The same experiment might be run with a group
of subjects in only an hour using a between-subjects design; each individual subject would spend
just one hour reading and evaluating one story.

If a procedure involves testing each subject individually, a great deal of time can be taken up by
resetting equipment for each condition. That could lead to extra hours of testing per subject. In a
perception experiment that requires calibrating several sensitive electronic instruments for each
condition, the researcher and subjects are in for some tedious testing sessions. As an alternative,
several subjects in a row could be tested in each condition, requiring fewer changes to the
equipment.

Keep in mind that experiments can easily become tedious for subjects. Subjects who are expected
to perform many tasks might get restless during the experiment and begin to make hasty
judgments to hurry the process along—leading to inaccurate data. For the most part, these
limitations are really just inconveniences. We can seek out subjects who are willing to invest a lot of
time in a study. We can spend an additional 10, 20, or even 100 hours testing subjects if it is
essential to the value of the experiment. Sometimes it is, and experimenters may spend hours or
even days testing each subject. But more serious problems, linked to the independent variable, limit
the within-subjects approach.
Interference Between Conditions

Often each subject can be in only one condition of an experiment. Taking part in more than one
condition would be either impossible or useless or would change the effect of later treatments.
Imagine that we are doing a study on car-buying preferences. We hypothesize that the type of car
individuals first learned to drive will influence their own purchasing choices later. For simplicity, let us
say that people who learn to drive in small cars (compact or smaller) will be more likely to buy small
cars than will people who learn to drive in full-sized cars. With the cooperation of a local driving
school, we randomly assign half of the subjects to each treatment condition (small or full-sized car).

We suspect that car-buying preferences are influenced by a wide range of other factors, too—
including financial status, parents’ choice of car, advertising, and perhaps even unidentified genetic
differences in personality that influence our choices. The numerous makes and models on the
market attest to the diversity of tastes. People differ greatly in what they want in a car. Because so
many subject variables are involved, perhaps a within-subjects design would be a better choice.

Actually, in this experiment our choice is simple: We cannot do a within-subjects experiment. Once
people learn to drive in a small car, they can never learn to drive again in a full-sized car. Even if we
put subjects through the same sets of lessons again, they would not respond to them in the same
fashion because they are not novices anymore. The first training condition would interfere with all
later attempts. If one treatment condition precludes another, as it does in this kind of experiment, a
between-subjects design is required.

Sometimes it is possible to run all subjects through all treatments, but it does not make sense to do
it. What if we want subjects to learn a list of words? In one condition, we tell the subjects to learn the
words by forming mental pictures (images) of them. In the other condition, we ask subjects to repeat
the words over and over. We want to use the same list in both conditions so that the difficulty of the
list will not be a confounding variable. But once the subjects have practiced the list in one condition,
they will recall it more easily in the next condition. The interference between different conditions of
an experiment is usually the biggest drawback in using within-subjects designs. If the treatments
clash so badly that we cannot give them to the same subjects, we will need to use a between-
subjects design.

Whenever we consider a within-subjects design, we also need to consider the possibility that effects
on the dependent variable might be influenced by the order in which we give the treatments.
Subjects’ responses might differ from one treatment to another just because of the position, or order,
of the series of treatments. In a within-subjects experiment, an order effect is a potential confound.
For instance, if we were asking people to watch a series of television commercials and rate how
much they liked each one, the order in which we presented the commercials might affect ratings.
The first commercial they saw might always get a higher rating than it deserves simply because it is
novel. By the third or fourth one, ratings of any commercial might be lower than they should be
because subjects have tuned out. Advertisers know about this kind of order effect and keep their
fingers crossed that their commercials will be placed first in any long commercial break. In a within
subjects design, we use special counterbalancing procedures to offset interference and to control for
potential order effects between conditions.

CONTROLLING WITHIN-SUBJECTS DESIGNS


Controlling for Order Effects

Counterbalancing

Suppose we want to do some market research on a new brand of cola. We know that people differ
greatly in their preferences for foods and beverages, so we decide to use a within-subjects design.
We would like to get people to compare their present brand of cola with our new brand. We will get
ratings on how good our cola tastes compared with the old brands. That information will tell us
whether we can expect the new product to compete with well-known brands. Our hypothesis is that
our new cola will get better ratings than the old brands. We recruit cola drinkers and bring them to
our testing center. For 2 hours before the taste test, we keep them in a lounge in which they can
relax and read magazines—but cannot eat or drink. Then we give each subject a glass of the new
cola. After subjects have had time to drink it, we ask them to indicate on a rating scale how much
they liked the taste. We want to compare subjects’ ratings of the new cola with their ratings of their
regular brand, so we introduce a second condition. We ask all subjects to drink a glass of their
favorite cola. After they finish, we get them to rate their favorite drink on the rating scale. Now we
have ratings of the two types of colas, new and old. We can compare the average ratings and see
how our product competes.

Would it surprise you to learn that the average rating of our new brand was much higher than the
average rating of the old brands? What is wrong with this experiment? The problem, of course, is
that any cola might taste good after you have had nothing to drink for several hours.

In this experiment, we varied the brand of cola that people were asked to drink. There were two
conditions, “new brand” and “old brand.” Unfortunately, in addition to varying the brand of cola, we
varied an important extraneous variable—order. We created confounding by always giving subjects
the new cola first, Subjects might have rated the new brand higher because it really is delicious, but
the ratings were probably distorted because subjects had not had anything to drink for a full 2 hours
before they tasted the new product. In addition, the subjects might have given their old brand lower
ratings because they had just had something else to drink and were no longer thirsty. In this
experiment, we see that the order in which we presented the treatment conditions could have
changed the subjects’ responses. We have confounding caused by order effects.

Two other kinds of changes can occur when subjects are run in more than one condition. First,
fatigue effects can cause performance to decline as the experiment goes on: Subjects get tired. As
they solve more and more word problems, for instance, they could begin to make mistakes. They
might also become bored or irritated by the experiment and merely go through the motions until it is
over. Second, different factors may lead to improvement as the experiment proceeds—that is, to
practice effects. As subjects become familiar with the experiment, they could relax and do a little
better. They get better at using the apparatus, develop strategies for solving problems, or even catch
on to the real purpose of the study.

All these changes, both positive and negative, are called progressive error: As the experiment
progresses, results are distorted. The changes in subjects’ responses are not caused by the
independent variable; they are order effects produced when we run subjects through more than one
treatment condition. Progressive error includes any changes in the subjects’ responses that are
caused by testing in multiple treatment conditions. It includes order effects, such as the effects of
practice.

We control for any extraneous variable by making sure it affects all treatment conditions in the same
way. We can do that by eliminating the variable completely, by holding it constant, or by balancing it
out across treatment conditions. In a within-subjects experiment, we cannot eliminate order effects.
Nor can we hold them constant, giving all subjects the treatments in the same order because we are
trying to avoid just this kind of systematic effect. But we can balance them out—distribute them
across the conditions—so that they affect all conditions equally.

Think about the cola experiment. We did a poor job of setting it up because we let the order of the
colas stay the same for all subjects. Everyone tasted the new brand first. How could we redo the
experiment so that progressive error would affect the results for both kinds of colas in the same
way? We want to be sure that subjects’ ratings reflect accurate taste judgments, not merely a
difference between the first and second glass of cola. Suppose we modify our procedures a little. We
run the first condition the same as before: Subjects do not eat or drink for 2 hours; then they drink a
glass of the new brand of cola and give their ratings. But instead of having them drink the old brand
immediately, we have them return to the lounge for another 2 hours. At the end of that time, we give
them the old brand of cola and get the second set of ratings. Does this help? We avoid the problem
of having subjects drink the new brand after 2 hours in the lounge and the old brand when they are
not as thirsty. However, our data may still be contaminated by the order of the conditions. When the
subjects drink the new brand, they have been in the lounge a total of 2 hours. When they drink the
old brand, they have spent 4 hours in the lounge. By this time, they may be tired of hanging around.
They may be getting hungry. They have also had practice filling out the rating scale, as well as time
to think about what they said before.

Let’s look at how progressive error can accumulate during the course of an experiment in which
subjects solve four successive word problems. Progressive error can be illustrated by the graph in
the figure above. You can see that progressive error is low in the early part of the experiment and
increases gradually as the experiment continues. Here, the first treatment produces only 1 unit of
error; the second treatment produces 2 units. Because error increases as the experiment
progresses, the third treatment produces 3 units, and the fourth treatment produces 4 units. When
we sum the progressive error from all four treatments, we find that during the course of the
experiment we have accumulated 10 units of progressive error (1 + 2 + 3 + 4 = 10).

Fortunately, researchers have worked out several procedures for controlling for order effects. These
procedures are called counterbalancing, and they all have the same function: to distribute
progressive error across the different treatment conditions of the experiment. By using these
procedures, we can ensure that the order effects that alter results on one condition will be offset, or
counterbalanced, by the order effects operating on other conditions.

Subject-by-Subject Counterbalancing

We can control for progressive error in the cola experiment through one of two general approaches.
First, we can control it by using subject-by-subject counterbalancing, a technique for controlling
progressive error for each individual subject by presenting all treatment conditions more than once.
The idea is to redistribute the effects of progressive error so that they will equal about the same
amount in each condition that a subject completes. Two common techniques used to create subject-
by-subject counterbalancing are reverse counterbalancing and block randomization.

Reverse Counterbalancing

Let’s see what happens to progressive error in the cola experiment if we present each treatment
more than once. We give each subject two glasses of each cola instead of one, but we use reverse
counterbalancing, a technique for controlling progressive error for each individual subject by
presenting all treatment conditions twice, first in one order, then in the reverse order. We can call the
new brand “condition A” and the old brand “condition B.” We can equalize progressive error for these
two conditions by presenting them in the order ABBA. Subjects now drink four glasses of cola
instead of two, and we can use the figure above again to describe how the ABBA procedure works
to balance out the effects of progressive error.

If we run the conditions in the ABBA order, we can add up the units of progressive error for each
condition. Recall that for the first treatment (or trial), progressive error is equal to 1 unit; for the
second trial, 2 units; for the third, 3 units; and for the fourth, 4 units. Because condition A is given in
trials one and four, progressive error for condition A works out to be 5 units (1 + 4 = 5). For condition
B, given in trials two and three, progressive error also equals 5 units (2 + 3 = 5). Of course, these
numerical quantities are hypothetical, but you can see the logic behind the counterbalancing
procedure. Now both conditions contain some trials in which progressive error is relatively high and
others in which it is relatively low, but the total units of error are the same for both conditions.

Using reverse counterbalancing, each subject gets the ABBA sequence. This ensures that
progressive error affects conditions A and B about equally for each subject. If there are more than
two treatment conditions, we can counter balance for each subject by continuing the pattern. With
three conditions, the sequence for each subject would be ABCCBA; with four, the sequence would
be ABCDDCBA; and so on.

Progressive error, however, is not necessarily this easy to control completely. The figure above
illustrates error that is linear; that is, described by one straight line. But suppose true error has a
more complex distribution across trials. Perhaps subjects get a little better with practice, and they
also fatigue to some extent. On a long series of trials, they might catch their second wind and do a
bit better as the experiment draws to a close. The effects of progressive error may be curvilinear
(like an inverted U) or nonmonotonic (changing direction), as in the figure below. Suppose the
impact of progressive error looks more like that represented in the figure below. If we use the ABBA
procedure, progressive error for condition A will equal 3 units (1 + 2), and progressive error for
condition B will equal 5 units (2 + 3). This is no better than simply testing everyone on A first, then B.

Block Randomization

When progressive error is nonlinear, researchers often prefer to use block randomization. Each set
of treatments (e.g., ABCD) is considered as a single block, and treatments within each block are
given in random order. For block randomization to be successful in controlling nonlinear progressive
error, it is usually necessary to present each treatment several times, resulting in a sequence
containing a number of randomized blocks. For example, if you decided to give each treatment
(ABCD) 5 times, block randomization could produce the following sequence of five treatment blocks:

BCDA • DBAC • ACDB • CABD • BADC

The experiment would consist of 20 trials (five repeats of each condition). Clearly, block
randomization in which subjects are presented with many blocks is not ideal for all types of
experiments. However, it is commonly used in cognition, perception, and psychophysics
experiments in which treatment conditions are relatively short.

In reality, we rarely know precisely what progressive error will look like. Therefore, it is especially
important to be cautious when planning a design involving repeated measures for each subject. The
available control procedures might not be adequate to distribute the effects of progressive error
equally across all conditions. We rely on prior research to guide our decisions. Occasionally,
progressive error itself must become a variable for study. When we clarify its impact beforehand, we
can set up the most effective controls.
Across-Subjects Counterbalancing

One drawback of counterbalancing within each subject is that we have to present each condition to
each subject more than once. As the number of conditions increases, the length of the sequence
of treatments also increases.

Depending on the experiment, the procedures can become time-consuming, expensive, or just plain
boring for the subjects as well as for the experimenter. As an alternative, we can often use the
second general approach, across-subjects counterbalancing. These procedures serve the same
basic purpose as subject-by-subject counterbalancing: They are used to distribute the effects of
progressive error so that if we average across subjects, the effects will be the same for all conditions
of the experiment. We are not always concerned about the individual subject’s responses, but we
still want to be sure that progressive error affects each of the various treatment conditions equally.
These across-subjects techniques are complete counterbalancing and partial counterbalancing.

Complete Counterbalancing

If we always present treatment conditions in the same order, progressive error will affect some
conditions more than others. Complete counterbalancing controls for this effect by using all
possible sequences of the conditions and using every sequence the same number of times. If we
had only two treatments (AB), as we did in the cola experiment, we would give half the subjects A
first and then B. We would give the other subjects B first and then A. You can see that this is very
similar to what we did to control order effects within each subject by giving each subject both
sequences, AB and BA. But when we counterbalance across subjects, we need to give each subject
only one sequence. Different subjects are assigned to the sequences at random, and we give each
sequence to an equal number of subjects. Some subjects go through condition A without any
practice; others go through B without any practice. The effects of progressive error should turn out to
be about the same for each condition if we pool the data from all subjects.

It is easy to counterbalance completely when there are only two conditions but suppose there are
more. Let us say that we are testing memory for faces displaying different emotions. In our
experiment, we have three sets of target photographs. One set (A) shows people who are smiling.
The second set (B) shows people who are frowning. The third set (C) is used as a control; it shows
people whose faces appear to be neutral. Later, we will ask subjects to go through another much
larger set and pick out the people they have seen before. In this experiment, we do not want to show
all subjects the happy target faces first because people tend to have better recall for the first (and
last) things they see; things in the middle are less well recalled. We can control for this kind of error
by using complete counterbalancing. We use all possible sequences of the ABC treatment
conditions; we also use each sequence the same number of times.

Table A shows complete counterbalancing for an experiment with three treatment conditions. There
are six possible order sequences for the three conditions. For our face recognition experiment, we
would need order sequences like happy, neutral, sad (ACB) and sad, happy, neutral (BAC). To
counterbalance completely, we must use all the sequences and use each one the same number of
times. Thus, for six sequences, we would need at least six subjects. Ideally, we should have more
than one subject for each sequence, so we need a number of subjects that is a multiple of 6.
Remember that we have to use all the sequences an equal number of times. We can use 6, or 12, or
18 subjects but not 9, 11, or 17 because these are not multiples of 6.

Can we tell in advance how many sequences and how many subjects we will need? How can we be
sure we did not miss a sequence? You can find the number of possible sequences by computing N
factorial, represented by N! We get N! by computing the product of N and all integers smaller than N
until we reach 1. In an experiment with four treatment conditions, 4! is 4 × 3 × 2 × 1 = 24.

4! is equal to 24. This tells us that when we have four treatment conditions, there are 24 possible
orders in which to present them. Earlier we saw that there are six possible sequences if we have
three conditions. You can verify that by computing 3! 3 × 2 × 1 = 6.

The number of possible order sequences clearly expands very quickly as the number of treatment
conditions increases. To counterbalance an experiment completely, we need at least one subject for
each possible sequence. That means we need at least 4 times as many subjects for a four-condition
experiment as we do for a three-condition experiment (24 versus 6 possible sequences). We also
need to present each sequence an equal number of times. If we want more than one subject per
sequence, we will need multiples of 24 for a four-condition experiment—a minimum of 48 subjects.

Partial Counterbalancing

You can see that it is economical to keep the number of treatments to a minimum. If we double the
number of conditions from three to six, we increase the minimum number of subjects needed to
counterbalance completely from 6 to 720. Of course, it makes sense to omit any condition that is not
necessary for a good test of the hypothesis. Still, sometimes six or even more conditions are
essential. In those cases, we may use partial counterbalancing procedures. The basic idea is the
same. We use these procedures when we cannot do complete counterbalancing but still want to
have some control over progressive error across subjects. Partial counterbalancing controls
progressive error by using some subset of the available order sequences; these sequences are
chosen through special procedures.

The simplest partial counterbalancing procedure is randomized partial counterbalancing. When


there are many possible order sequences, we randomly select out as many sequences as we have
subjects for the experiment. Suppose we have 120 possible sequences (five treatment conditions)
and only 30 subjects. We would randomly select 30 sequences, and each subject would get one of
those sequences. You can see that this procedure may not control for order effects quite as
effectively as complete counterbalancing, but it is better than simply using the same order for all
subjects. If possible, use complete counterbalancing because it is safer. If you must use partial
counterbalancing, be realistic about it. If there are 720 possible order sequences in the experiment,
running three subjects just does not make sense. You will not be able to get good control over order
effects. As a general rule, use at least as many randomly selected sequences as there are
experimental conditions.

Another procedure commonly used to select a subset of order sequences is called Latin square
counterbalancing. A matrix, or square, of sequences is constructed that satisfies the following
condition: Each treatment appears only once in any order position in the sequences. Table B shows
a basic Latin square for an experiment with four treatment conditions (a 4 x 4 matrix). Each row
represents a different order sequence. Notice that each of the four treatment conditions appears in
the first, second, third, and fourth position only once. This method controls adequately for
progressive error caused by order effects because each treatment condition occurs equally often in
each position.

Once you have selected your sequences, you would assign subjects at random to receive the
different orders. Remember that each sequence is used equally often. With four sequences, you
would need to run at least four subjects. If you have more subjects available, it is always better to
run more than one subject in each order condition. For the four-condition experiment, any multiple of
4 subjects could be used: 8, 12, 16, and so on.

Using a Latin square to determine treatment sequences will provide protection against order effects,
but it cannot control for other kinds of systematic interference between two treatment conditions.
Notice in Table B that some parts of the sequences tend to repeat themselves. For instance,
condition A comes right before B in two out of four of the sequences. If exposure to condition A
affects how subjects will respond to condition B, the Latin square will not provide enough control.
The experiment can still be confounded. This kind of systematic interference is called a carryover
effect.

TABLE B

Carryover Effects

Carryover effects refer to the effects of some treatments that persist or carry over after the previous
treatments are removed. On a smaller scale, imagine how carryover effects could sabotage
experiments. Earlier in the module, we saw how some combinations of treatments are impossible to
administer to the same subjects because one treatment precludes another. For instance, we cannot
both do and not do surgery on the same animal. Similarly, we do not want to allow one experimental
condition to interfere with a subsequent condition. For example, when inducing different emotions
within subjects, researchers have to take precautions to ensure that one emotion has completely
passed before they begin each new treatment. We do not want to give subjects treatments that will
give them clues to what they should do in later conditions. We do not want one experimental
treatment to make a later treatment easier (or harder). We do not want the effects of early
conditions to contaminate later conditions.

Notice that carryover effects differ from order effects in an important way. Order effects emerge as a
result of the position of a treatment in a sequence (first, second, third, etc.). It does not matter what
the specific treatment is; if it occurs first in a sequence, subjects will handle it differently than if it
occurs last. Carryover, however, is a function of the treatment itself. Gasoline will produce changes
in the ability to detect subsequent odors no matter whether the subject smells it first, second, or
tenth. Feeling sadness will carry over to the next emotion regardless of whether sadness is the first,
second, or fourth condition in the experiment.

We can control for carryover effects to some extent by using some of the same counterbalancing
procedures that control for order effects. Subject-by-subject counterbalancing and complete
counterbalancing will usually control carryover effects adequately by balancing them out over the
entire experiment. Control is less certain with randomized counterbalancing, and it might not be
controlled at all if Latin square counterbalancing is used. Using mathematical techniques, however, it
is possible to construct special Latin squares, called balanced Latin squares, that can control for
both order and carryover effects. In balanced Latin squares, each treatment condition (1) appears
only once in each position in the order sequences and (2) precedes and follows every other
condition an equal number of times. Table C represents a balanced Latin square for four treatment
conditions (ABCD).

TABLE C

Compare the balanced Latin square with the Latin square depicted in Table B. Can you see the
important differences that allow the balanced square to control for carryover as well as order effects?
In both tables, each treatment appears only once in each position in the sequence as a control for
order effects, but the sequences in Table C also control for carryover effects. Here, every treatment
precedes and follows every other treatment an equal number of times. Parts of the sequences, such
as A preceding B, do not repeat themselves, so potential carryover effects are distributed equally
over the entire experiment.
Choosing Among Counterbalancing Procedures

Every experiment with a within-subjects condition will need some form of counterbalancing. In a
within-subjects design with one independent variable or a within-subjects factorial design, you must
counterbalance all conditions. If the design is a within-subjects factorial, remember to multiply the
levels of each factor together to get the total number of conditions. For instance, our earlier factorial
example of comparing slides of men and women displaying four different expressions (a 4 x 2 within-
subjects design) has eight conditions that need to be counterbalanced. In a mixed design, only
within-subjects factors need to be counterbalanced.

Deciding whether to use subject-by-subject or across-subjects counterbalancing can be a problem in


itself. We need to counterbalance for each subject when we expect large differences in the pattern of
progressive error from subject to subject. In a weightlifting experiment, we might expect subjects to
fatigue at different rates. In that sort of experiment, it makes sense to counterbalance conditions for
each person in the study because that would give the most control over order and carryover effects.
In some experiments, we do not expect large differences in the way progressive error influences
each subject. When we know the effects will be about the same for everyone, we do not need to
worry about progressive error within each subject—we can counterbalance across subjects instead.

There are practical things to consider, too. You might not have the time to run through all the
conditions more than once for each subject. Or you might not have enough subjects to make
complete counterbalancing feasible. The same considerations that come into play as you select a
within- or between-subjects design can also limit your choice of controls in the within-subjects
experiment. For guidance, you should look at the procedures that have been used in similar
experiments.

If prior researchers have had success with across-subjects counterbalancing, it is probably all right
to use it. Try to avoid randomized and Latin square counterbalancing if you expect carryover effects.
When in doubt, counterbalance subject by subject if you can. The worst that can happen is that you
might overcontrol the experiment. It is always a good idea to use the procedures that give the most
control, simply because you might not know what the extraneous variables are or whether
progressive error really will be the same for all subjects.

Although we have talked about counterbalancing mainly in the context of within-subjects designs,
counterbalancing procedures can be useful in between-subjects experiments, too. For example, in a
between-subjects experiment on list learning, a researcher might compare two different study
conditions on people’s ability to memorize a list of 10 words. The researcher would want both groups
to memorize the same 10 words to make sure that the lists were equally difficult in both study
conditions. In addition, it might be desirable to use several different random orders of the items on
each list and randomly assign some number of subjects in each group to each different list. If just
one order were used, the possibility exists that the list contained some logical sequence that was
easier to learn under one set of conditions than another. Whenever subjects are presented with an
experimental stimulus that is really a group of items (words, pictures, stories, and the like), the order
of the items should be counterbalanced to avoid confounding. You may find other opportunities to
apply the counterbalancing procedures you have learned as you design your own experiments.

Order as a Design Factor

If you are concerned that a partial counterbalancing technique might not be controlling adequately
for progressive error or carryover effects, there is a way to test whether treatments are producing
similar effects in all order sequences. Researchers do this fairly routinely if they use a Latin square
and have four or less order sequences. You can simply include treatment order as an additional
factor in the design. Suppose you wanted to conduct the happy (H) versus sad (S) film clip
experiment. You might be concerned that one type of film produces more carryover than another. It
would be a good idea to include order as a factor in your design. With only two treatment conditions,
it should be fairly easy to use both possible order sequences: Half the subjects receive the sequence
Happy-Sad, and the others receive Sad-Happy. Your experiment would be statistically analyzed as a
2 x 2 (Order x Film) mixed-factorial design (the treatment order is always a between-subjects
factor.). If the order factor produced no significant effects, you can feel more confident that your
counterbalancing procedure worked. However, if the order of treatments produced significant effects,
you will know your experiment is confounded by order. Effects on the dependent variable could have
been produced by the order of treatments, rather than by the treatments themselves.

Sometimes, you will find that the effect of one condition is much greater than that of others. A
gasoline may alter the witness’ ability to identify various scents more than the smell of lilacs. When
one condition has more impact than others, we say that the carryover effects are asymmetrical—or,
more simply, lopsided. When one condition carries over more than others, control is extremely
difficult, if not impossible. In such situations, an experimenter should reconsider the design of the
experiment and switch to a between-subjects design if possible.

HOW CAN YOU CHOOSE A DESIGN?

How do you decide whether to use a within-subjects or a between-subjects design? First, as always,
think about the hypothesis of the experiment.

 How many treatment conditions do you need to test the hypothesis?


 Would it be possible to have each subject in more than one of these conditions?
 If so, you might be able to use a within-subjects design.
 Do your treatment conditions interfere with one another?
 Yes? Then you might want to use a between-subjects design.

Consider the practical advantages of each approach. Is it simpler to run the experiment one way or
the other? Which will be more time-consuming? If you can get only a few subjects, the within-
subjects design might be better. Remember that there is a trade-off: The longer the experiment
takes, the harder it might be to find willing subjects (and the more likely it is that they will become
fatigued). You can control subject variables best in a within-subjects design. If there are likely to be
large individual differences in the way subjects respond to the experiment, the within-subjects
approach is usually better.

Remember to review the research literature. If other experimenters have used within-subjects
designs for similar research problems, it is probably because that approach works best. If all other
things seem equal, use the within-subjects design. It is better from a statistical standpoint, because
you maximize your chances of detecting the effect of the independent variable.

All the experimental designs we have discussed thus far have required manipulating or selecting
independent variables and testing a number of subjects. These are called large N designs (N stands
for the number of subjects needed in the experiment). The large N approach is by far the most
common technique used in research design, but it is not the only approach used by contemporary
researchers.

SMALL N DESIGNS

Some researchers prefer to use small N designs, which test only one or a very few subjects. These
researchers argue that large N designs lack precision because they pool, or combine, the data from
many different subjects to reach conclusions about the effects of independent variables. The
conclusions of large N experiments can sometimes be misleading because they obscure the results
of individual subjects, who can vary widely in their responses to treatment conditions. Small N
researchers argue that aggregate effects are artificial because they often do not represent what
really occurs with any individual subject—instead, large N experiments can reveal only general
trends, which might produce dubious conclusions. For example, it is possible to miss the effect of an
independent variable in a large N experiment. Responses of different subjects to the same
independent variable can vary greatly; sometimes they even cancel each other out, giving the
appearance that no effects at all were produced. The following example will demonstrate how that
might happen. A related disadvantage of large N designs is that the results of data aggregated over
groups of subjects might not really be a good reflection of the reactions of individual subjects.
Studies of learning can provide a good example.

Small N designs take a very different approach to studying effects of independent variables. Here
the behavior of one or a few subjects is studied much more intensely. Typically, the researcher
measures the subject’s behavior many times. A subject can be studied in one intensive session or
over a period of weeks, months, or even years. When group or large N designs are used, subjects
are generally measured at only one point in time—right after the experimental manipulation—and
that measurement might or might not be a good representation of the subject’s response to the
treatment.

Sometimes small N designs are used for practical reasons. In clinical psychopathology research, for
example, a psychologist might want to conduct an experiment to test a new therapy treatment for
depressed individuals. The researcher might not be able to find enough depressed individuals to
form experimental and control groups for a large N study; however, a small N experiment could be
conducted in which the progress of one or a few patients could be studied intensively.

Small N designs are used in the laboratory and in clinical and other field settings; they can be used
to study both human and animal behavior. Animal researchers very often use small N designs for
practical reasons. Research animals, from mice to chimpanzees, are costly to acquire and maintain,
and their training often involves months or years. Sometimes animals must be sacrificed so that their
brains or other tissues may be studied. In these cases, researchers try to use as few animals as
possible; here small N designs make a great deal of sense.

Another area of psychology in which small N designs are common is psychophysics (the study of
how we sense and perceive physical stimuli). Basic psychophysical processes operate similarly for
most individuals, so researchers can obtain a good picture of how these processes operate by
testing a very small number of subjects. Small N research of this kind has a venerable history in
psychology. In fact, a major component of early experimentation in the United States in the late 19th
century was the study of psychophysical processes, which relied on the single-subject approach
almost exclusively. If additional subjects were tested at all, they were used to replicate effects
produced on a single subject. It wasn’t until Sir Ronald Fisher invented the statistical technique
called analysis of variance in the 1930s (Fisher, 1935) that large N or group designs requiring
inferential statistics began to gain in popularity.

Small N designs are used in many areas of psychology, but they are used most extensively in
experimentation using principles of operant conditioning. The well-known behaviorist B. F. Skinner
studied changes in the rate of behavior based on the introduction of positive or negative
consequences (reinforcement) using the small N design. His techniques have become known as
“the experimental analysis of behavior.” Skinner strongly believed that there was more to gain by
careful, continuous measurement of the behavior of a single subject than by using statistical tests to
compare data obtained from different groups of subjects. Skinner’s approach uses just one or two
subjects and requires special measurement procedures.

ABA DESIGNS

ABA refers to the order of the conditions of the experiment: A (the baseline condition) comes first,
followed by B (the experimental condition). Finally, we return to the baseline condition (A) to verify
that the change in behavior is linked to the independent variable. ABA designs may only be used if
the treatment conditions are reversible; for that reason, they are also called reversal designs. Many
small N experiments use the ABA design. The ABA design is sometimes used for large N
experiments, too.

Variations of the ABA Format


It is also possible to use several different experimental conditions in a small N experiment by
extending the ABA format. We can proceed using different treatment conditions as follows:
ABACADA, and so on, where B, C, and D represent three different treatments. What is important is
that we collect baseline data before any experimental intervention and that we return to the baseline
condition after each experimental treatment. The small N design is frequently used to test the effects
of positive or negative reinforcement on individuals with behavioral problems. The small N approach
can be applied to many behavior modification problems. Sometimes, clinical experience leads
researchers to decide to implement some, but not all, baseline levels, using variations of the ABA
format.

Until now, we have focused on the importance of returning to the baseline conditions to verify the
impact of our independent variable. However, in many clinical and behavior modification studies,
researchers choose not to return to the baseline condition, even temporarily. Kazdin explains why: If
behavior did revert to baseline levels when treatment was suspended temporarily, such a change
would be clinically undesirable. Essentially, returning the client to baseline levels of performance
amounts to making behavior worse …. In most circumstances, the idea of making a client worse just
when treatment may be having an effect is ethically unacceptable. (2003, p. 283)

This is clearly true in experiments done to modify self-injurious behaviors. Suppose you were
working with a disturbed boy who hit his head against the wall, kicked himself, and punched himself
with his fists. If you concluded that the child performed these behaviors as a way of getting attention
from caregivers, then one possible treatment would be to withhold paying any attention to the child
until the self-injurious behaviors stopped. Whenever the boy begins to harm himself, you stop talking
to him and turn away. Suppose your treatment worked, and the self-destructive behaviors
decreased.

How do we know that withdrawing attention actually caused the change in the boy’s behavior?
Perhaps the change was just a coincidence. Would we want to find out? Would we want to return to
an original set of conditions that might make the boy hurt himself again? No. Even though the
experimental procedures require a return to the baseline conditions, psychologists sacrifice some
scientific precision for ethical reasons. When we make an intervention that we hope will be
therapeutic, our primary goal is helping the patient. If we succeed in changing a patient’s behavior to
something more adaptive using only an AB design, we have accomplished that goal.

MULTIPLE BASELINE DESIGN

At times, a researcher might want to assess the effects of a treatment on two or more different
behaviors in the same person. Or a researcher might be interested in testing the effects of an
intervention on a behavior that occurs in multiple settings or situations. Alternatively, a researcher
might want to evaluate a particular kind of treatment on more than one individual. In all three
instances, the researcher has the option of using a multiple baseline design, in which a series of
baselines and treatments are compared, but once a treatment is established, it is not withdrawn.
Sometimes, a multiple baseline design can be used to solve the ethical problem posed by
withdrawing effective treatment. Let’s go back to the example of the disturbed boy with self-injurious
behaviors. We could increase the certainty that our treatment, rather than something else, produced
the boy’s behavior change by using a multiple baseline design. If the self-injurious behavior occurs in
more than one setting, a multiple baseline design is possible. Suppose the boy engaged in self-
injurious behaviors at home, at school, and during therapy sessions. Baseline behavior in all three
settings could be recorded concurrently, and then the treatment could be applied in the first setting.
Once behavior change was established in the first setting, the treatment could be applied in the
second setting. If self-injurious behaviors declined in the first two settings but not the third, a good
case can be made that the treatment was indeed efficacious.

A multiple baseline design can be a very effective way of demonstrating the efficacy of an
intervention technique without reversal to the baseline condition. Instead of testing the same
intervention across different settings in the same individual, researchers also can test the
intervention on two or more individuals using a multiple baseline design. If the intervention produces
an effect on several individuals, our confidence that the effect was actually produced by the
treatment is increased and the study has greater internal validity. In addition, each subject added to
the experiment serves to replicate the effect produced on a single subject.

STATISTICS AND VARIABILITY IN SMALL N DESIGNS

Have you noticed something unusual about the small N experiments you have learned about so far?
We have not mentioned statistics at all. That is because most small N experiments in the past have
not used them. Often it is possible to determine whether the independent variable had an effect
simply by looking at a graph of the data. For example, in the hypothetical multiple baseline study to
reduce cartoon viewing, statistics would not be necessary to judge whether the boy’s cartoon
viewing had dropped. It is easy to see just by looking at Figure A that a meaningful drop occurred.
The data presented in Figure B also illustrate a clearly observable effect (although Danforth did
present statistics to support his conclusions).

Figure A

Figure B
The use of statistics in small N designs remains controversial among small N researchers (Morgan &
Morgan, 2001), but the use of statistics appears to be increasing as new statistical approaches
become available (Ator, 1999). Statistics can be especially helpful when the pattern of data is too
ambiguous to interpret just by looking—for instance, when there are frequent ups and downs (i.e.,
variability) in the behavior of interest. Unlike the straightforward and unambiguous examples that are
generally used to teach experimental designs, there is frequently a great deal of variability in
subjects’ responses both within a single subject and between different subjects.

One role of statistics is to allow us to make inferences about the population from our sample data;
therefore, it may not be reasonable to make generalizations about whole populations from data on
one or a very few subjects. Many traditional statistical tests used for large group designs may not be
appropriate for small N studies unless 50 or more measurements are taken during each baseline
and treatment phase. In recent years, however, a number of new statistical approaches have
become available that can be used with these designs, but these are beyond the scope of this
module. In multiple baseline designs, we frequently encounter an additional problem. In addition to
variability within a single subject, an intervention can produce highly variable results in different
individuals.

CHANGING CRITERION DESIGNS

In some types of small N research using operant conditioning techniques, the behavior being
modified cannot be changed all at once. Instead, in a changing criterion design, the behavior will be
modified in increments, and the criterion for success will intentionally be changed as the behavior is
modified. Beginning a program of weight training is a good example of an instance in which a
personal trainer could use a changing criterion design. In the initial stages of lifting free weights, the
trainer would verbally reinforce the client for lifting fairly light weights. As the client gets stronger,
however, the criterion weight that the client is expected to lift for the same praise from the trainer
gets heavier!

The same scheme is useful in behavior therapy settings. For example, after baseline performance
has been established, a therapist might contract with a parent and child to begin a program to
increase the time the child spends doing homework. A changing criterion design might be used.
Here, the amount of homework time that is required to earn prizes and rewards changes during the
therapy. For the first couple of weeks, rewards might be earned for one-half hour of homework a
day. Once the half-hour criterion is well established, it might be increased to 45 minutes a day. Now
it will take 45 minutes of homework a day to earn the rewards. The criterion can be shifted upward
incrementally until it reaches a suitable amount of time. This design is particularly useful when the
eventual, desired behavior must be shaped. First, simple behaviors and then closer-and-closer
approximations of the final, desired behavior are reinforced until the complete behavior is performed.

DISCRETE TRIALS DESIGNS


The designs we have looked so far have all used baselines to control for behavior as it normally
occurs without the experimental manipulation. However, another type of small N design frequently
used in psychophysical research is the discrete trials design, which does not rely on baselines.
Instead, it relies on presenting and averaging across many, many applications of different treatment
conditions and comparing performance on the dependent variable across treatment conditions.
Repeated presentation over many trials can provide a reliable picture of the effects of the
independent variable. And because individuals’ sensory systems are similar, the results from a small
number of subjects are likely to be generalizable to others.

WHEN TO USE LARGE N AND SMALL N DESIGNS

The small N design is appropriate when you are studying a particular subject, such as a disturbed
child. It is also useful when very few subjects are available. You can actually carry out an entire
experiment with just one subject, although this is not always ideal. Without replication, a small N
study might have little external validity. When we do experiments, we usually want to be able to
generalize from our results—we want to be able to make statements about people or pigeons that
were not actually subjects in an experiment. Many researchers prefer to do large N studies because
they believe that they can generalize from their results more successfully. All other things being
equal, an experiment with more subjects has greater generalizability.

In a large N study, we may form separate groups of subjects for each treatment condition. The
subjects run through their assigned conditions, and we then measure them on the dependent
variable. We pool data from each group and evaluate them statistically to see if the groups behaved
differently. In a small N experiment, we watch one or a few subjects over an extended period. We
record baseline data. We introduce the experimental intervention and monitor the changes in the
dependent variable throughout the experimental condition. Typically, we take several
measurements. We can see whether the effect of the experimental intervention is instant or whether
it builds over time. Unless it would be unwise, therapeutically, to remove a treatment that seems to
be working, we continue to measure after the intervention is removed. We can verify that the
independent variable causes changes in behavior because we can see what happens when that
variable is removed. In short, we can often get a more complete and accurate picture of the effects
of independent variables from a small N study than from a large N study that tests the same
hypothesis.

Then why not use a small N design for every experiment? We would certainly save a lot of time
recruiting subjects. But is it safe to generalize from a small N study? Small N researchers say yes,
as long as we can evaluate how “typical” our small sample is. For example, we could compare the
behavior of our pigeon during the baseline condition with the records of other pigeons in the
research literature. If our pigeon seems to behave about the same as other pigeons do, we would
probably assume it is a typical subject. However, even if the subject behaves typically in our
baseline conditions, we still cannot be sure this particular subject is not unusually sensitive to the
independent variable.
Generalizing from the results of one or a few subjects is particularly risky when the subjects are
people. No two people would be expected to react in exactly the same way. Pooling the responses
of many individual subjects makes it more likely that the effects are generalizable to people outside
the experiment. Generalizing also depends on the type of process that is being measured. Some
psychophysical processes are quite similar in most people. We all react in much the same way to a
loud, unexpected sound, such as a gunshot, by displaying a startle reaction. Other psychological
processes—such as whether we will laugh in relief or get angry at an experimenter who fired off a
gun close to our head—show large individual differences. If we are measuring a process that is
relatively invariant, the results from a small N experiment would have greater generalizability than if
we were measuring a behavior for which we expected large differences among people.

In a small N study, we also cannot be sure that the results are not caused by some unseen accident
(a history threat). For instance, a well-meaning cleaning person who gives fertilizer to our subject
just at the time we begin talking to it could contaminate our plant study. For these reasons, it is
important to replicate the findings of a small N experiment before generalizations are made. An
experiment with multiple applications of a treatment and multiple returns to baseline is more
convincing than a single application.

It is impossible to say whether small or large N studies always have greater generality. All things
rarely are equal. A large N study with a badly biased sample might tell us little about behavior in the
population, whereas the findings of a well-controlled experiment with a single subject might be
successfully replicated again and again on different subjects. By gathering baseline data, applying
the experimental manipulation, and then returning to the baseline condition, we can get a very clear
idea of the impact of the independent variable.

You might also like