Krosnick e Presser (2010) - Question and Questionnaire Design
Krosnick e Presser (2010) - Question and Questionnaire Design
Krosnick e Presser (2010) - Question and Questionnaire Design
The heart of a survey is its questionnaire. Drawing a sample, hiring, and training
interviewers and supervisors, programming computers, and other preparatory work
is all in service of the conversation that takes place between researchers and
respondents. Survey results depend crucially on the questionnaire that scripts this
conversation (irrespective of how the conversation is mediated, e.g., by an
interviewer or a computer). To minimize response errors, questionnaires should be
crafted in accordance with best practices.
Recommendations about best practices stem from experience and common lore,
on the one hand, and methodological research, on the other. In this chapter, we first
offer recommendations about optimal questionnaire design based on conventional
wisdom (focusing mainly on the words used in questions), and then make further
recommendations based on a review of the methodological research (focusing mainly
on the structural features of questions).
We begin our examination of the methodological literature by considering
open versus closed questions, a difference especially relevant to three types of
measurement: (1) asking for choices among nominal categories (e.g., ‘‘What is the
most important problem facing the country?’’), (2) ascertaining numeric quantities
(e.g., ‘‘How many hours did you watch television last week?’’), and (3) testing factual
knowledge (e.g., ‘‘Who is Joseph Biden?’’).
Next, we discuss the design of rating scales. We review the literature on the
optimal number of scale points, consider whether some or all scale points should be
labeled with words and/or numbers, and examine the problem of acquiescence
response bias and methods for avoiding it. We then turn to the impact of response
option order, outlining how it varies depending on whether categories are nominal or
ordinal and whether they are presented visually or orally.
After that, we assess whether to offer ‘‘don’t know’’ or no-opinion among a
question’s explicit response options. Next we discuss social desirability response bias
1. Use simple, familiar words (avoid technical terms, jargon, and slang);
2. Use simple syntax;
3. Avoid words with ambiguous meanings, i.e., aim for wording that all respondents
will interpret in the same way;
4. Strive for wording that is specific and concrete (as opposed to general and
abstract);
5. Make response options exhaustive and mutually exclusive;
6. Avoid leading or loaded questions that push respondents toward an answer;
7. Ask about one thing at a time (avoid double-barreled questions); and
8. Avoid questions with single or double negations.
1. Early questions should be easy and pleasant to answer, and should build rapport
between the respondent and the researcher.
2. Questions at the very beginning of a questionnaire should explicitly address
the topic of the survey, as it was described to the respondent prior to the
interview.
3. Questions on the same topic should be grouped together.
4. Questions on the same topic should proceed from general to specific.
5. Questions on sensitive topics that might make respondents uncomfortable should
be placed at the end of the questionnaire.
6. Filter questions should be included, to avoid asking respondents questions that do
not apply to them.
1. Two reservations sometimes expressed about measuring quantities with open questions are that some
respondents will say they don’t know or refuse to answer and others will round their answers. In order to
minimize missing data, respondents who do not give an amount to the open question can be asked follow-up
closed questions, such as ‘‘Was it more or less than X?’’ (see, for example, Juster & Smith, 1997). Minimizing
rounded answers is more difficult, but the problem may apply as much to closed questions as to open.
268 Jon A. Krosnick and Stanley Presser
When designing a rating scale, a researcher must specify the number of points on
the scale. Likert (1932) scaling most often uses 5 points; Osgood, Suci, and
Tannenbaum’s (1957) semantic differential uses 7 points; and Thurstone’s (1928)
equal-appearing interval method uses 11 points. The American National Election
Study surveys have measured citizens’ political attitudes over the last 60 years
using 2-, 3-, 4-, 5-, 7-, and 101-point scales (Miller, 1982). Robinson, Shaver, and
Wrightsman’s (1999) catalog of rating scales for a range of social psychological
constructs and political attitudes describes 37 using 2-point scales, 7 using 3-point
scales, 10 using 4-point scales, 27 using 5-point scales, 6 using 6-point scales, 21 using
7-point scales, two using 9-point scales, and one using a 10-point scale. Rating scales
used to measure public approval of the U.S. president’s job performance vary from
2 to 5 points (Morin, 1993; Sussman, 1978). Thus, there appears to be no standard
for the number of points on rating scales, and common practice varies widely.
In fact, however, the literature suggests that some scale lengths are preferable to
maximize reliability and validity. In reviewing this literature, we begin with a discus-
sion of theoretical issues and then describe the findings of relevant empirical studies.
2. Paradoxically, the openness of open questions can sometimes lead to narrower interpretations than
comparable closed questions. Schuman and Presser (1981), for instance, found that an open version of the
most important problem facing the nation question yielded many fewer ‘‘crime and violence’’ responses
than a closed version that offered that option, perhaps because respondents thought of crime as a local
(as opposed to national) problem on the open version but not on the closed. The specificity resulting from
the inclusion of response options can be an advantage of closed questions. For a general discussion of the
relative merits of open versus closed items, see Schuman (2008, chapter 2).
Question and Questionnaire Design 269
relatively precise and stable understanding of the meaning of each point on the scale.
Fourth, most or all respondents must agree in their interpretations of the meanings
of each scale point. And a researcher must know what those interpretations are.
If some of these conditions are not met, data quality is likely to suffer. For example,
if respondents fall in a particular region of an underlying evaluative dimension (e.g.,
‘‘like somewhat’’) but no response options are offered in this region (e.g., a scale
composed only of ‘‘dislike’’ and ‘‘like’’), respondents will be unable to rate themselves
accurately. If respondents interpret the points on a scale one way today and differently
next month, then they may respond differently at the two times, even if their
underlying attitude has not changed. If two or more points on a scale appear to have
the same meaning (e.g., ‘‘some of the time’’ and ‘‘occasionally’’) respondents may be
puzzled about which one to select, leaving them open to making an arbitrary choice.
If two people differ in their interpretations of the points on a scale, they may give
different responses even though they may have identical underlying attitudes. And if
respondents interpret scale point meanings differently than researchers do, the
researchers may assign numbers to the scale points for statistical analysis that
misrepresent the messages respondents attempted to send via their ratings.
9.3.1.1. Translation ease The length of scales can impact the process by which
people map their attitudes onto the response alternatives. The ease of this mapping
or translation process varies, partly depending upon the judgment being reported.
For instance, if an individual has an extremely positive or negative attitude toward
an object, a dichotomous scale (e.g., ‘‘like,’’ ‘‘dislike’’) easily permits reporting that
attitude. But for someone with a neutral attitude, a dichotomous scale without a
midpoint would be suboptimal, because it does not offer the point most obviously
needed to permit accurate mapping.
A trichotomous scale (e.g., ‘‘like,’’ ‘‘neutral,’’ ‘‘dislike’’) may be problematic for
another person who has a moderately positive or negative attitude, equally far from
the midpoint and the extreme end of the underlying continuum. Adding a moderate
point on the negative side (e.g., ‘‘dislike somewhat’’) and one on the positive side of
the scale (e.g., ‘‘like somewhat’’) would solve this problem. Thus, individuals who
want to report neutral, moderate, or extreme attitudes would all have opportunities
for accurate mapping.
The value of adding even more points to a rating scale may depend upon how
refined people’s mental representations of the construct are. Although a 5-point
scale might be adequate, people may routinely make more fine-grained distinctions.
For example, most people may be able to differentiate feeling slightly favorable,
moderately favorable, and extremely favorable toward objects, in which case a
7-point scale would be more desirable than a 5-point scale.
If people do make fine distinctions, potential information gain increases as the
number of scale points increases, because of greater differentiation in the judgments
made (for a review, see Alwin, 1992). This will be true, however, only if individuals
do in fact make use of the full scale, which may not occur with long scales.
The ease of mapping a judgment onto a response scale is likely to be determined in
part by how close the judgment is to the conceptual divisions between adjacent points
270 Jon A. Krosnick and Stanley Presser
on the scale. For example, when people with an extremely negative attitude are
asked, ‘‘Is your opinion of the President very negative, slightly negative, neutral,
slightly positive, or very positive?’’ they can easily answer ‘‘very negative,’’ because
their attitude is far from the conceptual division between ‘‘very negative’’ and
‘‘slightly negative.’’ However, individuals who are moderately negative have a true
attitude close to the conceptual division between ‘‘very negative’’ and ‘‘slightly
negative,’’ so they may face a greater challenge in using this 5-point rating scale. The
‘‘nearness’’ of someone’s true judgment to the nearest conceptual division between
adjacent scale points is associated with unreliability of responses — those nearer to a
division are more likely to pick one option on one occasion and another option on
a different occasion (Kuncel, 1973, 1977).
9.3.1.2. Clarity of scale point meanings In order for ratings to be reliable, people
must have a clear understanding of the meanings of the points on the scale. If the
meaning of scale points is ambiguous, then both reliability and validity of
measurement may be compromised.
A priori, it seems that dichotomous response option pairs are very clear in
meaning; that is, there is likely to be considerable consensus on the meaning of
options such as ‘‘favor’’ and ‘‘oppose’’ or ‘‘agree’’ and ‘‘disagree.’’ Clarity may be
compromised when a dichotomous scale becomes longer, because each point added is
one more point to be interpreted. And the more such interpretations a person must
make, the more chance there is for inconsistency over time or across individuals.
That is, it is presumably easier for someone to identify the conceptual divisions
between ‘‘favoring,’’ ‘‘opposing,’’ and being ‘‘neutral’’ on a trichotomous item than
on a seven-point scale, where six conceptual divisions must be specified.
For rating scales up to seven points long, it may be easy to specify intended
meanings of points with words, such as ‘‘like a great deal,’’ ‘‘like a moderate
amount,’’ ‘‘like a little,’’ ‘‘neither like nor dislike,’’ ‘‘dislike a little,’’ ‘‘dislike a
moderate amount,’’ and ‘‘dislike a great deal.’’ But once the number of scale points
increases above seven, point meanings may become considerably less clear.
For example, on 101-point attitude scales (sometimes called feeling thermometers),
what exactly do 76, 77, and 78 mean? Even for 11- or 13-point scales, people may be
hardpressed to define the meaning of the scale points.
9.3.1.3. Uniformity of scale point meaning The number of scale points used is
inherently confounded with the extent of verbal labeling possible, and this
confounding may affect uniformity of interpretations of scale point meanings
across people. Every dichotomous and trichotomous scale must, of necessity, include
verbal labels on all scale points, thus enhancing their clarity. But when scales have
four or more points, it is possible to label only the end points with words. In such
cases, comparisons with dichotomous or trichotomous scales reflect the impact of
both number of scale points and verbal labeling. It is possible to provide an effective
verbal label for each point on a scale containing more than 7 points, but doing so
becomes more difficult as the number of scale points increases beyond that length.
Question and Questionnaire Design 271
The respondent’s task may be made more difficult when presented with numerical
rather than verbal labels. To make sense of a numerically labeled rating scale,
respondents must first generate a verbal definition for each point and then match
these definitions against their mental representation of the attitude of interest. Verbal
labels might therefore be advantageous, because they may clarify the meanings of the
scale points while at the same time reducing respondent burden by removing a step
from the cognitive processes entailed in answering the question.
9.3.1.4. Satisficing Finally, the optimal number of rating scale points may depend
on individuals’ cognitive skills and motivation to provide accurate reports. Offering a
midpoint on a scale may constitute a cue encouraging satisficing to people low
in ability and/or motivation, especially if its meaning is clearly either ‘‘neutral/no
preference’’ or ‘‘status quo — keep things as they are now.’’ If pressed to explain
these answers, satisficing respondents might have little difficulty defending such
replies. Consequently, offering a midpoint may encourage satisficing by providing a
clear cue offering an avenue for doing so.
However, there is a potential cost to eliminating midpoints. Some people may
truly belong at the scale midpoint and may wish to select such an option to
communicate their genuine neutrality or endorsement of the status quo. If many
people have neutral attitudes to report, eliminating the midpoint will force them to
pick a point either on the positive side or on the negative side of the scale, resulting in
inaccurate measurement.
The number of points on a rating scale can also impact satisficing via a different
route: task difficulty. The number of scale points offered on a rating scale may be a
determinant of task difficulty. Two-point scales simply require a decision of direction
(e.g., pro vs. con), whereas longer scales require decisions of both direction and
extremity. Very long scales require people to choose between many options, so these
scales may be especially difficult in terms of scale point interpretation and mapping.
Yet providing too few scale points may contribute to task difficulty by making it
impossible to express moderate positions. Consequently, task difficulty (and
satisficing as well) may be at a minimum for moderately long rating scales, resulting
in more accurate responses.
Many investigations have produced evidence useful for inferring the optimal number
of points on rating scales. Some of this work has systematically varied the number of
scale points offered while holding constant all other aspects of questions. Other work
has attempted to discern people’s natural discrimination tendencies in using rating
scales. Several of the studies we review did not explicitly set out to compare reliability
or validity of measurement across scale lengths but instead reported data that permit
us to make such comparisons post hoc.
272 Jon A. Krosnick and Stanley Presser
9.3.2.1. Reliability Lissitz and Green (1975) explored the relation of number of
scale points to reliability using simulations. These investigators generated sets of true
attitudes and random errors for groups of hypothetical respondents and then added
these components to generate responses to attitude questions on different-length
scales in two hypothetical ‘‘waves’’ of data. Cross-sectional and test–retest reliability
increased from 2- to 3- to 5-point scales but were equivalent thereafter for 7-, 9-, and
14-point scales. Similar results were obtained in simulations by Jenkins and Taber
(1977), Martin (1978), and Srinivasan and Basu (1989).
Some studies have found the number of scale points to be unrelated to cross-
sectional reliability. Bendig (1954) found that ratings using either 2-, 3-, 5-, 7-, or
9-point scales were equivalently reliable. Similar results have been reported for scales
ranging from 2 to 7 points (Komorita & Graham, 1965; Masters, 1974) and for
longer scales ranging from 2 to 19 points (Birkett, 1986; Matell & Jacoby, 1971;
Jacoby & Matell, 1971). Other studies have yielded differences that are consistent
with the notion that scales of intermediate lengths are optimal (Birkett, 1986; Givon
& Shapira, 1984; Masters, 1974). For example, Givon and Shapira (1984) found
pronounced improvements in item reliability when moving from 2-point scales
toward 7-point scales. Reliability continued to increase up to lengths of 11 points,
but the increases beyond 7 points were quite minimal for single items.
Another way to assess optimal scale length is to collect data on a scale with many
points and recode it into a scale with fewer points. If longer scales contain more
random measurement error, then recoding should improve reliability. But if longer
scales contain valid information that is lost in the recoding process, then recoding
should reduce data quality. Consistent with this latter hypothesis, Komorita (1963)
found that cross-sectional reliability for 6-point scales was 0.83, but only 0.71 when
the items were recoded to be dichotomous. Thus, it appears that more reliable
information was contained in the full 6-point ratings than the dichotomies. Similar
findings were reported by Matell and Jacoby (1971), indicating that collapsing scales
longer than 3 points discarded reliable information, because long scales provided
more information than short scales and were no less reliable.
Although there is some variation in the patterns yielded by these studies, they
generally support the notion that reliability is lower for scales with only two or three
points compared to those with more points, but suggest that the gain in reliability
levels off after about 7 points.
9.3.2.2. Validity Studies estimating correlations between true attitude scores and
observed ratings on scales of different lengths using simulated data have found that
validity increases as scales lengthen from 2 points; however, as scales grow longer, the
gains in validity become correspondingly smaller (Green & Rao, 1970; Lehmann &
Hulbert, 1972; Lissitz & Green, 1975; Martin, 1973, 1978; Ramsay, 1973).
Other techniques to assess the validity of scales of different lengths have included:
correlating responses obtained from two different ratings of the same construct (e.g.,
Matell & Jacoby, 1971; Smith, 1994; Smith & Peterson, 1985; Watson, 1988; Warr,
Barter, & Brownridge, 1983), correlating attitude measures obtained using scales of
different lengths with other attitudes (e.g., Schuman & Presser, 1981, pp. 175–176),
Question and Questionnaire Design 273
and using the ratings obtained using different scale lengths to predict other attitudes
(Rosenstone, Hansen, & Kinder, 1986; Smith & Peterson, 1985). These studies have
typically found that concurrent validity improves with increasing scale length.
Several studies suggest that longer scales are less susceptible to question order effects
(Wedell & Parducci, 1988; Wedell, Parducci, & Lane, 1990; Wedell, Parducci, &
Geiselman, 1987). However, one study indicates that especially long scales might be
more susceptible to context effects than those of moderate length (Schwarz & Wyer,
1985). Stember and Hyman (1949/1950) found that answers to dichotomous questions
were influenced by interviewer opinion, but this influence disappeared among individuals
who were also offered a middle alternative, yielding a trichotomous question.
As with the research on reliability, these studies generally support the notion that
validity is higher for scales with a moderate number of points than for scales with
fewer, with the suggestion that validity is compromised by especially long scales.
3. Almost all the studies reviewed above involved experimental designs varying the number of rating scale
points, holding constant all other aspects of the questions. Some additional studies have explored the
impact of number of scale points using a different approach: Meta-analysis. These studies have taken large
sets of questions asked in pre-existing surveys, estimated their reliability and/or validity, and meta-
analyzed the results to see whether data quality varies with scale point number (e.g., Alwin, 1992, 1997;
Alwin & Krosnick, 1991; Andrews, 1984, 1990; Scherpenzeel, 1995). However, these meta-analyses
sometimes mixed together measures of subjective judgments with measurements of objective constructs
such as numeric behavior frequencies (e.g., number of days) and routinely involved strong confounds
between number of scale points and other item characteristics, only some of which were measured and
controlled for statistically. Consequently, it is not surprising that these studies yielded inconsistent findings.
For example, Andrews (1984) found that validity and reliability were worst for 3-point scales, better for
2- and 4-point scales, and even better as scale length increased from 5 to 19 points. In contrast, Alwin and
Krosnick (1991) found that 3-point scales had the lowest reliability, found no difference in the reliabilities
of 2-, 4-, 5, and 7-point scales, and found 9-point scales to have maximum reliability (though these latter
scales actually offered 101 response alternatives). And Scherpenzeel (1995) found the highest reliability for
4/5-point scales, lower reliability for 10 points, and even lower for 100 points. We therefore view these
studies as less informative than experiments that manipulate rating scale length.
Question and Questionnaire Design 275
branching format took less time in a telephone survey than the equivalent one-item
7-point scale.
Once the length of a rating scale has been specified, a researcher must decide how to
label the points. Various studies suggest that reliability is higher when all points are
labeled with words than when only some are (e.g., Krosnick & Berent, 1993).
Respondents also express greater satisfaction when more scale points are verbally
labeled (e.g., Dickinson & Zellinger, 1980). Researchers can maximize reliability and
validity by selecting labels that divide up the continuum into approximately equal
units (e.g., Klockars & Yamagishi, 1988; for a summary, see Krosnick & Fabrigar,
forthcoming).4
Many closed attitude measures are modeled after Likert’s technique, offering
statements to respondents and asking them to indicate whether they agree or disagree
with each or to indicate their level of agreement or disagreement. Other attitude
measures offer assertions and ask people to report the extent to which the assertions
are true or false, and some attitude measures ask people ‘‘yes/no’’ questions (e.g.,
‘‘Do you favor limiting imports of foreign steel?’’).
These sorts of item formats are very appealing from a practical standpoint,
because such items are easy to write. If one wants to identify people who have
positive attitudes toward bananas, for example, one simply needs to write a
statement expressing an attitude (e.g., ‘‘I like bananas’’) and ask people whether they
agree or disagree with it or whether it is true or false. Also, these formats can be used
to measure a wide range of different constructs efficiently. Instead of having to
change the response options from one question to the next as one moves from
measuring liking to perceived goodness, the same set of response options can be used.
Nonetheless, these question formats may be problematic. People may sometimes
say ‘‘agree,’’ ‘‘true,’’ or ‘‘yes’’ regardless of the question being asked of them. For
example, a respondent might agree with the statement that ‘‘individuals are mainly to
blame for crime’’ and also agree with the statement that ‘‘social conditions are mainly
to blame for crime.’’ This behavior, labeled ‘‘acquiescence,’’ can be defined as
endorsement of an assertion made in a question, regardless of the assertion’s content.
The behavior could result from a desire to be polite rather than confrontational in
interpersonal interactions (Leech, 1983), from a desire of individuals of lower social
status to defer to individuals of higher social status (Lenski & Leggett, 1960), or from
an inclination to satisfice rather than optimize when answering questionnaires
(Krosnick, 1991).
The evidence documenting acquiescence by a range of methods is now voluminous
(for a review, see Krosnick & Fabrigar, forthcoming). Consider first agree/disagree
4. This suggests that analog devices such as thermometers or ladders may not be good measuring devices.
276 Jon A. Krosnick and Stanley Presser
questions. When people are given the choices ‘‘agree’’ and ‘‘disagree,’’ are not told
the statements to which they apply, and are asked to guess what answers an
experimenter is imagining, ‘‘agree’’ is chosen much more often than ‘‘disagree’’
(e.g., Berg & Rapaport, 1954). When people are asked to agree or disagree with pairs
of statements stating mutually exclusive views (e.g., ‘‘I enjoy socializing’’ vs. ‘‘I don’t
enjoy socializing’’), the between-pair correlations are negative but generally very
weakly so (Krosnick and Fabrigar report an average correlation of only –0.22 across
41 studies). Although random measurement error could cause the correlations to
depart substantially from –1.0, acquiescence could do so as well.
Consistent with this possibility, averaging across 10 studies, 52% of people agreed
with an assertion, whereas only 42% of people disagreed with the opposite assertion
(Krosnick & Fabrigar, forthcoming). Another set of eight studies compared answers
to agree/disagree questions with answers to forced choice questions where the order
of the views expressed by the response alternatives was the same as in the agree/
disagree questions. On average, 14% more people agreed with an assertion than
expressed the same view in the corresponding forced choice question. In seven other
studies, an average of 22% of the respondents agreed with both a statement and its
reversal, whereas only 10% disagreed with both. Thus, taken together, these methods
suggest an acquiescence effect averaging about 10%.
Other evidence indicates that the tendency to acquiesce is a general inclination
of some individuals across questions. The cross-sectional reliability of the tendency
to agree with assertions averaged 0.65 across 29 studies. And the over-time
consistency of the tendency to acquiesce was about 0.75 over one month, 0.67 over
four months, and 0.35 over four years (e.g., Couch & Keniston, 1960; Hoffman,
1960; Newcomb, 1943).
Similar results (regarding correlations between opposite assertions, endorsement
rates of items, their reversals, and forced choice versions, and so on) have been
produced in studies of true/false questions and of yes/no questions, suggesting that
acquiescence is present in responses to these items as well (see Krosnick & Fabrigar,
forthcoming). And there is other such evidence regarding these response alternatives.
For example, people are much more likely to answer yes/no factual questions
correctly when the correct answer is ‘‘yes’’ than when it is ‘‘no’’ (e.g., Larkins &
Shaver, 1967; Rothenberg, 1969), presumably because people are biased toward
saying ‘‘yes.’’
Acquiescence is most common among respondents who have lower social status
(e.g., Gove & Geerken, 1977; Lenski & Leggett, 1960), less formal education (e.g.,
Ayidiya & McClendon, 1990; Narayan & Krosnick, 1996), lower intelligence (e.g.,
Forehand, 1962; Hanley, 1959; Krosnick, Narayan, & Smith, 1996), lower cognitive
energy (Jackson, 1959), less enjoyment from thinking (Messick & Frederiksen, 1958),
and less concern to convey a socially desirable image of themselves (e.g., Goldsmith,
1987; Shaffer, 1963). Also, acquiescence is most common when a question is difficult
(Gage, Leavitt, & Stone, 1957; Hanley, 1962; Trott & Jackson, 1967), when
respondents have become fatigued by answering many prior questions (e.g., Clancy &
Wachsler, 1971), and when interviews are conducted by telephone as opposed to face-
to-face (e.g., Calsyn, Roades, & Calsyn, 1992; Holbrook, Green, & Krosnick, 2003).
Question and Questionnaire Design 277
Although some of these results are consistent with the notion that acquiescence results
from politeness or deferral to people of higher social status, all of the results are
consistent with the satisficing explanation.
If this interpretation is correct, acquiescence might be reduced by assuring
(through pretesting) that questions are easy for people to comprehend and answer
and by taking steps to maximize respondent motivation to answer carefully and
thoughtfully. However, no evidence is yet available on whether acquiescence can be
reduced in these ways. Therefore, a better approach to eliminating acquiescence is
to avoid using agree/disagree, true/false, and yes/no questions altogether. This is
especially sensible because answers to these sorts of questions are less valid and less
reliable than answers to the ‘‘same’’ questions expressed in a format that offers
competing points of view and asks people to choose among them (e.g., Eurich, 1931;
Isard, 1956; Watson & Crawford, 1930).
One alternative approach to controlling for acquiescence is derived from the
presumption that certain people have acquiescent personalities and are likely to do
all of the acquiescing. According to this view, a researcher needs to identify those
people and statistically adjust their answers to correct for this tendency (e.g., Couch
& Keniston, 1960). To this end, many batteries of items have been developed to
measure a person’s tendency to acquiesce, and people who offer lots of ‘‘agree,’’
‘‘true,’’ or ‘‘yes’’ answers across a large set of items can then be spotlighted as likely
acquiescers. However, the evidence on moderating factors (e.g., position in the
questionnaire and mode of administration) that we reviewed above suggests that
acquiescence is not simply the result of having an acquiescent personality; rather,
it is influenced by circumstantial factors. Because this ‘‘correction’’ approach does
not take that into account, the corrections performed are not likely to fully adjust for
acquiescence.
It might seem that acquiescence can be controlled by measuring a construct with
a large set of agree/disagree or true/false items, half of them making assertions
opposite to the other half (called ‘‘item reversals;’’ see Paulhus, 1991). This approach
is designed to place acquiescers in the middle of the dimension, but it will do so only
if the assertions made in the reversals are as extreme as the original statements.
Furthermore, it is difficult to write large sets of item reversals without using the word
‘‘not’’ or other such negations, and evaluating assertions that include negations is
cognitively burdensome and error-laden for respondents, thus adding measurement
error and increasing respondent fatigue (e.g., Eifermann, 1961; Wason, 1961). Even if
one is able to construct appropriately reversed items, acquiescers presumably end up
at a point on the measurement dimension where most probably do not belong on
substantive grounds. That is, if these individuals were induced not to acquiesce and
to instead answer the items thoughtfully, their final scores would presumably be more
valid than placing them at or near the midpoint of the dimension.
Most important, answering an agree/disagree, true/false, or yes/no question
always requires respondents to first answer a comparable rating question with
construct-specific response options. For example, people asked to agree or disagree
with the assertion ‘‘I like bananas,’’ must first decide how positive or negative their
attitudes are toward bananas (perhaps concluding ‘‘I love bananas’’) and then
278 Jon A. Krosnick and Stanley Presser
translate that conclusion into the appropriate selection in order to answer the
question. Researchers who use such questions presume that arraying people along
the agree/disagree dimension corresponds monotonically to arraying them along the
underlying substantive dimension of interest. That is, the more people agree with the
assertion ‘‘I like bananas,’’ the more positive is their true attitude toward bananas.
Yet consider respondents asked for their agreement with the statement ‘‘I am
usually pretty calm.’’ They may ‘‘disagree’’ because they believe they are always very
calm or because they are never calm, which violates the monotonic equivalence of the
response dimension and the underlying construct of interest. As this example makes
clear, it would be simpler to ask people directly about the underlying dimension. Every
agree/disagree, true/false, or yes/no question implicitly requires the respondent to rate
an object along a continuous dimension, so asking about that dimension directly is
bound to be less burdensome. Not surprisingly, then, the reliability and validity of
rating scale questions that array the full attitude dimension explicitly (e.g., from
‘‘extremely bad’’ to ‘‘extremely good,’’ or from ‘‘dislike a great deal’’ to ‘‘like a great
deal’’) are higher than those of agree/disagree, true/false, and yes/no questions that
focus on only a single point of view (e.g., Ebel, 1982; Mirowsky & Ross, 1991; Ruch &
DeGraff, 1926; Saris & Krosnick, 2000; Wesman, 1946). Consequently, it seems best
to avoid agree/disagree, true/false, and yes/no formats altogether and instead ask
questions using rating scales that explicitly display the evaluative dimension.
Many studies have shown that the order in which response alternatives are presented
can affect their selection. Some studies show primacy effects (options more likely to
be selected when they are presented early); others show recency effects (options more
likely to be selected when presented last), and still other studies show no order effects
at all. Satisficing theory helps explain these results.
We consider first how response order affects categorical questions and then turn
to its effect in rating scales. Response order effects in categorical questions (e.g.,
‘‘Which do you like more, peas or carrots?’’) appear to be attributable to ‘‘weak
satisficing.’’ When confronted with categorical questions, optimal answering would
entail carefully assessing the appropriateness of each of the offered response
alternatives before selecting one. In contrast, a weak satisficer would simply choose
the first response alternative that appears to constitute a reasonable answer. Exactly
which alternative is most likely to be chosen depends in part upon whether the
choices are presented visually or orally.
When categorical alternatives are presented visually, either on a show-card in a
face-to-face interview or in a self-administered questionnaire, weak satisficing is
likely to bias respondents toward selecting choices displayed early in a list.
Respondents are apt to consider each alternative individually beginning at the top
of the list, and their thoughts are likely to be biased in a confirmatory direction
(Koriat, Lichtenstein, & Fischhoff, 1980; Klayman & Ha, 1984; Yzerbyt & Leyens,
Question and Questionnaire Design 279
1991). Given that researchers typically include choices that are plausible, confirma-
tion-biased thinking will often generate at least a reason or two in favor of most of
the alternatives in a question.
After considering one or two alternatives, the potential for fatigue (and therefore
reduced processing of later alternatives) is significant. Fatigue may also result
from proactive interference, whereby thoughts about the initial alternatives interfere
with thinking about later, competing alternatives (Miller & Campbell, 1959).
Weak satisficers cope by thinking only superficially about later alternatives; the
confirmatory bias thereby advantages the earlier items. Alternatively, weak satisficers
can simply terminate their evaluation altogether once they come upon an alternative
that seems to be a reasonable answer. Because many answers are likely to seem
reasonable, such respondents are again apt to end up choosing alternatives near the
beginning of a list. Thus, weak satisficing seems liable to produce primacy effects
under conditions of visual presentation.
When response alternatives are presented orally, as in face-to-face or telephone
interviews, the effects of weak satisficing are more difficult to anticipate. This is so
because order effects reflect not only evaluations of each option, but also the limits of
memory. When categorical alternatives are read aloud, presentation of the second
alternative terminates processing of the first one, usually relatively quickly.
Therefore, respondents are able to devote the most processing time to the final
items; these items remain in short-term memory after interviewers pause to let
respondents answer.
It is conceivable that some people listen to a short list of categorical alternatives
without evaluating any of them. Once the list is completed, these individuals may
recall the first alternative, think about it, and then progress forward through the list
from there. Given that fatigue should instigate weak satisficing relatively quickly,
a primacy effect would be expected. However, because this approach requires more
effort than first considering the final items in the list, weak satisficers are unlikely
to use it very often. Therefore, considering only the allocation of processing, we
would anticipate both primacy and recency effects, though the latter should be more
common than the former.
These effects of deeper processing are likely to be reinforced by the effects of
memory. Categorical alternatives presented early in a list are most likely to enter
long-term memory (e.g., Atkinson & Shiffrin, 1968), and those presented at the
end of a list are most likely to be in short-term memory immediately after the list is
heard (e.g., Atkinson & Shiffrin, 1968). Furthermore, options presented late are
disproportionately likely to be recalled (Baddeley & Hitch, 1977). So options
presented at the beginning and end of a list are more likely to be recalled after the
question is read, particularly if the list is long. Therefore, both early and late
categorical options should be more available for selection, especially among weak
satisficers. Short-term memory usually dominates long-term memory immediately
after acquiring a list of information (Baddeley & Hitch, 1977), so memory factors
should promote recency effects more than primacy effects. Thus, in response to orally
presented questions, mostly recency effects would be expected, though some primacy
effects might occur as well.
280 Jon A. Krosnick and Stanley Presser
Schwarz and Hippler (1991) and Schwarz, Hippler, and Noelle-Neumann (1992)
note two additional factors that may govern response order effects: the plausibility of
the response alternatives presented and perceptual contrast effects. If deep processing
is accorded to an alternative that seems highly implausible, even people with a con-
firmatory bias in reasoning may fail to generate any reasons to select it. Thus, deeper
processing of some alternatives may make them especially unlikely to be selected.
Although studies of response order effects in categorical questions seem to offer a
confusing pattern of results when considered as a group, a clearer pattern appears when
the studies are separated into those involving visual and oral presentation. In visual
presentation, primacy effects have been found (Ayidiya & McClendon, 1990; Becker,
1954; Bishop, Hippler, Schwarz, & Strack, 1988; Campbell & Mohr, 1950; Israel &
Taylor, 1990; Krosnick & Alwin, 1987; Schwarz et al., 1992). In studies involving oral
presentation, nearly all response order effects have been recency effects (McClendon,
1986; Berg & Rapaport, 1954; Bishop, 1987; Bishop et al., 1988; Cronbach, 1950;
Krosnick, 1992; Krosnick & Schuman, 1988; Mathews, 1927; McClendon, 1991;
Rubin, 1940; Schuman & Presser, 1981; Schwarz et al., 1992; Visser, Krosnick,
Marquette, & Curtin, 2000).5
If the response order effects demonstrated in these studies are due to weak
satisficing, then these effects should be stronger under conditions where satisficing
is most likely. And indeed, these effects were stronger among respondents with
relatively limited cognitive skills (Krosnick, 1990; Krosnick & Alwin, 1987; Krosnick
et al., 1996; McClendon, 1986, 1991; Narayan & Krosnick, 1996). Mathews (1927)
also found stronger primacy effects as questions became more and more difficult and
as people became more fatigued. And although McClendon (1986) found no relation
between the number of words in a question and the magnitude of response order
effects, Payne (1949/1950) found more response order effects in questions involving
more words and words that were more difficult to comprehend. Also, Schwarz et al.
(1992) showed that a strong recency effect was eliminated when prior questions on
the same topic were asked, which presumably made knowledge of the topic more
accessible and thereby made optimizing easier.
Much of the logic articulated above regarding categorical questions seems
applicable to rating scales, but in a different way than for categorical questions.
Many people’s attitudes are probably not perceived as precise points on an
underlying evaluative dimension but rather are seen as ranges or ‘‘latitudes of
acceptance’’ (Sherif & Hovland, 1961; Sherif, Sherif, & Nebergall, 1965). If satisficing
respondents consider the options on a rating scale sequentially, they may select the
first one that falls in their latitude of acceptance, yielding a primacy effect under both
visual and oral presentation.
Nearly all of the studies of response order effects in rating scales involved visual
presentation, and when order effects appeared, they were almost uniformly primacy
effects (Carp, 1974; Chan, 1991; Holmes, 1974; Johnson, 1981; Payne, 1971; Quinn &
Belson, 1969). Furthermore, the two studies of rating scales that used oral
presentation found primacy effects as well (Kalton, Collins, & Brook, 1978; Mingay
& Greenwell, 1989). Consistent with the satisficing notion, Mingay and Greenwell
(1989) found that their primacy effect was stronger for people with more limited
cognitive skills. However, these investigators found no relation of the magnitude of
the primacy effect to the speed at which interviewers read questions, despite the fact
that a fast pace presumably increased task difficulty. Also, response order effects
were no stronger when questions were placed later in a questionnaire (Carp, 1974).
Thus, the moderators of rating scale response order effects may be different from the
moderators of such effects in categorical questions, though more research is clearly
needed to fully address this question.
How should researchers handle response order effects when designing survey
questions? One seemingly effective way to do so is to counterbalance the order in
which choices are presented. Counterbalancing is relatively simple to accomplish
with dichotomous questions; a random half of the respondents can be given one
order, and the other half can be given the reverse order. When the number of
response choices increases, the counterbalancing task can become more complex.
However, when it comes to rating scales, it makes no sense to completely randomize
the order in which scale points are presented, because that would eliminate the
sensible progressive ordering from positive to negative, negative to positive, most to
least, least to most, etc. Therefore, for scales, only two orders ought to be used,
regardless of how many points are on the scale.
Unfortunately, counterbalancing order creates a new problem: variance in responses
due to systematic measurement error. Once response alternative orders have been
varied, respondent answers may differ from one another partly because different people
received different orders. One might view this new variance as random error variance,
the effect of which would be to attenuate observed relations among variables and leave
marginal distributions of variables unaltered. However, given the theoretical
explanations for response order effects, this error seems unlikely to be random.
Thus, in addition to counterbalancing presentation order, it seems potentially
valuable to take steps to reduce the likelihood of the effects occurring in the first place.
The most effective method for doing so presumably depends on the cognitive
mechanism producing the effect. If primacy effects are due to satisficing, then steps that
reduce satisficing should reduce the effects. For example, with regard to motivation,
questionnaires can be kept short, and accountability can be induced by occasionally
asking respondents to justify their answers. With regard to task difficulty, the wording
of questions and answer choices can be made as simple as possible.
What happens when people are asked a question about which they have no relevant
knowledge? Ideally, they will say that they do not know the answer. But respondents
may wish not to appear uninformed and may therefore give an answer to satisfy the
282 Jon A. Krosnick and Stanley Presser
question such as ‘‘Do you favor or oppose U.S. government aid to Nicaragua?’’
a respondent’s first step would be to search long-term memory for any information
relevant to the objects mentioned: U.S. foreign aid and Nicaragua. If no information
about either is recalled, the individual can quickly respond by saying ‘‘don’t know.’’
But if some information is located about either object, the person must then retrieve
that information and decide whether it can be used to formulate a reasonable
opinion. If not, the individual can then answer ‘‘don’t know,’’ but the required search
time makes this a relatively slow response. Glucksberg and McCloskey (1981)
reported a series of studies demonstrating that ‘‘don’t know’’ responses do indeed
occur either quickly or slowly, the difference resulting from whether or not any
relevant information can be retrieved in memory.
According to the proponents of DK filters, the most common reason for DKs is
that the respondent lacks the necessary information and/or experience with which
to form an attitude. This would presumably yield quick, first-stage DK responses.
In contrast, second-stage DK responses could occur for other reasons, such as
ambivalence: some respondents may know a great deal about an object and/or have
strong feelings toward it, but their thoughts and/or feelings may be contradictory,
making it difficult to select a single response.
DK responses might also result at the point at which respondents attempt to
translate their judgment into the choices offered by a question. Thus, people may
know approximately where they fall on an attitude scale (e.g., around 6 or 7 on a 1–7
scale), but because of ambiguity in the meaning of the scale points or of their internal
attitudinal cues, they may be unsure of exactly which point to choose, and therefore
offer a DK response. Similarly, individuals who have some information about an
object, have a neutral overall orientation toward it, and are asked a question without
a neutral response option might say DK because the answer they would like to give
has not been conferred legitimacy. Or people may be concerned that they do not
know enough about the object to defend an opinion, so their opinion may be
withheld rather than reported.
Finally, it seems possible that some DK responses occur before respondents have
even begun to attempt to retrieve relevant information. Thus, respondents may say
‘‘don’t know’’ because they do not understand the question (see, e.g., Fonda, 1951).
There is evidence that DK responses occur for all these reasons, but when
people are asked directly why they say ‘‘don’t know,’’ they rarely mention lacking
information or an opinion. Instead they most often cite other reasons such as
ambivalence (Coombs & Coombs, 1976; Faulkenberry & Mason, 1978; Klopfer &
Madden, 1980; Schaeffer & Bradburn, 1989).
Satisficing theory also helps account for the fact that DK filters do not
consistently improve data quality (Krosnick, 1991). According to this perspective,
people have many latent attitudes that they are not immediately aware of holding.
Because the bases of those opinions reside in memory, people can retrieve those
bases and integrate them to yield an overall attitude, but doing so requires signifi-
cant cognitive effort (optimizing). When people are disposed not to do this work
and instead prefer to shortcut the effort of generating answers, they may attempt
to satisfice by looking for cues pointing to an acceptable answer that requires
284 Jon A. Krosnick and Stanley Presser
little effort to select. A DK option constitutes just such a cue and may
therefore encourage satisficing, whereas omitting the DK option is more apt to
encourage respondents to do the work necessary to retrieve relevant information
from memory.
This perspective suggests that DK options should be especially likely to attract
respondents under the conditions thought to foster satisficing: low ability to
optimize, low motivation to do so, or high task difficulty. Consistent with this
reasoning, DK filters attract individuals with more limited cognitive skills, as well as
those with relatively little knowledge and exposure to information about the attitude
object (for a review, see Krosnick, 1999). In addition, DK responses are especially
common among people for whom an object is low in personal importance, of little
interest, and arouses little affective involvement. This may be because of lowered
motivation to optimize under these conditions. Furthermore, people are especially
likely to say DK when they feel they lack the ability to formulate informed opinions
(i.e., subjective competence), and when they feel there is little value in formulating
such opinions (i.e., demand for opinionation). These associations may arise at
the time of attitude measurement: low motivation inhibits a person from drawing
on knowledge available in memory to formulate and carefully report a substantive
opinion of an object.
DK responses are also more likely when questions appear later in a questionnaire,
at which point motivation to optimize is presumably waning (Culpepper, Smith, &
Krosnick, 1992; Krosnick et al., 2002; Dickinson & Kirzner, 1985; Ferber, 1966;
Ying, 1989). Also, DK responses become increasingly common as questions become
more difficult to understand (Converse, 1976; Klare, 1950).
Hippler and Schwarz (1989) proposed still another reason why DK filters may
discourage reporting of real attitudes: Strongly worded DK filters (e.g., ‘‘or haven’t
you thought enough about this issue to have an opinion?’’) might suggest that a great
deal of knowledge is required to answer a question and thereby intimidate people
who feel they might not be able to adequately justify their opinions. Consistent
with this reasoning, Hippler and Schwarz found that respondents inferred from
the presence and strength of a DK filter that follow-up questioning would be more
extensive, would require more knowledge, and would be more difficult. People
motivated to avoid extensive questioning or concerned that they could not defend
their opinions might be attracted toward a DK response.
A final reason why people might prefer the DK option to offering meaningful
opinions is the desire not to present a socially undesirable or unflattering image
of themselves. Consistent with this claim, many studies found that people who
offered DK responses frequently would have provided socially undesirable responses
(Cronbach, 1950, p. 15; Fonda, 1951; Johanson, Gips, & Rich, 1993; Kahn &
Hadley, 1949; Rosenberg, Izard, & Hollander, 1955).
Taken together, these studies suggest that DKs often result not from genuine lack
of opinions but rather from ambivalence, question ambiguity, satisficing, intimida-
tion, and self-protection. In each of these cases, there is something meaningful to be
learned from pressing respondents to report their opinions, but DK response options
Question and Questionnaire Design 285
discourage people from doing so. As a result, data quality does not improve when
such options are explicitly included in questions.
In order to distinguish ‘‘real’’ opinions from ‘‘non-attitudes,’’ follow-up questions
that measure attitude strength may be used. Many empirical investigations have
confirmed that attitudes vary in strength, and the task respondents presumably face
when confronting a ‘‘don’t know’’ response option is to decide whether their attitude
is sufficiently weak to be best described by that option. But because the appropriate
cut point along the strength dimension is both hard to specify and unlikely to be
specified uniformly across respondents, it seems preferable to encourage people to
report their attitude and then describe where it falls along the strength continuum
(see Krosnick, Boninger, Chuang, Berent, & Carnot, 1993 and Wegener, Downing,
Krosnick, & Petty, 1995 for a discussion of the nature and measurement of the
various dimensions of strength).
Belli, Traugott, Young, and McGonagle (1999) reported that offering these
categories reduced voting reports, though their comparisons simultaneously varied
other features as well.
Finally, consistent with our advice in the preceding section on don’t knows, it is
better not to provide explicit DK options for sensitive items, as they are more apt to
provide a cover for socially undesirable responses.
288 Jon A. Krosnick and Stanley Presser
6. The strategies we review generally apply to questions about objective phenomena (typically behavior).
For a review of problems associated with the special case of recalling attitudes, see Smith (1984) and
Markus (1986).
7. As a reminder to the interviewer of the importance of a slower pace, pause notations may be included in
the text of the question, e.g.: ‘‘In a moment, I’m going to ask you whether you voted on Tuesday,
November 5th (PAUSE) which was ____ days ago. (PAUSE) Before you answer, think of a number of
different things that will likely come to mind if you actually did vote this past election day; (PAUSE) things
like whether you walked, drove or were driven. (PAUSE) After thinking about it, you may realize that you
did not vote in this particular election. (PAUSE) Now that you have thought about it, which of these
statements best describes you: I did not vote in the November 5th election; (PAUSE) I thought about
voting but didn’t; (PAUSE) I usually vote but didn’t this time; (PAUSE) I am sure I voted in the
November 5th election.’’
Question and Questionnaire Design 289
memory (e.g., not saying the first thing that comes to mind); formally asking
respondents to commit to doing a good job in line with the instructions; and having
the interviewer provide positive feedback to respondents when they appear to be
satisfying the instructions. Cannell et al. (1981) showed that these methods, each of
which needs to be built into the questionnaire, improved reporting (see also Kessler,
Wittchen, Abelson, & Zhao, 2000).
Irrespective of how much time or effort the respondent invests, however, some
information will be difficult to recall. When records are available, the simplest
approach to improving accuracy is to ask respondents to consult them. Alternatively,
respondents may be asked to enter the information in a diary at the time of encoding
or shortly thereafter. This requires a panel design in which respondents are contacted
at one point and the diaries collected at a later point (with respondents
often contacted at an intermediate point to remind them to carry out the task).8
For discussions of the diary method, see Verbrugge (1980) and Sudman and Ferber
(1979).
Accuracy may also be increased by reducing the burden of the task respondents
are asked to perform. This can be done by simplifying the task itself or by assisting
the respondent in carrying it out. One common way of simplifying the task is to
shorten the reference period. Respondents will have an easier time recalling how
often they have seen a physician in the last month than in the last year, and it is easier
to recall time spent watching television yesterday than last week.
Most reference periods, however, will be subject to telescoping — the tendency to
remember events as having happened more recently (forward telescoping) or less
recently (backward telescoping) than they actually did. Neter and Waksberg (1964)
developed the method of bounded recall to reduce this problem. This involves a panel
design, in which the second interview asks respondents to report about the period
since the first interview (with everything reported in the second interview compared
to the reports from the initial interview to eliminate errors). Sudman, Finn, and
Lannom (1984) proposed that at least some of the advantages of bounding could be
obtained in a single interview, by asking first about an earlier period and then about
the more recent period of interest. This was confirmed by an experiment they did, as
well as by a similar one by Loftus, Klinger, Smith, and Fiedler (1990).
Another way of simplifying the task involves decomposition: dividing a single
question into its constituent parts. Cannell, Oksenberg, Kalton, Bischoping, and
Fowler (1989), for example, suggested that the item:
During the past 12 months since July 1st, 1987, how many times have
you seen or talked with a doctor or a medical assistant about your
health?
8. The diary approach — by sensitizing respondents to the relevant information — may also be used to
gather information that respondents would otherwise not encode (e.g., children’s immunizations). But a
potential drawback of the method is that it may influence behavior, not just measure it.
290 Jon A. Krosnick and Stanley Presser
can be decomposed into four items (each with the same 12 month reference period):
overnight hospital stays; other times a doctor was seen; times a doctor was not seen
but a nurse or other medical assistant was seen; and times a doctor, nurse or other
medical assistant was consulted by telephone.9
In self-administered modes, checklists can sometimes be used to decompose an
item. Experimental evidence suggests that checklists should be structured in ‘‘did-did
not’’ format as opposed to ‘‘check-all-that-apply,’’ partly because respondents take
longer to answer forced choice items, and partly because forced choice results are
easier to interpret (Smyth, Dillman, Christian, & Stern, 2006).
When it is not feasible to simplify the task, several methods may be used to assist
the respondent in carrying it out. All involve attempts to facilitate recall by linking
the question to memories related to the focal one. Thus, Loftus and Marburger
(1983) reported that the use of landmark events (e.g., ‘‘since the eruption of
Mt. St. Helens y) appeared to produce better reporting than the more conventional
approach (e.g., ‘‘in the last six months y’’). Similarly, Means, Swan, Jobe, and
Esposito (1991) and Belli, Smith, Andreski, and Agrawal (2007) found that calendars
containing key events in the respondent’s life improved reporting about other events
in the respondent’s past.10
Another way to aid recall is to include question cues similar to those that were
present at the time of encoding. Instead of asking whether a respondent was
‘‘assaulted,’’ for instance, the inquiry can mention things the respondent might have
experienced as assault — whether someone used force against the respondent:
This kind of cuing may not only improve recall; it also more clearly conveys the
task (by defining ‘‘assault’’). But the cues must cover the domain well, as events
characterized by uncued features are apt to be underreported relative to those with
cued features.11
9. Belli, Schwarz, Singer, and Talarico (2000), however, suggest that decomposition is less good for
measuring nondistinctive, frequent events.
10. As the administration of the calendars in both studies involved conversational or flexible interviewing —
a departure from conventional standardized interviewing — further research is needed to determine how
much of the improved reporting was due to the calendar, per se, and how much to interviewing style.
11. Place cues may also aid recall. Thus, in the context of crime, one might ask whether victimizations
occurred at home, work, school, while shopping, and so on. Likewise, cues to the consequences of events
may be helpful. In the case of crime, for example, one might ask respondents to think about times they
were fearful or angry (Biderman et al., 1986). On the use of emotions cues, more generally, see Kihlstrom,
Mulvaney, Tobias, and Tobis (2000).
Question and Questionnaire Design 291
As we noted in the earlier section on open versus closed questions, when asking
about amounts, open questions are typically preferable to closed questions, because
category ranges using absolute amounts can be interpreted in unwanted ways
(Schwarz et al., 1985), and categories using vague quantifiers (e.g., ‘‘a few,’’ ‘‘some,’’
and ‘‘many’’) can be interpreted differently across respondents (Schaeffer, 1991).
When quantities can be expressed in more than one form, accuracy may be
improved by letting respondents select the reporting unit they are most familiar with.
In asking about job compensation, for instance, respondents might be offered a
choice of reporting in hourly, weekly, annual, or other terms, as opposed to the
researcher choosing a unit for everyone. More generally, given the risk of error, it is
usually best to avoid having respondents perform computations that researchers can
perform from respondent-provided components.
Survey results may be affected not only by the wording of a question, but by the
context in which the question is asked. Thus, decisions about the ordering of items in
a questionnaire — fashioning a questionnaire from a set of questions — should be
guided by the same aim that guides wording decisions — minimizing error.
Question order has two major facets: serial (location in a sequence of items) and
semantic (location in a sequence of meanings). Both may affect measurement by
influencing the cognitive processes triggered by questions.
Serial order can operate in at least three ways: by affecting motivation, promoting
learning, and producing fatigue.
Items at the very beginning of a questionnaire may be especially likely to influence
willingness to respond to the survey, because they can shape respondents’ under-
standing of what the survey is about and what responding to it entails. Thus, a
questionnaire’s initial items should usually bear a strong connection to the topic and
purpose that were described in the survey introduction, engage respondent interest, and
impose minimal respondent burden. This often translates into a series of closed attitude
questions, though factual items can be appropriate as long as the answers are neither
difficult to recall nor sensitive in nature. It is partly for this reason that background and
demographic characteristics most often come at the end of questionnaires.
Conventional wisdom holds that responses to early items may be more prone to
error because rapport has not been fully established or the respondent role has not
been completely learned. We know of no experiments demonstrating either of these
effects, although Andrews (1984) reported nonexperimental evidence suggesting
that questions performed less well at the very beginning of a questionnaire. These
considerations support the recommendation that difficult or sensitive items should
not be placed early in a questionnaire.
292 Jon A. Krosnick and Stanley Presser
12. In a similar vein, Peytchev, Couper, McCabe, and Crawford (2006) found that visible skip instructions
in the scrolling version of a web survey led more respondents to choose a response that avoided subsequent
questions for an item on alcohol use (though not for one on tobacco use) compared to a page version with
invisible skips. For findings on related issues, see Gfroerer, Lessler, and Parsley (1997).
13. Paper and pencil administration constitutes an exception to this rule as the skip patterns entailed by the
recommendation are apt to produce significant error in that mode.
14. Although context can affect judgments about whether or not items are related, this effect is likely to be
restricted to judgments about items on the same or similar topics.
Question and Questionnaire Design 293
performance in batteries of personality items. Although order did not influence item
means, it did alter item-to-total correlations: the later an item appeared in a
unidimensional battery, the more strongly answers to the item correlated with the
total score. Put differently, the more questions from the battery an item followed, the
more apt it was to be interpreted in the intended manner and/or the more readily
respondents retrieved information relevant to the answer. However, Smith (1983)
reported inconsistent results on the effects of grouping items, and others (Metzner &
Mann, 1953; Baehr, 1953; Martin, 1980) have found no effect.15
A different kind of effect of grouping on retrieval was reported by Cowan,
Murphy, and Wiener (1978), who found that respondents reported significantly
more criminal victimization when the victimization questions followed a series of
attitudinal questions about crime. Answering earlier questions about crime may have
made it easier for respondents to recall victimization episodes.
Although grouping-related questions may improve measurement, it can lead to
poorer assessment under some circumstances. For instance, several experiments have
shown that respondents’ evaluations of their overall life satisfaction were affected by
whether the item followed evaluations of specific life domains, but the effect’s nature
depended on the number of previous related items. When the general item was
preceded by a single item about marital satisfaction, some respondents assumed —
having just been asked about their marriage — that the general item was inquiring
about other aspects of their life, so they excluded marital feelings. By contrast, when
the general item was preceded by items about several other domains — including
marriage — then respondents were apt to assume the general item was asking
them to summarize across the domains, and thus they were likely to draw on feelings
about their marriage in answering it (Schwarz, Strack, & Mai, 1991; Tourangeau,
Rasinski, & Bradburn, 1991).16
The results from these experiments suggest a qualification of the conventional
advice to order related questions in a ‘‘funnel,’’ from more general to more specific.
Although ‘‘general’’ items are more susceptible to influence from ‘‘specific’’ ones than
vice versa (because more general items are more open to diverse interpretation), these
context experiments suggest that such influence can improve measurement by
exerting control over context (and therefore reduce the diversity of interpretations).
Changing the weights respondents give to the factors relevant to answering a
question is another way in which context operates — by influencing the extent to
which a factor is salient or available to the respondent at the time the question is
posed. In one of the largest context effects ever observed, many fewer Americans said
that the United States should admit communist reporters from other countries
when that item was asked first than when it followed an item that asked whether the
15. Couper, Traugott, and Lamias (2001) and Tourangeau, Couper, and Conrad (2004) found that
correlations between items in a web survey were slightly stronger when the items appeared together on a
single screen than when they appeared one item per screen.
16. Similar findings for general and specific ratings of communities have been reported by Willits and
Saltiel (1995).
294 Jon A. Krosnick and Stanley Presser
Soviet Union should admit American reporters (Schuman & Presser, 1981). In this
case, a consistency dynamic was evoked when the item came second (making a
comparison explicit), but not when it came first (leaving the comparison implicit
at best).17
In other cases, context can influence the meaning of response options by changing
the nature of the standard used to answer a question. For instance, ratings of Bill
Clinton might differ depending on whether they immediately follow evaluations of
Richard Nixon or of Abraham Lincoln (cf. Carpenter & Blackwood, 1979).
When question ordering affects the meaning of response options or the weighting
of factors relevant to answering an item, one context does not necessarily yield better
measurement than another. Instead, the effects reflect the fact that choices — in ‘‘real
world’’ settings no less than in surveys — are often inextricably bound up with the
contexts within which the choices are made (Slovic, 1995). Thus, decisions about how
to order items should be informed by survey aims. When possible, question context
should be modeled on the context to which inference will be made. In an election
survey, for instance, it makes sense to ask about statewide races after nationwide
races, since that is the order in which the choices appear on the ballot. But in the
many cases that have no single real-world analog, consideration should be given to
randomizing question order.18
Although context effects can be unpredictable, they tend to occur almost
exclusively among items on the same or closely related topics (Tourangeau, Singer, &
Presser, 2003). Likewise the effects are almost always confined to contiguous
items (Smith, 1988; but for an exception to this rule, see Schuman, Kalton, &
Ludwig, 1983).19 Schwarz and Bless (1992) and Tourangeau, Rips, and Rasinski
(2000) provide good theoretical discussions of survey context. An important tool
for identifying potential order effects in a questionnaire is pretesting, to which we
turn next.
17. Lorenz, Saltiel, and Hoyt (1995) found similar results for two pairs of items, one member of which
asked about the respondents’ behavior toward their spouse and the other of which asked about their
spouse’s behavior toward them.
18. When the survey goal includes comparison to results from another survey, replicating that survey’s
questionnaire context is desirable.
19. With paper and pencil self-administration and some computerized self-administration, respondents
have an opportunity to review later questions before answering earlier ones. Thus, in these modes, later
items can affect responses to earlier ones (Schwarz & Hippler, 1995), although such effects are probably
not common.
Question and Questionnaire Design 295
Probably the least structured evaluation method is expert review, in which one or
more experts critiques the questionnaire. The experts are typically survey methodo-
logists, but they can be supplemented with specialists in the subject matter(s) of the
questionnaire. Reviews are done individually or as part of a group discussion.
As many of the judgments made by experts stem from rules, attempts have been
made to draw on these rules to fashion an evaluation task that nonexperts can do.
Probably the best known of these schemes is the Questionnaire Appraisal System
(QAS), a checklist of 26 potential problems (Willis & Lessler, 1999; see also Lessler &
Forsyth, 1996). In an experimental comparison, Rothgeb, Willis, and Forsyth (2001)
found that the QAS identified nearly every one of 83 items as producing a problem
whereas experts identified only about half the items as problematic — suggesting the
possibility of numerous QAS false positives. In a smaller-scale analysis of 8 income
items, by contrast, van der Zouwen and Smit (2004) reported substantial agreement
between QAS and expert review.
Evaluations may also be computerized. The Question Understanding Aid
(QUAID) — computer software based partly on computational linguistics — is
designed to identify questions that suffer from five kinds of problems: unfamiliar
20. Prior to pretesting, researchers will often benefit from self-administering their questionnaires (role
playing the respondent), which provides an opportunity for them to discover the difficulties they have
answering their own questions.
296 Jon A. Krosnick and Stanley Presser
Methods not involving data collection can only make predictions about whether
items cause problems. By contrast, methods employing data collection can provide
evidence of whether items, in fact, cause problems. The most common form of pretest
data collection — conventional pretesting — involves administering a questionnaire
to a small sample of the relevant population under conditions close to, or identical
to, those of the main survey. Interviewers are informed of the pretest’s objectives, but
respondents are not. The data from conventional pretests consist partly of the
distribution of respondent answers to the questions, but mainly of the interviewers’
assessments of how the questions worked, which are typically reported at a group
debriefing discussion (though sometimes on a standardized form instead of, or in
addition to, the group discussion).
Conventional pretest interviews may be used as the foundation for several other
testing methods. Behavior Coding, Response Latency, Vignettes, and Respondent
Debriefings may all be grafted on to conventional pretest interviews.
Behavior coding measures departures from the prototypical sequence in which the
interviewer asks the question exactly as it appears in the questionnaire and then the
respondent provides an answer that meets the question’s aim. Coding may be carried
out by monitors as interviews are conducted or (more reliably) from recordings of the
interviews. The most basic code (e.g., Fowler & Cannell, 1996) identifies departures
the interviewer makes from the question wording as well as departures the
respondent makes from a satisfactory answer, for instance, requesting clarification
or expressing uncertainty.23 Hess, Singer, and Bushery (1999) found that problematic
respondent behavior as measured by behavior codes was inversely related to an
item’s reliability. Dykema, Lepkowski, and Blixt (1997) found that several
respondent behavior codes were associated with less-accurate answers (though, for
one item, substantive changes in the interviewer’s reading of the question were
associated with more accurate answers).
Response latency measures the time it takes respondents to answer a question.
It may be assessed either during an interview by the interviewer’s depressing a key
when she finishes asking an item and then again when the respondent begins his
answer, or after the interview is completed by listening to recordings (which, as with
behavior coding, is less error-prone). Unfortunately, the interpretation of longer
times is not always straightforward, as delays in responding could mean that a
question is difficult to process (usually a bad sign) or that the question encourages
thoughtful responding (typically a good sign). The one study we know of that
addresses this issue with validation data (Draisma & Dijkstra, 2004) found that
longer response latencies were associated with more incorrect answers, though
another study that addressed the issue more indirectly (Bassili & Scott, 1996)
reported mixed results.
Vignettes describe hypothetical situations that respondents are asked to judge.
They have been adapted to pretesting to gauge how concepts conveyed in questions
are understood. In their test of the meaning of the Current Population Survey’s work
item, for example, Campanelli, Rothgeb, and Martin (1989) administered vignettes
like the following:
23. More elaborate behavior codes (e.g., van der Zouwen & Smit, 2004) consider interaction sequences,
e.g., the interviewer reads the question with a minor change, followed by the respondent’s request for
clarification, which leads the interviewer to repeat the question verbatim, followed by the respondent
answering satisfactorily.
298 Jon A. Krosnick and Stanley Presser
The multiplicity of testing methods raises questions about their uniqueness — the
extent to which different methods produce different diagnoses. Studies that compare
two or more methods applied to a common questionnaire often show a mixed
picture — significant overlap in the problems identified but considerable disagree-
ment as well. The interpretation of these results, however, is complicated by the fact
that most of the studies rely on a single trial of each method. Thus, differences
between methods could be due to unreliability, the tendency of the same method to
yield different results across trials.
As might be expected, given its relatively objective nature, behavior coding has
been found to be highly reliable (Presser & Blair, 1994). Conventional pretests, expert
reviews, and cognitive interviews, by contrast, have been shown to be less reliable
(Presser & Blair, 1994; DeMaio & Landreth, 2004). The computer methods (QUAID
and SQP) may be the most reliable, though we know of no research demonstrating
the point. Likewise, the structure of the remaining methods (QAS, response latency,
vignettes, and respondent debriefings) suggests their reliability would be between that
Question and Questionnaire Design 299
of conventional pretests, expert reviews and cognitive interviews, on the one hand,
and computerized methods, on the other. But, again, we know of no good estimates
of these reliabilities.
Inferences from studies that compare testing methods are also affected by the
relatively small number of items used in the studies and by the fact that the items are
not selected randomly from a well-defined population. Nonetheless, we can
generalize to some extent about differences between the methods. The only methods
that tend to diagnose interviewer (as opposed to respondent) problems are behavior
coding (which explicitly includes a code for interviewer departures from verbatim
question delivery) and conventional pretests (which rely on interviewer reports).
Among respondent problems, the methods seem to yield many more comprehension
difficulties (about the task respondents think the question poses) than performance
difficulties (about how respondents do the task), and — somewhat surprisingly —
this appears most true for cognitive interviews (Presser & Blair, 1994). Conventional
testing, behavior coding, QAS, and response latency are also less apt than the other
approaches to provide information about how to repair the problems they identify.
Although there is no doubt that all of the methods uncover problems with
questions, we know only a little about the degree to which these problems are
significant, i.e., affect the survey results. And the few studies that address this issue
(by reference to reliability or validity benchmarks) are generally restricted to a single
method, thereby providing no information on the extent to which the methods differ
in diagnosing problems that produce important consequences. This is an important
area for future research.
Given the present state of knowledge, we believe that questionnaires will
often benefit from a multimethod approach to testing. Moreover, when significant
changes are made to a questionnaire to repair problems identified by pretesting,
it is usually advisable to mount another test to determine whether the revisions
have succeeded in their aim and not caused new problems. When time and
money permit, this multimethod, multi-iteration approach to pretesting can be
usefully enhanced by split sample experiments that compare the performance of
different versions of a question or questionnaire (Forsyth, Rothgeb, & Willis, 2004;
Schaeffer & Dykema, 2004).
9.11. Conclusion
Researchers who compose questionnaires should find useful guidance in the
specific recommendations for the wording and organization of survey questionnaires
that we have offered in this chapter. They should also benefit from two more general
recommendations. First, questionnaire designers should review questions from
earlier surveys before writing their own. This is partly a matter of efficiency — there
is little sense in reinventing the wheel — and partly a matter of expertise: the design
of questions and questionnaires is an art as well as a science and some previous
300 Jon A. Krosnick and Stanley Presser
questions are likely to have been crafted by skillful artisans or those with many
resources to develop and test items.
Moreover, even when questions from prior surveys depart from best practice, they
may be useful to borrow. This is because replicating questions opens up significant
analytical possibilities: Comparisons with the results from other times and from
other populations. As such comparisons require constant wording, it will be
appropriate to ask questions that depart from best practice in these cases.
Will such comparisons be affected by the response errors that arise from the
departure from best practice? Not if the response errors are constant across the
surveys. Unfortunately, most of the literature on question wording and context
focuses on univariate effects, so we know less about the extent to which response
effects vary between groups (i.e., the effect on bivariate or multivariate relationships).
Although there is evidence that some response effects (e.g., acquiescence) may affect
comparisons between certain groups (e.g., those that differ in educational
attainment), there is evidence in other cases for the assumption of ‘‘form-resistant
correlations’’ (Schuman & Presser, 1981).
Relevant evidence can be generated by repeating the earlier survey’s item on only
a random subsample of the new survey, and administering an improved version to
the remaining sample. This will not yield definitive evidence (because it relies on the
untested assumption that the effect of wording is — or would have been — the same
in the different surveys), but it can provide valuable information about the measures.
Second, just as different versions of the ‘‘same’’ item administered to split samples
can be instructive, multiple indicators of a single construct (administered to the entire
sample) can likewise be valuable. Although the emphasis in the question literature is
generally on single items, there is usually no one best way to measure a construct, and
research will benefit from the inclusion of multiple measures. This is true both in the
narrow psychometric sense that error can be reduced by combining measures, as well
as in the broader sense of discovery-making when it turns out that the measures do
not in fact tap the same construct.
References
Allen, B. P. (1975). Social distance and admiration reactions of ‘unprejudiced’ whites. Journal
of Personality, 43, 709–726.
Alwin, D. F. (1992). Information transmission in the survey interview: Number of response
categories and the reliability of attitude measurement. Sociological Methodology, 22, 83–118.
Alwin, D. F. (1997). Feeling thermometers versus 7-point scales: Which are better? Sociological
Methods & Research, 25, 318–340.
Alwin, D. F., & Krosnick, J. A. (1991). The reliability of survey attitude measurement.
The influence of question and respondent attributes. Sociological Methods & Research, 20,
139–181.
Anderson, B. A., Silver, B. D., & Abramson, P. R. (1988). The effects of the race of the
interviewer on race-related attitudes of black respondents in SRC/CS national election
studies. Public Opinion Quarterly, 52, 289–324.
Question and Questionnaire Design 301
Bishop, G. F., Oldendick, R. W., Tuchfarber, A. J., & Bennett, S. E. (1979). Effects of opinion
filtering and opinion floating: Evidence from a secondary analysis. Political Methodology, 6,
293–309.
Bogart, L. (1972). Silent politics: Polls and the awareness of public opinion. New York: Wiley.
Burchell, B., & Marsh, C. (1992). The effect of questionnaire length on survey response.
Quality and Quantity, 26, 233–244.
Cacioppo, J. T., Petty, R. E., Feinstein, J. A., & Jarvis, W. B. G. (1996). Dispositional
differences in cognitive motivation: The life and times of individuals varying in need for
cognition. Psychological Bulletin, 119, 197–253.
Calsyn, R. J., Roades, L. A., & Calsyn, D. S. (1992). Acquiescence in needs assessment studies
of the elderly. The Gerontologist, 32, 246–252.
Campanelli, P. C., Rothgeb, J. M., & Martin, E. A. (1989). The role of respondent
comprehension and interviewer knowledge in CPS labor force classification. In: Proceedings
of the Section on Survey Research Methods (pp. 425–429). American Statistical Association.
Campbell, A. (1981). The sense of well-being in America: Recent patterns and trends. New York:
McGraw-Hill.
Campbell, D. T., & Mohr, P. J. (1950). The effect of ordinal position upon responses to items
in a checklist. Journal of Applied Psychology, 34, 62–67.
Cannell, C. F., Miller, P. V., & Oksenberg, L. (1981). Research on interviewing techniques.
Sociological Methodology, 11, 389–437.
Cannell, C. F., Oksenberg, L., Kalton, G., Bischoping, K., & Fowler, F. J. (1989). New
techniques for pretesting survey questions. Research report. Survey Research Center,
University of Michigan, Ann Arbor, MI.
Carp, F. M. (1974). Position effects in single trial free recall. Journal of Gerontology, 29,
581–587.
Carpenter, E. H., & Blackwood, L. G. (1979). The effect of questions position on responses to
attitudinal question. Rural Sociology, 44, 56–72.
Champney, H., & Marshall, H. (1939). Optimal refinement of the rating scale. Journal of
Applied Psychology, 23, 323–331.
Chan, J. C. (1991). Response-order effects in Likert-type scales. Educational and Psychological
Measurement, 51, 531–540.
Cialdini, R. B. (1993). Influence: Science and practice (3rd ed.). New York: Harper Collins.
Clancy, K. J., & Wachsler, R. A. (1971). Positional effects in shared-cost surveys. Public
Opinion Quarterly, 35, 258–265.
Converse, J. M. (1976). Predicting no opinion in the polls. Public Opinion Quarterly, 40,
515–530.
Converse, J. M., & Presser, S. (1986). Survey questions: Handcrafting the standardized
questionnaire. Beverly Hills, CA: Sage.
Converse, P. E. (1964). The nature of belief systems in mass publics. In: D. E. Apter (Ed.),
Ideology and discontent (pp. 206–261). New York: Free Press.
Coombs, C. H., & Coombs, L. C. (1976). ‘Don’t know’: Item ambiguity or respondent
uncertainty? Public Opinion Quarterly, 40, 497–514.
Cotter, P. R., Cohen, J., & Coulter, P. B. (1982). Race-of-interviewer effects in telephone
interviews. Public Opinion Quarterly, 46, 278–284.
Couch, A., & Keniston, K. (1960). Yeasayers and naysayers: Agreeing response set as a
personality variable. Journal of Abnormal and Social Psychology, 60, 151–174.
Couper, M. P., Traugott, M. W., & Lamias, M. J. (2001). Web survey design and
administration. Public Opinion Quarterly, 65, 230–253.
Question and Questionnaire Design 303
Cowan, C. D., Murphy, L. R., & Wiener, J. (1978). Effects of supplemental questions on
victimization estimates from the National Crime Survey. In: Proceedings of the Section on
Survey Research Methods, American Statistical Association.
Cronbach, L. J. (1950). Further evidence on response sets and test design. Educational and
Psychological Measurement, 10, 3–31.
Culpepper, I. J., Smith, W. R., & Krosnick, J. A. (1992). The impact of question order on
satisficing in surveys. Paper presented at the Midwestern Psychological Association annual
meeting, Chicago, IL.
DeMaio, T. J., & Landreth, A. (2004). Do different cognitive interview techniques produce
different results? In: S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin,
J. Martin & E. Singer (Eds), Methods for testing and evaluating survey questionnaires.
Hoboken, NJ: Wiley.
Dickinson, J. R., & Kirzner, E. (1985). Questionnaire item omission as a function of within
group question position. Journal of Business Research, 13, 71–75.
Dickinson, T. L., & Zellinger, P. M. (1980). A comparison of the behaviorally anchored rating
mixed standard scale formats. Journal of Applied Psychology, 65, 147–154.
Draisma, S., & Dijkstra, W. (2004). Response latency and (para)linguistic expressions as
indicators of response error. In: S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler,
E. Martin, J. Martin & E. Singer (Eds), Methods for testing and evaluating survey
questionnaires (pp. 131–149). Hoboken, NJ: Wiley.
Droitcour, J., Caspar, R. A., Hubbard, M. L., Parsley, T. L., Visscher, W., & Ezzati, T. M.
(1991). The item count technique as a method of indirect questioning: A review of its
development and a case study application. In: P. P. Biemer, R. M. Groves, L. E. Lyberg,
N. A. Mathiowetz & S. Sudman (Eds), Measurement errors in surveys (pp. 185–210).
New York: Wiley.
Duan, N., Alegria, M., Canino, G., McGuire, T., & Takeuchi, D. (2007). Survey conditioning
in self-reported mental health service use: Randomized comparison of alternative
instrument formats. Health Services Research, 42, 890–907.
Dykema, J., Lepkowski, J. M., & Blixt, S. (1997). The effect of interviewer and respondent
behavior on data quality: Analysis of interaction coding in a validation study. In: L. Lyberg,
P. Biemer, M. Collins, E. D. De Leeuw, C. Dippo, N. Schwarz & D. Trewin (Eds), Survey
measurement and process quality (pp. 287–310). New York: Wiley.
Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of
Educational Measurement, 19, 267–278.
Edgell, S. E., Himmelfarb, S., & Duchan, K. L. (1982). Validity of forced response in a
randomized response model. Sociological Methods and Research, 11, 89–110.
Eifermann, R. (1961). Negation: A linguistic variable. Acta Psychologia, 18, 258–273.
Eurich, A. C. (1931). Four types of examinations compared and evaluated. Journal of
Educational Psychology, 26, 268–278.
Evans, R., Hansen, W., & Mittlemark, M. B. (1977). Increasing the validity of self-reports of
behavior in a smoking in children investigation. Journal of Applied Psychology, 62, 521–523.
Faulkenberry, G. D., & Mason, R. (1978). Characteristics of nonopinion and no opinion
response groups. Public Opinion Quarterly, 42, 533–543.
Ferber, R. (1966). Item nonresponse in a consumer survey. Public Opinion Quarterly, 30, 399–415.
Finkel, S. E., Guterbock, T. M., & Borg, M. J. (1991). Race-of-interviewer effects in a
preelection poll: Virginia 1989. Public Opinion Quarterly, 55, 313–330.
Fonda, C. P. (1951). The nature and meaning of the Rorschach white space response. Journal
of Abnormal Social Psychology, 46, 367–377.
304 Jon A. Krosnick and Stanley Presser
Holbrook, A. L., & Krosnick, J. A. (2005). Do survey respondents intentionally lie and claim
that they voted when they did not? New evidence using the list and randomized response
techniques. Paper presented at the American Political Science Association Annual Meeting,
Washington, DC.
Holbrook, A. L., & Krosnick, J. A. (in press). Social desirability bias in voter turnout reports:
Tests using the item count technique. Public Opinion Quarterly.
Holmes, C. (1974). A statistical evaluation of rating scales. Journal of the Marketing Research
Society, 16, 86–108.
Isard, E. S. (1956). The relationship between item ambiguity and discriminating power in a
forced-choice scale. Journal of Applied Psychology, 40, 266–268.
Israel, G. D., & Taylor, C. L. (1990). Can response order bias evaluations? Evaluation and
Program Planning, 13, 365–371.
Jackson, D. N. (1959). Cognitive energy level, acquiescence, and authoritarianism. Journal of
Social Psychology, 49, 65–69.
Jacoby, J., & Matell, M. S. (1971). Three-point Likert scales are good enough. Journal of
Marketing Research, 7, 495–500.
Jenkins, G. D., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices
of composite scale reliability. Journal of Applied Psychology, 62, 392–398.
Jensen, P. S., Watanabe, H. K., & Richters, J. E. (1999). Who’s up first? Testing for order
effects in structured interviews using a counterbalanced experimental design. Journal of
Abnormal Child Psychology, 27, 439–445.
Johanson, G. A., Gips, C. J., & Rich, C. E. (1993). If you can’t say something nice – A
variation on the social desirability response set. Evaluation Review, 17, 116–122.
Johnson, J. D. (1981). Effects of the order of presentation of evaluative dimensions for bipolar
scales in four societies. Journal of Social Psychology, 113, 21–27.
Johnson, W. R., Sieveking, N. A., & Clanton, E. S. (1974). Effects of alternative positioning of
open-ended questions in multiple-choice questionnaires. Journal of Applied Psychology, 6,
776–778.
Juster, F. T., & Smith, J. P. (1997). Improving the quality of economic data: Lessons from the
HRS and AHEAD. Journal of the American Statistical Association, 92, 1268–1278.
Kahn, D. F., & Hadley, J. M. (1949). Factors related to life insurance selling. Journal of
Applied Psychology, 33, 132–140.
Kalton, G., Collins, M., & Brook, L. (1978). Experiments in wording opinion questions.
Applied Statistics, 27, 149–161.
Kalton, G., Roberts, J., & Holt, D. (1980). The effects of offering a middle response option
with opinion questions. Statistician, 29, 65–78.
Katosh, J. P., & Traugott, M. W. (1981). The consequences of validated and self-reported
voting measures. Public Opinion Quarterly, 45, 519–535.
Kessler, R. C., Wittchen, H. U., Abelson, J. M., & Zhao, S. (2000). Methodological issues in
assessing psychiatric disorder with self-reports. In: A. A. Stone, J. S. Turkkan,
C. A. Bachrach, J. B. Jobe, H. S. Kurtzman & V. S. Cain (Eds), The science of self-
report: Implications for research and practice (pp. 229–255). Mahwah, NJ: Lawrence
Erlbaum Associates.
Kihlstrom, J. F., Mulvaney, S., Tobias, B. A., & Tobis, I. P. (2000). The emotional
unconscious. In: E. Eich, J. F. Kihlstrom, G. H. Bower, J. P. Forgas & P. M. Niedenthal
(Eds), Cognition and emotion (pp. 30–86). New York: Oxford University Press.
Klare, G. R. (1950). Understandability and indefinite answers to public opinion questions.
International Journal of Opinion and Attitude Research, 4, 91–96.
306 Jon A. Krosnick and Stanley Presser
Klayman, J., & Ha, Y. (1984). Confirmation, disconfirmation, and information in hypothesis-
testing. Unpublished manuscript, Graduate School of Business. Chicago, IL: Center for
Decision Research.
Klockars, A. J., & Yamagishi, M. (1988). The influence of labels and positions in rating scales.
Journal of Educational Measurement, 25, 85–96.
Klopfer, F. J., & Madden, T. M. (1980). The middlemost choice on attitude items: Ambivalence,
neutrality, or uncertainty. Personality and Social Psychology Bulletin, 6, 97–101.
Knowles, E. E., & Byers, B. (1996). Reliability shifts in measurement reactivity: Driven by
content engagement or self-engagement? Journal of Personality and Social Psychology, 70,
1080–1090.
Knowles, E. S. (1988). Item context effects on personality scales: Measuring changes the
measure. Journal of Personality and Social Psychology, 55, 312–320.
Komorita, S. S. (1963). Attitude context, intensity, and the neutral point on a Likert scale.
Journal of Social Psychology, 61, 327–334.
Komorita, S. S., & Graham, W. K. (1965). Number of scale points and the reliability of scales.
Educational and Psychological Measurement- , 25, 987–995.
Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confidence. Journal of
Experimental Psychology: Human Learning and Memory, 6, 107–118.
Kraut, A. I., Wolfson, A. D., & Rothenberg, A. (1975). Some effects of position on opinion
survey items. Journal of Applied Psychology, 60, 774–776.
Kreuter, F., McCulloch, S., & Presser, S. (2009). Filter questions in interleafed versus grouped
format: Effects on respondents and interviewers. Unpublished manuscript.
Kreuter, F., Presser, S., & Tourangeau, R. (2008). Social desirability bias in CATI, IVR, and web
surveys: The effects of mode and question sensitivity. Public Opinion Quarterly, 72, 847–865.
Krosnick, J. A. (1990). Americans’ perceptions of presidential candidates: A test of the
projection hypothesis. Journal of Social Issues, 46, 159–182.
Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude
measures in surveys. Applied Cognitive Psychology, 5, 213–236.
Krosnick, J. A. (1992). The impact of cognitive sophistication and attitude importance on
response order effects and question order effects. In: N. Schwarz & S. Sudman (Eds), Order
effects in social and psychological research (pp. 203–218). New York: Springer.
Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–567.
Krosnick, J. A., & Alwin, D. F. (1987). An evaluation of a cognitive theory of response–order
effects in survey measurement. Public Opinion Quarterly, 51, 201–219.
Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party identification and policy
preferences: The impact of survey question format. American Journal of Political Science,
37, 941–964.
Krosnick, J. A., Boninger, D. S., Chuang, Y. C., Berent, M. K., & Carnot, C. G. (1993).
Attitude strength: One construct or many related constructs? Journal of Personality and
Social Psychology, 65, 1132–1151.
Krosnick, J. A., & Fabrigar, L. R. (forthcoming). The handbook of questionnaire design.
New York: Oxford University Press.
Krosnick, J. A., Holbrook, A. L., Berent, M. K., Carson, R. T., Hanemann, W. M., Kopp, R. J.,
Mitchell, R. C., Presser, S., Ruud, P. A., Smith, V. K., Moody, W. R., Green, M. C., &
Conaway, M. (2002). The impact of ‘no opinion’ response options on data quality: Non-
attitude reduction or invitation to satisfice? Public Opinion Quarterly, 66, 371–403.
Krosnick, J. A., Narayan, S., & Smith, W. R. (1996). Satisficing in surveys: Initial evidence.
New Directions for Program Evaluation, 70, 29–44.
Question and Questionnaire Design 307
Krosnick, J. A., & Schuman, H. (1988). Attitude intensity, importance, and certainty and
susceptibility to response effects. Journal of Personality and Social Psychology, 54, 940–952.
Krysan, M. (1998). Privacy and the expression of white racial attitudes. Public Opinion
Quarterly, 62, 506–544.
Kuncel, R. B. (1973). Response process and relative location of subject and item. Educational
and Psychological Measurement, 33, 545–563.
Kuncel, R. B. (1977). The subject-item interaction in itemmetric research. Educational and
Psychological Measurement, 37, 665–678.
Larkins, A. G., & Shaver, J. P. (1967). Matched-pair scoring technique used on a first-grade
yes-no type economics achievement test. Utah Academy of Science, Art, and Letters:
Proceedings, 44, 229–242.
Laurent, A. (1972). Effects of question length on reporting behavior in the survey interview.
Journal of the American Statistical Association, 67, 298–305.
Lee, L., Brittingham, A., Tourangeau, R., Willis, G., Ching, P., Jobe, J., & Black, S. (1999).
Are reporting errors due to encoding limitations or retrieval failure? Applied Cognitive
Psychology, 13, 43–63.
Leech, G. N. (1983). Principles of pragmatics. London: Longman.
Lehmann, D. R., & Hulbert, J. (1972). Are three-point scales always good enough? Journal of
Marketing Research, 9, 444–446.
Lenski, G. E., & Leggett, J. C. (1960). Caste, class, and deference in the research interview.
American Journal of Sociology, 65, 463–467.
Lensvelt-Mulders, G. J. L. M., Hox, J. J., van der Heijden, P. G. M., & Maas, C. (2005). Meta-
analysis of randomized response research, thirty-five years of validation. Sociological
Methods & Research, 33, 319–348.
Lessler, J. T., & Forsyth, B. H. (1996). A coding system for appraising questionnaires. In:
N. Schwartz & S. Sudman (Eds), Answering questions (pp. 259–292). San Francisco, CA:
Jossey-Bass.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140,
1–55.
Lindzey, G. G., & Guest, L. (1951). To repeat – Check lists can be dangerous. Public Opinion
Quarterly, 15, 355–358.
Lissitz, R. W., & Green, S. B. (1975). Effect of the number of scale points on reliability:
A Monte Carlo approach. Journal of Applied Psychology, 60, 10–13.
Locander, W., Sudman, S., & Bradburn, N. (1976). An investigation of interview method,
threat and response distortion. Journal of the American Statistical Association, 71,
269–275.
Loftus, E. F., Klinger, M. R., Smith, K. D., & Fiedler, J. A. (1990). A tale of two questions:
Benefits of asking more than one question. Public Opinion Quarterly, 54, 330–345.
Loftus, E. F., & Marburger, W. (1983). Since the eruption of Mt. St. Helens, has anyone
beaten you up? Social Cognition, 11, 114–120.
Lorenz, F., Saltiel, J., & Hoyt, D. (1995). Question order and fair play: Evidence of even-
handedness in rural surveys. Rural Sociology, 60, 641–653.
Lucas, C. P., Fisher, P., Piacentini, J., Zhang, H., Jensen, P. S., Shaffer, D., Dulcan, M.,
Schwab-Stone, M., Regier, D., & Canino, G. (1999). Features of interview questions
associated with attenuation of symptom reports. Journal of Abnormal Child Psychology, 27,
429–437.
Markus, G. B. (1986). Stability and change in political attitudes: Observed, recalled, and
‘explained’. Political Behavior, 8, 21–44.
308 Jon A. Krosnick and Stanley Presser
Martin, E. (1980). The effects of item contiguity and probing on measures of anomia. Social
Psychology Quarterly, 43, 116–120.
Martin, E. (2004). Vignettes and respondent debriefing for questionnaire design and
evaluation. In: S. Presser, J. M. Rothgeb, M. P. Couper, J. L. Lessler, E. Martin,
J. Martin & E. Singer (Eds), Methods for testing and evaluating survey questionnaires
(pp. 149–172). New York: Wiley.
Martin, W. S. (1973). The effects of scaling on the correlation coefficient: A test of validity.
Journal of Marketing Research, 10, 316–318.
Martin, W. S. (1978). Effects of scaling on the correlation coefficient: Additional
considerations. Journal of Marketing Research, 15, 304–308.
Masters, J. R. (1974). The relationship between number of response categories and reliability
of Likert-type questionnaires. Journal of Educational Measurement, 11, 49–53.
Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale
items? Study I: Reliability and validity. Educational and Psychological Measurement, 31,
657–674.
Matell, M. S., & Jacoby, J. (1972). Is there an optimal number of alternatives for Likert-
scale items? Effects of testing time and scale properties. Journal of Applied Psychology, 56,
506–509.
Mathews, C. O. (1927). The effect of position of printed response words upon children’s
answers to questions in two-response types of tests. Journal of Educational Psychology, 18,
445–457.
McClendon, M. J. (1986). Response-order effects for dichotomous questions. Social Science
Quarterly, 67, 205–211.
McClendon, M. J. (1991). Acquiescence and recency response–order effects in interview
surveys. Sociological Methods and Research, 20, 60–103.
McClendon, M. J., & Alwin, D. F. (1993). No-opinion filters and attitude measurement
reliability. Sociological Methods and Research, 21, 438–464.
McKelvie, S. J. (1978). Graphic rating scales – how many categories? British Journal of
Psychology, 69, 185–202.
Means, B., Swan, G. E., Jobe, J. B., & Esposito, J. L. (1991). An alternative approach
to obtaining personal history data. In: P. P. Biemer, R. M. Groves, L. E. Lyberg,
N. A. Mathiowetz & S. Sudman (Eds), Measurement errors in surveys (pp. 127–144). New
York: Wiley.
Messick, S., & Frederiksen, N. (1958). Ability, acquiescence, and ‘authoritarianism’.
Psychological Reports, 4, 687–697.
Metzner, H., & Mann, F. (1953). Effects of grouping related questions in questionnaires.
Public Opinion Quarterly, 17, 136–141.
Miller, N., & Campbell, D. T. (1959). Recency and primacy in persuasion as a func-
tion of the timing of speeches and measurement. Journal of Abnormal Social Psychology,
59, 1–9.
Miller, W. E. (1982). American National Election Study, 1980: Pre and post election surveys.
Ann Arbor, MI: Inter-University Consortium for Political and Social Research.
Mingay, D. J., & Greenwell, M. T. (1989). Memory bias and response-order effects. Journal
of Official Statistics, 5, 253–263.
Mirowsky, J., & Ross, C. E. (1991). Eliminating defense and agreement bias from measures
of the sense of control: A 2 2 index. Social Psychology Quarterly, 54, 127–145.
Mondak, J. J. (2001). Developing valid knowledge scales. American Journal of Political
Science, 45, 224–238.
Question and Questionnaire Design 309
Morin, R. (1993). Ask and you might deceive: The wording of presidential approval questions
might be producing skewed results. The Washington Post National Weekly Edition,
December 6–12, p. 37.
Murray, D. M., & Perry, C. L. (1987). The measurement of substance use among adolescents:
When is the bogus pipeline method needed? Addictive Behaviors, 12, 225–233.
Narayan, S., & Krosnick, J. A. (1996). Education moderates some response effects in attitude
measurement. Public Opinion Quarterly, 60, 58–88.
Neter, J., & Waksberg, J. (1964). A study of response errors in expenditure data from
household interviews. Journal of the American Statistical Association, 59, 18–55.
Newcomb, T. E. (1943). Personality and social change. New York: Dryden Press.
Norman, D. A. (1973). Memory, knowledge, and the answering of questions. In: R. L. Solso
(Ed.), Contemporary issues in cognitive psychology, The Loyola symposium. Washington,
DC: Winston.
O’Muircheartaigh, C., Krosnick, J. A., & Helic, A. (1999). Middle alternatives, acquiescence,
and the quality of questionnaire data. Paper presented at the American Association for
Public Opinion Research annual meeting, St. Petersburg, FL.
Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning.
Urbana, IL: University of Illinois Press.
Ostrom, T. M., & Gannon, K. M. (1996). Exemplar generation: Assessing how
respondents give meaning to rating scales. In: N. Schwarz & S. Sudman (Eds), Answering
questions: Methodology for determining cognitive and communicative processes in survey
research (pp. 293–441). San Francisco, CA: Jossey-Bass.
Parry, H. J., & Crossley, H. M. (1950). Validity of responses to survey questions. Public
Opinion Quarterly, 14, 61–80.
Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of
Personality and Social Psychology, 46, 598–609.
Paulhus, D. L. (1986). Self-deception and impression management in test responses. In:
A. Angleitner & J. Wiggins (Eds), Personality assessment via questionnaires: Current issues in
theory and measurement (pp. 143–165). New York: Springer-Verlag.
Paulhus, D. L. (1991). Measurement and control of response bias. In: J. P. Robinson,
P. R. Shaver & L. S. Wrightman (Eds), Measures of personality and social psychological
attitudes. Measures of social psychological attitudes series (Vol. 1). San Diego, CA: Academic
Press.
Pavlos, A. J. (1972). Radical attitude and stereotype change with bogus pipeline
paradigm. Proceedings of the 80th Annual Convention of the American Psychological
Association, 7, 292.
Pavlos, A. J. (1973). Acute self-esteem effects on racial attitudes measured by rating scale and
bogus pipeline. Proceedings of the 81st Annual Convention of the American Psychological
Association, 8, 165–166.
Payne, J. D. (1971). The effects of reversing the order of verbal rating scales in a postal survey.
Journal of the Marketing Research Society, 14, 30–44.
Payne, S. L. (1949/1950). Case study in question complexity. Public Opinion Quarterly, 13,
653–658.
Payne, S. L. (1950). Thoughts about meaningless questions. Public Opinion Quarterly, 14, 687–696.
Peytchev, A., Couper, M. P., McCabe, S. E., & Crawford, S. D. (2006). Web survey design.
Paging versus scrolling. Public Opinion Quarterly, 70, 596–607.
Poe, G. S., Seeman, I., McLaughlin, J., Mehl, E., & Dietz, M. (1988). Don’t know boxes in
factual questions in a mail questionnaire. Public Opinion Quarterly, 52, 212–222.
310 Jon A. Krosnick and Stanley Presser
Presser, S., & Blair, J. (1994). Survey pretesting: Do different methods produce different
results? In: P. V. Marsden (Ed.), Sociological methodology (pp. 73–104). Cambridge, MA:
Blackwell.
Presser, S., Rothgeb, J. M., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., & Singer, E.
(Eds). (2004). Methods for testing and evaluating survey questionnaires. New York: Wiley.
Presser, S., Traugott, M. W., & Traugott, S. (1990). Vote ‘‘over’’ reporting in surveys: The
records or the respondents? Presented at the International Conference on Measurement
Errors, Tucson, AZ.
Quinn, S. B., & Belson, W. A. (1969). The effects of reversing the order of presentation of verbal
rating scales in survey interviews. London: Survey Research Center.
Ramsay, J. O. (1973). The effect of number categories in rating scales on precision of
estimation of scale values. Psychometrika, 38, 513–532.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1999). Measures of political attitudes. San
Diego, CA: Academic Press.
Roese, N. J., & Jamieson, D. W. (1993). Twenty years of bogus pipeline research: A critical
view and meta-analysis. Psychological Bulletin, 114, 363–375.
Rosenberg, N., Izard, C. E., & Hollander, E. P. (1955). Middle category response: reliability
and relationship to personality and intelligence variables. Educational and Psychological
Measurement, 15, 281–290.
Rosenstone, S. J., Hansen, J. M., & Kinder, D. R. (1986). Measuring change in personal
economic well-being. Public Opinion Quarterly, 50, 176–192.
Rothenberg, B. B. (1969). Conservation of number among four- and five-year-old children:
Some methodological considerations. Child Development, 40, 383–406.
Rothgeb, J., Willis, G., & Forsyth, B. H. (2001). Questionnaire pretesting methods: Do
different techniques and different organizations produce similar results? Paper presented at
the annual meeting of the American Statistical Association.
Rubin, H. K. (1940). A constant error in the seashore test of pitch discrimination. Unpublished
master’s thesis, University of Wisconsin, Madison, WI.
Ruch, G. M., & DeGraff, M. H. (1926). Corrections for chance and ‘guess’ vs. ‘do not guess’
instructions in multiple-response tests. Journal of Educational Psychology, 17, 368–375.
Rundquist, E. A., & Sletto, R. F. (1936). Personality in the Depression. Minneapolis, MI:
University of Minnesota Press.
Saris, W. E., & Gallhofer, I. N. (2007). Design evaluation and analysis of questionnaires for
survey research. New York: Wiley.
Saris, W. E., & Krosnick, J. A. (2000), The damaging effect of acquiescence response bias on
answers to agree/disagree questions. Paper presented at the American Association for Public
Opinion Research annual meeting. Portland, OR.
Schaeffer, N. C. (1991). Hardly ever or constantly? Group comparisons and vague quantifiers.
Public Opinion Quarterly, 55, 395–423.
Schaeffer, N. C., & Bradburn, N. M. (1989). Respondent behavior in magnitude estimation.
Journal of the American Statistical Association, 84, 402–413.
Schaeffer, N. C., & Dykema, J. (2004). A multiple-method approach to improving the clarity of
closely related concepts: Distinguishing legal and physical custody of children. In: S. Presser,
J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin & E. Singer (Eds), Methods
for testing and evaluating survey questionnaires (pp. 475–502). New York: Wiley.
Scherpenzeel, A. (1995). Meta-analysis of a European comparative study. In: W. E. Saris &
A. Munnich (Eds), The multitrait-multimethod approach to evaluate measurement instruments
(pp. 225–242). Budapest, Hungary: Eotvos University Press.
Question and Questionnaire Design 311
Schlenker, B. R., & Weigold, M. F. (1989). Goals and the self-identification process:
Constructing desired identities. In: L. A. Pervin (Ed.), Goal concepts in personality and social
psychology (pp. 243–290). Hillsdale, NJ: Lawrence Erlbaum Associates.
Schuman, H. (1966). The random probe: A technique for evaluating the validity of closed
questions. American Sociological Review, 31, 218–222.
Schuman, H. (1972). Two sources of anti-war sentiment in America. American Journal of
Sociology, 78, 513–536.
Schuman, H. (2008). Method and meaning in polls and surveys. Cambridge, MA: Harvard
University Press.
Schuman, H., & Converse, J. M. (1971). The effect of black and white interviewers on black
responses. Public Opinion Quarterly, 35, 44–68.
Schuman, H., Kalton, G., & Ludwig, J. (1983). Context and contiguity in survey
questionnaires. Public Opinion Quarterly, 47, 112–115.
Schuman, H., & Presser, S. (1981). Questions and answers in attitude surveys: Experiments on
question form, wording and context. New York: Academic Press.
Schuman, H., & Scott, J. (1987). Problems in the use of survey questions to measure public
opinion. Science, 236, 957–959.
Schwarz, N., & Bless, H. (1992). Constructing reality and its alternatives: An inclusion/
exclusion model of assimilation and contrast effects in social judgment. In: L. L. Martin &
A. Tesser (Eds), The construction of social judgment (pp. 217–245). Hillsdale, NJ: Lawrence
Erlbaum Associates.
Schwarz, N., & Hippler, H. J. (1991). Response alternatives: The impact of their choice and
presentation order. In: P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz &
S. Sudman (Eds), Measurement errors in surveys (pp. 41–56). New York: Wiley.
Schwarz, N., & Hippler, H. J. (1995). Subsequent questions may influence answers to
preceding questions in mail surveys. Public Opinion Quarterly, 59, 93–97.
Schwarz, N., Hippler, H. J., Deutsch, B., & Strack, F. (1985). Response scales: Effects of
category range on reported behavior and subsequent judgments. Public Opinion Quarterly,
49, 388–395.
Schwarz, N., Hippler, H. J., & Noelle-Neumann, E. (1992). A cognitive model of response-
order effects in survey measurement. In: N. Schwarz & S. Sudman (Eds), Context effects in
social and psychological research (pp. 187–201). New York: Springer-Verlag.
Schwarz, N., & Strack, F. (1985). Cognitive and affective processes in judgments of subjective
well-being: A preliminary model. In: H. Brandstatter & E. Kirchler (Eds), Economic
psychology (pp. 439–447). Linz, Austria: Tauner.
Schwarz, N., Strack, F., & Mai, H. (1991). Assimilation and contrast effects in part-whole
question sequences: A conversational logic analysis. Public Opinion Quarterly, 55, 3–23.
Schwarz, N., & Wyer, R. S. (1985). Effects of rank-ordering stimuli on magnitude ratings of
these and other stimuli. Journal of Experimental Social Psychology, 21, 30–46.
Shaffer, J. W. (1963). A new acquiescence scale for the MMPI. Journal of Clinical Psychology,
19, 412–415.
Sherif, C. W., Sherif, M., & Nebergall, R. E. (1965). Attitude and social change. Philadelphia,
PA: Saunders.
Sherif, M., & Hovland, C. I. (1961). Social judgment: Assimilation and contrast effects in
communication and attitude change. New Haven, CT: Yale University Press.
Sigall, H., & Page, R. (1971). Current stereotypes: A little fading, a little faking. Journal of
Personality and Social Psychology, 18, 247–255.
Simon, H. A. (1957). Models of man. New York: Wiley.
312 Jon A. Krosnick and Stanley Presser
Visser, P. S., Krosnick, J. A., Marquette, J. F., & Curtin, M. F. (2000). Improving election
forecasting: Allocation of undecided respondents, identification of likely voters, and
response order effects. In: P. L. Lavrakas & M. Traugott (Eds), Election polls, the news
media, and democracy. New York: Chatham House.
Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer
bias. Journal of the American Statistical Association, 60, 63–69.
Warr, P., Barter, J., & Brownridge, G. (1983). On the interdependence of positive and negative
affect. Journal of Personality and Social Psychology, 44, 644–651.
Warwick, D. P., & Lininger, C. A. (1975). The sample survey: Theory and practice. New York:
McGraw-Hill.
Wason, P. C. (1961). Response to affirmative and negative binary statements. British Journal
of Psychology, 52, 133–142.
Watson, D. (1988). The vicissitudes of mood measurement: Effects of varying descriptors,
time frames, and response formats on measures of positive and negative affect. Journal of
Personality and Social Psychology, 55, 128–141.
Watson, D. R., & Crawford, C. C. (1930). Four types of tests. The High School Teacher, 6,
282–283.
Wedell, D. H., & Parducci, A. (1988). The category effect in social judgment: Experimental
ratings of happiness. Journal of Personality and Social Psychology, 55, 341–356.
Wedell, D. H., Parducci, A., & Geiselman, R. E. (1987). A formal analysis of ratings of
physical attractiveness: Successive contrast and simultaneous assimilation. Journal of
Experimental Social Psychology, 23, 230–249.
Wedell, D. H., Parducci, A., & Lane, M. (1990). Reducing the dependence of clinical judgment
on the immediate context: Effects of number of categories and types of anchors. Journal of
Personality and Social Psychology, 58, 319–329.
Wegener, D. T., Downing, J., Krosnick, J. A., & Petty, R. E. (1995). Measures and
manipulations of strength-related properties of attitudes: Current practice and future
directions. In: R. E. Petty & J. A. Krosnick (Eds), Attitude strength: Antecedents and
consequences (pp. 455–487). Hillsdale, NJ: Lawrence Erlbaum Associates.
Wesman, A. G. (1946). The usefulness of correctly spelled words in a spelling test. Journal of
Educational Psychology, 37, 242–246.
Willis, G. B. (2004). Cognitive interviewing revisited: A useful technique, in theory? In:
S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin & E. Singer
(Eds), Methods for testing and evaluating survey questionnaires (pp. 23–43). Hoboken, NJ:
Wiley.
Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design.
Thousand Oaks, CA: Sage Publications.
Willis, G. B., & Lessler, J. (1999). The BRFSS-QAS: A guide for systematically evaluating
survey question wording. Rockville, MD: Research Triangle Institute.
Willits, F. K., & Saltiel, J. (1995). Question order effects on subjective measures of quality of
life: A two-state analysis. Rural Sociology, 57, 654–665.
Wiseman, F. (1972). Methodological bias in public opinion surveys. Public Opinion Quarterly,
36, 105–108.
Ying, Y. (1989). Nonresponse on the Center for Epidemiological Studies – Depression scale in
Chinese Americans. International Journal of Social Psychiatry, 35, 156–163.
Yzerbyt, V. Y., & Leyens, J. (1991). Requesting information to form an impression: The
influence of valence and confirmatory status. Journal of Experimental Social Psychology, 27,
337–356.