Polling & Survey Methods PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 673
At a glance
Powered by AI
The document provides an overview of a book on polling and survey methods, discussing its contributors, introduction, and various parts and chapters.

The book is about polling and survey methods, covering topics related to survey design, data collection, analysis and presentation of survey results.

Some of the topics covered in the book include total survey error, longitudinal surveys, mixing survey modes, sampling, questionnaire design, exit polling, sampling hard-to-reach populations, improving survey quality, causal inference with survey data, measuring public opinion, and cross-national surveys.

T h e Ox f o r d H a n d b o o k o f

P OL L I N G
A N D SU RV E Y
M E T HOD S
The Oxford Handbook of

POLLING
AND SURVEY
METHODS
Edited by
LONNA RAE ATKESON
and
R. MICHAEL ALVAREZ

1
3
Oxford University Press is a department of the University of Oxford. It furthers
the University’s objective of excellence in research, scholarship, and education
by publishing worldwide. Oxford is a registered trade mark of Oxford University
Press in the UK and certain other countries.

Published in the United States of America by Oxford University Press


198 Madison Avenue, New York, NY 10016, United States of America.

© Oxford University Press 2018

All rights reserved. No part of this publication may be reproduced, stored in


a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by license, or under terms agreed with the appropriate reproduction
rights organization. Inquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above.

You must not circulate this work in any other form


and you must impose this same condition on any acquirer.

Library of Congress Cataloging-in-Publication Data


Names: Atkeson, Lonna Rae, 1965– editor. | Alvarez, R. Michael, 1964– editor.
Title: The Oxford handbook of polling and survey methods /
edited by Lonna Rae Atkeson and R. Michael Alvarez.
Description: New York : Oxford University Press, [2018]
Identifiers: LCCN 2018008316 | ISBN 9780190213299 (Hard Cover) |
ISBN 9780190213305 (updf) | ISBN 9780190903824 (epub)
Subjects: LCSH: Public opinion polls. | Social surveys.
Classification: LCC HM1236 .O945 2018 | DDC 303.3/8—dc23
LC record available at https://lccn.loc.gov/2018008316

1 3 5 7 9 8 6 4 2
Printed by Sheridan Books, Inc., United States of America
Contents

Contributors ix

Introduction to Polling and Survey Methods  1


Lonna Rae Atkeson and R. Michael Alvarez

PA RT I   SU RV E Y DE SIG N
1. Total Survey Error  13
Herbert F. Weisberg
2. Longitudinal Surveys: Issues and Opportunities  28
D. Sunshine Hillygus and Steven A. Snell
3. Mixing Survey Modes and Its Implications  53
Lonna Rae Atkeson and Alex N. Adams
4. Taking the Study of Political Behavior Online  76
Stephen Ansolabehere and Brian F. Schaffner
5. Sampling for Studying Context: Traditional Surveys and New
Directions  97
James G. Gimpel
6. Questionnaire Science  113
Daniel L. Oberski

PA RT I I   DATA C OL L E C T ION
7. Exit Polling Today and What the Future May Hold  141
Anthony M. Salvanto
8. Sampling Hard-​to-​Locate Populations: Lessons from Sampling
Internally Displaced Persons (IDPs)  155
Prakash Adhikari and Lisa A. Bryant
vi   Contents

9. Reaching Beyond Low-​Hanging Fruit: Surveying Low-​Incidence


Populations  181
Justin A. Berry, Youssef Chouhoud, and Jane Junn
10. Improving the Quality of Survey Data Using CAPI Systems in
Developing Countries  207
Mitchell A. Seligson and Daniel E. Moreno Morales
11. Survey Research in the Arab World  220
Lindsay J. Benstead
12. The Language-​Opinion Connection  249
Efrén O. Pérez

PA RT I I I   A NA LYSI S A N D P R E SE N TAT ION


13. Issues in Polling Methodologies: Inference and Uncertainty  275
Jeff Gill and Jonathan Homola
14. Causal Inference with Complex Survey Designs: Generating
Population Estimates Using Survey Weights  299
Ines Levin and Betsy Sinclair
15. Aggregating Survey Data to Estimate Subnational Public Opinion  316
Paul Brace
16. Latent Constructs in Public Opinion  338
Christopher Warshaw
17. Measuring Group Consciousness: Actions Speak Louder
Than Words  363
Kim Proctor
18. Cross-​National Surveys and the Comparative Study of Electoral
Systems: When Country/​Elections Become Cases  388
Jeffrey A. Karp and Jack Vowles
19. Graphical Visualization of Polling Results  410
Susanna Makela, Yajuan Si, and Andrew Gelman
20. Graphical Displays for Public Opinion Research  439
Saundra K. Schneider and William G. Jacoby
Contents   vii

PA RT I V   N E W F RON T I E R S
21. Survey Experiments: Managing the Methodological Costs
and Benefits  483
Yanna Krupnikov and Blake Findley
22. Using Qualitative Methods in a Quantitative Survey Research
Agenda  505
Kinsey Gimbel and Jocelyn Newsome
23. Integration of Contextual Data: Opportunities and Challenges  533
Armando Razo
24. Measuring Public Opinion with Social Media Data  555
Marko Klašnja, Pablo Barberá, Nicholas Beauchamp,
Jonathan Nagler, and Joshua A. Tucker
25. Expert Surveys as a Measurement Tool: Challenges and
New Frontiers  583
Cherie D. Maestas
26. The Rise of Poll Aggregation and Election Forecasting  609
Natalie Jackson

Index 633
Contributors

Alex N. Adams is a PhD student in the Department of Political Science at the University
of New Mexico. His research interests focus on political psychology and survey
methodology.
Prakash Adhikari is an Associate Professor of Political Science at Central Michigan
University. His research and teaching interests lie at the intersection of comparative pol-
itics and international relations, with specific focus on civil war, forced migration, and
transitional justice.
R. Michael Alvarez is a Professor in the Division of Humanities and Social Sciences at
the California Institute of Technology. His primary research interests are public opinion
and voting behavior, election technology and administration, electoral politics, and sta-
tistical and computer modeling.
Stephen Ansolabehere is the Frank G. Thompson Professor of Government at Harvard
University where he studies elections, democracy, and the mass media. He is a Principal
Investigator of the Cooperative Congressional Election Study, and his principal areas
are electoral politics, representation, and public opinion.
Lonna Rae Atkeson is a Professor and Regents Lecturer in the Department of Political
Science at the University of New Mexico where she directs the Institute for Social
Research and the Center for the Study of Voting, Elections and Democracy. Her primary
interests are the areas of survey methodology, election science and administration, and
political behavior.
Pablo Barberá is an Assistant Professor of Computational Social Science in the
Methodology Department at the London School of Economics. His primary areas of re-
search include social media and politics, computational social science, and comparative
electoral behavior and political representation.
Nicholas Beauchamp is an Assistant Professor of Political Science at Northeastern
University. He specializes in U.S. politics (political behavior, campaigns, opinion, polit-
ical psychology, and social media) and political methodology (quantitative text analysis,
machine learning, Bayesian methods, agent-​based models, and networks).
Lindsay J. Benstead is an Associate Professor of Political Science in the Mark O. Hatfield
School of Government and Interim Director of the Middle East Studies Center (MESC)
at Portland State University, Contributing Scholar in the Women’s Rights in the Middle
x   Contributors

East Program at Rice University, and Affiliated Scholar in the Program on Governance
and Local Development (GLD) at the University of Gothenburg and Yale University. Her
research interests include survey methodology and the Middle East-North Africa region.
Justin A. Berry is an Assistant Professor in the Department of Political Science at
Kalamazoo College. His research and teaching interests include American politics, po-
litical attitudes & behavior, race & ethnic politics, public opinion, immigration policy,
education policy, social movements, and methodology & research design.
Paul Brace is the Clarence L. Carter Professor of Political Science at Rice University. His
areas of interest include state and intergovernmental politics, judicial decision making,
and the presidency.
Lisa A. Bryant is an Assistant Professor at California State University, Fresno. Her
teaching and research interests include political behavior and voter behavior, campaigns
and elections, election administration, public opinion, the media, political psychology,
state politics, gender politics, and political methodology, focusing on experimental and
survey research methods.
Youssef Chouhoud is a PhD student at the University of Southern California in Political
Science & International Relations. His research interests include comparative democ-
ratization, political tolerance, Middle East politics, and Muslim minorities in the West.
Blake Findley is a PhD student in the Department of Political Science at Stony Brook
University. He does research in political psychology, political communication, and po-
litical methodology.
Andrew Gelman is the Higgins Professor of Statistics, Professor of Political Science, and
Director of the Applied Statistics Center at Columbia University. His research spans a
wide range of topics in statistics and social sciences, survey methodology, experimental
design, statistical inference, computation, and graphics.
Jeff Gill is a Distinguished Professor, Department of Government, Professor, Department
of Mathematics and Statistics, and member of the Center for Behavioral Neuroscience at
American University. His research applies Bayesian modeling and data analysis (decision
theory, testing, model selection, and elicited priors) to questions in general social science
quantitative methodology, political behavior and institutions, and medical/​health data.
Kinsey Gimbel is Director of the Customer Experience Division at Fors Marsh Group.
Her primary areas of experience are qualitative research, survey design and administra-
tion, data analysis and reporting, and program evaluation.
James G. Gimpel is a Professor of Government at the University of Maryland. His
interests lie in the areas of political behavior, political socialization, and the political ge-
ography of American politics.
D. Sunshine Hillygus is a Professor of Political Science and Director of the Initiative on
Survey Methodology at Duke University. Her research and teaching specialties include
Contributors   xi

public opinion, political behavior, survey research, campaigns and elections, and infor-
mation technology and society.
Jonathan Homola is an Assistant Professor at Rice University. He is a political method-
ologist and a comparativist. His substantive research interests include party competi-
tion, representation, political behavior, gender and politics, and immigration.
Natalie Jackson is a Survey Methodologist at JUST Capital with experience running
survey research programs in academic, media, and nonprofit settings. She was in charge
of the election forecasting models and poll aggregation at The Huffington Post during
the 2014 and 2016 election cycles. She has a PhD in political science and researches how
people form attitudes and respond to surveys, as well as how the survey process can
affect reported attitudes.
William G. Jacoby is a Professor in the Department of Political Science at Michigan
State University. His main professional interests are mass political behavior (public
opinion, political attitudes, and voting behavior) and quantitative methodology (meas-
urement theory, scaling methods, statistical graphics, and modern regression).
Jane Junn is a Professor of Political Science at the University of Southern California. She
is the author of five books on political participation and public opinion in the United
States. Her research focuses on political behavior, public opinion, racial and ethnic poli-
tics, the politics of immigration, gender and politics, and political identity.
Jeffrey A. Karp is a Professor of Political Science at Brunel University in London. He
specializes in public opinion, elections, and comparative political behavior.
Marko Klašnja is an Assistant Professor of Political Science at Georgetown University,
with the joint appointment in the Government Department and the Edmund A. Walsh
School of Foreign Service. He specializes in comparative politics, political behavior, and
political economy of democratic accountability.
Yanna Krupnikov is an Associate Professor in the Department of Political Science at
Stony Brook University. Her research and teaching focus on political psychology,
political communication, political persuasion, political behavior, and empirical
methodology.
Ines Levin is an Assistant Professor in the Department of Political Science at the
University of California, Irvine. Her research focuses on quantitative research methods
with substantive applications in the areas of elections, public opinion, and political
behavior.
Cherie D. Maestas is the Marshall A. Rauch Distinguished Professor of Political Science
in the Department of Political Science and Public Administration at the University
of North Carolina at Charlotte where she also directs the Public Policy Program. She
studies political communication, political psychology, risk attitudes, and legislative
responsiveness.
xii   Contributors

Susanna Makela is a PhD student in the Statistics Department at Columbia University.


Her areas of interest include the application of statistical and quantitative methods to
global health issues.
Daniel E. Moreno Morales is Executive Director and founding member of Ciudadanía,
Comunidad de Estudios Sociales y Acción Pública, a local research NGO in Bolivia. He
holds a PhD in Political Science from Vanderbilt University. He is an expert in public
opinion and has worked on areas such as ethnic and national identity, citizenship, dem-
ocratic values, and quality of democracy.
Jonathan Nagler is a Professor of Politics, Affiliated faculty in the Center for Data
Science, and a Co-​Director of the Social Media and Political Participation Laboratory
at New York University. His areas of interest and research include quantitative method-
ology, voting behavior, social-​media, turnout, and the impact of the economy and infor-
mation on elections.
Jocelyn Newsome is a Senior Study Director at Westat who manages a range of data
collection efforts. She specializes in the use of qualitative methods for questionnaire de-
velopment, including cognitive testing, behavior coding, and focus groups.
Daniel L. Oberski is an Associate Professor of Data Science Methodology in the
Methodology & Statistics Department at Utrecht University. His research focuses on the
problem of measurement in the social sciences.
Efrén O. Pérez is an Associate Professor of Political Science at Vanderbilt University,
and a Co-​Director of its Research on Individuals, Politics, & Society (RIPS) experi-
mental lab. His research encompasses political psychology and public opinion, with an
emphasis on racial and ethnic politics.
Kim Proctor is a Technical Director, Division of Business and Data Analysis (DBDA)
at Centers for Medicare & Medicaid Services (CMS) where she oversees the statistical
analysis of Medicaid data and operational information to design analytic studies and
inform Medicaid policy. She has a PhD in Political Science from the University of New
Mexico.
Armando Razo is an Associate Professor in the Department of Political Science at
Indiana University and a Founding Scientific Leadership Team member of the Indiana
University Network Science Institute. His research lies within political economy of de-
velopment, with a focus on the interaction of informal institutions, political-​economic
networks, and public policies across political regimes.
Anthony M. Salvanto is an Elections & Surveys Director at CBS News. His specialties
include U.S. Politics & Elections, Voting, Polling, and Public Opinion.
Brian F. Schaffner is the Newhouse Professor of Civic Studies at Tufts University. His
research focuses on public opinion, campaigns and elections, political parties, and leg-
islative politics.
Contributors   xiii

Saundra K. Schneider is a Professor in the Department of Political Science at Michigan


State University and the Director of the Inter-​university Consortium for Political and
Social Research Program in Quantitative Methods of Social Research at the University
of Michigan. Her main research interests are public policy and methodology, with a
focus on state-​level program spending, health care policymaking, and public attitudes
toward governmental disaster relief.
Mitchell A. Seligson is the Centennial Professor of Political Science and Professor of
Sociology at Vanderbilt University and serves as a member of the General Assembly
of the Inter-​American Institute of Human Rights. He is the founder and Senior
Advisor of the Latin American Public Opinion Project (LAPOP), which conducts the
AmericasBarometer surveys that currently cover 27 countries in the Americas.
Yajuan Si is a Research Assistant Professor in the Survey Methodology Program, located
within the Survey Research Center at the Institute for Social Research on the University
of Michigan-​Ann Arbor campus. Her research lies in cutting-​edge methodology devel-
opment in streams of Bayesian statistics, complex survey inference, missing data impu-
tation, causal inference, and data confidentiality protection.
Betsy Sinclair is an Associate Professor of Political Science at Washington University in
St. Louis. Her research interests are American politics and political methodology with
an emphasis on individual political behavior.
Steven A. Snell is a Principal Research Scientist and Survey Methodologist at Qualtrics
and a fellow at the Qualtrics Methodology Lab. He holds a PhD in Politics from
Princeton University and researches best practices in online sampling, longitudinal
survey methods, and data quality in survey research.
Joshua A. Tucker is a Professor of Politics and affiliated Professor of Russian and Slavic
Studies and Data Science at New  York University, the Director of the NYU Jordan
Center for the Advanced Study of Russia, and a Co-​Director of the NYU Social Media
and Political Participation (SMaPP) laboratory. His research interests are mass political
behavior, the intersection of social media and politics, and post-​communist politics.
Jack Vowles is a Professor of Comparative Politics at Victoria University of Wellington.
His research is primarily in comparative political behavior and New Zealand politics.
Christopher Warshaw is an Assistant Professor of Political Science at George
Washington University. His areas of research are American politics, representation,
public opinion, state and local politics, environmental politics and policy, and statistical
methodology.
Herbert F. Weisberg is an Emeritus Professor of Political Science at The Ohio State
University (PhD, Michigan 1968). He joined OSU in 1974 from the University of
Michigan where he was a (tenured) Associate Professor. An American politics scholar,
he is known for his research and teaching on American voting behavior and Congress,
as well as his work on survey research and political methodology.
T h e Ox f o r d H a n d b o o k o f

P OL L I N G
A N D SU RV E Y
M E T HOD S
I n t rodu c t i on to P olling
a n d Su rvey Method s

Lonna Rae Atkeson and R. Michael Alvarez

Introduction

In recent years political polling has been in a state of visible crisis. Recent “polling
misses” have been well-​publicized:  the Brexit election, the peace agreement refer-
endum in Colombia, and the U.S. presidential election. In the first example, the Brexit
vote in the United Kingdom was a close call that missed its mark, while in Colombia
polls regarding a referendum on a peace deal that took more than seven years to pro-
duce suggested that 66% of eligible voters supported it. However, when the votes were
counted on election day the referendum failed by a very close margin, with 50.2% of
voters rejecting it.
In the United States another important miss was the failure of polls conducted in the
competitive battleground states to predict a Donald Trump presidential win at nearly
any point in the election. A  recent report from the American Association of Public
Opinion Research (AAPOR) argued that while the national polls in 2016 were quite ac-
curate, the state-​by-​state polling in important battleground states suffered from meth-
odological issues that appear to account for much of their inaccuracy (AAPOR 2017).
Moreover, poll aggregators such as fivethirtyeight.com and the Huffington Post pro-
vided odds that Hillary Clinton would win by very safe margins. For example, the final
election odds from fivethirtyeight.com gave Clinton a 71% chance of winning the elec-
tion, the lowest percentage of any poll aggregator, and the Huffington Post gave Clinton
a 98% chance of winning the election, the highest of any poll aggregator.
These polling misses are highly consequential. Not only have they provided pundits,
media, and the public with misleading information, but by being so seemingly unreli-
able they may even make people skeptical and distrustful of polling in general. Because
of these highly visible “misses,” political polling has an image problem, as a recent
2    Lonna Rae Atkeson and R. Michael Alvarez

U.S. poll finding shows that only 37% of the public trusts public opinion polls a great deal
or a good amount.1
Election polling is an especially unique industry and academic enterprise because it
is one of the few social sciences in which predictions can be validated against outcomes,
therefore providing the opportunity to assess issues related to survey error. Although
comparing predictions to outcomes provides a sense of when polls are off track, there
are many places in the survey design in which errors can be introduced, and thus being
attentive to innovation and best practices in all aspects of design is critical for a reliable
and valid survey.
Problems with polling usually stem from a variety of factors, including issues with the
sampling frame and nonresponse bias. Because of these issues, and because of the many
complex designs, which often involve multiple modes, panels, or oversamples, there
may be unequal probabilities of respondent selection, variation in response rates across
subgroups, or departures from distributions on key demographic or other variables
within the data, such as party identification, which may result in a variety of postsurvey
adjustment weighting strategies. Indeed, pollsters today do a great deal of postsurvey
adjustment weighting to create data sets that are representative of the population under
study. While there is certainly a science to weighting data, methodological differences
in how data are statistically weighted can lead to different results and different predicted
winners.
For example, in an experiment during the 2016 election the same raw data set was
given to four different pollsters for postsurvey adjustments; the result was four different
election predictions, from Trump up one point to Clinton up four points.2 Another dif-
ficult problem for pollsters in an election environment is identifying likely voters. Yet
other problems may have to do with nonresponse bias, which may lead some types of
voters to refuse to participate in the poll. Shy respondents may cause problems for a
survey if, for example, they are associated with a particular candidate or particular issue
position.
In developed countries, changes in survey research over the last fifteen years have
been tumultuous. The growth of the Internet, the decline in household use of landlines,
and the dramatic increase in cell phone use has made it both easier and more difficult to
conduct surveys. While the “gold standard” for survey research has traditionally been
probability based sampling, today many polls and surveys use nonprobability designs,
such as opt-​in Internet panels for online surveys. Furthermore, surveys that begin with a
random sample often have such low response rates (less than 10% is now very common)
that the quality and accuracy of inferences drawn from the resulting sample may be
problematic.
For general population studies, the increase in Internet surveys has also meant
that researchers are relying today more on respondent-​driven surveys than on the
interviewer-​driven designs that dominated the field in previous decades. The preva-
lence of Internet surveys has also led to a greater number of panel designs and to consid-
eration of unique issues that arise with panel data. Survey researchers are also relying on
many more modes and combining them more often.
Introduction to Polling and Survey Methods    3

In the developing world, in-​person surveys are still the norm, but technology is
allowing the development of innovative new methodologies, such as the use of com-
puter assisted personal interview (CAPI) systems or Global Positioning System (GPS)
devices, both of which may improve survey quality and reduce total survey error. But
other issues abound in surveys conducted in many developing areas, in particular
survey coverage and the representativeness of many survey samples.
In addition, there are many new opportunities in the field and many new data sets.
Table 0.1 presents a list of all the academically collected and freely accessible data sets
discussed in this Handbook. The number of readily accessible data sets is impressive and
affords researchers the chance to answer new and old questions in different contexts. But
using these data sets also presents some challenges, in particular understanding how
complex survey designs affect how researchers use them. In addition to the wide range
of survey data readily available today, there are also innovations in using surveys to in-
terview experts, social media as public opinion data, poll aggregation, the integration of
qualitative methods with survey designs, and the expanded use of survey experiments.
Technological advances in computing and statistics have also provided new and better
methods to assess opinion in subnational contexts and have created opportunities for
better methods to estimate and use latent constructs. In addition, the art of displaying
data has advanced significantly, allowing researchers to use graphics to inform their
decision-​making process during the survey and modeling process, as well as after the
fact in how the data are communicated to consumers.
These changes present new opportunities and challenges and make this Oxford
University Press Handbook on Polling and Survey Methods timely. Polls, of course, tend
to focus on a single question, and simple analysis of a substantive single question usually
relies on simple two-​variable crosstabs with demographic variables, whereas surveys
focus on the answers to many questions in which a research design is often embedded.
The goals of the Handbook are to outline current best practices and highlight the
changing nature of the field in the way social scientists conduct surveys and analyze and
present survey data. The Handbook considers four broad areas of discovery: survey de-
sign, data collection, analysis and presentation, and new frontiers. Following is a discus-
sion of the main contributions and points of interest of each chapter.

Survey Design

The first section of the Handbook focuses on general survey methodology considera­
tions. Because survey methodology is the study of the sources of error in surveys,
with the intention of limiting as many of those sources of error as possible to pro-
duce an accurate measure or true value of the social or political world, it begins with
an essay by Herbert F. Weisberg that explains the total survey error and total survey
quality approach. Survey error is the difference between what the actual survey pro-
cess produces and what should be obtained from it. Total survey error considers both
4    Lonna Rae Atkeson and R. Michael Alvarez

Table 0.1 Publicly Available National Surveys


Data Set URL

American National Election Studies http://​www.electionstudies.org


Comparative Study of Electoral Systems http://​www.cses.org/​
Pew Research Center http://​www.people-​press.org/​datasets/​
The British Election Study http://​www.britishelectionstudy.com/​
The Dutch Parliamentary Election Studies http://​www.dpes.nl/​en/​
The French National Election Study http://​www.cevipof.fr/​fr/​eef2017/​fnes/​
German Federal Election Studies http://​www.gesis.org/​en/​elections-​home/​
germanfederal-​elections/​
The Swedish National Election Studies http://​valforskning.pol.gu.se/​english
The American Panel Survey http://​taps.wustl.edu
Candidate Emergence Study http://​ces.iga.ucdavis.edu
UCD Congressional Election Study http://​electionstudy.ucdavis.edu/​
The Varieties of Democracy Project https://​v-​dem.net/​en/​
US Census http://​www.census.gov/​ces/​rdcresearch/​
Cooperative Congressional Election http://​projects.iq.harvard.edu/​cces
Study
Latin American Public Opinion Project http://​www.vanderbilt.edu/​lapop/​
National Opinion Research Center http://​www3.norc.org/​GSS+Website/​
Arab Barometer http://​www.arabbarometer.org/​
World Values Survey http://​www.worldvaluessurvey.org/​wvs.jsp
Afro​barometer http://​www.afrobarometer.org/​
Pew Global Research http://​www.pewglobal.org/​about/​
Asian Barometer http://​www.asianbarometer.org
Gallup World Poll http://​www.gallup.com/​services/​170945/​world-​poll.aspx
Comparative National Elections Project http://​u.osu.edu/​cnep/​
European Social Survey http://​www.europeansocialsurvey.org/​
European Election Studies http://​eeshomepage.net/​
Eurobarometer http://​ec.europa.eu/​public_​opinion/​index_​en.htm

observational and nonobservational errors. Observational error, or what is usu-


ally considered measurement error, focuses on survey questions and their relation-
ship to the underlying attribute one is interested in measuring. Measurement error in
this context is the difference between the true value and the measured value. Errors of
nonobservation focus on problems in estimating the mean and distribution of a variable
from a sample instead of the full population. Although the goal in a survey is always to
minimize both observational and nonobservational errors, there are constraints within
the survey environment, including costs, timing, and ethics. The total survey quality
Introduction to Polling and Survey Methods    5

approach extends the total survey error approach to consider additional criteria, in-
cluding providing usable and quality data to the researcher.
The next several chapters consider various survey design issues related to the method
of data collection. Survey researchers often have to ask: What is the best method to
collect the data I need for my research project? Data collection methods come in two
basic forms, interviewer-​administered surveys or self-​administered surveys, but data
collection efforts must also consider the nature of the survey and whether it is cross-​
sectional or longitudinal. Panel surveys interview the same respondent over time to
track attitudes and behavior, thus measuring individual-​level changes in attitudes
and behavior, which cross-​sectional surveys cannot easily assess. Hillygus and Snell
consider the unique challenges and opportunities related to using longitudinal or
panel designs, including the tension between continuity across panel waves and in-
novation, panel attrition, and potential measurement error related to panel condi-
tioning of respondents and seam bias. Both the Atkeson and Adams chapter and the
Ansolabehere and Schaffner chapter address issues related to survey mode. The former
chapter focuses on the advantages and disadvantages associated with using mixed
mode surveys, which have become increasingly popular. Mixed mode surveys are
those that involve mixtures of different contact and response modes. They pay par-
ticular attention to how the presence or absence of an interviewer influences survey
response, especially social desirability, and item nonresponse. Thus, they compare
mail/​Internet surveys to in-​person/​telephone surveys across a variety of dimensions
and consider best practices. Ansolabehere and Schaffner focus their attention on the
quality of surveys that use opt-​in online nonprobability survey panels, the Cooperative
Congressional Election Study (CCES), and compare that to traditional probability
samples.
Gimpel’s chapter considers the geographic distribution of respondents and how
context, characterized as a respondent’s location, influences attitudes and behavior.
Traditional sampling designs, for example, focus on strategies that allow researchers
to make inferences about the population, which often limit the geographical space in
which respondents are found. This tends to create small sample sizes that have limited
utility in helping to understand a primary interest of social scientists, how spatial context
influences opinion. Because sometimes social scientists are interested in representing
places and people, they need to consider a different sampling design; Gimpel’s chapter
identifies when and how one can sample for context.
Oberski considers another important aspect of survey design, question wording.
While many survey methodology textbooks discuss the “art” of writing questions,
Oberski takes a more systematic approach, arguing that by using experiments we can
better differentiate good or reliable survey questions from the bad and unreliable. To this
end, Saris et al. (2012) over many years built up a large question data set that estimated
the reliability and common method variance or quality of those questions, coded
characteristics of those questions that related to their quality, and predicted question
quality based on a meta-​analysis. They then created a free Web-​based application that
allows researchers to input questions and obtain an estimate of their quality. The bulk
6    Lonna Rae Atkeson and R. Michael Alvarez

of Oberski’s chapter focuses on explaining the Survey Quality Predictor (SQP) tool and
how researchers can use it to make question design a solid science and less of an art.

Data Collection

The Handbook’s next section begins with a discussion of postelection exit polling. Exit
polls offer the first look at who is voting, how they are voting, and why they are voting
that way; they also offer valuable insights into political behavior, especially vote choice.
These types of surveys have been part of our election landscape since 1967, and as new
modes of voting have developed, especially early and mail voting, exit polls have had to
be modified to ensure they accurately reflect voters. Salvanto’s chapter provides an over-
view of the history and value of exit polls and much needed information on how exit poll
operations are managed.
Many researchers are interested in studying the attitudes and behavior of hard-​to-​
reach populations. These individuals can be hard to reach for many different reasons.
For example, some groups of people may be hard to identify (e.g., protestors), or they
may be hard to locate, such as the LGBT community, which is a very small group whose
members live everywhere, so that finding them in the population can be difficult and
expensive. It might be hard to persuade some populations to participate, for example,
politicians or their staff or people engaging in socially undesirable or illegal activities.
The chapters by Adkihari and Bryant and by Berry, Chouhoud, and Junn both focus
on these difficult-​to-​locate populations. Adkihari and Bryant consider hard-​to-​reach
populations in international or developing contexts, while Berry et al. focus on low-​in-
cidence populations in the United States. Adkihari and Bryant build their story around a
research design in Nepal that examined citizens who either fled their homes or decided to
stay during the Maoist insurgency between 1996 and 2006. To examine important theo-
retical questions related to internally displaced people (IDP), the study first had to iden-
tify displacement patterns so that a sample of both those who decided to stay and those
who fled could be drawn. The study also had to solve problems related to difficult terrain,
lack of infrastructure, low-​education populations, and other factors to develop a strong
survey design. Berry, Choudhoun, and Junn, on the other hand, focus their chapter on
the United States and on low-​incidence populations, who make up a relatively small pro-
portion of the public that could be characterized as new immigrants, racial or ethnic
minorities, religious minorities, or small populations that are relatively dispersed, such as
gays or lesbians. They outline a strategy that uses a tailored or targeted approach to cap-
ture these hard-​to-​reach populations. They consider various attributes of these groups,
such as whether the group is geographically concentrated or dispersed or the degree
of uniformity among its members, and how these attributes help to make good design
decisions related to sampling, making contact and gaining cooperation, and analysis.
Both chapters provide best practices, useful advice, and important considerations on suc-
cessfully interviewing hard-​to-​reach populations.
Introduction to Polling and Survey Methods    7

Seligson and Moreno’s chapter and Benstead’s chapter focus on issues related to
the developing world. Seligson and Moreno’s chapter looks at the introduction of the
CAPI systems as a quality control measure in face-​to-​face surveys in Latin America.
They argue that CAPI systems improve the quality of the data collected in-person by
eliminating many sources of error and allowing the researcher much more control of
the field process. Benstead examines data collection issues in the Arab world, which is
an often difficult and sometimes inhospitable environment for survey researchers and
for social scientists more generally. Over the past several decades a variety of public
opinion surveys from the Middle Eastern and North African regions have been made
available to researchers (e.g., the Arab Barometer, Afrobarometer), opening up new
opportunities for research in these understudied nations. Many of these nations are
more accessible to researchers than they were previously, and Benstead also considers
unique challenges researchers face when working in this region, as well as best practices
for survey researchers.
The chapter by Perez on the connection between language and opinion rounds out
the section on data collection. Given that there are so many public opinion surveys,
often asking the same questions in different languages across different cultures, Perez
asks what the connection between language and opinion is and how we can isolate
its effects. In particular, Perez highlights how cognitive psychology can assist us in
building theoretical models that help explain how and when language will influence
opinion.

Analysis and Presentation

The next set of essays begins with a chapter by Gill and Homola, who discuss a variety
of issues related to statistical inference and hypothesis testing using survey data. They
highlight several methodological concerns regarding transparency of data, uncertainty
in the process, the margin of error, and significance testing. Levin and Sinclair examine
how including or excluding survey weights affects various matching algorithms. They
find that weights are important to make accurate causal inferences from complex survey
data. Their chapter demonstrates the need to account for characteristics of the sample to
make population-​based inferences.
The next chapter, by Brace, is interested in the study of subnational public opinion.
Accurate and reliable measurement of subnational public opinion is especially valu-
able when researchers are interested in understanding how context, or the political
and social environment, influences opinion, and how opinion influences government
outcomes. One of the many problems with looking at these types of questions is that
there is very little systematic comparative analysis across states, congressional districts,
legislative districts, counties, or cities. Surveys at the subnational level are fairly unique
and are conducted by different polling organizations at different times, using different
methodologies and question wording. Brace discusses the history of this field and the
8    Lonna Rae Atkeson and R. Michael Alvarez

development of various tools and methods to disaggregate national opinion polls to the
subnational level to produce reliable estimates of subnational opinion.
Usually researchers are interested in abstract concepts such as political knowledge,
ideology, and polarization. But these are often measured with single variables that
possess a large quantity of measurement error. Chris Warshaw discusses the value of
latent constructs, the various ways latent constructs have been identified, and new
methodologies that are available for testing latent constructs. Proctor’s chapter follows
with a focus on the application of item response theory to the study of group conscious-
ness. She demonstrates how latent constructs help to clarify the role group conscious-
ness plays in understanding political behavior, using a study of the LGBT community,
and how some of the measurement assumptions underlying group consciousness are
incorrect.
Next, in their chapter Karp and Vowles examine the challenges and opportunities in-
herent in comparative cross-​national survey research. Comparative cross-​sectional re-
search creates opportunities for examining the role differing institutions and cultures
play in political behavior. They use the CSES as a vehicle to evaluate cross-​cultural
equivalence in questionnaire design, survey mode, response rates, and case selection.
Presentation using data visualization is valuable for public opinion researchers and
consumers. Good visualization of survey and poll results can help researchers uncover
patterns that might be difficult to detect in topline and cross-​tabulations and can also
help researchers more effectively present their results to survey and poll consumers.
Therefore, two chapters are devoted to graphing opinion data. The first, by Makela, Si,
and Gelman, argues that graphs are valuable at all stages of the analysis, including the
exploration of raw data, weighting, building bivariate and multivariate models, and
understanding and communicating those results to others. The second chapter, by
Schneider and Jacoby, provides specific guidelines on when a graph and what type of
graph would be most useful for displaying and communicating survey data and analyt-
ical results from survey models. Both chapters provide many useful examples and excel-
lent ideas for ways to explore and report data.

New Frontiers

The last section of the Handbook explores new frontiers in survey methodology.
It begins with an essay by Krupnikov and Findley that outlines the growth in survey
experiments and their usefulness. They argue that survey experiments provide a balance
between internal and external validity that provides needed leverage on opinion for-
mation. However, this is not without some costs, especially related to the participants
chosen, and researchers need to carefully consider their goals when identifying the best
test for their theory.
Gimbel and Newsome turn their attention to the consideration of how qualita-
tive data can both improve survey methodology and help to better understand and
Introduction to Polling and Survey Methods    9

interpret survey results. They focus on three qualitative tools—​focus groups, in-​depth
interviewing, and cognitive interviewing—​and provide best practices for when and how
to use these tools. Qualitative research is an important part of many public opinion re-
search projects; Gimbel and Newsome provide a great deal of guidance about how to
best conduct this type of opinion research.
Razo considers the important role of context in social research. He argues that the
problem with context in social research is that it is often too vague, and that scholars
need greater guidance on collecting and analyzing contextual data. Razo’s chapter
provides insight into how scholars can better collect and use contextual data in their
analyses of individual-​level opinion and behavior. Next Klašnja et  al. discuss using
Twitter as a source of public opinion data. They identify three main concerns with using
Tweets as opinion, including how to measure it, assessing its representativeness, and
how to aggregate it. They consider potential solutions to these problems and outline
how social media data might be used to study public opinion and social behavior.
Many research questions involve the use of experts to identify processes, institutions,
and local environments or other information that only a knowledgeable informant
might have. The chapter by Maestas focuses on the use of expert surveys in providing
these bits of valuable information for researchers. It considers survey and questionnaire
design issues and aggregation procedures, with a focus on enhancing the validity and
reliability of experts’ estimates. Finally, the last chapter, by Jackson, focuses on polling
aggregation and election forecasting, which is interesting to both academics and applied
researchers. Her essay discusses the history of election forecasting and the use of poll
aggregation, the technical and statistical demands of poll aggregation and election
forecasting, and the controversies surrounding it.

Looking Back, and Looking Ahead

This Handbook has brought together a unique mixture of academics and practitioners
from various backgrounds, academic disciplines, and experiences. In one sense, this is
reflective of the interdisciplinary nature of polling and survey methodology: polls and
surveys are widely used in academia, government, and the private sector. Designing,
implementing, and analyzing high-​quality, accurate, and cost-​effective polls and surveys
require a combination of skills and methodological perspectives. Despite the well-​
publicized issues that have cropped up in recent political polling, looking back at the
significant body of research that has been conducted by the authors in this Handbook,
a great deal is known today about how to collect high-​quality polling and survey data.
Over the course of the last several decades, the survey and polling industries have
experienced rapid change. We care about quality surveys and good survey data because
as social scientists we are only as good as the data we produce. Therefore, it is critical to
consider best practices, guidelines, and helping researchers assess a variety of factors so
that they can make good choices when they collect and analyze data. Equally important
10    Lonna Rae Atkeson and R. Michael Alvarez

is transmitting those results to others in a clear and accessible way. This Handbook goes
a long way toward providing a great deal of current information on the state of the field.
There is a bright future for further development of polling and survey methodology.
Unlike the situation a few decades ago, today there are many opportunities for innovative
research on how to improve polling and survey methodology. Ranging from new tools
to test survey design (e.g., Oberski in this Handbook, or tools found in Montgomery and
Cutler [2013]), to innovations in how interviews are conducted (Seligson and Moreno in
this Handbook), to the use of social media data to study individual opinion and behav­
ior (Klašnja et  al. in this Handbook), technology is changing the nature of survey and
polling methodology. We hope that the chapters in this Handbook help researchers and
practitioners understand these trends and participate in the development of new and better
approaches for measuring, modeling, and visualizing public opinion and social behavior.

Acknowledgments
Books, and in particular edited volumes like this one, require a great deal of help and assis-
tance. Of course we thank all of the authors of the chapters in this Handbook, especially for
their patience as we worked to produce this complicated volume. At Caltech, we thank Sabrina
De Jaegher for administrative assistance and for helping us stay organized and on track.
Brittany Ortiz from the University of New Mexico was instrumental in helping us get this proj­
ect started.
And special thanks go to the team at Oxford University Press (current and past), who helped
us to launch, organize, edit, and most important, finish this Handbook. David McBride pro-
vided important guidance, and we also thank Claire Sibley, William Richards, Tithi Jana,
Anitha Alagusundaram, Emily MacKenzie and Kathleen Weaver. Finally, Alexandra Dauler
helped us formulate the basic idea for this Handbook and got us started with this project.

Notes
1. http://​www.huffingtonpost.com/​entry/​most-​americans-​dont-​trust-​public-​opinion-​polls_​
us_​58de94ece4b0ba359594a708.
2. https://​www.nytimes.com/​interactive/​2016/​09/​20/​upshot/​the-​error-​the-​polling-​world-​
rarely-​talks-​about.html.

References
American Association of Public Opinion Research, Ad Hoc Committee on 2016 Election
Polling. 2017. “An Evaluation of 2016 Election Polls in the U.S.” https://​www.aapor.org/​
Education-​Resources/​Reports/​An-​Evaluation-​of-​2016-​Election-​Polls-​in-​the-​U-​S.aspx.
Montgomery, Jacob M., and Josh Cutler. 2013. “Computerized Adaptive Testing for Public
Opinion Surveys.” Political Analysis 21 (2): 172–​192.
Saris, W. E., D. L. Oberski, M. Revilla, D. Z. Rojas, L. Lilleoja, I. Gallhofer, and T. Gruner, (2012).
Final report about the project JRA3 as part of ESS infrastructure (SQP 2002-​2011). Technical
report, RECSM, Universitat Pompeu Fabra, Spain, Barcelona.
Pa rt  I

SU RV E Y  DE SIG N
Chapter 1

Total Survey E rror

Herbert F. Weisberg

Introduction

The total survey error (TSE) approach has become a paradigm for planning and
evaluating surveys. The survey field began atheoretically in the early 1900s, when social
scientists simply began asking people questions. Gradually several separate theoretical
elements fell into place, starting with statistical sampling theory and then the social psy-
chology of attitudes (Converse 1987). By the mid-​twentieth century the literature was
recognizing the existence of different types of error in surveys, particularly Hansen,
Hurwitz, and Madow (1953) and Kish (1965). Robert Groves’s (1989) Survey Errors and
Survey Costs systemized the consideration of errors in surveys in the comprehensive
TSE framework.
Groves’s book unified the field by categorizing the types of survey errors and pitting
them against the costs involved in conducting surveys. Each of the several types of
survey error can be minimized, but that takes financial resources, which are neces-
sarily finite. The TSE approach provides a systematic way of considering the trade-​offs
involved in choosing where to expend resources to minimize survey error. Different
researchers may weigh these trade-​offs differently, deciding to spend their resources to
minimize different potential survey errors. The TSE approach was developed when tel-
ephone interviewing was in its prime. It is still useful now that Internet surveys have be-
come prevalent, though Internet surveys raise somewhat different problems regarding
certain potential error sources. Furthermore, the different trade-​offs between survey
errors and costs can vary between interviewer-​driven studies (as in face-​to-​face and tel-
ephone interviewing) and respondent-​driven studies (as in mail and Internet surveys).
Costs are not the only challenge that researchers face in conducting surveys. Time and
ethics also can impose constraints (Weisberg 2005). For example, the time constraints
raised when the news media need to gauge the immediate public reaction to a presiden-
tial speech are very different from when academic researchers have the luxury of being
able to take a month or two to survey public opinion. As to ethics, the concerns that
14   Herbert F. Weisberg

arise when interviewing on sensitive topics, such as people’s drug use, are very different
from those that exist when seeking to measure attitudes on public policies, such as gov-
ernment welfare programs. Depending on the survey organization, it is now common
for survey researchers to need prior approval from an institutional review board be-
fore going into the field, including approval of the research design and survey questions
(Singer 2008). Thus, there can be trade-​offs between minimizing survey error and the
cost, time, and ethics involved in a survey.
In addition to survey constraints, Weisberg (2005) further emphasized the impor-
tance of another consideration:  survey effects. These involve choices that must be
made for which there are no error-​free decisions. For example, there may be question
order effects in a survey, but there is no perfect order of questions. It may be impor-
tant for survey researchers to try to estimate the magnitude of some of these survey
effects, though they cannot be eliminated regardless of how many resources are spent
on them.
While the TSE approach has become important in academic survey research,
the total survey quality (TSQ) approach has become important in government-​
sponsored research. The quality movement developed in the management field
(Drucker 1973; Deming 1986), which recognized that customers choose the pro-
ducer that provides the best quality for the money. That led to management models
such as total quality management and continuous quality improvement. When
applied to the survey field (Biemer and Lyberg 2003; Lyberg 2012), the quality per-
spective leads to emphasis on such matters as the survey’s accuracy, credibility,
relevance, accessibility, and interpretability. For example, many survey clients ex-
pect high-​quality deliverables, including a data set with a complete codebook and
a detailed description of the survey procedures, including sampling and weighting.
Thus, survey organizations must develop procedures to maximize the quality
of their product, but within the context of the trade-​offs between survey errors
and costs.

The Total Survey Error Approach

The TSE approach focuses on a variety of possible errors in surveys. The early work on
surveys dealt with one type of error: the sampling error that occurs when one interviews
a sample of the population of interest rather than the entire population. As later work
identified other sources of errors, it became clear that sampling error was just the “tip
of the iceberg,” with several other potential sources of error also being necessary to
consider.
In preparing a survey effort, the researcher should consider the various potential
sources of error and decide how to handle each one. Typically, the researcher elects to
try to limit the amount of some types of error, such as by choosing how large a sample
Total Survey Error   15

to take. The researcher may opt to measure the magnitude of other types of error, such
as by giving random half samples different versions of a key question to see how much
question wording affects the results. Inevitably, the researcher ends up ignoring some
other types of error, partly because it is impossible to deal with every possible source of
error under a fixed monetary budget with time constraints.
Of course, survey research is not the only social science research technique that
faces potential errors. Campbell and Stanley’s (1963) distinction between “internal va-
lidity” and “external validity” in experimental research demonstrated how systemat-
ically considering the different types of error in a research approach could advance
a field.
Notice that the TSE approach deals with “potential” errors. It is not saying that these
are all serious errors in every survey project or that mistakes have been made. Instead,
it is alerting the researcher to where errors might be occurring, such as the possibility
that people who refuse to participate in a survey would have answered the questions
differently than those who responded. In some cases there will be no reason to think
that refusals would bias a study, but in other instances those who will not cooperate
might be expected to differ systematically from those who participate. If the research
topic is one likely to lead to this type of error, it might be worth trying to get as much in-
formation as possible about people who fell into the sample but were not interviewed,
so they can be compared with the actual respondents. But if nonresponse is unlikely
to bias the results, then it would be better to focus the survey budget on minimizing
other possible errors. Thus, the TSE approach makes researchers think about the likely
sources of error in their surveys before deciding what trade-​offs to make.
In considering different sources of survey error, it is important to distinguish be-
tween random and systematic error. Random errors are the mistakes that occur by
chance without any particular pattern; they increase the variance of the variable but
should cancel out in large samples. Systematic errors are more serious, since they bias
the results, such as when questions are worded to give only one side of a policy question.
Furthermore, survey errors can either be uncorrelated or correlated. Uncorrelated
errors are the isolated errors, such as when a respondent says “strongly agree” and the
interviewer accidentally happens to press the key corresponding to “agree.” Correlated
errors are more serious because they increase the variance of estimates, making it
more difficult to obtain statistical significance. Cluster sampling, coders coding many
interviews, and a large number of interviews per interviewer all lead to correlated errors.
These procedures are commonly used to cut costs, but it is necessary to recognize that
they increase the variance of estimates.
Figure 1.1 depicts the various types of error covered in descriptions of TSE. Sampling
error is shown as the tip of the iceberg, with the other possible errors potentially being
as large or larger than the sampling error. Each of these types of error is described in the
following sections. Groves et al. (2009) provide an update of Groves (1989) that includes
later research on each type of error. Weisberg (2005) and McNabb (2014) further discuss
the different sources and categories of nonsampling error.
16   Herbert F. Weisberg

Sampling Error

Respondent Selection
Issues
Coverage Error
Nonresponse Error at the Unit Level
Nonresponse Error at the Item Level
Measurement Error Due to Respondents
Response
Accuracy Issues Measurement Error Due to Interviewers

Postsurvey Error
Survey Administration Mode Effects
Issues
Equivalence Error

Figure 1.1  The Different Types of Survey Error.


Source: Weisberg (2005, 19).

Response Accuracy Issues

Measurement Error Due to Respondents


Measurement error is an important response accuracy problem, particularly when the
respondent does not answer the question accurately. If respondents are not motivated
enough to provide accurate answers, the interviewer can try to increase their motiva-
tion, such as stressing the importance of accurate answers. Unclear question wording
can lead to answers that are inaccurate, making it important to pretest questions. The
most serious problem is biased question wording, which sometimes occurs when in-
terest groups write survey questions and try to word them so as to exaggerate how much
the public agrees with their positions. Some survey questions ask respondents to report
more detail than they can be expected to know, such as when asking sick people ex-
actly how many times they went to a doctor in the last year. Indeed, answering temporal
questions can be very difficult for respondents (Tourangeau, Rips, and Rasinski 2000,
100–​135; Weisberg 2005, 97–​100).
As these examples suggest, measurement error due to respondents is often attribut-
able to the questionnaire construction. This type of measurement error can be lessened
by using survey questions that are well tested and by doing pretests on the question-
naire. One pretest procedure is “think-​aloud protocols,” in which respondents are
asked to report what goes through their minds as they think about how to answer the
questions (DeMaio, Rothgeb, and Hess 1998). More generally, the cognitive aspects of
survey methodology (CASM) movement (Jabine et al. 1984) emphasizes the value of
Total Survey Error   17

“cognitive interviewing,” in which the cognitive processes used by respondents in an-


swering questions are studied (Miller et al. 2014).
There are two important theoretical developments that help researchers in thinking
through how to minimize measurement error due to respondents. One is Tourangeau,
Rips, and Rasinski’s (2000, 7–​16) delineation of four stages of the response process. The
first stage is for the respondent to comprehend the question. Then the respondent must
retrieve relevant information from his or her memory. The third step is to judge the
appropriate answer. The fourth step is to select and report the answer, such as when a
respondent decides to censor his or her responses by not admitting to socially unac-
ceptable behavior. Measurement error can arise at each of these steps, so the researcher
should try to develop questions for which each stage is as simple for respondents as
possible.
The other important theoretical development is the notion of two response modes: a
high road, in which people carefully think through their answers, versus a low road,
in which people give a response just to move on to the next question. The low road is
termed “satisficing” (Krosnick and Alwin 1987) and is evidenced, for example, when a
respondent “straight-​lines” by simply saying “agree” to a long series of agree/​disagree
questions without really thinking through each question separately, or for that matter,
just saying “don’t know” on all of them.
It is important both to measure the amount of satisficing and to minimize it.
A respondent who gets through a long questionnaire in a very short time might be
satisficing. Giving very short answers to open-​ended questions is another sign of
satisficing. Computerized surveys can be programmed to keep track of how long
it takes to answer questions, to see if satisficing is occurring on particular question
sequences. Keeping questionnaires short is one means of trying to minimize
satisficing. Or, if it is necessary to ask many agree/​disagree questions together, at least
some can be reversed, so that the agree response on some questions means the same
as the disagree response on other questions, so that a person who agrees to every
question would not be scored as being at one extreme of the set of questions. There can
be mode differences on satisficing. For example, Atkeson, Adams, and Alvarez (2014)
find greater nondifferentiation on answers to questions about the perceived ide-
ology of several politicians in self-​administered questionnaires than on interviewer-​
administered questionnaires.
Internet surveys facilitate the use of survey experiments to measure survey effects
(Mutz 2011). Random half samples can be given different wording of key questions, and
the order of response options can be varied randomly. While it is possible to do such
randomization in telephone surveys, the larger sample size that can be achieved at a
reduced cost in Internet surveys makes it feasible to include more such experiments
in a survey. The saving of interviewer salaries permits spending more of the research
budget on these experiments, though there are added costs in programming the survey,
testing the programming, and then handling the experiments appropriately at the data
analysis stage.
18   Herbert F. Weisberg

Measurement Error Due to Interviewers


Interviewers should be facilitating the interview and helping obtain accurate answers,
but they can also introduce error. That error can be random, such as when an interviewer
accidentally records a “yes” answer as a “no,” or it can be systematic, such as when an in-
terviewer always mispronounces a particular word in a question. Giving interviewers
extensive training on good interviewing techniques as well as on the current interview
schedule can minimize interviewer error (Fowler and Mangione 1990, ch. 7). Systematic
interviewer error cumulates the more interviews are taken by each interviewer, so it is
better to have more interviewers take fewer interviews each rather than having a small
number of interviewers each take very large numbers of interviews. The intraclass cor-
relation, which measures the variance associated with interviewers (Kish 1965), and the
average number of interviews taken per interviewer essentially multiply the standard
error of variables, making it more difficult to achieve statistical significance.
There are two schools of thought as to what interviewing style best minimizes
measurement error. The traditional approach has been “standardized interviewing,”
in which interviewers are instructed to ask the identical question the same exact way
to all respondents, not interjecting their own views and not supplying any extra in-
formation to the respondents (Fowler and Mangione 1990, ch. 4). By contrast, in the
“conversational interviewing” (or “flexible interviewing”) approach, interviewers are
instructed to help respondents understand the questions (Conrad and Schober 2000).
On a question that asks people how many pieces of furniture they have bought in the
last three months, for example, the interviewer might be allowed to help the respondent
understand whether a lamp qualifies as a piece of furniture. Allowing interviewers to
clarify the meaning of questions could introduce error into the process, but it could also
help respondents answer what the questions are really trying to ask.
Interviewer error is one form of error that vanishes as a consideration in mail
questionnaires and Internet surveys. On the cost side, these types of surveys save the
large expenses associated with hiring, training, supervising, and paying interviewers.
At the same time, it is important to recognize that response accuracy may decline on
open-​ended questions without an interviewer who can encourage the respondent to
give longer and more complete answers and to think more when replying to questions.

Item-​Level Nonresponse
Response accuracy can also be impaired when there is nonresponse on individual
survey questions. Such missing data occur when people refuse to answer particular
questions, skip questions accidentally, or do not have an opinion (“don’t know”). While
it is usually impossible to eliminate all missing data, motivating the respondent to an-
swer all questions can decrease the problem (Cannell, Oksenberg, and Converse 1977).
Many survey research specialists contend that the problem of missing data is lessened
Total Survey Error   19

when the data collection is conducted by skilled interviewers who develop good rapport
with respondents. Some Internet surveys do not allow respondents to proceed unless
they answer all questions, but that solution increases the risk of breakoffs, because some
frustrated respondents may give up answering the survey.
Results may be biased because of missing data if the people who do not answer differ
systematically from those who do answer, such as if higher income people are more
likely to refuse to answer the income question. There is no problem if the missing data
are truly so at random; however, bias arises if the occurrence of missing data is correlated
with the variables of interest. For example, if the higher income people who do not re-
port their income tend to vote more conservatively than those who do report their in-
come, then the correlation between income and vote may be understated.
One strategy for dealing with missing data is to insert values for the missing values
using an imputation strategy. The imputation strategy that is becoming most prevalent
is performing regression of a variable with missing values on other variables in the data,
with a random error term added to the predicted value. A multiple imputation approach
involves performing five to ten imputations of this type, so that the variance of estimates
across imputations can be assessed (Rubin 1987; Little and Rubin 2002).

Respondent Selection Issues

Unit-​Level Nonresponse
Turning to respondent selection, error can occur when respondents who fall within
the sample cannot be contacted or refuse to cooperate with the interview (Groves and
Couper 1998). This nonresponse at the unit level can bias the sample if the people who
do not participate differ systematically from those who do participate. This has become
a serious potential problem over the years, as the refusal rate in surveys has increased.
Some conservative political commentators have encouraged people to not participate
in polls, which would lead to an underestimate of the Republican vote if conservatives
followed their advice. Sometimes surveys seek side information about people who
refuse to participate, so the unit nonresponse error can be estimated. It is becoming
common to have interviewers try to tailor the interview request to the respondent as a
means of minimizing refusals, such as by stressing how the survey can be of value to the
respondent (Dillman, Smyth, and Christian 2014).
When clients require high response rates, they sometimes offer to pay respondents
to participate. Monetary incentives of $1 to $5 can produce small increases of 2% to 12%
in response rates, with diminishing returns with higher incentives (Cantor, O’Hare,
and O’Connor 2008; Singer and Ye 2013). Very large incentives can, however, increase
the response rate considerably. For example, the 2012 American National Election
Studies survey initially offered selected respondents $25 for the hour-​long pre-​election
20   Herbert F. Weisberg

interview, which was increased by $25 increments as the election approached, until it
hit $100, with similar incentives for participation in the post-​election survey. Those
incentives helped achieve response rates for ninety-​minute surveys of 38% for the pre-​
election wave and 94% for the post-​election (American National Election Studies 2013,
7, 29), which are several times greater than most response rates.

Coverage Error
There are other respondent selection issues. When sampling from a list, there is some-
times a discrepancy between the list and the population of interest (McNabb 2014, ch.
5). Such coverage error occurs when a sample for telephone interviewing is taken from
a telephone book, thereby excluding people with unlisted numbers. The Republican
Party polling in the 2012 presidential election overstated the likely Romney vote be-
cause cell phone numbers were not always sampled, leading to coverage error because
many young people did not have landline phones. There was a substantial coverage bias
in Internet surveys when Internet access was more limited than today; that problem is
less severe now that more people have access to the Internet, though there is still the
problem of not being able to sample the Internet. Address-​based sampling using a sam-
pling frame of addresses (Link et al. 2008) is becoming a common sampling approach,
because it has better coverage than telephone-​based and Internet systems.
Multiple sampling frames are sometimes used to ensure that the full population of in-
terest is covered, though it is then important to avoid “multiplicity” by giving less weight
to interviews of any people who had multiple chances of falling into the sample. Dual
frame telephones are becoming common to sample both landline and cellular phones,
and there is work on the optimal allocation of interviews to the multiple frames when
they have different response rates (Lohr and Brick 2014). Another frame error problem
is “ineligibles,” which occur when people are interviewed who do not fall within the
sampling frame. If the sample is intended to be geographically limited, for example, it is
worth checking to make sure the respondent lives in the designated area before starting
the interview.

Sampling Error
Sampling error arises when interviewing just a sample of the population. When proba-
bility sampling is used, the “margin of error” equivalent to a 95% confidence interval can
be calculated. For example, if a sample of sixteen hundred cases is taken from a popu-
lation of millions through simple random sampling, then 95% of the time an estimated
proportion would be within 2.5% of the true population proportion. Taking a larger
sample can reduce sampling error, though that can be costly for surveys using human
interviewers, since halving the sampling error requires quadrupling the number of
interviews.
Total Survey Error   21

The number of interviews can often be increased considerably in Internet surveys


with little increased cost. However, it is important to bear in mind that taking very large
numbers of interviews is not a guarantee of accurate results. A famous example is when
the now-​defunct Literary Digest magazine received more than two million responses
to its 1936 presidential election survey based on postcards that people mailed in, which
led it to forecast that Kansas governor Alf Landon would certainly defeat the reelection
attempt of Franklin Delano Roosevelt.
Sampling issues can be very technical. Because simple random sampling requires a
population listing, which is often not feasible, samples of the general population instead
usually involve stratifying and clustering. A  proportional stratified sample takes the
right proportion of cases from subcategories, such as people living in each region of the
country, thus increasing the accuracy of the sample (Lohr 2010, ch. 3). A cluster sample
reduces the costs by sampling within known clusters, such as city blocks, though sam-
pling errors increase, since cases within the same cluster are not entirely independent of
one another (Lohr 2010, chap. 4).
Internet surveys face important issues regarding respondent selection. It is diffi-
cult to conduct probability sampling on the Internet, because researchers rarely have
a listing of email addresses of the population of interest. The main instance of prob-
ability sampling in Internet surveys is when surveying an organization or company
that maintains an accurate list of email addresses of its members or employees, though
it would be important to estimate the proportion of those people who do not check
their email.
One way to do probability sampling on the Internet without a population listing is
to first take a probability sample through either telephone, mail, or in-​person contacts
and then ask people who fall in that sample to take the actual interview online. While
some surveys use that approach for respondent recruitment, online polls using opt-​
in samples are more common. Unfortunately, opt-​in polls raise the risk of selection
bias: that people who volunteer to participate are different from the population of in-
terest on the key variables in a study. Strictly speaking, “sampling errors” cannot be val-
idly computed for such nonprobability samples, though survey reports often provide
the sampling error for a random sample of the obtained size. This has led to considerable
controversy in the survey research community. Proponents of online polls argue that
the nonprobability sampling issue is no more serious than the problem that telephone
surveys face nowadays, when refusal rates are so high that the attained sample may also
not be representative of the target population. Those who are skeptical of nonprobability
sampling counter that a measure of uncertainty other than “sampling error” should be
developed for such samples.
Researchers are using a variety of weighting approaches to try to deal with this
problem, including poststratification adjustment, sample matching (Vavreck and Rivers
2008), and propensity score weights (Lee and Valliant 2009; Tourangeau, Conrad, and
Couper 2013). Some research suggests that Internet surveys using such weights now
provide samples that are as representative as ones obtained by probability sampling
(Ansolabehere and Schaffner 2014 c.f., Yeager et  al. 2011). At least Internet surveys
22   Herbert F. Weisberg

provide a savings in interviewer salaries, which can allow more of the research budget to
be spent on developing and implementing the best possible weighting scheme.

Survey Administration Issues

Survey Mode Effects


Errors can also occur related to how the survey is administered. The decision about the
survey mode is the most fundamental. Whether to use an interviewer has effects, partic-
ularly for asking questions on “sensitive topics.” Also, “social desirability” effects occur
when respondents are unwilling to admit some of their attitudes or behaviors to an in-
terviewer. Conversely, respondents may be more likely to make ego-​driven claims in
interviewer-​administered survey modes than in self-​administered survey modes, such
as in responses about internal political efficacy (Atkeson, Adams, and Alvarez 2014).
There can also be effects related to how the respondent obtains the question, whether by
reading or by hearing it.
Techniques have been developed to deal with some of these potential problems.
For example, questions can be phrased so as to minimize potential embarrassment in
indicating that one engages in socially undesirable behavior. Internet surveys have an
advantage in dealing with sensitive topics, in that respondents do not have to worry
whether their responses would elicit disapproval from an interviewer. Similarly,
in personal interviewing, the interviewer can let the respondent answer sensitive
questions directly on the computer. If the researcher feels that it is important to have
respondents hear questions rather than read them in computer-​administered surveys,
it is possible to program a survey for the computer to read the questions aloud to the
respondents.
There are, of course, fairly obvious cost and time differences among in-​person, tel-
ephone, mail, and Internet surveys, particularly for studies that are taken over large
geographical areas such as whole nations. In-​person interviews are very expensive (as
high as $1,000 per interview in total costs) because of the logistics involved in having
interviewers across a nation, and these surveys generally take a long period of time.
Telephone surveys are considerably less expensive, but interviewer costs still add up.
Mail surveys are relatively inexpensive, but they also take a long period of time, partic-
ularly if there are repeated attempts to get responses from people who do not answer at
first. Internet surveys are also relatively inexpensive, and they tend to take a compara-
tively short period of time.
Mixed-​mode surveys use multiple modes, such as both telephone and Web, often in
an attempt to reach people in one mode who would be nonresponsive in another mode.
Olson, Smyth, and Wood (2012) found that people who are offered their preferred mode
for taking a survey do participate at higher rates. However, the response rate for a mail
Total Survey Error   23

survey did not increase when people were also offered the option of instead taking it on
the Internet. While it may sound good to offer people a choice of mode in answering a
survey, that can add considerably to the costs, in that it can require providing the infra-
structure for multiple modes, such as both processing completed mail questionnaires
and programming an Internet version. Indeed, Olson, Smyth, and Wood became fairly
pessimistic about conducting such a mixed-​mode survey. They note that “programming
a Web survey when it will be offered in conjunction with a mail survey may not be cost
effective” (2012, 631), so funds might be better spent on providing incentives or addi-
tional mailings for a mail survey. The extra costs for computer-​assisted telephone and
Web multiple mode surveys may be less, since it may be possible to use the same com-
puter program code for both. However, there still would be dual logistical operations for
handling interviewers and keeping track of Internet responses. Also, there could be con-
cern about how comparable the responses given to an interviewer and those provided
on the Internet with less human interaction are.

Postsurvey Error
Error also can enter during the processing and analysis of survey data. In particular, the
coding of open-​ended survey responses into a small number of numeric categories is a
common source of error, because people’s responses can rarely be categorized neatly. As
a means of minimizing coding error, complicated coding schemes should be pretested to
gauge their reliability, and coders should be trained on the coding rules before starting
to process actual interviews. Survey organizations often have multiple coders code the
same material, or at least a sample of the responses, which allows the computation of an
intercoder-​reliability measure that shows how replicable the coding is.
Errors associated with research staff entering the data into the computer are
eliminated when respondents type in their answers themselves, and that also eliminates
data entry costs. However, respondents who lack typing skills may be more likely to
make entry errors than trained staff would be.

Comparability Error
There can also be “house effects” related to the organization that conducts the survey.
Sometimes house effects are due to bias, such as when some survey organizations al-
ways obtain more pro-​Republican responses than others, but these effects can also
be random, such as when an organization sometimes gets more pro-​Republican and
other times more pro-​Democratic responses. More generally, “comparability effects”
(Weisberg 2005, ch. 13) and “comparison error” (Smith 2011) point to the problem of
lack of equivalence of surveys that purport to measure the same concepts in different
countries as well as the lack of equivalence of the meaning of the same survey question
24   Herbert F. Weisberg

as real-​world conditions change over time. For example, the terms “liberal” and “con-
servative” do not mean the exact same thing today that they did seventy-​five years ago.
These potential problems suggest the need for caution when evaluating a survey and
even more so when comparing results from different surveys. These effects are more
likely to affect the means of variables than relationships between variables, so care is es-
pecially important when comparing means between surveys taken at different points in
time by different survey organizations.

Total Survey Quality

The TSQ approach extends the TSE approach by emphasizing the need for a usable set
of findings (Biemer and Lyberg 2003). The TSQ approach accepts that TSE’s focus on
accuracy is appropriate, but it adds further criteria. For one thing, the results should be
credible, which they are not if the response rate in the survey is too low. For example,
while some research shows that survey error does not necessarily increase when survey
response rates are low (Curtin, Presser, and Singer 2000; Keeter et al. 2000), surveys
with a 1% response rate might not be considered credible in public policy debates. In
addition, the results should be relevant, which requires choosing survey questions that
truly measure the concepts of interest. The survey should be conducted in a timely
manner, as determined by when the researcher needs the data. The data should be ac-
cessible, so the researcher has full access to them. The data should be provided to the
researcher in an interpretable manner, including a codebook and full documentation
about how the sampling was conducted. Government-​sponsored surveys are often re-
quired to satisfy specific quality criteria such as these.
Furthermore, quality should be achieved at three different levels (Lyberg 2012).
First, the survey results given to the researcher should satisfy the type of quality
standards described above (“product quality”). Second, quality control can be used
to be sure that the process by which the survey is conducted is of high quality, as in
high standards for hiring and supervising interviewers (“process quality”). Third,
the management of the survey organization should be of high quality, such as having
strong leadership, good customer relations, and high staff satisfaction (“organization
quality”).
It is important to recognize that achieving high quality also has costs and takes time.
For example, full documentation of a survey takes staff resources away from conducting
the next survey. Fortunately, many of the steps needed to create a high-​quality survey
organization can benefit multiple surveys, so the costs involved in achieving and
maintaining quality can often be amortized across many studies. Furthermore, devel-
oping a reputation for high quality standards benefits a survey organization in terms
of helping it attract more survey business. Still, devoting resources to quality involves
trade-​offs, just as minimizing survey error does, so TSQ should be considered along
with TSE.
Total Survey Error   25

Conclusion

The survey field has moved from its early focus on sampling error to a realization of the
importance of considering the broader range of errors that can affect surveys. The TSE
approach provides a comprehensive framework for thinking about those errors and bal-
ancing them against the constraints of costs, time, and ethics. The TSQ perspective fur-
ther emphasizes the need for satisfying high quality standards.
Survey modes have also evolved considerably over the years. In the 1930s it became
possible to conduct an in-​person survey across a large country. However, in-​person
interviewing and the use of human interviewers were both very expensive, leading to
development of new survey modes that made those costs unnecessary. In the 1980s
telephones became so universal that national telephone surveys became prevalent. As
access to the Internet has increased, Internet surveys have now become widespread.
Each of these changes has had implications for survey error, survey costs, and survey
quality. Given this history, one should expect survey administration to continue to be af-
fected by new technological advances in the future. As that happens, it will be important
to take into account possible changes in the trade-​offs between survey costs and survey
errors.

References
American National Election Studies. 2013. User’s Guide and Codebook for the ANES 2012 Times
Series Study. Ann Arbor, MI: University of Michigan and Stanford, CA: Stanford University.
Ansolabehere, S., and B. F. Schaffner. 2014. “Does Survey Mode Still Matter?” Political Analysis
22 (3): 285–​303.
Atkeson, L. R., A. N. Adams, and R. M. Alvarez. 2014. “Nonresponse and Mode Effects in Self-​
and Interviewer-​Administered Surveys.” Political Analysis 22 (3): 304–​320.
Biemer, P. P., and L. E. Lyberg. 2003. Introduction to Survey Quality. New York: John Wiley
& Sons.
Campbell, D. T., and J. Stanley. 1963. Experimental and Quasi-​Experimental Designs for
Research. Chicago: Rand-​McNally.
Cannell, C. F., L. Oksenberg, and J. M. Converse. 1977. Experiments in Interviewing Techniques.
Hyattsville, MD: National Center for Health Services Research.
Cantor, D., B. O’Hare, and K. O’Connor. 2008. “The Use of Monetary Incentives to Reduce
Non-​Response in Random Dial Telephone Surveys.” In Advances in Telephone Survey
Methodology, edited by J. M. Lepkowski, C. Tucker, J. M. Brick, E. D. de Leeuw, L. Japec, P. J.
Lavrakas, M. W. Link, and R. L. Sangster, 471–​498. New York: John Wiley and Sons.
Conrad, F., and M. Schober. 2000. “Clarifying Question Meaning in a Household Telephone
Survey.” Public Opinion Quarterly 64: 1–​28.
Converse, J. M. 1987. Survey Research in the United States: Roots and Emergence, 1890–​1960.
Berkeley: University of California Press.
Curtin, R., S. Presser, and E. Singer. 2000. “The Effects of Response Rate Changes on the Index
of Consumer Sentiment.” Public Opinion Quarterly 64: 413–​428.
26   Herbert F. Weisberg

DeMaio, T. J., J. Rothgeb, and J. Hess. 1998. “Improving Survey Quality through Pretesting.”
Proceedings of the Survey Research Method Section, American Statistical Association, 3: 50–​58.
Deming, W. E. 1986. Out of the Crisis. Cambridge, MA: MIT Press.
Dillman, D. A., J. D. Smyth, and L. M. Christian. 2014. Internet, Phone, Mail, and Mixed-​Mode
Surveys: The Tailored Design Method. 4th ed. New York: Wiley.
Drucker, P. 1973. Management. New York: Harper & Row.
Fowler, F. J., Jr., and T. W. Mangione. 1990. Standardized Survey Interviewing:  Minimizing
Interviewer-​Related Error. Newbury Park, CA: Sage.
Groves, R. M. 1989. Survey Errors and Survey Costs. New York: Wiley.
Groves, R. M., and M. P. Couper. 1998. Nonresponse in Household Interview Surveys.
New York: Wiley.
Groves, R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. 2009.
Survey Methodology. 2nd ed. New York: Wiley.
Hansen, M. H., W. N. Hurwitz, and W. G. Madow. 1953. Sample Survey Methods and Theory.
New York: Wiley.
Jabine, T. B., M. L. Straf, J. M. Tanur, and R. Tourangeau, eds. 1984. Cognitive Aspects of Survey
Methodology. Washington, DC: National Academy Press.
Keeter, S., C. Miller, A. Kohut, R. M. Groves, and S. Presser. 2000. “Consequences of Reducing
Non-​Response in a National Telephone Survey.” Public Opinion Quarterly 64: 125–​148.
Kish, L. 1965. Survey Sampling. New York: Wiley.
Krosnick, J. A., and D. F. Alwin. 1987. “An Evaluation of a Cognitive Theory of Response-​Order
Effects in Survey Measurement.” Public Opinion Quarterly 51 (2): 201–​219.
Lee, S., and R. Valliant. 2009. “Estimation for Volunteer Panel Web Surveys Using Propensity
Score Adjustment and Calibration Adjustment.” Sociological Methods and Research
37: 319–​343.
Link, M. W., M. P. Battaglia, M. R. Frankel, L. Osborn, and A. H. Mokdad. 2008. “A Comparison
of Address-​Based Sampling (ABS) Versus Random-​Digit Dialing (RDD) for General
Population Surveys.” Public Opinion Quarterly 72 (1): 6–​27.
Little, R., and D. Rubin. 2002. Statistical Analysis with Missing Data. 2nd ed. New York: Wiley.
Lohr, S. L. 2010. Sampling: Design and Analysis. 2nd ed. Boston: Cengage Learning.
Lohr, S. L., and J. M. Brick. 2014. “Allocation for Dual Frame Telephone Surveys with
Nonresponse.” Journal of Survey Statistics and Methodology 2 (4): 388–​409.
Lyberg, L. E. 2012. “Survey Quality.” Survey Methodology 38 (2): 107–​130.
McNabb, D. E. 2014. Nonsampling Error in Social Surveys. Los Angeles: Sage.
Miller, K., S. Willson, V. Chepp, and J. L. Padilla, eds. 2014. Cognitive Interview Methodology.
New York: Wiley.
Mutz, D. C. 2011. Population-​ Based Survey Experiments. Princeton, NJ:  Princeton
University Press.
Olson, K., J. D. Smyth, and H. M. Wood. 2012. “Does Giving People Their Preferred Survey
Mode Actually Increase Survey Participation Rates?” Public Opinion Quarterly 76
(4): 611–​635.
Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Singer, E. 2008. “Ethical Issues in Surveys.” In International Handbook of Survey Methodology,
edited by E. D. de Leeuw, J. J. How, and D. A. Dillman, 78–​96. New York: Lawrence Erlbaum
Associates.
Singer, E. and C. Ye. 2013. “The Use and Effects of Incentives in Surveys.” Annals of the American
Academy of Political and Social Science 645 (January): 112–​141.
Total Survey Error   27

Smith, T. W. 2011. “Refining the Total Survey Error Perspective.” International Journal of Public
Opinion Research 28 (4): 464–​484.
Tourangeau, R., F. G. Conrad, and M. P. Couper. 2013. The Science of Web Surveys.
New York: Oxford University Press.
Tourangeau, R., L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
Cambridge, UK: Cambridge University Press.
Vavreck, L., and D. Rivers. 2008. “The 2006 Cooperative Congressional Election Study.”
Journal of Elections, Public Opinion & Parties 18 (4): 355–​366.
Weisberg, H. F. 2005. The Total Survey Error Approach. Chicago: University of Chicago Press.
Yeager, D. S., J. A. Krosnick, L. C. Chang, H. S. Javitz, M. S. Levendusky, A. Simpser, and R.
Wang. 2011. “Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys
Conducted with Probability and Non-​Probability Samples.” Public Opinion Quarterly 75
(4): 709–​747.
Chapter 2

L ongitu dina l Su rv eys


Issues and Opportunities

D. Sunshine Hillygus and Steven A. Snell

Introduction

Longitudinal or panel surveys, in which the same respondents are interviewed re-
peatedly at different points in time, are increasingly common across the academic,
private, and public sectors. The major infrastructure surveys in political science, soci-
ology, and economics—​the American National Election Study (ANES), the General
Social Survey (GSS), and the Panel Study on Income Dynamics (PSID)—​now all con-
tain panel components. The unique benefits of panel surveys are widely recognized: by
interviewing the same subjects over time, panel surveys offer greater causal leverage
than a cross-​sectional survey and enable the analysis of individual-​level changes in
attitudes, behavior, or knowledge.
Advances in survey research technology, especially the proliferation of Internet-​
based surveying, have lowered the barriers to entry for longitudinal research.
The emergence of online panels like GfK Knowledge Networks, YouGov, and the
RAND American Life Panel makes it easier than ever for researchers to conduct
repeated interviews (Hillygus, Jackson, and Young 2014; Baker et al. 2010; Yeager
et al. 2011). Furthermore, in the past several years researchers have pooled their
efforts and budgets in various collaborative panel studies, such as the Cooperative
Congressional Election Study (CCES) and The American Panel Study (TAPS). The
2008 Associated Press-​Yahoo!News Election Panel and the CBS/​New York Times/​
YouGov 2014 Election Panel are two such projects that have involved collaborations
between public opinion scholars and media organizations.
Despite their analytic strengths and increasing availability for research, panel surveys
are not without their drawbacks. They share all the problems of other surveys—​quality
threats from sampling and nonsampling errors—​while also facing several unique
challenges in their design, implementation, and analysis. In this chapter we consider
Longitudinal Surveys   29

three such challenges:  (1) a tension between continuity and innovation in the ques-
tionnaire design; (2) panel attrition, whereby some individuals who complete the first
wave of the survey fail to participate in subsequent waves; and (3) types of measure-
ment error—​panel conditioning and seam bias—​specific to panel surveys. We provide
an overview of these various issues and their implications for data quality and also out-
line current approaches to diagnose and correct for these issues in the survey design and
analysis.
First we define the longitudinal survey and distinguish it from related designs. We
then discuss the advantages and disadvantages of longitudinal surveys, drawing
attention to their unique challenges. Finally, we review best practices for avoiding the
most common pitfalls and highlight avenues of future research that can improve the de-
sign and analysis of longitudinal polling.

Background

Although longitudinal surveys have a seemingly straightforward definition—​they are


survey projects in which respondents are interviewed at two or more points in time—​
it is useful to distinguish them from related designs, especially because of overlaps in
usage of the relevant terminology.
The longitudinal survey belongs to the larger class of longitudinal methods because
it is designed to elicit data from the same respondents at multiple time points (Menard
2002); nevertheless, there are alternative approaches for measuring individual-​level
change over time that do not qualify as panel surveys. Nonsurvey examples of longi-
tudinal research abound in the social sciences, including a wide range of time series
data, such as those using country-​level factors (e.g., Beck, Katz, and Tucker 1998; Blais
and Dobrzynska 1998), state-​or county-​level measures (e.g., Bishop 2013), or repeated
observations of individuals from voter registration files or other nonsurvey sources
(e.g., Davenport et al. 2010).
While not all longitudinal research is survey research, it is also the case that not all
surveys designed to measure temporal change can be considered longitudinal surveys.
A  cross-​sectional design, in which subjects are each interviewed only once, can be
re-​asked at different points in time using samples drawn independently (Visser et al.
2014; Menard 2002). An example of this repeated cross-​sectional design is the typical
tracking poll during a political campaign, designed to measure the ebbs and flows of
candidate support. If sampling procedures and question wording are sufficiently sim-
ilar, repeated cross-​sectional surveys are an effective tool for detecting societal shifts
in opinion. Repeated cross-​sectional surveys can even be superior to panel surveys for
some research questions. For example, the former might do a better job of capturing
new entrants to a population, potentially providing a more accurate reflection of the
population’s attitudes or behaviors in cases in which new entrants are especially different
(Tourangeau 2003). Nevertheless, causal inference is generally weaker in a repeated
30    D. Sunshine Hillygus and Steven A. Snell

cross-​section than in a panel survey because the researcher can only compare groups of
respondents rather than individuals (Visser et al. 2014; Tourangeau 2003; Bartels 1999).
Another method for measuring change is a retrospective survey design, in which
respondents are asked during a single interview to recall attitudes or behaviors at several
previous time periods (Menard 2002). This measurement strategy is distinct from the
longitudinal survey because it relies on respondents’ retrospection rather than repeated
interviews. While this approach allows researchers to measure within-​subject change
over time, an obvious deficiency is that it relies on memory recall, which introduces
potential bias given the difficulty that some survey respondents have remembering
even the most basic facts or behaviors (Bradburn, Rips, and Shevell 1987; Groves 2004;
Tourangeau, Rips, and Rasinski 2000).
A final point of distinction exists between panel surveys and so-​called online
survey panels, like GfK Knowledge Networks and YouGov. Because of the difficulty
of constructing a general population sample frame of email addresses, online survey
panels have emerged as the primary way in which Internet polling is conducted (Groves
et al. 2009). An online survey panel is simply a group of prescreened respondents who
have expressed a willingness to participate in surveys, usually in exchange for money or
other compensation (Baker et al. 2010).1 The surveys in which these panelists take part
might be cross-​sectional or longitudinal. Despite the common use of the term “panel” to
refer to this particular survey mode and sample source, this chapter focuses on the lon-
gitudinal survey design—​in which the same respondents are interviewed multiple times
for a given study. Such survey designs can be conducted online, by telephone, by mail, or
in person.
That said, the Internet age has certainly expanded opportunities for longitudinal
survey designs. The emergence of online survey panels facilitates the growing interest
in longitudinal survey research by reducing the costs of subject recruitment and pro-
viding a pool of willing subjects who can be easier to locate for follow-​up interviews.
The willingness of online panelists to engage in additional surveys helps reduce a key
cost of longitudinal research. On the other hand, the repeated interviewing of the same
subjects might exacerbate the shortcomings of nonprobability online panels in par-
ticular. Researchers are increasingly concerned, for example, about the conditioning
effects of repeated interviewing in both panel survey designs and online survey panels
more generally (see Hillygus, Jackson, and Young 2014; Adams, Atkeson, and Karp 2012;
Callegaro et al. 2014).
In addition to distinguishing what is and is not a panel survey, it is also worth
highlighting the wide variability in the possible design features of panel surveys. Panel
surveys can include dozens of waves or just an initial interview and a single follow-​up.
The ANES, for instance, typically includes one pre-​election interview and one post-​
election interview—​a two-​wave panel. Panel surveys can also vary in the duration of the
study and the length of time between survey interviews. The four-​wave Youth-​Parent
Socialization Panel study spanned more than three decades, from 1965 to 1997, but most
election panels span only a matter of months. Panel surveys also vary in their sampling
strategy. A fixed panel design asks all respondents to participate at the same time, while a
Longitudinal Surveys   31

rotating panel divides the sample into different cohorts, with initial interviews staggered
across survey waves. As discussed in the next section, the latter design offers useful lev-
erage for assessing panel attrition and conditioning effects. Finally, designs differ in
how they define their inferential population—​some define only at the first wave, while
others update at each wave. In other words, an individual who died between waves 1 and
2 would be counted as ineligible in the former and as a nonrespondent in the latter.
Eligibility for follow-​up interviews can also vary—​with some panels attempting
to follow up with all respondents who complete the initial interview, while others se-
lect a narrower subset of respondents for subsequent interviews.2 As with all research
methodologies, the goals of the study—​balanced against time and cost considerations—​
should guide these specific design decisions. For a more detailed overview of these and
other design issues in panel surveys, see Menard (2007), Duncan and Kalton (1987),
Kalton and Citro (1993), and Kasprzyk et al. (1989).

Advantages of Longitudinal Surveys

The growth of longitudinal surveys in the last several years reflects the significant
benefits of repeated interviews with the same subjects. First, longitudinal surveys are
critical for understanding the dynamics of public opinion. While cross-​sectional
surveys are well-​suited to track societal trends in opinion over time, they cannot iden-
tify within-​subject change (Tourangeau 2003; Visser et al. 2014). As such, it is difficult to
determine if changes in public opinion, such as Americans’ dramatic shift in attitudes
about same-​sex marriage, are a function of sampling and cohort replacement or a reflec-
tion of real changes in individual attitudes (e.g., Baunach 2011; Brewer 2008). Without
conducting repeated surveys with the same subjects, we cannot evaluate who changed
their minds or why.
This ability to evaluate within-​subject change is what makes panel surveys a critical
tool in the study of campaigns and elections. The seminal Columbia research on voting
behavior was based on panel studies, such as the seven-​wave sample of twenty-​four
hundred voters in Erie County, Ohio, during the 1940 election (Lazarsfeld, Berelson,
and Gaudet 1948; Berelson, Lazarsfeld, and McPhee 1954). A  longitudinal design
enabled researchers to observe which voters changed their candidate preferences
during the campaign. Although voting research came to rely increasingly on national
cross-​sectional surveys for much of the twentieth century, the last decade or so has
seen a renewed interest in panel surveys as a tool for examining the decision calculus of
voters at various points in the campaign (e.g., Henderson, Hillygus, and Tompson 2010;
Iyengar, Sood, and Lelkes 2012). The strength of the panel design is that by interviewing
the same respondents multiple times in the course of the campaign, the researchers
have a much stronger sense of the evolution of individual-​level voter decision making.
Consider, for instance, that cross-​sectional polls typically find that roughly 5% of the
electorate is undecided between the candidates at any given point in the campaign;
32    D. Sunshine Hillygus and Steven A. Snell

longitudinal surveys show that it is not always the same 5% of the electorate in every
snapshot, offering a very different portrait of the campaign (Henderson and Hillygus
2016).
A second, and related, advantage of the longitudinal design is that measuring
within-​subject change offers greater leverage in estimating causal effects. This
design is especially convincing if the pre-​and post-​intervention surveys closely
precede and follow, or bracket, an intervention.3 Such an intervention might be
naturally occurring or a manipulation of the researcher. For example, Hillygus and
Jackman (2003) compare interviews before and after presidential conventions and
debates to estimate the effect of these major campaign events on candidate pref-
erence. With experimental interventions, panel surveys provide the pre-​treatment
baseline by which the post-​treatment effects are later evaluated. Examples of such
analyses include surveys gauging political knowledge and attitudes before and
after respondents are randomly assigned to receive a free newspaper subscription
(Gerber, Karlan, and Bergan 2009) and a panel survey designed to detect the effects
of a large-​scale campaign against vote buying on voter turnout and vote choice
(Vicente 2014).
Even without an intervention, the within-​subjects nature of the panel design
provides the temporal ordering of measures that is necessary (though not sufficient)
to establish causality (Bartels 2006). For example, researchers have used panel data to
explore the complex relationship between party identification and policy preferences
(Carsey and Layman 2006) and between media messages and issue attitudes (Lenz
2009). While this approach has a somewhat weaker claim to causality, the temporal
ordering of the measurement makes it far superior to traditional observational
studies.
A third advantage of the longitudinal survey design is the opportunity it provides
researchers to assess the reliability of the concepts being measured, a critical com-
ponent of measurement error. Reliability refers to the degree to which consecutive
measurements of a given concept yield the same result, provided that the meaning of
the concept has not changed across time. Some phenomena can easily be measured
reliably—​gender and level of education, for example—​while most concepts of interest
to social scientists are subject to measurement error. In classical test theory, test-​retest
stability is a standard approach for evaluating reliability and necessitates a longitudinal
design (Carmines and Zeller 1979; Bartels 2006). For example, Achen (1975) reassesses
the seminal analysis of early ANES panels (Converse 1964) and finds that much of the
instability across time in voter preferences is attributable to the poor reliability of survey
measures. Longitudinal surveys also enable measurement error adjustments. For ex-
ample, in panels with an item measured three or more times, the researcher can employ
the difference in responses from one set of waves to assess the reliability of the item and
to then correct appropriately for measurement bias when comparing responses to the
same question across another set of waves (e.g., Bartels 1999). This calibration exercise
allows researchers to control for and better distinguish measurement noise from real at-
titude change.
Longitudinal Surveys   33

Challenges in Longitudinal Surveys

As the previous discussion makes clear, panel studies offer a number of compelling
advantages for studying social, political, and economic phenomena. They do, however,
come with some downsides. First, longitudinal data have a complex structure that can com-
plicate analysis. By virtue of having multiple interviews with the same respondents, the
data have a hierarchical structure that should be accounted for in the statistical modeling
(Gelman and Hill 2007). There is a wide variety of modeling approaches for handling panel
data: change point models, duration models, transition models, fixed effect models, hierar-
chical models, and so forth. Unfortunately, the substantive conclusions can differ depending
on the modeling approach used, and it is not always clear which approach is best suited to
the research question. Broadly, analysts can model either the level of (y) or the change in y
(Δy) as a function of either the level of or change in the levels of the predictor variables, where
the number of possible combinations depends on the number of survey waves used in the
analysis. Given that the particular research question and data structure will determine the
most appropriate modeling strategy, we refer readers to dedicated texts such as Singer and
Willett (2003), Finkel (1995), and Hsiao (2003). Another complexity in analyzing longitu-
dinal surveys is that it not always obvious which weight to use given that multiple weights
are often provided.4 Again, the decision depends on the research question and the particular
combination of waves used, but generally analysts will want to use the longitudinal weight
associated with the wave in which their dependent variable is measured.
Panel surveys also face a number of threats to data quality that can jeopardize the
ability to make inferences about the outcomes of interest. To be sure, all survey re-
search faces a litany of challenges that can threaten the validity and reliability of survey
estimates. A rich literature across academic and professional fields has made strides in
identifying potential sources of bias in survey research (e.g., Groves 2004; Groves et al.
2009; Groves and Couper 2012; Weisberg 2005). The “Total Survey Error” paradigm
classifies survey error as pertaining to survey sampling, coverage, nonresponse, meas-
urement, and postsurvey analysis and recommends best practices in survey design,
implementation, and evaluation to mitigate these errors (e.g., Biemer 2011; Groves and
Lyberg 2011; Weisberg 2005). In addition to these usual sources of error, however, panel
surveys face additional threats to quality associated with measuring the same individuals
at different points in time. We outline three such challenges here: (1) a tension between
continuity and innovation in the questionnaire design; (2) panel attrition; and (3) panel
conditioning and seam effects (panel-​specific measurement error).

Balancing Continuity and Innovation in Panel Surveys


Given that the ability to track within-​subjects change is one of the longitudinal survey
design’s chief benefits, it perhaps goes without saying that the basic “way to measure
34    D. Sunshine Hillygus and Steven A. Snell

change is not to change the measure” (Smith 2005). Yet longitudinal studies often face
a tension between the need for comparability over time and the pressure to change the
question wording or other design features of the study. Especially in panels that span an
extended time period, there may be compelling reasons to modify, update, or retire a
question (Tourangeau 2003). For example, after nearly one hundred years of use, the U.S.
Census Bureau in 2013 dropped the word “Negro” from its racial response categories.
Even within a shorter panel, there can be reasons to change question wording. Within
a political panel survey of an election campaign, for instance, it is common for vote
choice response options to change from the choice between a generic Democrat and
Republican during the nomination stage to the choice between two specific candidates
after the party nominees are known. Research has consistently shown that public
opinion estimates are sensitive to even small differences in question wording and re-
sponse options (e.g., Green and Schickler 1993; Abramson and Ostrom 1994). Moreover,
responses can also be affected by changes in other survey design features such as mode,
incentives, fielding period, question order, and the like (Jackson 2011).
The point is simply that questionnaire or survey design changes should not be made
lightly and require experimentation and calibration to lessen the inherent loss of con-
tinuity and comparability. Two kinds of experimentation are useful. The first is an
“offline” experiment, wherein additional subjects participate in separate pilot studies,
which randomize respondents to potential versions of the changes under consideration
(Tourangeau 2003). Given the expense of longitudinal research, this process of inde-
pendent piloting is valuable because researchers can more fully understand response
properties and refine the revised survey item before interfering with the continuity
of the panel. The second type of experiment is a split-​ballot design within the panel
survey (Tourangeau 2003). This similarly allows researchers to make between-​item
comparisons for the larger sample, but provides the additional benefit of sustaining the
time series by presenting the original item to some subset of respondents. While experi-
mentation should guide necessary adaptations of existing items, transparency regarding
what has changed and why is the other key to successful innovation (Jackson 2011).

Panel Attrition
Perhaps the most well-​recognized challenge to longitudinal studies is panel attrition,
wherein some respondents in the sample fail to complete subsequent waves. Attrition
affects longitudinal studies of all types, modes, and sponsors. For instance, the multiple-​
decade PSID, first fielded in 1968, lost nearly 50% of the initial sample members by 1989.
The ANES 2008–​2009 Panel Study lost 36% of respondents in less than a year of monthly
interviews. At best, attrition reduces effective sample size, thereby decreasing analysts’
abilities to discover longitudinal trends in behavior. At worst, attrition results in an
available sample that is not representative of the target population, thereby introducing
biases into estimates of the outcomes of interest. Recent expansions in the number and
use of panel surveys, coupled with worsening response rates, make the issue of panel
Longitudinal Surveys   35

attrition particularly salient. It is well-​documented that response rates for all surveys,
including government surveys, have been in decline in recent decades (Hillygus et al.
2006). The implications may be particularly severe for panel studies since they depend
on respondents participating at multiple points in time (Schoeni et al. 2013). Even high-​
quality government surveys have found that nonresponse and attrition have grown
worse in recent years. For example, before 1992 the Survey of Income and Program
Participation (SIPP) typically lost about 20% of the original sample by the final wave.
That loss rate increased to 35.5% in 1996 and 36.6% in 2004 (Westat 2009, 22).
Reinterviewing the same panelists can be a labor-​intensive process: researchers must
locate, recontact, and persuade the panelist to participate in later waves. If any of these
three steps breaks down, the case is lost (Watson and Wooden 2009). The need to track
panelists to new locations can substantially increase both survey costs and the difficulty
of gaining cooperation, leading some long-​duration panels to alter their sampling de-
sign. For instance, the Early Childhood Longitudinal Study of the National Center for
Education Statistics sampled only 50% of students who moved schools between waves
to help reduce the cost of follow-​up interviews. In sum, panel attrition is a problem for
all panel surveys, the problem has worsened over time, and there are now more data
analysts who have to contend with the problem.
The threats of panel attrition are widely recognized by public opinion researchers
(e.g., Ahern and Le Brocque 2005; Traugott and Rosenstone 1994; Zabel 1998), but there
is little consensus about how to handle it. Analyses of panel attrition tend to be reported
and published separately from those of substantive research (e.g., Zabel 1998; Fitzgerald,
Gottschalk, and Moffitt 1998; Bartels 1999; Clinton 2001; Kruse et al. 2009). Yet panel
attrition is not just a technical issue of interest only to methodologists; it can have direct
implications for the substantive knowledge claims that can be made from panel surveys.
For example, Bartels (1999) showed that differential attrition of respondents in the 1992–​
1996 ANES panel resulted in an overestimation of political interest in the population.
Frankel and Hillygus (2013) show that attrition in the 2008 ANES panel study biased
estimates of the relationship between gender and campaign interest.
Too often, researchers simply ignore panel attrition, conducting the analysis on the
subset of respondents who completed all panel waves data (e.g., Wawro 2002). In a re-
view of the literature, Ahern and Le Brocque (2005) find that fewer than one-​quarter of
studies employing panel data discuss attrition or offer any analyses to detect or correct
for panel attrition. In doing so, scholars make an assumption that panel attrition occurs
randomly. In the language of the missing data literature (Little and Rubin 2002), any
complete-​case descriptive analysis assumes the missing data—​subsequent survey waves,
in this case—​are missing completely at random (MCAR). That is, no observed or unob-
served data can systematically predict or account for this missingness. Unfortunately,
this assumption is almost always unfounded. Countless analyses have found that panel
attrition is related to a variety of respondent characteristics (e.g., Behr 2005).
Broadly speaking, the literature on the correlates of panel attrition emphasizes that
repeated participation in a panel survey depends on both the ability and motivation
to cooperate. As such, characteristics like income, education, gender, race, and being
36    D. Sunshine Hillygus and Steven A. Snell

foreign born correlate with attrition (Gray et al. 1996; Fitzgerald, Gottschalk, and Moffitt
1998; Loosveldt, Pickery, and Billiet 2002; Behr 2005; Lynn et  al. 2005; Watson and
Wooden 2009). Individuals who are more socially engaged and residentially stable—​
homeowners and those with children (especially young children) at home—​are more
likely to remain in a panel study, while younger respondents and those who live alone
are more likely to drop out (Lipps 2007; Uhrig 2008; Watson and Wooden 2009; Groves
and Couper 2012). Research also shows that civic engagement and interest in the survey
topic are correlated with attrition; those who care more about the topic are less likely to
attrit (Traugott and Morchio 1990; Traugott and Rosenstone 1994; Loosveldt and Carton
1997; Lepkowski and Couper 2001; Loosveldt, Pickery, and Billiet 2002; Voogt 2005;
Smith and Son 2010). Measures of political engagement and political interest, in par-
ticular, can be predictive of attrition in surveys on all topics, but are especially predic-
tive of attrition in surveys with political content (Brehm 1993; Traugott and Rosenstone
1994; Bartels 1999; Burden 2000; Voogt and Saris 2003; Olson and Witt 2011). For ex-
ample, Olson and Witt (2011) find that political interest has been consistently predic-
tive of retention in the ANES time series from 1964 to 2004. More recent research has
also emphasized that the respondents’ survey experience in the first wave will influence
their likelihood of participating in future waves (e.g., Frankel and Hillygus 2013). Given
the wide range of attrition correlates, Chen et al. (2015) recommend a step-​by-​step pro-
cess of identifying the predictors of attrition based on wave 1 responses and sampling
frame data.5
In case of expected attrition bias, there is a variety of approaches for correcting
estimates to improve inference. The use of post-​stratification weights is the most
common correction method used, and attrition-​adjusted survey weights are routinely
provided by survey firms. Weighting is not without controversy, however. As Deng
et al. (2013) highlight, there is wide variability in the way weights are constructed and
in the variables used to account for panel attrition. While researchers typically weight
to demographic benchmarks like the Current Population Survey (CPS) or American
Community Survey (ACS), Vandecasteele and Debels (2006) argue that weights based
on demographic variables alone are likely inadequate to correct for attrition. Weights
can also result in increased standard errors and introduce instabilities in the estimates
(Gelman 2007).6
An alternative approach is imputation, in which the attrited cases are replaced with
plausible values. While there are many different imputation methods, the preferred
approach is multiple imputation, in which multiple values are estimated to replace the
missing data (Pasek et al. 2009; Honaker and King 2010). As with weighting, standard
approaches to multiple imputation assume that missing cases are missing at random
(MAR)—​dependent on observed data, but not unobserved data.
Another approach for dealing with panel attrition is through specialized statistical
models. In cases in which MCAR or MAR assumptions are implausible, selection
models (Hausman and Wise 1979; Brehm 1993; Kenward 1998; Scharfstein, Rotnitzky,
and Robins 1999) or pattern mixture models (Little 1993; Kenward, Molenberghs,
and Thijs 2003) can be used to model attrition that is not missing at random
Longitudinal Surveys   37

(NMAR)—​dependent on the values of unobserved data. These approaches, however,


also require strong and untestable assumptions about the attrition process, because
there is insufficient information in the original panel data to understand why some cases
are missing (e.g., Schluchter 1992; Brown 1990; Diggle and Kenward 1994; Little and
Wang 1996; Daniels and Hogan 2008). Recent research shows that refreshment samples
can be used as leverage for modeling the attrition process (Bhattacharya 2008; Deng
et al. 2013; Hirano et al. 1998, 2001; Si, Reiter, and Hillygus 2014). A refreshment sample
is a new sample, independently drawn and given the same questionnaire at the same
time as the original panelists. Newly introduced cohorts in a rotating panel offer similar
leverage. The comparison of these new data to the original panel allows researchers to
properly correct estimates from the panel data.
Because substantive results can be sensitive to the particular corrective approach em-
ployed (Zabel 1998; Kristman, Manno, and Côté 2005; Ayala, Navarro, and Sastre 2006;
Basic and Rendtel 2007), the best approach for handling panel attrition is to prevent
it in the first place. At the end of the chapter, we review recommendations for design
decisions that can help to mitigate attrition and other panel effects.

Panel-​Specific Measurement Error
It is perhaps ironic that one of the advantages of panel surveys is that they enable
assessment of the reliability of survey measures because they can also introduce addi-
tional measurement error—​panel conditioning and seam effects—​that can threaten the
validity of survey estimates. We consider each of these issues in turn.

Panel Conditioning
Panel conditioning, also known as time-​in-​sample bias, refers to the phenomenon in
which participation in earlier waves of the panel affects responses in subsequent waves.
For example, respondents might pay more attention to a political contest because they
are participating in a panel about voting and know they will be asked their opinions
about the candidates. Warren Miller, a pioneer of the ANES, used to joke that the study’s
panel design was an expensive voter mobilization effort because participation in the
pre-​election survey motivated respondents to show up at the ballot box. Conditioning
effects can jeopardize the validity of survey estimates, biasing estimates of the magni-
tude and/​or correlates of change (Kasprzyk et al. 1989; Sturgis, Allum, and Brunton-​
Smith 2009; Warren and Halpern-​Manners 2012).
Researchers have long been concerned about panel conditioning effects.7 In one of
the earliest political panel surveys, researchers identified the potential for panel con-
ditioning, noting that “the big problem yet unsolved is whether repeated interviews
are likely, in themselves, to influence a respondent’s opinions” (Lazarsfeld 1940, 128).
Clausen (1968) found that those who participated in a pre-​election survey in 1964
were more likely to report voting in the post-​election survey—​he attributed seven per-
centage points to the stimulating effect of participating in the pre-​election interview.
38    D. Sunshine Hillygus and Steven A. Snell

Traugott and Katosh (1979) replicated the study and found an even larger mobiliza-
tion effect. Many others have reached similar conclusions (Kraut and McConahay
1973; Yalch 1976; Greenwald et  al. 1987; Anderson, Silver, and Abramson 1988;
Granberg and Holmberg 1992; Simmons, Bickart, and Lynch Jr 1993; Bartels 1999;
Voogt and Van Kempen 2002). Although political interest and political knowledge
are commonly found to be susceptible to panel conditioning effects, the issue is not
restricted to political surveys. For example, Battaglia, Zell, and Ching (1996) found
that asking mothers about the immunization status of their children led to higher
vaccination rates after the interview. Unfortunately, it is not always clear when panel
conditioning will be an issue. While there is considerable documentation that panel
conditioning can exist, it is not always present. Some research finds limited or no
panel conditioning bias (Bartels 1999; Smith, Gerber, and Orlich 2003; Kruse et al.
2009). More generally, there is a lack of clarity in the research about the conditions
under which panel conditioning is expected to change attitudes, behaviors, or
knowl­edge. In addition, panel conditioning effects might depend on the charac-
teristics of respondents, the topic of the survey, or a variety of other survey design
factors. Moreover, Mann (2005) has disputed the methodological basis of much of
the previous research identifying panel conditioning effects. The common approach
to diagnosing conditioning effects is to simply compare panelist responses in follow-​
up waves with cross-​sectional measures of the same items. Even when using refresh-
ment samples or rotating samples, it can be difficult to distinguish panel conditioning
effects from attrition bias (Warren and Halpern-​Manners 2012).8 For instance,
inflated turnout levels in the ANES post-​election survey may be due to panel condi-
tioning, attrition among those not interested in politics, or other sources of survey
error, such as bias in initial nonresponse (Burden 2000).
The specific mechanisms by which panel conditioning effects occur also vary.
Changes in behavior might occur if survey participation increases respondent motiva-
tion or interest in the topic—​as is the case for political knowledge in an election panel
(Bartels 1999; Kruse et al. 2009). Alternatively, survey respondents could change their
responses as they become more familiar with the interview process and survey experi-
ence. The first type of panel conditioning has been referred to as “conditioning change
in true status,” and the second is called “conditioned reporting.” Conditioned reporting
is a strategic response to the interview, such as learning to give answers that reduce
the number of follow-​up questions. This second type of panel conditioning is closely
linked with the issue of “professional” respondents in online survey panels. These are
respondents who have a lot of experience with taking surveys, so they might understand
how to answer in such a way as to reduce burden and maximize their paid incentives.
Indeed, there may well be concerns that panel survey research that relies on samples de-
rived from online respondent panels will have panelists who are already conditioned at
the time of the first wave because they have already participated in previous surveys on
related topics. It is quite common, for instance, to find that YouGov and GfK panelists
are more politically knowledgeable than the general population.9 In principle, it should
be possible to distinguish conditioned reporting from conditioned responses through
Longitudinal Surveys   39

studies designed to specifically test these different mechanisms. Unfortunately, such re-
search is rare.10
There is also little guidance about what to do if panel conditioning bias is found in a
longitudinal study. Some researchers contend that “once they occur the resulting data
are irredeemably biased” (Warren and Halpern-​Manners 2012). This means that it is all
the more important for researchers to prevent panel conditioning in the design of their
surveys as we discuss in more detail at the end of the chapter. For example, research
has suggested that panel conditioning effects are more common when the baseline and
follow-​up surveys are separated by a month or less (e.g., Bailar 1989; De Amici et al.
2000; Fitzsimons, Nunes, and Williams 2007; Levav and Fitzsimons 2006).

Seam Effects
Another source of measurement error unique to longitudinal surveys has been termed
“seam bias”; it refers to the tendency of estimates of change that are measured across the
“seam” of two successive survey waves to far exceed estimates of change that are meas-
ured within a single wave (Conrad, Rips, and Fricker 2009). That is, when respondents
are asked to recall behaviors or conditions at multiple reference times in a single
interview—​for example, employment status in the current month and in the previous
month—​they report few changes between the referenced time periods; in contrast,
estimates of change are much higher if they are measured in two separate waves of data
collection. As a result, estimates of month-​to-​month changes in employment status are
far higher when looking across survey waves than when reported within a single inter-
view (Lynn and Sala 2006).
Seam effects have been most often studied in economics, but they have been found
across a wide range of measures, recall periods, and design features (Lemaitre 1992).
Seam effects were first documented in estimates of government program participation
in the Census Bureau’s SIPP panel survey (Czajka 1983), but have also been found in
the CPS (Cantor and Levin 1991; Polivka and Rothgeb 1993), the PSID (Hill 1987), the
Canadian Survey of Labour and Income Dynamics (Brown, Hale, and Michaud 1998),
and the European Community Household Panel Survey (Jackle and Lynn 2004).
Research examining the source of seam bias suggests that it stems both from
respondents underestimating change within the reference period of a single interview
and overestimating change across waves. Collins (1975), for example, speculates that
between two-​thirds and three-​quarters of the observed change in various employment
statistics (as measured in a monthly labor force survey) were an artifact of this type
of measurement error. Lynn and Sala (2006) label the amount of change they observe
from one survey wave to the next in various employment characteristics as “implausibly
high.” At the same time, researchers have documented underestimates of change within
a single wave, a phenomenon labeled “constant wave responding” (Martini 1989; Rips,
Conrad, and Fricker 2003). Using record validation, Marquis and Moore (1989) confirm
that both factors produce the seam effect.
Seam bias has largely been attributed to respondent memory issues and task dif-
ficulty. For example, there is larger seam bias found with wider time intervals
40    D. Sunshine Hillygus and Steven A. Snell

between waves and the to-​be-​recalled change (Kalton and Miller 1991). There are
also larger seam effects when the recall items are more cognitively difficult (Lynn
and Sala 2006). Some have suggested that seam bias can be further exacerbated by
panel conditioning because individuals learn that it is less burdensome to give the
same response for each referenced time than to report change (Rips, Conrad, and
Fricker 2003).
A related phenomenon identified in political surveys is a sharp discrepancy in
the stability of vote choice or time of vote decision when measured via recall in
a post-​election survey compared to estimation based on measures of candidate
support from multiple waves of panel data (Plumb 1986; Chaffee and Rimal 1996;
Fournier et al. 2004). Researchers have found discrepancies at both the aggregate
and individual levels (Plumb 1986; Chaffee and Rimal 1996). For example, in an
analysis of vote intention stability in the four-​wave ANES 1980 panel study, Plumb
(1986) finds that just 40% of respondents had the same time of decision with both
methods. Critically, some find that the recall measure produces higher levels of
stability (Plumb 1986), while others find it produces lower levels of stability (Katz
1971; Kogen and Gottfried 2012). Several explanations have been offered. First, it
may be difficult for respondents to remember when the decision was made, espe-
cially if asked several months after the fact. Second, there might be issues of so-
cial desirability, whereby respondents might prefer to indicate that they delayed
their decisions in order to appear neutral or independent. Alternatively, some—​
especially partisans—​might claim they knew all along, not wanting to admit that
they were ever undecided.
In terms of mitigating seam bias, the preponderance of research has focused on
efforts to improve respondent recall (Callegaro 2008). For example, Rips, Conrad, and
Fricker (2003) demonstrate that researchers can reduce seam effects by altering question
order. They reason that seam bias is a predictable pattern of satisficing given the usual
grouping of questions by topic instead of time period (Rips, Conrad, and Fricker 2003;
Conrad, Rips, and Fricker 2009). Furthermore, respondents did best when time was or-
dered backwards, or in reverse chronological order—​asking first about the most recent
week and then about earlier and earlier weeks (Rips, Conrad, and Fricker 2003).
The other innovation that targets seam effects at the design stage is dependent
interviewing (DI), which addresses the issue of seam bias straight on by automati-
cally populating a panelist’s previous response and asking if the response still holds
(Conrad, Rips, and Fricker 2009; Moore et al. 2009; Lynn et al. 2005). The previous re-
sponse serves as a reminder or anchor by which the respondent can compare the pre-
sent, perhaps causing reflection on any change and when it may have occurred (Moore
et al. 2009). Dependent interviewing is increasingly common, having been employed
in the Census Bureau’s SIPP and CPS projects (Conrad, Rips, and Fricker 2009), and is
thought to improve interview times and general data quality; nevertheless, Lynn et al.
(2005) cautions that the method may underestimate change across waves if it induces
acquiescence bias among respondents who want to tell the interviewer that the previous
response is still accurate.
Longitudinal Surveys   41

As with panel conditioning, the best solution for seam effects is to prevent them.
Though there are some post-​survey methods for dealing with seam bias, many of them
effectively throw away data. For an overview of such methods, see Lynn et al. (2005).

Recommendations for Researchers

The trend favoring longitudinal surveys will almost certainly continue given the
method’s ability to track within-​ subject change. Nevertheless, as with all survey
methods, longitudinal surveys face several challenges to their validity and reliability.
Responsible researchers must acknowledge the potential impact of these challenges
on substantive knowledge claims. In addition to threats from declining response rates,
concerns about the representativeness of survey respondents, and difficulties measuring
various attitudes and behaviors—​issues that arise in all survey designs—​longitudinal
surveys can face the unique challenges of comparability issues, panel attrition, panel
conditioning, and seam effects. Researchers should grapple with potential biases from
attrition and measurement error as a matter of course. Analyses should routinely in-
clude assessments of the quality of panel composition and resulting data, using whatever
information about attrition can be gleaned by comparing later waves to earlier waves on
observable factors like respondent demographics, survey satisfaction, or other meas-
ures related to respondent experience. Despite some potential limitations of weighting
as a correction for attrition, we recommend that—​at minimum—​analysts calculate
estimates using the longitudinal survey weights. Better still, researchers should leverage
refreshment samples or rotating panels, if available, to better understand the impact of
attrition bias and panel conditioning on the survey estimates.
It is the producers of new longitudinal surveys, however, who bear the greatest re-
sponsibility for preventing panel effects. Those designing panel surveys can take several
measures to reduce panel survey error and improve the reliability and validity of the
resulting data. Given findings about the relationship between the survey experience and
attrition, the researcher should first ensure that the questionnaire, especially the ques-
tionnaire for the first survey wave, adheres to best practices in questionnaire design.11
Furthermore, the researcher should enact protocols to make certain that interviewers
are well trained, as poor interviewer performance decreases panelists’ propensity to
respond in later waves. Even in Internet polling, in which there is no traditional in-
terviewer, the survey design must take into account potential technological issues
and general user-​friendliness, as difficulties with the online interface similarly cause
panelists to attrit (Frankel and Hillygus 2013).
This also points to the need to explicitly measure respondents’ survey experience,
such as including a survey satisfaction item at the end of the first wave questionnaire.
Where respondents report satisfaction with the interviewer, the researcher can re-
duce nonresponse in later waves by assigning the same interviewer to all follow-​up
interviews. When a respondent is found to be at risk of attriting, design adaptations
42    D. Sunshine Hillygus and Steven A. Snell

can be made to increase the likelihood of response—​for example, increasing the incen-
tive payments for those with a high propensity to attrit (Laurie and Lynn 2009; Schoeni
et al. 2013). The researcher executing a panel survey design must also take great care
to keep track of panelists. Lepkowski and Couper (2001) identify the researcher’s in-
ability to locate and contact panelists as a major source of panel attrition. When
respondents cannot be identified at the time of a later survey, cases are lost, resulting in
a reduction of effective sample size and potentially biasing estimates for the remaining
cases. The researcher can prevent lost cases by undertaking several activities to track
respondents, such as instigating communication with the panelist between waves that
are spaced far apart, collecting additional contact information (e.g., a mailing address
and phone number, even if the primary means of communication is email), and using
public records and administrative data sources for tracing respondents. For example,
the PSID regularly updates panelist addresses using the United States Postal Service
national change of address service and offers respondents a $10 payment to simply re-
turn a prepaid postcard verifying their full contact information (Schoeni et al. 2013).
This sort of mailing belongs to the broad class of “keeping in touch exercises” (KITEs)
(Laurie 2007). Another activity to improve tracking of panelists is the use of a dedi-
cated website for respondents with information about the study, past results, and a
change-​of-​address form.
The researcher can also address measurement error through careful survey design.
A  researcher concerned about panel conditioning might interview respondents less
frequently, since panel conditioning can be exacerbated by frequent and closely spaced
interviews. On the other hand, infrequent waves that are spaced far apart might rely
more heavily on recall regarding the period between waves, which can induce seam
effects. The researcher is left to balance these different considerations, with the optimal
design depending on the research question and variables of interest. For instance, panel
conditioning has been shown to have relatively limited effects on attitudinal questions,
but strong effects on political knowledge. If the researcher wants to engage questions
about the relationship between political knowledge and various outcomes, the best de-
sign would minimize conditioning effects by asking political knowledge questions infre-
quently and perhaps by developing new political knowledge items. On the other hand, if
the primary goal is to observe change in some attitude or behavior, the researcher might
do best to field many waves close together—​thereby minimizing seam effects at the pos-
sible risk of inducing some conditioning.
As we hope this chapter makes clear, there are many opportunities for future research
that could inform the design, conduct, and analysis of panel surveys. Researchers could
build into the panel design observational or experimental features to distinguish and
measure the various sources of longitudinal survey error. For example, a new panel of
respondents for a longitudinal survey might gain traction on the distinction between
panel attrition and conditioning by drawing on a very rich sampling frame, such as a
voter registration database enhanced with commercial data. This kind of list would
provide relatively straightforward criteria for measuring nonrandom attrition, by
comparing the pre-​study covariates of returning panelists and those who drop out and
Longitudinal Surveys   43

would also provide some leverage on conditioning, by allowing the researcher to com-
pare the respondents’ predicted and actual responses and behaviors.
Experimental designs might manipulate the panel survey experience for some
respondents in order to gain a clearer understanding of how to minimize survey
error. For instance, building on the earlier discussion of panel conditioning versus
seam effects, the researcher could randomize respondents to complete several or few
surveys that are spaced near or far apart. Similarly, the researcher can evaluate other
design tradeoffs by randomizing design differences across panelists. For example, pre-
vious research suggests that the researcher can stem panel attrition by increasing com-
munication with panelists, directing them to a study website, and sharing details of
study findings with them. These measures are meant to increase panelists’ interest in
and commitment to the panel survey (Schoeni et al. 2013), but the researcher should
consider whether these efforts—​especially the provision of study results—​contribute to
panel conditioning. An experimental design could randomize the use of these partic-
ular retention efforts to estimate their effect on attrition and panel conditioning.
In addition, given the extent to which longitudinal survey research is being conducted
with online panels, more research should consider how the online setting reduces or
exacerbates the various types of error unique to the longitudinal survey design. Building on
Adams, Atkeson, and Karp (2012) and Hillygus, Jackson, and Young (2014), such research
might compare the panel conditioning effects of new recruits who enter either online
panels or other types of panel surveys. Research on survey error in online surveys would be
greatly enhanced if collaborations with the proprietors of online panels provided not just
the number of surveys completed and panelists’ time in the panel (Clinton 2001; Adams,
Atkeson, and Karp 2012), but also information about the kinds of surveys to which the pan-
elist has been invited and the kinds of surveys that the panelist has actually completed.
It is our hope that future research on panel survey error will not only provide a more
comprehensive list of best practices to prevent and to measure survey error, but also will
mitigate these biases when they are found in existing longitudinal survey data.

Acknowledgments
This work was supported by NSF Grant SES-​10-​61241. Any opinions, findings, conclusions, or
recommendations expressed in this material are those of the authors and do not necessarily re-
flect the views of the National Science Foundation.

Notes
1. Although most online survey panels are nonprobability panels, in which panelists have
opted-​in to the panel, there are limited examples of online probability survey panels, such as
the RAND American Life Panel, AmeriSpeaks, and GfK Knowledge Networks.
2. A related difference is in the definition of attrition. Some designs allow individuals who fail
to respond to one wave to return to subsequent waves (temporary attrition), while other
44    D. Sunshine Hillygus and Steven A. Snell

designs would consider those individuals permanent attriters. Internet panel studies that
rely on an online panel of respondents are especially likely to use the former design, as it is
nearly costless to invite former attriters into subsequent waves.
3. To be sure, the exact nature of the relationship between the intervention and the data
collection can affect the strength of the causal claims. Generally speaking, data collected
closer to the intervention give greater confidence that any observed changes are the result
of the intervention rather than confounding factors.
4. The weights provided often account for both unequal probabilities of selection in the sam-
pling design as well as unit nonresponse. As such, new weights are typically provided for
each wave to account for sample attrition.
5. To be sure, some researchers have found minimal attrition bias (Bartels 1999; Clinton
2001; Kruse et al. 2009). Most critical, of course, is that such an evaluation be conducted,
since the extent of attrition bias can vary across different outcomes.
6. In using any alternative approach to panel attrition correction, it remains important to
account for the sampling design in making inferences. If the survey firm does not pro-
vide all variables related to the sampling design (e.g., geographic clusters), researchers
can use the sampling design weights or wave 1 survey weights to make the necessary
adjustments.
7. Of course, even nonpanel studies must also confront the possibility that simply the act of
measuring social phenomena can sometimes change the object under investigation—​the
classic Hawthorne effect (e.g., Landsberger 1958).
8. Das, Toepoel, and van Soest (2011) offer one such approach that relies on a nonparametric
test for estimating separate attrition and conditioning effects.
9. It likely it does not help that researchers tend to ask the exact same political knowledge
questions across different studies.
10. Notable exceptions include Warren and Halpern-​Manners (2012); Sturgis, Allum, and
Brunton-​Smith (2009); and Das, Toepoel, and van Soest (2011).
11. Interested readers may want to consult the resources available at http://​dism.ssri.duke.
edu/​question_​design.php.

References
Abramson, P. R., and C. W. Ostrom. 1994. “Question Wording and Partisanship: Change and
Continuity in Party Loyalties During the 1992 Election Campaign.” Public Opinion Quarterly
58 (1): 21.
Achen, C. H. 1975. “Mass Political Attitudes and the Survey Response.” American Political
Science Review 69 (4): 1218–​1231.
Adams, A. N., L. R. Atkeson, and J. A. Karp. 2012. “Panel Conditioning in Online Survey
Panels: Problems of Increased Sophistication and Decreased Engagement.” Prepared for de-
livery at the American Political Science Association Annual Meeting. New Orleans.
Ahern, K., and R. Le Brocque. 2005. “Methodological Issues in the Effects of Attrition: Simple
Solutions for Social Scientists.” Field Methods 17 (February): 53–​69.
Anderson, B A., B. D. Silver, and P. R. Abramson. 1988. “The Effects of the Race of the
Interviewer on Race-​related Attitudes of Black Respondents in SRC/​CPS National Election
Studies.” Public Opinion Quarterly 52 (3): 289–​324.
Longitudinal Surveys   45

Ayala, L., C. Navarro, and M. Sastre. 2006. Cross-​country Income Mobility Comparisons under Panel
Attrition: The Relevance of Weighting Schemes. Technical report, Instituto de Estudios Fiscales.
Bailar, B. A. 1989. “Information Needs, Surveys, and Measurement Errors.” In Panel Surveys,
edited by D. Kasprzyk, G Duncan, G. Kalton, and M. P. Singh, 1–​24. New York: Wiley.
Baker, R., S. J. Blumberg, J. M. Brick, M. P. Couper, M. Courtright, J. M. Dennis, . . . D. Zahs.
2010. “Research Synthesis: AAPOR Report on Online Panels.” Public Opinion Quarterly 74
(October): 711–​781.
Bartels, L. M. 1999. “Panel Effects in the American National Election Studies.” Political Analysis
8 (January): 1–​20.
Bartels, L. M. 2006. “Three Virtues of Panel Data for the Analysis of Campaign Effects.”
In Capturing Campaign Effects, edited by H. E. Brady and R. Johnston, 134–​163. Ann
Arbor: University of Michigan Pres.
Basic, E., and U. Rendtel. 2007. “Assessing the Bias due to Non-​coverage of Residential Movers
in the German Microcensus Panel: An Evaluation Using Data from the Socio-​Economic
Panel.” AStA: Advances in Statistical Analysis 91 (3): 311–​334.
Battaglia, M. P., E. R. Zell, and P. L. Y. H. Ching. 1996. “Can Participating in a Panel Sample
Introduce Bias into Trend Estimates?” In Proceedings of the Survey Research Methods Section,
1010–​1013. Alexandria, VA: American Statistical Association. Retrieved from http://​www.
amstat.org/​sections/​SRMS/​Proceedings/​y1996f.html.
Baunach, D. M. 2011. “Decomposing Trends in Attitudes Toward Gay Marriage, 1988–​2006.”
Social Science Quarterly 92 (June): 346–​363.
Beck, N., J. N. Katz, and R. Tucker. 1998. “Taking Time Seriously: Time-​Series-​Cross-​Section
Analysis with a Binary Dependent Variable.” American Journal of Political Science 42
(4): 1260–​1288.
Behr, A. 2005. “Extent and Determinants of Panel Attrition in the European Community
Household Panel.” European Sociological Review 21 (July): 489–​512.
Berelson, B. R., P. F. Lazarsfeld, and W. N. McPhee. 1954. Voting: A Study of Opinion Formation
in a Presidential Campaign. Chicago: University of Chicago Press.
Bhattacharya, D. 2008. “Inference in Panel Data Models under Attrition Caused by
Unobservables.” Journal of Econometrics 144 (2): 430–​446.
Biemer, P. P. 2011. “Total Survey Error:  Design, Implementation, and Evaluation.” Public
Opinion Quarterly 74 (February): 817–​848.
Bishop, B. H. 2013. “Drought and Environmental Opinion A Study of Attitudes toward Water
Policy.” Public Opinion Quarterly 77 (3): 798–​810.
Blais, A., and A. Dobrzynska. 1998. “Turnout in Electoral Democracies.” European Journal of
Political Research 33: 239–​261.
Bradburn, N. M., L. J. Rips, and S. K. Shevell. 1987. “Answering Autobiographical Questions: The
Impact of Memory and Inference on Surveys.” Science 236 (April): 157–​161.
Brehm, J. 1993. The Phantom Respondents: Opinion Surveys and Political Representation. Ann
Arbor: University of Michigan Press.
Brewer, P. R. 2008. “The Shifting Foundations of Public Opinion about Gay Rights.” Journal of
Politics 65 (July): 1208–​1220.
Brown, A., A. Hale, and S. Michaud. 1998. “Use of Computer Assisted Interviewing in
Longitudinal Surveys.” In Computer Assisted Survey Information Collection, edited by M. P.
Couper, R. P. Baker, J. Bethlehem, C. Z. F. Clark, J. Martin, W. L. Nicholls, II, J. M. O’Reilly,
185–​200. New York: John Wiley & Sons.
46    D. Sunshine Hillygus and Steven A. Snell

Brown, C. H. 1990. “Protecting against Nonrandomly Missing Data in Longitudinal Studies.”


Biometrics 46 (1): 143–​155.
Burden, B. C. 2000. “Voter Turnout and the National Election Studies.” Political Analysis 8
(4): 389–​398.
Callegaro, M. 2008. “Seam Effects in Longitudinal Surveys.” Journal of Official Statistics 24
(3): 387–​409.
Callegaro, M., R. Baker, J. Bethlehem, A. S. Goritz, J. A. Krosnick, and P. J. Lavrakas. 2014.
“Online Panel Research.” In Online Panel Research: A Data Quality Perspective, edited by
Callegaro, M., R. Baker, J. Bethlehem, A. S. Goritz, J. A. Krosnick, and P. J. Lavrakas, 1–​22.
New York: John Wiley & Sons.
Cantor, D., and K. Levin. 1991. Summary of Activities to Evaluate the Dependent Interviewing
Procedure of the Current Population Survey. Report submitted to the Bureau of Labor
Statistics by Westat, Inc. (Contract No. J-​9-​J-​8-​0083).
Carmines, E. G., and R. A. Zeller. 1979. Reliability and Validity Assessment. Thousand Oaks,
CA: Sage.
Carsey, T. M., and G. C. Layman. 2006. “Changing Sides or Changing Minds? Party
Identification and Policy Preferences in the American Electorate.” American Journal of
Political Science 50 (April): 464–​477.
Chaffee, S. H., and R. N. Rimal. 1996. “Time of Vote Decision and Openness to Persuasion.”
In Political Persuasion and Attitude Change, edited by D. Mutz, P. Sniderman, and R. Brody,
267–​291. Ann Arbor: University of Michigan Press.
Chen, Q., A Gelman, M. Tracy, F. H. Norris, and S. Galea. 2015. “Incorporating the Sampling
Design in Weighting Adjustments for Panel Attrition.” Statistics in Medicine.
Clausen, A. R. 1968. “Response Validity:  Vote Report.” Public Opinion Quarterly 32
(4): 588–​606.
Clinton, J. D. 2001. “Panel Bias from Attrition and Conditioning:  A Case Study of the
Knowledge Networks Panel.” Unpublished manuscript, Stanford University. Retreived from
http://​www.knowledgenetworks.com/​insights/​docs/​Panel%20Effects.pdf.
Collins, C. 1975. “Comparison of Month-​to-​month Changes in Industry and Occupation
Codes with Respondent’s Report of Change: CPS Job Mobility Study.” US Census Bureau,
Response Research Staff Report (75-​5).
Conrad, F. G., L. J. Rips, and S. S. Fricker. 2009. “Seam Effects in Quantitative Responses.”
Journal of Official Official Statistics 25 (3): 339–​361.
Converse, P. E. 1964. “The Nature of Belief Systems in Mass Publics.” In Ideology and Discontent,
edited by David E. Apter, 206–​261. Ann Arbor: University of Michigan Press.
Czajka, J. 1983. “Subannual Income Estimation.” In Technical, Conceptual and Administrative
Lessons of the Income Survey Development Program (ISDP), 87–​97. New York: Social Science
Research Council.
Daniels, M. J., and J. W. Hogan. 2008. Missing Data in Longitudinal Studies:  Strategies for
Bayesian Modeling and Sensitivity Analysis. New York: CRC Press.
Das, M., V. Toepoel, and A. van Soest. 2011. “Nonparametric Tests of Panel Conditioning and
Attrition Bias in Panel Surveys.” Sociological Methods & Research 40 (January): 32–​56.
Davenport, T. C., A. S. Gerber, D. P. Green, C. W. Larimer, C. B. Mann, and C. Panagopoulos.
2010. “The Enduring Effects of Social Pressure: Tracking Campaign Experiments Over a
Series of Elections.” Political Behavior 32 (May): 423–​430.
De Amici, D, C. Klersy, F. Ramajoli, L. Brustia, and P. Politi. 2000. “Impact of the Hawthorne
Effect in a Longitudinal Clinical Study.” Controlled Clinical Trials 21 (April): 103–​114.
Longitudinal Surveys   47

Deng, Y., D. S. Hillygus, J. P. Reiter, Y. Si, and S. Zheng. 2013. “Handling Attrition in Longitudinal
Studies: The Case for Refreshment Samples.” Statistical Science 28 (May): 238–​256.
Diggle, P., and M. G. Kenward. 1994. “Informative Drop-​out in Longitudinal Data Analysis.”
Applied Statistics 43 (1): 49–​93.
Duncan, G. J., and G. Kalton. 1987. “Issues of Design and Analysis of Surveys across Time.”
International Statistical Review/​Revue Internationale de Statistique 55 (1): 97–​117.
Finkel, S. E. 1995. Causal Analysis with Panel Data. Thousand Oaks, CA: Sage Publications.
Fitzgerald, J., P. Gottschalk, and R. Moffitt. 1998. An Analysis of Sample Attrition in Panel
Data: The Michigan Panel Study of Income Dynamics. Technical report.
Fitzsimons, G. J., J. C. Nunes, and P. Williams. 2007. “License to Sin: The Liberating Role of
Reporting Expectations.” Journal of Consumer Research 34 (1): 22–​31.
Fournier, P., R. Nadeau, A. Blais, E. Gidengil, and N. Nevitte. 2004. “Time-​of-​voting Decision
and Susceptibility to Campaign Effects.” Electoral Studies 23 (4): 661–​681.
Frankel, L. L., and D. S. Hillygus. 2013. “Looking Beyond Demographics: Panel Attrition in the
ANES and GSS.” Political Analysis 22 (October): 336–​353.
Gelman, 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science
22 (2): 153–​164.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/​Hierarchical
Models. Cambridge: Cambridge University Press.
Gerber, A. S., D. Karlan, and D. Bergan. 2009. “Does the Media Matter? A Field Experiment
Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American
Economic Journal: Applied Economics 1 (March): 35–​52.
Granberg, D., and S. Holmberg. 1992. “The Hawthorne Effect in Election Studies: The Impact
of Survey Participation on Voting.” British Journal of Political Science 22 (02): 240–​247.
Gray, R., P. Campanelli, K. Deepchand, and P. Prescott-​Clarke. 1996. “Exploring Survey Non-​
response: The Effect of Attrition on a Follow-​up of the 1984–​85 Health and Life Style Survey.”
The Statistician 45 (2): 163–​183.
Green, D. P., and E. Schickler. 1993. “Multiple-​Measure Assessment of Party Identification.”
Public Opinion Quarterly 57 (4): 503.
Greenwald, A. G., C. G. Carnot, R. Beach, and B. Young. 1987. “Increasing Voting Behavior by
Asking People If They Expect to Vote.” Journal of Applied Psychology 72 (2): 315.
Groves, R. M. 2004. Survey Errors and Survey Costs. New York: John Wiley & Sons.
Groves, R. M., and L. Lyberg. 2011. “Total Survey Error:  Past, Present, and Future.” Public
Opinion Quarterly 74 (February): 849–​879.
Groves, R. M., and M. P. Couper. 2012. Nonresponse in Household Interview Surveys.
New York: John Wiley & Sons.
Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau.
2009. Survey Methodology. 2nd ed. New York: Wiley.
Hausman, J. A., and D. A. Wise. 1979. “Attrition Bias in Experimental and Panel Data: The Gary
Income Maintenance Experiment.” Econometrica 47 (2): 455–​473.
Henderson, M., and D. S. Hillygus. 2016. “Changing the Clock: The Role of Campaigns in the
Timing of Vote Decision.” Public Opinion Quarterly 80(3): 761–770.
Henderson, M., D. S. Hillygus, and T. Tompson. 2010. “ ‘Sour Grapes’ or Rational Voting? Voter
Decision Making Among Thwarted Primary Voters in 2008.” Public Opinion Quarterly 74
(March): 499–​529.
Hill, Daniel. 1987. “Response Errors around the Seam:  Analysis of Change in a Panel with
Overlapping Reference Periods.” In Proceedings of the Section on Survey Research Methods,
48    D. Sunshine Hillygus and Steven A. Snell

American Statistical Association, 210–​215. Retreived from http://​www.amstat.org/​sections/​


srms/​Proceedings/​papers/​1987_​032.pdf.
Hillygus, D. S., and S. Jackman. 2003. “Voter Decision Making in Election 2000: Campaign
Effects, Partisan Activation, and the Clinton Legacy.” American Journal of Political Science 47
(4): 583–​596.
Hillygus, D. S, N. Jackson, and M. Young. 2014. “Professional Respondents in Nonprobability
Online Panels.” In Online Panel Research:  A Data Quality Perspective, edited by M.
Callegro, R. Baker, J. Bethlehem, A. S. Goritz, J. A. Krosnick, and P. J. Lavrakas, 219–​237.
New York: Wiley.
Hillygus, D. S., N. H. Nie, K. Prewitt, and H Pals. 2006. The Hard Count: The Political and Social
Challenges of Census Mobilization. New York: Russell Sage Foundation.
Hirano, K, G. W. Imbens, G. Ridder, and D. B. Rebin. 1998. “Combining Panel Data Sets with
Attrition and Refreshment Samples.” Working Paper 230, National Bureau of Economic
Research.
Hirano, K., G. W. Imbens, G. Ridder, and D. B. Rubin. 2001. “Combining Panel Data Sets with
Attrition and Refreshment Samples.” Econometrica 69 (6): 1645–​1659.
Honaker, J., and G. King. 2010. “What to Do about Missing Values in Time-​Series Cross-​
Section Data.” American Journal of Political Science 54 (April): 561–​581.
Hsiao, C. 2003. Analysis of Panel Data. 2nd ed. Cambridge: Cambridge University Press.
Iyengar, S., G. Sood, and Y. Lelkes. 2012. “Affect, Not Ideology: A Social Identity Perspective on
Polarization.” Public Opinion Quarterly 76 (September): 405–​431.
Jackle, A., and P. Lynn. 2004. “Dependent Interviewing and Seam Effects in Work History
Data.” ISER Working Paper 2004-​24, Institute for Social and Economic Research, University
of Essex, Colchester.
Jackson, N. 2011. Questionnaire Design Issues in Longitudinal and Repeated Cross-​Sectional
Surveys. Report of the Duke Initiative on Survey Methodology Workshop on Questionnaire
Design Issues in Longitudinal and Repeated Cross-​Sectional Surveys, February 18.
Kalton, G., and C. F. Citro. 1993. The Future of the Survey of Income and Program Participation.
Washington, D.C.: National Academy Press.
Kalton, G., and M. E. Miller. 1991. “The Seam Effect with Social Security Income in the Survey
of Income and Program Participation.” Journal of Official Statistics 7 (2): 235–​245.
Kasprzyk, D., G. Duncan, G. Kalton, and M. P. Singh. 1989. Panel Surveys. New York: Wiley.
Katz, 1971. “Platforms & Windows: Broadcasting’s Role in Election Campaigns.” Journalism &
Mass Communication Quarterly 48 (2): 304–​314.
Kenward, M. G. 1998. “Selection Models for Repeated Measurements with Non-​random
Dropout: An Illustration of Sensitivity.” Statistics in Medicine 17 (23): 2723–​2732.
Kenward, M. G., G. Molenberghs, and H. Thijs. 2003. “Pattern-​mixture Models with Proper
Time Dependence.” Biometrika 90 (1): 53–​7 1.
Kogen, L., and J. A. Gottfried. 2012. “I Knew It All Along! Evaluating Time-​of-​decision
Measures in the 2008 US Presidential Campaign.” Political Behavior 34 (4): 719–​736.
Kraut, R. E., and J. B. McConahay. 1973. “How Being Interviewed Affects Voting:  An
Experiment.” Public Opinion Quarterly 37 (3): 398–​406.
Kristman, V. L., M. Manno, and P. Côté. 2005. “Methods to Account for Attrition in Lon-​
gitudinal Data: Do They Work? A Simulation Study.” European Journal of Epidemiology 20
(8): 657–​662.
Kruse, Y., M. Callegaro, J. M. Dennis, C. DiSogra, S. Subias, M. Lawrence, and T. Thompson.
2009. “Panel Conditioning and Attrition in the AP-​Yahoo! News Election Panel Study.”
Longitudinal Surveys   49

Presented at the Annual Meeting of the American Association for Public Opinion Research.
Hollywood, FL Retreived from http://​www.knowledgenetworks.com/​ganp/​docs/​jsm2009/​
Panel%20Conditioning%20and%20Attrition_​JSM_​2009_​submitted.pdf.
Landsberger, H. A. 1958. Hawthorne Revisited: Management and the Worker, Its Critics, and
Developments in Human Relations in Industry. Ithaca: Cornell University Press.
Laurie, H. 2007. “Minimizing Panel Attrition.” In Handbook of Longitudinal
Research:  Design, Measurement, and Analysis, edited by Scott Menard, 167–​ 184.
Burlington, MA: Elsevier.
Laurie, H., and P. Lynn. 2009. “The Use of Respondent Incentives on Longitudinal Surveys.” In
Methodology of Longitudinal Surveys, edited by Peter Lynn, 205–​234. Chichester, UK: John
Wiley & Sons.
Lazarsfeld, P. F. 1940. “ ‘Panel’ Studies.” Public Opinion Quarterly 4 (1): 122–​128.
Lazarsfeld, P. F., B. Berelson, and H. Gaudet. 1948. The People’s Choice: How the Voter Makes Up
His Mind in a Presidential Campaign. New York: Columbia University Press.
Lemaitre, G. 1992. Dealing with the Seam Problem for the Survey of Labour and Income
Dynamics. Ottawa: Statistics Canada.
Lenz, G. S. 2009. “Learning and Opinion Change, Not Priming: Reconsidering the Priming
Hypothesis.” American Journal of Political Science 53 (4): 821–​837.
Lepkowski, J. M., and M. P. Couper. 2001. “Nonresponse in the Second Wave of Longitudinal
Household Surveys.” In Survey Nonresponse, edited by R. M. Groves, D. A. Dillman, J. L.
Eltinge, and R. J. Little, 259–​272. New York: Wiley and Sons.
Levav, J., and G. J. Fitzsimons. 2006. “When Questions Change Behavior: The Role of Ease of
Representation.” Psychological Science 17 (March): 207–​213.
Lipps, O. 2007. “Attrition in the Swiss Household Panel.” Methoden–​Daten–​Analysen 1
(1): 45–​68.
Little, R. J. A. 1993. “Pattern-​mixture Models for Multivariate Incomplete Data.” Journal of the
American Statistical Association 88 (421): 125–​134.
Little, R. J. A, and D. B. Rubin. 2002. Statistical Analysis with Missing Data. New York: Wiley.
Little, R. J. A, and Y. Wang. 1996. “Pattern-​mixture Models for Multivariate Incomplete Data
with Covariates.” Biometrics 58 (1): 98–​111.
Loosveldt, G., and A. Carton. 1997. “Evaluation of Nonresponse in the Belgian Election Panel
Study ‘91–​‘95.” In Proceedings of the Survey Research Methods Section, American Statistical
Association, 1017–​1022. Retreived from http://​www.amstat.org/​sections/​srms/​Proceedings/​
papers/​1997_​175.pdf.
Loosveldt, G., J. Pickery, and J. Billiet. 2002. “Item Nonresponse as a Predictor of Unit
Nonresponse in a Panel Survey.” Journal of Official Statistics 18 (4): 545–​558.
Lynn, P., and E. Sala. 2006. “Measuring Change in Employment Characteristics:  The
Effects of Dependent Interviewing.” International Journal of Public Opinion Research 18
(4): 500–​509.
Lynn, P., N. Buck, J. Burton, A. Jackle, and H. Laurie. 2005. “A Review of Methodological
Research Pertinent to Longitudinal Survey Design and Data Collection.” ISER Working
Paper 2005-​29, Institute for Social and Economic Research, University of Essex, Colchester.
Mann, C. B. 2005. “Unintentional Voter Mobilization:  Does Participation in Preelection
Surveys Increase Voter Turnout?” ANNALS of the American Academy of Political and Social
Science 601 (1): 155–​168.
Marquis, K. H, and J. C. Moore. 1989. “Some Response Errors in SIPP—​With Thoughts About
Their Effects and Remedies.” In Proceedings of the Section on Survey Research Methods,
50    D. Sunshine Hillygus and Steven A. Snell

American Statistical Association, 381–​386. Retreived from http://​www.amstat.org/​sections/​


srms/​Proceedings/​papers/​1989_​067.pdf.
Martini, A. 1989. “Seam Effect, Recall Bias, and the Estimation of Labor Force Transition Rates
from SIPP.” In Proceedings of the Survey Research Methods Section, American Statistical
Association,. 387–​392. Retreived from http://​www.amstat.org/​sections/​srms/​proceedings/​
papers/​1989_​068.pdf.
Menard, S. 2002. Longitudinal Research. Vol. 76. 2nd ed. Thousand Oaks: Sage Publications.
Menard, S., ed. 2007. Handbook of Longitudinal Research: Design, Measurement, and Analysis.
Burlington, MA: Elsevier.
Moore, J., N. Bates, J. Pascale, and A. Okon. 2009. “Tackling Seam Bias Through Questionnaire
Design.” In Methodology of Longitudinal Surveys, edited by Peter Lynn, 72–​ 92.
New York: John Wiley & Sons.
Olson, K., and L. Witt. 2011. “Are We Keeping the People Who Used to Stay? Changes
in Correlates of Panel Survey Attrition Over Time.” Social Science Research 40
(4): 1037–​1050.
Pasek, J., A. Tahk, Y. Lelkes, J. A. Krosnick, B. K. Payne, O. Akhtar, and T. Tompson. 2009.
“Determinants of Turnout and Candidate Choice in the 2008 US Presidential Election
Illuminating the Impact of Racial Prejudice and Other Considerations.” Public Opinion
Quarterly 73 (5): 943–​994.
Plumb, E. 1986. “Validation of Voter Recall:  Time of Electoral Decision Making.” Political
Behavior 8 (4): 302–​312.
Polivka, A. E., and J. M. Rothgeb. 1993. “Redesigning the CPS Questionnaire.” Monthly Labor
Review September: 10–​28.
Rips, L. J., F. G. Conrad, and S. S. Fricker. 2003. “Straightening the Seam Effect in Panel
Surveys.” Public Opinion Quarterly 67 (4): 522–​554.
Scharfstein, D. O., A. Rotnitzky, and J. M. Robins. 1999. “Adjusting for Nonignorable Drop-​out
Using Semiparametric Nonresponse Models.” Journal of the American Statistical Association
94 (448): 1096–​1120.
Schluchter, M. D. 1992. “Methods for the Analysis of Informatively Censored Longitudinal
Data.” Statistics in Medicine 11 (14–​15): 1861–​1870.
Schoeni, R. F., F. Stafford, K. A. McGonagle, and P. Andreski. 2013. “Response Rates in
National Panel Surveys.” Annals of the American Academy of Political and Social Science 645
(January): 60–​87.
Si, Y., J. P. Reiter, and D. S. Hillygus. 2014. “Semi-​parametric Selection Models for Potentially
Non-​ignorable Attrition in Panel Studies with Refreshment Samples.” Political Analysis
(June): 1–​21.
Simmons, C. J., B. A. Bickart, and J. G. Lynch Jr. 1993. “Capturing and Creating Public Opinion
in Survey Research.” Journal of Consumer Research 20 (2): 316–​329.
Singer, J. D., and J. B. Willett. 2003. Applied Longitudinal Data Analysis: Modeling Change and
Event Occurrence. New York: Oxford University Press.
Smith, J. K., A. S. Gerber, and A. Orlich. 2003. “Self-​Prophecy Effects and Voter Turnout: An
Experimental Replication.” Political Psychology 24 (3): 593–​604.
Smith, T. W. 2005. “The Laws of Studying Societal Change.” General Social Survey Social
Change Report, No. 50.
Smith, T. W., and J. Son. 2010. “An Analysis of Panel Attrition and Panel Change on the
2006-​2008 General Social Survey Panel.” General Social Survey Methodological Report,
No. 118.
Longitudinal Surveys   51

Sturgis, P., N. Allum, and I. Brunton-​Smith. 2009. “Attitudes Over Time: The Psychology of
Panel Conditioning.” In Methodology of Longitudinal Surveys, edited by P. Lynn, 113–​126.
Chichester, UK: John Wiley & Sons.
Tourangeau, R. 2003. Recurring Surveys:  Issues and Opportunities. Report to the National
Science Foundation on a workshop held on March 28–​29. Arlington, VA: National Science
Foundation.
Tourangeau, R., L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
New York: Cambridge University Press.
Traugott, M. W., and J. P. Katosh. 1979. “Response Validity in Surveys of Voting Behavior.”
Public Opinion Quarterly 43 (3): 359.
Traugott, S., and G. Morchio. 1990. “Assessment of Bias Due to Attrition and Sample Selection
in the NES 1989 Pilot Study.” ANES Technical Report, Center for Political Studies, University
of Michigan, Ann Arbor.
Traugott, S., and S. J. Rosenstone. 1994. Panel Attrition Among the 1990–​1992 Panel Respondents.
Technical Report, Center for Political Studies.
Uhrig, S. C. N. 2008. “The Nature and Causes of Attrition in the British Household Panel Study.”
ISER Working Paper 2008-​05, Institute for Social and Economic Research, University of
Essex, Colchester.
Vandecasteele, L., and A. Debels. 2006. “Attrition in Panel Data:  The Effectiveness of
Weighting.” European Sociological Review 23 (December): 81–​97.
Vicente, P. C. 2014. “Is Vote Buying Effective? Evidence from a Field Experiment in West
Africa.” Economic Journal 124 (574): F356–​F387.
Visser, P. S., J. A. Krosnick, P. J. Lavrakas, and N. Kim. 2014. “Survey Research.” In Handbook of
Research Methods in Social Psychology, 2nd ed., edited by H. T. Reis and C. M. Judd, 223–​252.
Cambridge: Cambridge University Press.
Voogt, R. J. J. 2005. “An Alternative Approach to Correcting Response and Nonresponse Bias
in Election Research.” Acta Politica 40 (1): 94–​116.
Voogt, R. J. J., and W. E. Saris. 2003. “To Participate or Not to Participate: The Link between
Survey Participation, Electoral Participation, and Political Interest.” Political Analysis 11
(2): 164–​179.
Voogt, R. J. J., and H. Van Kempen. 2002. “Nonresponse Bias and Stimulus Effects in the Dutch
National Election Study.” Quality and Quantity 36 (4): 325–​345.
Warren, J. R., and A. Halpern-​Manners. 2012. “Panel Conditioning in Longitudinal Social
Science Surveys.” Sociological Methods & Research 41: 491–​534.
Watson, N., and M. Wooden. 2009. “Identifying Factors Affecting Longitudinal Survey
Response.” In Methodology of Longitudinal Surveys, edited by Peter Lynn, 157–​ 183.
Chichester, UK: John Wiley & Sons.
Wawro, G. 2002. “Estimating Dynamic Panel Data Models in Political Science.” Political
Analysis 10 (1): 25–​48.
Weisberg, H. F. 2005. The Total Survey Error Approach: A Guide to the New Science of Survey
Research. Chicago: University of Chicago Press.
Westat. 2009. “SIPP Sample Design and Interview Procedures.” In Survey of Income and
Program Participation Users’ Guide, 1–​25. Rockville, MD. Retreived from http://​www.census.
gov/​content/​dam/​C ensus/​programs-​surveys/​sipp/​methodology/​SIPP_​USERS_​Guide_​
Third_​Edition_​2001.pdf.
Yalch, R. F. 1976. “Pre-​election Interview Effects on Voter Turnout.” Public Opinion Quarterly
40 (3): 331–​336.
52    D. Sunshine Hillygus and Steven A. Snell

Yeager, D. S., J. A. Krosnick, L. Chang, H. S. Javitz, M. S. Levendusky, A. Simpser, and R. Wang.


2011. “Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys Conducted
with Probability and Non-​probability Samples.” Public Opinion Quarterly 75 (4): 709–​747.
Zabel, J. E. 1998. “An Analysis of Attrition in the Panel Study of Income Dynamics and the
Survey of Income and Program Participation with application to a model of labor market
behavior.” Journal of Human Resources 33 (2): 479–​506.
Chapter 3

Mixing Su rv ey Mode s
and Its Impli c at i ons

Lonna Rae Atkeson and Alex N. Adams

Mixed Mode Surveys

The use of scientific sampling in survey research dates back to the 1930s, when it was
primarily conducted through the mail or personal visits to households (Elinson 1992).
However, contact with sample members and the administration of the survey instru-
ment can come in multiple formats or modes; the number of modes available and their
complexity has increased over the past eighty years. In this context, mode generally
refers to a strategy or method of respondent contact and data collection. Respondents
can be contacted and respond in person, by mail, over the phone, over the Internet, on a
personal computer or mobile device, or via texts, providing a number of different mode
options.
Mixed mode surveys are defined as surveys that involve mixtures of different contact
and interviewing methods with respondents. For example, a mixed mode survey might
contact sample members by phone or mail and then have them respond to a question-
naire over the Internet. Alternatively, a mixed mode survey might allow for multiple
forms of response. For example, sample frame members may be able to complete the
interview over the phone, by mail, or on the Web. Alternatively, a mixed mode design
may encourage an Internet response with the first contact, but those who fail to respond
to the initial contact may later receive a mail survey, phone call, or face-​to-​face (FTF)
visit. Finally, even within a particular mode format the data may be collected differently
in some portions of the instrument. All of these variations are considered mixed mode
surveys.
Whether the survey is administered by another party, an interviewer, or the re-
spondent is a key structural component of the survey environment that has impor-
tant empirical implications for data quality and comparability (Fuchs, Couper, and
Hansen 2000; Atkeson, Adams, and Alvarez 2014). We define a survey as interviewer
54    Lonna Rae Atkeson and Alex N. Adams

administered when the interviewer is a live person who can independently interact with
the respondent. Thus, while there are many modes, and they have proliferated as tech-
nology has expanded, the presence or absence of an interviewer and the level of his or
her involvement in the survey process provides a structural feature that is a critical the-
oretical consideration in understanding survey response. We say “level of involvement”
because the presence of an administrator does not necessarily imply that the question-
naire is administered by an interviewer. For example, the growth of computer assisted
personal interviews (CAPIs), especially with regard to the administration of sensitive
questions, has created intrasurvey variation, with some questions having back-​and-​
forth interactions between the interviewer and the respondent, creating a dynamic
interview between them, and other questions having no interactions with the inter-
viewer, creating a self-​administered environment. Alternatively, an administrator may
provide students at a school with a paper questionnaire or voters at a polling location
with an exit questionnaire and he or she may remain present while the questionnaire is
answered, but that person’s level of involvement is minimal, creating an environment
more akin to self-​administered questionnaires (SAQs) than interviewer-​administered
questionnaires (IAQs).
Research suggests that interviewer-​ driven designs, regardless of mode, and
respondent-​driven designs, regardless of mode, provide largely the same within-​
mode response patterns (Atkeson and Tafoya 2008). However, researchers find some
differences between in-​ person and telephonic interviewing, especially relating to
expressing “don’t know” (DK). More respondents select DK in telephone surveys than
in FTF surveys (Aneshensel et al. 1982; Aquilino 1992; Groves and Kahn 1979; Jordan,
Marcus, and Reeder, 1980; de Leeuw 1992). There also appears to be slightly less item
nonresponse in online surveys than in mail surveys (Kwak and Raddler 2002).
Nevertheless, when it comes to observational errors related to respondent-​instrument
interactions, major differences generally depend on whether the respondent is assisted
by an interviewer when answering the survey instrument.
Over the last fifteen years we have seen increasing use of mixed mode surveys that em-
ploy multiple modes to reduce survey error, especially coverage and response error. The
purpose of these designs is to achieve better quality data, especially in terms of sample
representativeness (Dillman, Smyth, and Christain 2009). The expansion of survey
modes and the use of mixed mode surveys is in part due to the prohibitive costs associ-
ated with FTF interviewing, the introduction of new technology (the Web, interactive
voice response [IVR], the personal computer, fax machines, cell phones, etc.), and the
interaction of technology with population demographics. However, combining modes
may create data comparability problems because different visual and oral cues help to
structure the survey response. There is evidence that survey mode influences the quality
of the data collected (de Leeuw and Van Der Zouwen 1988; Fowler, Roman, and Di 1998;
Dillman et al. 1996). Theoretically, different modes lead to different types of survey error
because the survey context differs depending on interviewer-​respondent interactions
and survey presentation, which vary by survey contexts. The result is that each mode
produces a different response pattern, which may be due to social desirability, question
Mixing Survey Modes    55

order, interviewer presence or absence, primacy or recency effects, or the visual layout
of questions (Fowler, Roman, and Di 1998; Schuman and Presser 1981; Schuman 1992;
Sudman, Bradburn, and Schwarz 1996; Christian and Dillman 2004; Smyth et al. 2006;
Tourangeau, Couper, and Conrad 2004).

Survey Research Modes


Past and Present

During the early years of systematic commercial and academic survey research, the
1930s through roughly 1970, there were largely two survey modes: FTF surveys and
mail surveys (Lyberg and Kasprzyk 1991). In the 1970s telephone became the domi-
nant survey methodology due to increased coverage of the telephone, the high cost of
FTF surveys, the speed at which phone surveys could be processed, and the relative
comparable quality of the data received (Groves and Khan 1979). Since the 1990s the
Internet, using both probability based and nonprobability based samples, has risen
as a popular and formidable challenge to the telephone survey. In addition to the
cost savings associated with online surveys, their comparability to phone surveys in
terms of speed and data processing has made the Internet a popular methodology
(Couper 2000; Atkeson and Tafoya 2008). In the early 2000s reliable address based
sampling (ABS) became possible in the United States because the U.S. Postal Service
mailing list was made commercially available; it is the largest database with near
universal coverage of residential homes (Iannacchione 2011). Given the coverage
problems related to phone and Internet studies, this led to a resurgence of mail-​based
survey research in the last decade, including those surveys with a mixture of contact
(mail) and response (Internet) methods. In general we can say that over time there
has been a methodological shift from a survey environment that was dominated by
a personal interaction between the respondent and the interviewer (e.g., FTF and
phone) to one that is respondent driven (e.g., Internet and mail) (Dillman, Smyth,
and Christian 2009).
Regardless of mode, however, survey researchers have been vexed by two major
problems:  coverage issues and declining response rates. Telephone surveys, which
were seen as relatively cheap and provided near universal coverage in the 1970s, began
to have problems in the 1990s with the rise of mobile phones and the subsequent de-
cline in landlines. Telephone surveys largely relied on a methodology that used random
digit dialing (RDD) along with deep knowledge of landline phone exchanges including
area codes and prefixes to select probability samples. Since 2003 the National Health
Interview Survey (NHIS) has determined whether or not a family within a household
maintained a landline telephone (Blumberg and Luke 2016). The results, summarized in
Figure 3.1, show that over time household landline services have steadily declined, while
households with wireless service have steadily increased. The latest report available, for
56    Lonna Rae Atkeson and Alex N. Adams

80
70
60
50
40
30
20
10
0
Ja ec 2 5
Ju 05

6
6
7
7
8
Ju 08

9
Ju 09

0
0
1

Ju n 2 1
2
2
3
Ju 13

n 4
15
l-D 00

l-D 00
n- 0 0

l-D 00

0
l-D 00

l-D 00

l-D 01
n- 0 1

l-D 01

Ju 1
l-D 01
n- 0 1

l-D 01

l-D 01
n- 0 1
n- 0

n- 0

n- 0

n- 0

n- 0

n- 0

20
Ju n 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju n 2
Ja ec 2

Ja ec 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju

Ju

Ju

Ju

Ju

Ju
n-
Ja

Figure 3.1  Percent wireless only households.

January–​July 2015, suggests that nearly half (47%) of all households were wireless-​or
cell-​phone-​only households.
In addition, Blumberg and Luke (2016) report that of those households who have a
landline, 35% do not rely on it for all or even most of their calls, but instead receive all or
most of their calls on their wireless phones. The fact is that in many homes, even when
the landline is present it is more of a museum piece that collects sales calls than a valu-
able household communication device. These data clearly show that relying on landlines
to represent the general population leads to huge coverage error. The increase in wireless
homes and reliance on personal cell phones over household (HH) community phones
suggests that a substantial majority of all households are difficult or impossible to reach
using a traditional RDD or landline sampling frame.
Moreover, the problem of coverage is exacerbated because mobile-​ phone-​ only
households are not equally represented throughout the population. Younger adults,
nonwhites, renters, and poorer adults are much more likely to live in mobile-​phone-​
only homes (Blumberg and Luke 2016). Indeed, two-​thirds (67%) of adults aged twenty-​
five to thirty-​four, two-​thirds of all renters, and three in five (60%) Hispanics live in
mobile-​phone-​only households, compared to only two in five (20%) adults aged forty-​
five to sixty-​four, 37% of HH in which a member of the HH owns the home, and 43% of
nonwhite Hispanics (Blumberg and Luke 2016, 2–​3). Figures 3.2 and 3.3 show estimates
of wireless-​only adults over time by ethnicity and age, respectively. With more than half
of the population relying, or mostly relying, on mobile phones, and huge differences in
key demographic variables, coverage issues for surveys that use landline based methods
are a serious threat to one of the main goals of survey research: accurate population
inference.
Of course the solution to this problem is to add cell phones to the mix of landlines in
the sample, and many phone surveys now include cell phone numbers. However, the
solution is not simple, and there are potential problems. The primary methodological
problem is that there is no sample frame that lists active cell phones or their regional lo-
cation. Random digit dialing sampling worked in part because area codes and prefixes
Mixing Survey Modes    57

80
70
60
50
40
30
20
10
0
Ja ec 2 5
Ju 05

6
6
7
Ju 07

8
Ju 08

9
Ju 09

0
0
1
Ju 11

2
2
3
Ju 13

n 4
15
l-D 00

l-D 00
n- 00

l-D 00

l-D 00

l-D 00

l-D 01
n- 01

l-D 01

l-D 01
n- 01

l-D 01

l-D 01
n- 01
n- 0

n- 0

n- 0

n- 0

n- 0

n- 0

20
Ju n 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju n 2
Ja ec 2

Ju n 2
Ja ec 2
Ju n 2
Ja ec 2
Ju

Ju

Ju

Ju

Ju
n-
Ja

Hispanic White Black Asian

Figure 3.2  Percent of adults without a landline by ethnicity.

80

70

60

50

40

30

20

10

0
09 n

10 n

11 n

12 n

13 n

14 n

15 n
09 c

10 c

11 c

12 c

13 c

14 c
20 -De

20 -De

20 -De

20 -De

20 -De

20 -De
20 -Ju

20 -Ju

20 -Ju

20 -Ju

20 -Ju

20 -Ju

20 -Ju
n

n
l

l
Ju

Ju

Ju

Ju

Ju

Ju
Ja

Ja

Ja

Ja

Ja

Ja

Ja

18–24 25–29 30–34 35–44 45–64 65+

Figure 3.3  Percent of adults without a landline by age.

provided detailed information about respondent location, allowing for stronger survey
sampling designs that used hierarchical or clustered sampling criteria. Cell phones and
now landlines with portable numbers provide none of these advantages. In addition,
mobile phones are much more costly to reach because federal law requires that mobile
phone numbers be hand dialed by a person. These additional costs also reduce the effi-
cacy of this method.
The Internet also has coverage problems. First, not everyone has Internet access, lim-
iting coverage. Pew estimates that approximately 87% of adults in the United States have
58    Lonna Rae Atkeson and Alex N. Adams

Internet access, a substantial increase since 1995, when Pew first started asking about
adult Internet use and penetration was only 14% (Perrin and Duggan 2015). Second,
those households with Internet access are systematically different from those that do not
have access, though the differences we saw between wireless only and wireless and land-
line households were larger. According to Pew, 96% of adults aged eighteen to twenty-​
nine have Internet access, but only 58% of adults ages sixty-​five and over do. The Internet
is heavily used by the educated: 95% of those with a college education, but only by 66% of
those who did not graduate from high school and by 76% of those who did. Ethnicity is
also a factor, with nearly universal coverage among English-​speaking Asians (97%), but
85% coverage for whites, 81% for Hispanics, and 78% for blacks.
Another problem with Internet coverage is that even if everyone had Internet access,
there is no sample frame or email list of Internet users. Consequently, because there is no
sample frame there is no way to select a random sample, simple, stratified, or clustered,
for a national, state, or local cross-​sectional study, which is required for probability
based sampling methods. Generally speaking, to use the Internet in a probability based
sampling design for a large cross-​section of voters, for example, sample respondents
must be contacted by some other method first, by phone or mail, and then provided with
the opportunity to complete the survey on the Web. For example, we have been involved
in local and statewide surveys of voters since 2004, and we contact sample respondents
via the mail and provide them with a URL with which to complete the survey online
(Atkeson and Tafoya 2008; Atkeson et al. 2010; Atkeson, Adams, and Alvarez 2014).
Alternatively, many Internet survey firms use nonprobability sampling methods that
rely on members of an opt-​in panel to approximate a population. For example, the best
nonprobability survey houses might rely on matching techniques that select a virtual
sample using census data and then match panel members to the virtual sample to create
a representative sample (Ansolabhere and Shaffner 2014). Others might use quota sam-
pling or weighting (Loosveldt and Sonck 2008). Finally, one Internet survey vendor,
Gfk, recruits panel members through probability based methodologies including RDD
and ABS. Importantly, Gfk collected data for the American National Election Studies
(ANES) in both 2008 and 2012, creating two publicly available and widely used Internet
surveys that can be combined with and compared to the traditional FTF election design.1
Another problem is that regardless of mode used, over the last fifty years we have seen
a steady decline in response rates for all types of surveys: government, private, and ac-
ademic (de Leeuw and De Heer 2002). Declining response rates raise concerns about
nonresponse error. Nonresponse error results when certain groups or populations self-​
select out of the study, potentially creating a biased survey. Nonresponse error is a valid
concern and can create significant problems for producing reliable sample statistics, like
the mean, that can lead to problems in survey inference (Peterson and Kerin 1981). For
example, Burden (2000) argues that in the ANES declining response rates are respon-
sible for poorer presidential turnout estimates.2
Mixed mode surveys represent a potential solution, especially for the problem of de-
clining response rates. First, they provide a means, using dual or multiframe designs,
for reaching different subgroups of sample members (Day et al. 1995; Groves and Kahn
Mixing Survey Modes    59

1979; Shettle and Mooney 1999), and allow the researcher to tailor the survey contact
and response to respondent characteristics, which are likely attracted to different modes
based on familiarity and accessibility (de Leeuw 2005; Dillman 2000).
Second, mixed mode surveys may reduce nonresponse error if groups of respondents
who may not have either the motivation or the ability to respond do so when provided
with multiple or the right response options for them. For example, advance letters to
sample frame members that describe the study can create legitimacy and trust between
the respondent and the survey researcher that will increase response rates with follow-​
up phone calls (de Leeuw et al. 2004). In addition, information about the digital divide
suggests that Internet users tend to be younger, whiter, and more male, and thus a design
that relies heavily on the Internet may underrepresent important subgroups in the pop-
ulation of interest (Zickuhr and Smith 2012). Likewise, mail surveys may attract older
respondents (Atkeson and Tafoya 2008; Atkeson and Adams 2010). In this way, offering
multiple contact and response modes and being smart about how those are presented
can compensate for nonresponse problems that plague the use of any particular mode,
creating a highly representative survey that has the very desirable qualities of both relia-
bility and validity.
These factors have made mixed mode surveys increasingly popular over the last two
decades. According to Dillman et al. (2009, 11), one government administrator noted
that the change in the survey environment means, “We are trying to give respondents
what they want, but still do valid surveys. That means giving people a choice.” For ex-
ample, the American Community Survey first contacts potential respondents by mail;
those who do not respond receive a telephone survey, and if that fails it attempts a FTF
interview with a subsample of remaining nonrespondents (Alexander and Wetrogen
2000). Other government agencies, including the Bureau of Labor Statistics, with the
Current Employment Statistics Survey, and the Center for Disease Control, with the
National Survey of Family Growth, utilize mixed mode surveys.

Testing Mixed Mode Claims


Over the past decade we have been involved in administering post-​federal-​election
mixed mode surveys to large cross-​sections of voters. (For details on the survey meth-
odology and results see Atkeson and Tafoya 2008; Alvarez, Atkeson, and Hall 2007;
Atkeson et al. 2010, 2013, 2015; Atkeson, Adams, and Alvarez 2014). These voters are
randomly sampled from a list of registered voters provided by Bernalillo County, New
Mexico, the state of New Mexico, the state of Colorado, or the Democratic Party in the
case of the New Mexico primary in 2004. In each case the sample frame, voter regis-
tration files, contains descriptive information about sample members, including their
address, age, gender, and party registration, that allows us to make comparisons be-
tween the sample frame and sample respondents. Voters in New Mexico represent a
diverse cross-​section of the American public in terms of age, education, ethnicity, and
urbanization and as such provide a good testing ground for survey design questions.
60    Lonna Rae Atkeson and Alex N. Adams

There are no demographic or other contextual factors that make New Mexico partic-
ularly unique that would lead us to believe that our findings are not generalizable to
other cross-​sections of U.S. voters.3 Between 2006 and 2016 all sampled members were
contacted via a postcard and asked to respond to our survey in one of the following
ways: (a) with an attached mail survey, (b) by going to a specified URL and responding
online, or (c) by requesting a mail survey online or on the phone.4 In 2008 we also did
a post-​election, statewide telephone survey of voters. These election studies provide us
with some amount of response mode variation to examine how providing mode choice
might or might not skew respondent representativeness and whether providing a single
or many response options provides better response rates.

Effects on Representativeness
We begin by considering how providing the respondent with mail and Internet response
options affected the representativeness of our sample. Our reason for providing choice
was to address potential coverage issues for respondents who did not have access to the
Internet. In 2006, when we started examining general election voters, it was estimated
that only 71% of adults had access to the Internet and only about one-​third of residents
ages sixty-​five and over (Perrin and Duggan 2015). Given that age is a strong correlate of
voting participation (Rosenstone and Hansen 1993; Leighley and Nagler 2013), and that
sample information indicated the average voter in New Mexico and Colorado was age
52 and 53 respectively, we did not want to lose older voters because they could not access
our survey (Atkeson et al. 2010). Therefore, we offered voters a choice of participating
online or requesting a mail survey, and about one in five (20%) of respondents chose the
mail option, suggesting that it may have substantially increased our response rate. Over
the long term the costs of producing an opt-​in mail survey, changes in penetration, and
analyses of our sample characteristics made 2012 the last post-​election survey in which
we offered respondents this option.
Table 3.1 shows how providing the option of responding with a mail survey affected
survey representativeness for our New Mexico election studies between 2006 and 2012.
The expectation was that allowing more options for survey response would improve
both our response rates and sample representativeness. Table 3.1 provides the means for
percent female, age, percent Democratic, percent Republican, and percent decline to
state (DTS) party registration for the sample frame, the Internet respondents, and the
combined Internet and mail respondents (Internet + Mail). In addition, columns (5) and
(6) in the table display the differences between the sample means and the two survey
mode groups. In general, the results show that including the mail option does not im-
prove the representativeness of the survey respondents compared to the sample frame.
In fact, the Internet + Mail displays greater absolute error than Internet only in just over
half the estimates in Table 3.1. As expected based on the digital divide, we find that the
average age is higher in the Internet + Mail than the sample and the Internet only mode
in all four surveys. The other four demographic estimates (percent female, Democrat,
and DTS) do not exhibit consistent trends across years. On average, the differences in
error across the Internet and Internet + Mail modes are moderate, with the absolute
Mixing Survey Modes    61

Table 3.1 Comparison of Survey Respondents by Mode to the Sample Frame


by Gender, Age, and Party Registration by Year
(4) (6) (7) Error
(3) Internet—​ (5) Sample—​ Difference
(2) Internet + (Internet + Sample—​ (Internet + abs
(1) Sample Internet Mail Mail) Internet Mail) ((5)–​(6))

% Female
2006 54.0 52.7 53.9 −1.2 1.3 0.1 1.2
2008 54.2 54.7 55.6 −0.7 −0.5 −1.4 0.7
2010 54.0 52.7 52.1 0.6 1.3 1.9 0.6
2012 55.1 52.9 53.4 −0.5 2.2 1.7 0.5
Age
2006 51.6 51.4 54.5 −3.1 0.2 −2.9 *** 3.1
2008 48.0 53.3 55.7 −2.4 −5.3 *** −7.7 *** 2.4
2010 54.6 55.8 57.9 −2.0 −1.2 −3.3 *** 2.0
2012 50.7 56.2 58.6 −2.4 −5.5 *** −7.9 *** 2.4
% Democrat
2006 49.3 50.6 50.5 0.1 −1.3 −1.2 0.1
2008 50.1 54.9 55.1 −0.2 −4.8 * −5.0 * 0.2
2010 50.4 44.5 48.1 −3.6 5.9 ** 2.3 3.6
2012 48.2 50.1 52.0 −1.9 −1.9 −3.8 1.9
% Republican
2006 38.2 34.8 36.3 −1.5 3.4 1.9 1.5
2008 31.6 32.3 33.0 −0.7 −0.7 −1.4 0.7
2010 37.5 41.8 39.5 2.3 −4.3 * −2.0 2.3
2012 34.0 34.8 33.2 1.6 −0.8 0.8 1.6
% DTS
2006 12.5 14.7 13.2 1.5 −2.2 −0.7 1.5
2008 18.2 12.8 11.9 0.9 5.4 *** 6.3 *** 0.9
2010 12.1 13.7 12.3 1.4 −1.6 −0.2 1.4
2012 17.8 15.1 14.9 0.2 2.7 2.9* 0.2

Note: n for each of the four surveys: 2006 = 357 Internet, 90 mail; 2008 = 468 Internet, 115 mail;
2010 = 569 Internet, 233 mail; 2012 = 503 Internet, 109 mail.

mean difference between the Internet error and the Internet + Mail error being only
1.4.5 This research provides evidence that providing a mail survey option does not neces-
sarily lead to better survey representativeness. In fact, it can decrease it. Given that there
is little evidence over time that this method enhanced the representative nature of our
study, we stopped providing the mail option in 2014.
62    Lonna Rae Atkeson and Alex N. Adams

Do Multiple Options Increase Response Rates?


In addition, in 2010 we conducted an experiment to test whether providing only one
option, Internet or mail, or allowing the respondent to choose their preferred mode,
affected response rates. We randomly assigned our sample of 8,500 individuals into
three treatments:  Internet (800), mail (500), and both (7,200). Table 3.2 provides
the response rates for each of the three treatments across different demographic
groups. We found that the mail-​only option displays the highest response rate, 19.6%,
compared to 8.5% for the Internet only and 8.8% for the Internet with mail option. The
response rates between the Internet only and both treatments are statistically indis-
tinguishable overall and across all subgroups. Interestingly, the response rate for the
mail-​only treatment is more than twice that for the other two treatments across all
demographic and party groups. The response rates for age increase nearly monotoni-
cally as age increases.
These results suggest that the extra step of moving to the computer and finding the
URL or calling us and requesting a mail survey decreases the motivation of the re-
spondent to complete the survey, even though the mail survey requires voters to place
their survey in a preaddressed and stamped envelope and return it in the mail. Mail

Table 3.2 2010 Survey Response Rates for Three Mode Treatments (Internet-​


only, Mail-​only, Both Internet and Mail) across Demographic Groups
Internet Mail Both

Overall 8.5% 19.6% 8.8%


 Female 8.2% 19.8% 8.6%
 Male 8.9% 19.4% 9.1%
Age Categories
  18–​30 3.3% 6.4% 2.6%
  31–​45 4.1% 14.8% 5.4%
  46–​50 9.6% 22.9% 7.8%
  51–​65 12.2% 27.6% 12.1%
 66+ 11.8% 24.5% 14.9%
Geography
  Outside Abq 8.2% 18.1% 8.2%
 Abq 9.1% 22.2% 10.3%
Party Preference
 Democrat 6.3% 20.2% 8.3%
 DTS 6.0% 11.2% 6.2%
 Republican 13.9% 23.3% 11.4%
n 800 500 7,200
Mixing Survey Modes    63

surveys may better activate social exchange and increase motivation to complete the
survey than a postcard that asks someone to find a URL. Although the mixed mode
option we discuss here, contact by mail and response over the Internet, reduces data
processing time and management costs to the researcher, it apparently raises the costs
and reduces the benefits compared to a mail survey for respondents.
If we consider the costs of the two types of survey—​all mail response versus mixed
mode mail contact–​Internet response—​for the same number of respondents, sur-
prisingly, we find that the mail survey costs were slightly less by about $500 or just
under 10% of the costs of the Internet only survey. Table 3.3 breaks down the estimated
cost for each type of survey based on a desired N of 500 and under the assumptions
of Table 3.2: a 19.6% response rate for the mail survey only option and an 8.5% re-
sponse rate for the mail contact with the Internet only option reply. Based on the
assumed response rates, the mail sample size will need to be 2,551 and 5,882 for the
mixed mode survey. Of course these differences assume the principal investigator’s
time is constant across modes, which is not valid given that the human resources nec-
essary to manage the mail survey are much greater. It also assumes the cost of the
software for survey response is paid for by the university. Even if the institution does
not provide an online survey option, both SurveyMonkey and Google Surveys offer
free survey software. However, both free survey formats limit the kinds of branching
and random assignments available for the researcher. Therefore, depending on project
demands, additional software could be required to complete the Internet survey, thus
raising costs.
Given the relatively small variation in costs, the researcher should consider
whether the added management time of an all mail survey is worthwhile. It may be
better to have a more tailored mail survey, for example, that has a smaller sample size
and slightly higher costs, than to provide an Internet only option with a larger sample
but cheaper processing costs due to no labor costs related to data entry and inputting
mail dispositions, especially if nonresponse is believed to be related to the survey
content.
Further research consideration of the trade-​offs in costs and benefits of different
designs for researchers and respondents is necessary. Mail surveys or the Internet could
be a better methodology depending on the capacity of the researcher and expectations
regarding motivation of respondents across modes.

Combining Modes for Analysis

The fundamental question when using mixed mode survey methods is whether they
can be combined. Do responses to surveys across different modes create consistent and
reliable measures? Specifically, are the observational errors associated with different
modes such that they must be controlled for when using the data to answer substantive
questions of interest?
64    Lonna Rae Atkeson and Alex N. Adams

Table 3.3 Estimated Costs of Mail and Mixed Mode (Mail Contact—​Internet


Response) Survey for a Sample Size of 500
2,551 Sample for Mail
Survey Sample Response = 19.6%

First Mailing (First Class) .47 (2,551) 1,198.97


Second Mailing (postcard, nonprofit) .17 (2,551) 433.67
Third Mailing (postcard, nonprofit) .17 (2,551) 433.67
Address Print on Postcard .05 (2,551*2) 255.10
Postcard Print .17 (638 sheets) 108.46
Envelopes #10 Window .181 (2,551) 462.24
Envelopes #9 for BRM .101 (2,551) 258.16
BRM Return .485 (500) 242.50
Survey Print 300.00
Stuffing Envelopes .07 (2,551) 408.16
Printing Envelopes: return address on .035 (2,551*2) 178.57
two envelopes
Data Entry (11 an hour, 50 hours) 11.00 (50) 555.00
Folding Survey 400.00
Total 5,179.50

5882 Sample for Mixed


Mode survey Sample Response = 8.5%

First Mailing (first-​class postcard) .36 (5,882) 2,117.52


Second Mailing (nonprofit postcard) .17 (5,882) 999.94
Third Mailing .17 (5,882) 999.94
Address Print on Postcard .05 (5,882*3) 882.30
Postcard Print .17 (1,475 sheets*3) 752.25
URL 1/​month 6.00
Total 5,757.95

One primary concern is that different modes might lead to different response patterns,
leading to serious questions about data quality and comparability. This is true even when the
question wording is identical. This is also true for both cross-​sectional designs that collect
respondent data using different modes and panel designs in which respondent data are col-
lected over time using different survey modes. In the first case the question is: Do respondents
who answer the same questions across different survey modes result in the same distribution
of responses? In the second case the question is: Can questions be compared across the same
respondents over time when the data were collected using different survey modes?
Mixing Survey Modes    65

Some have suggested that mode of response may influence survey response, which
may influence the reliability and validity of the results (de Leeuw and Van Der Zowen
1988; Fowler, Roman, and Di 1998; Dillman et al. 1996). The problem is that contex-
tual cues present in a survey differ depending on their presentation and the presence
or absence of an interviewer. In this way, whether the survey is administered by the
interviewer or by the respondent may influence respondent answers, potentially
creating mode biases that can lead to problems of inference if not handled correctly
(Peterson and Kerin 1981; Campbell 1950; Mensch and Kandel 1988). If we imagine
that the survey process is similar to a conversation (Schwarz 1996), then the context
provided by the survey either through the interviewer or through the presentation of
question and answer scales may affect question interpretation and response. If such is
the case, then it may be problematic to combine identical questions across modes into
the same variable to obtain an aggregate representation of the cross-​section or panel
attitudes or behaviors. Indeed, when mode changes over time it could make changes
seen in panel data unreliable and therefore make inferences from the data impossible.
One example where this is a problem is in the 2000 ANES (Bowers and Ensley 2003),
in which respondents were interviewed in person, over the phone, or by a combina-
tion of both methods.
Problems associated with survey mode are likely due to the interactions among
the survey mode (either self-​administered or interviewer administered), the instru-
ment, and the respondent. An interviewer encourages social desirability effects on
certain types of questions; he or she can also affect response choice by encouraging
either primacy or recency effects in response answers and thus influence item re-
sponse. The lack of an interviewer means that the visual layout of questions, such
as spacing, might influence responses in a unique way (Fowler, Roman, and Di
1998; Shuman and Presser 1981; Shuman 1992; Sudman, Bradburn, and Schwarz
1996; Christian and Dillman 2004; Smyth et  al. 2006; Tourangeau, Couper, and
Conrad 2004).
One consistent finding in the literature is that the IAQs lead to less item nonresponse
than SAQs within the survey (Tourangeau, Rips, and Razinski 2000; Brøgger et al. 2002;
Van Campen et al. 1998; though see de Leeuw 1992). The lack of an interviewer per-
haps reduces engagement with the instrument, resulting in more skipped responses.
Respondents may be more likely to miss questions in SAQs because they do not follow
instructions, they do not understand the question, or they simply are not willing to an-
swer it and no one is there to encourage them to do so.
There is also some evidence that open-​ended responses are impacted by mode, with
differences across FTF, phone, and Internet/​mail. Open-​ended responses are valu-
able to researchers because they provide hints about how respondents understand the
question and allow the respondents to answer in their own words. Research shows that
FTF surveys provide more open-​ended responses than phone surveys, perhaps because
of the faster pace and lack of encouraging body language in phone surveys (Groves and
Kahn 1979; Kormendi and Noordhoek 1989).
66    Lonna Rae Atkeson and Alex N. Adams

Effects of Survey Presentation


Differences due to oral or visual presentation may also matter. Several studies show that
the layout of questions and answers, including the spacing on surveys, can influence
response patterns, and that even spacing produces the least biased results (Tourangeau
et al. 2004). In general, studies have shown that spacing, particularly the midpoint, as
a visual cue influences response patterns. Therefore, we always attempt on our SAQs
to place the DK option further away from the response set to differentiate it from the
scale and ensure the proper midpoint (Tourangeau et al. 2004; Christian, Parsons, and
Dillman 2009).
In the absence of an interviewer the visual layout of survey questions can be very im-
portant to response patterns, but it is not necessarily so.
In 2008 we fielded separate phone and mixed mode Internet and mail surveys to
a sample of voters in the state of New Mexico. In the telephone survey voters were
not prompted with the DK answer, but in the Internet survey it was a visible option
for the respondent. This difference in presentation had no effect for most questions.
For example, there were no differences in DK responses across a series of questions
about the ideology of eight candidates, vote confidence, internal efficacy, the number
of days respondents pay attention to the news, and how many days a week they dis-
cuss politics. In fact, despite the differences in DK presentation due to the presence
or absence of an interviewer, with the exception of one series of questions about the
frequency of various types of voter fraud, there were no differences in DK responses
across modes.
On the voter fraud questions that displayed significant differences in the number of
DK responses across the IAQ and SAQ formats, we asked, “I’m going to read a list of
possible illegal election activities that may or may not take place in your community and
I want you to tell me if you think each event occurs: all or most of the time, some of the
time, not much of the time, or never.” For each of the activities we found a significant
difference (p < .001, two-​tailed test) between means across modes (percentage point
difference in parentheses; a positive number indicates that the online option produced
more DK responses), including the following: a voter casts more than one ballot (21%);
tampering with ballots to change votes (26%); someone pretends to be another person
and casts a vote for them; (21%); and a non-​U.S. citizen votes (23%). We also asked, “If
election fraud happens at all, do you think it is more likely to take place with absentee or
mail voting or in-​person voting in a polling place?” and found a significant difference
between means (p < .001, two-​tailed test), with a mean difference of 18% between the
SAQ and IAQ in DK responses. Of course part of the explanation lies in the different
presentation of the DK response, but this was the same for all of the questions on the
survey, and we only saw differences in DK response across this one set of questions,
so the reason is not simply the fact that DK was left out of the verbal presentation of
the questions. We suspect that these are very difficult questions to answer and there-
fore are likely questions for which respondent uncertainty was very high, increasing the
likelihood of a DK response. Indeed, even in the telephone survey the DK percentages
Mixing Survey Modes    67

for these questions were much higher than for other questions. Given these factors, the
SAQ that presented a DK option may have better measured that uncertainty than the
phone survey by allowing people to feel they could easily choose DK. This suggests that
questions that may have unusually high DK responses relative to other survey items in
the interviewer setting may actually be problematic questions, producing biased results
due to a high degree of uncertainty regarding the correct response and its interaction
with an interviewer. Perhaps social desirability issues led voters to be more likely to
hazard a guess in the interviewer scenario than in the self-​administered scenario.

Survey Response: Social Desirability and Satisficing


Some of the most consistent and strongest findings in the literature involve socially de-
sirable responses. Social desirability refers to the need for respondents to present them-
selves in the most favorable way and may be especially pervasive when an interviewer is
present (London and Williams 1990; Aquilino 1994). Research shows that SAQs result
in fewer socially desirable responses than IAQs across a variety of issues (Chang and
Krosnick 2009, 2010; Fowler, Roman, and Di 1998; Schuman and Presser 1981; Schuman
1992; Sudman, Bradburn, and Schwarz 1996; Christian and Dillman 2004; Smyth et al.
2006; Tourangeau, Couper, and Conrad 2004).
Social desirability response theory suggests that one cue for survey response is the
perceived expectations of those around the respondent during the interview, espe-
cially the interviewer in a telephone or FTF survey. In these cases, the pressure of the
interviewing situation leads respondents to answer questions in socially desirable
ways. For example, this potential problem is seen consistently in ANES studies in
which large numbers of respondents indicate that they voted, when in fact they did not
(Traugott 1989; Belli, Traugott, and Beckman 2001; Atkeson, Adams, and Alvarez 2014;
but see Barent, Krosnick, and Lupia 2016 for an alternative perspective). The fact that
respondents have spent literally many hours with an interviewer in their own homes on
more than one occasion talking almost exclusively about politics leads respondents to
give the socially desirable response (Presser 1990). Similarly, research on overreporting
for the winner suggests the same problem (Wright 1990, 1993; Atkeson 1999).
Theoretically the presence of an interviewer raises the concern for the respondent
that his or her answers may be met with disapproval, leading respondents to provide
more socially favorable and biased responses. Social desirability appears in the form
of overreporting of good behaviors and underreporting of bad ones. While voter
turnout has been the most closely researched social desirability effect in political
science (Holbrook and Krosnick 2010; Blair and Imai 2012), it is likely that social de-
sirability invades other political attitudes as well. Sensitive questions that focus on the
respondents capability or ability often induce socially desirable responses that make
the respondents seem healthier, more obedient, and more efficacious (Blair and Imai
2012: Gingerich 2010; Tourangeau and Yan 2007; Holbrook, Green, and Krosnick 2003;
Kreuter, Presser and Tourangeau 2008).
68    Lonna Rae Atkeson and Alex N. Adams

Self-​administered questionnaires alternatively afford the individual greater privacy


and anonymity, reducing or eliminating the socially desirable response. We compared
social desirability effects across our post-​election 2008 IAQ (phone) and SAQ (Internet/​
mail) surveys (Atkeson, Adams, and Alvarez 2014). Using matching techniques to
isolate any sampling effects, we found strong evidence for social desirability in ego-​
driven questions, including personal voter confidence, state voter confidence, county
voter confidence, vote experience, trust in government, and internal efficacy, but not
in common behavior questions such as how much voters watched or read the news or
discussed politics, the amount of time they waited in line to vote, their vote choices for
president and the U.S. Senate, whether they regularly carry photo identification, if they
convinced others how to vote, and if they gave money to parties and candidates. The fact
that social desirability influences responses differently across modes creates problems
for comparisons across survey modes within either a cross-​section or a panel.
Satisficing may be another important problem that depends on mode response
and theoretically may have a social desirability component. Satisficing occurs when
respondents answer questions with little motivation and with minimal cognitive
effort (Krosnick 1991, 1999; Chang and Krosnick 2009, 2010). It leads respondents to
choose satisfactory responses as opposed to optimized responses, in which respondents
carefully consider the question, retrieve relevant information from memory, make
judgments about preferences, and then choose the best survey option (Cannell et al.
1981; Schwarz and Strack 1985; Tourangeau and Rasinski 1988).
One way to measure satisficing is by analyzing the degree of nondifferentiation within
a battery of survey questions; other methods include examining the quantity of open-​
ended responses or response times. Mixed mode surveys that combine both oral and
self-​administered surveys may produce different rates of satisficing due to the different
visual patterns and/​or the different cognitive effort involved in the survey. In particular,
interviewer-​driven surveys may motivate respondents to be attentive to the survey environ-
ment, and social desirability effects may reduce respondent incentives to respond quickly
and with little effort to questions that probe different objects on the same scale (e.g., ide-
ology or thermometer scores). For the respondent-​driven interview the visual cues, for
example an answer grid, may encourage identical responses across different items due to
reduced motivation. Some research shows that SAQs are more likely to display greater levels
of nondifferentiation (Fricker et al. 2005; Atkeson, Adams, and Alvarez 2014), suggesting
they have decreased levels of satisficing (but see Chang and Krosnick 2009, 2010).
To examine these claims we used publicly available ANES data to compare rates of
nondifferentiation or satisficing between FTF and Internet respondents. We used the 2008
ANES traditional FTF design and the Evaluating Government and Society Study (EGSS)
surveys that used the Gfk panel to complete an Internet survey. We utilized three ideo-
logical proximity scores to identify satisficers, including self-​ideology as well as the ide-
ology of the Democratic and Republican parties. The variable is dichotomous and takes
on a value of one when a respondent perceives his or her own ideology, the ideology of the
Democratic party, and the ideology of the Republican party as identical (e.g., self = very
liberal, Democrats = very liberal, and Republicans = very liberal); any respondent scoring
Mixing Survey Modes    69

Table 3.4 Frequencies of Whether or Not Respondent Places


Self, Democrats, and Republicans as the Same on the
Liberal-​Conservative Scale
2008 ANES EGSS

Percentage Count Percentage Count

Differentiation 96.9 1,539 84.5 4,098


Nondifferentiation 3.1 49 15.5 753
Total 100.0 1,588 100.0 4,852

Note: Data available from the American National Election Studies, using V080102 (Post-​
Election Weight) for the 2008 NES; c1_​weigh (EGSS1), c2_​weigh (EGSS2), c3_​weigh
(EGSS3), and c4_​weigh (EGSS4) for the merged EGSS.

one or more variables differently was coded as zero (e.g., self = moderate, Democrats = very
liberal, and Republicans = very conservative). Table 3.4 shows our results. We found that
satisficing was five times more likely in the SAQ than in the IAQ, which is very troubling
and suggests that simply combining modes may be problematic.
Increases in satisficing can be seriously problematic because of the additional error
introduced into the variable. In this case, the grid type measurement encouraged a sub-
stantial number of respondents to simply straight line their responses, leading to inaccurate
measures of the parties or their personal ideology. The high degree of straight lining suggests
that the level of engagement may be low for some responders in Internet surveys. Research on
internal manipulation checks suggest that between one-​third and one-​half of respondents in
online panels shirk and fail questions that ask for specific response patterns (Oppenheimer,
Meyvis, and Davidenko 2009; Berinsky, Margolis, and Sances 2013). Examination of grids
may provide opportunities for alternative ways to identify survey shirkers.
Social desirability and satisficing are two different types of error measurement in survey
questions. Social desirability is a problem mainly for IAQs, and satisficing is largely a problem
in SAQs. If many questions are sensitive or have ego-​based references, it may be important
to reduce interviewer-​respondent interaction through an online interview or CAPI. On the
other hand, if large numbers of questions are similar in type, such as a series of Likert type
scales, it may be necessary to use an interviewer to help maintain respondent motivation and
engagement. Thus the subject of questions may help identify whether the survey should be
single or mixed mode and whether (or when) an interviewer should be present or not.

Discussion

In summary, research shows that mode matters. It can affect who responds, how engaged
with or motivated by the survey instrument they are, and their responses. Mode may
70    Lonna Rae Atkeson and Alex N. Adams

be especially disconcerting for items that exhibit ego in an IAQ, resulting in increased
level of satisfaction, confidence, health, and moral behaviors. Given these cross-​mode
concerns, paying attention to these mode effects is important to researchers’ analysis
and conclusions, both when they design their own studies and when they use secondary
data that rely on multiple modes.
Some of the main reasons scholars use multiple modes are to reduce survey costs and
increase response rates. Reduction in costs occurs because the researcher often begins
with the cheapest collection mode and then moves on to more expensive modes because
of nonresponse (Holmberg, Lorenc, and Werner 2008). For example, the U.S. Census
employed a mixed mode design in 2010 that first tried to obtain responses by mail and
eventually moved to FTF follow-​ups with nonrespondents. When researchers have se-
rious concerns about nonresponse, offering a mixed mode survey that uses increasingly
expensive contact and/​or response methods to obtain survey completion might out-
weigh the problems associated with any mode effects. However, identifying mode effects
in surveys is difficult because mode is often confounded with the respondent activity
in the survey. For example, a respondent who responds to the first Internet treatment
may be systematically different from those who respond to the more costly subsequent
attempts to complete the survey. Therefore, differences between respondents across
modes may not only be simply due to mode, but also may be a combination of how
mode interacts with respondent motivation, making using simple dummy variables for
mode as controls in multivariate models problematic. Nevertheless, if the researcher is
concerned about response rates and response bias, a mixed mode option may make a lot
of sense.
Mixed mode surveys may also be a good choice when a researcher is concerned about
survey time or costs. We contact people by mail and try to motivate them to respond on-
line because the additional costs associated with processing mail responses to our small
research team is very high and there appear to be few problems with response bias even
with low response rates. However, this may not be the case for all research questions.
Therefore, a consideration of costs and errors is critical in determining the right survey
contact and response modes. Survey research has always been a delicate balance be-
tween costs and errors, and mixed mode designs offer a new consideration related to
these potential trade-​offs.
Over the past fifteen years mixed mode surveys have increased in popularity. As
mode options continue to expand and integrate with one another (e.g., FTF with CAPI),
researchers will need to continue to consider and examine the effect of mode on data
quality. Understanding how modes differ, their characteristics, and how these factors
influence survey response and nonresponse will be critical for reducing observational
and nonobservational errors. These factors are important so that survey researchers can
make informed decisions on the mode or modes best suited for their study. Continued
research on mode and its effects needs to be done so that knowledge can guide mixed
mode research designs and analysis.
Mixing Survey Modes    71

Notes
1. See the ANES 2012 Time Series data page at http://​www.electionstudies.org/​studypages/​
anes_​mergedfile_​2012/​anes_​mergedfile_​2012.htm.
2. See Martinez (2003), McDonald (2003) and Burden (2003) for further discussion of
this issue.
3. For example, we saw no differences in 2006 between the response rates and request for a
mail survey between Colorado and New Mexico, suggesting that at least with regard to
survey design issues, these results are transferable.
4. On the postcard in 2014 we only asked respondents to complete the survey online. We then
sent mail surveys to a subsample of nonrespondents.
5. The absolute value of (sample—​Internet) subtracted from (sample—​Internet + Mail).

References
Alexander, C. H. Jr., and S. Wetrogan. 2000. Integrating the American Community Survey and
the Intercensal Demographic Estimates Program. Proceedings of the American Statistical
Association at https://​www.census.gov/​content/​dam/​Census/​library/​working-​papers/​
2000/​acs/​2000_​Alexander_​01.pdf (accessed January 3, 2017).
Alvarez, R. M., L. R. Atkeson, T. E. Hall. 2007. “The New Mexico Election Administration
Report:  The 2006 New Mexico Election,” Unpublished manuscript, University of New
Mexico. http://​www.saveourvotes.org/​reports/​2007/​8-​02nm-​elections-​caltech-​mit.pdf.
Aneshensel, C., R. Frerichs, V. Clark, and P. Yokopenic. 1982. “Measuring Depression in
the Community:  A Comparison of Telephone and Personal Interviews.” Public Opinion
Quarterly 46: 110−121.
Ansolabehere, S., and B. Schaffner. 2014. “Re-​Examining the Validity of Different Survey
Modes for Measuring Public Opinion in the U.S.:  Findings from a 2010 Multi-​Mode
Comparison.” Political Analysis (3): 285−303.
Aquilino, W. 1992. “Telephone Versus Face-​to-​Face Interviewing for Household Drug Use
Surveys.” International Journal of the Addictions 27: 71−91.
Aquilino, W. 1994. “Interview Mode Effects in Surveys of Drug and Alcohol Use:  A Field
Experiment.” Public Opinion Quarterly 58 (2): 210–​240.
Atkeson, L. R. 1999. “ ‘Sure, I Voted for the Winner!’ Over Report of the Primary Vote for
the Party Nominee in the American National Election Studies.” Political Behavior 21
(3): 197−215.
Atkeson, L. R., and A. N. Adams. 2010. “Mixed Mode (Internet and Mail) Probability Samples
and Survey Representativeness:  The Case of New Mexico 2008.” Paper presented at the
Western Political Science Association, April 1−April 4, San Francisco, CA.
Atkeson, L. R., A. N. Adams, and R. M. Alvarez. 2014. “Nonresponse and Mode Effects in Self
and Interviewer Administered Surveys.” Political Analysis 22 (3): 304−320.
Atkeson, L. R., A. N. Adams, C. Stewart, and J. Hellewege. 2015. “The 2014 Bernalillo County
Election Administration Report.” Unpublished manuscript, University of New Mexico.
https://​ p olisci.unm.edu/​ c ommon/​ d ocuments/​ 2 014-​ b ernalillo-​ c ounty-​ n m-​ e lection-​
administration-​report.pdf.
72    Lonna Rae Atkeson and Alex N. Adams

Atkeson, L. R., L. A. Bryant, and A. N. Adams. 2013. “The 2012 Bernalillo County Election
Administration Report.” Unpublished manuscript, University of New Mexico. http://​www.
unm.edu/​~atkeson/​newmexico.html.
Atkeson, L. R., L. A. Bryant, A. N. Adams, L. Zilberman, and K. L. Saunders. 2010. “Considering
Mixed Mode Surveys for Questions in Political Behavior: Using the Internet and Mail to Get
Quality Data at Reasonable Costs.” Political Behavior 33: 161−178.
Atkeson, L. R., and L. Tafoya. 2008. “Surveying Political Activists:  An Examination of the
Effectiveness of a Mixed-​mode (Internet and Mail) Survey Design.” Journal of Elections,
Public Opinion and Parties 18 (4): 367−386.
Barent, M. K., J. A. Krosnick, and A. Lupia. 2016. “Measuring Voter Registration and Turnout
in Surveys:  Do Official Government Records Yeild More Accurate Assessments,” Public
Opinion Quarterly 80 (3): 597–​621.
Belli, R. F., M. Traugott, and M. N. Beckmann. 2001. “What Leads to Voting Overreports and
Admitted Nonvoters in the American National Election Studies.” Journal of Official Statistics
17 (4): 479−498.
Berinsky, A., M. Margolis, and M. Sances. 2013. “Separating the Shirkers from the Workers?
Making Sure Respondents Pay Attention on Self-​Administered Surveys.” American Journal
of Political Science 58 (3): 739–​753.
Blair, G., and K. Imai. 2012. “Statistical Analysis of List Experiments.” Political Analysis
20: 47−77.
Blumberg, S. J., and J. V. Luke. 2016. “Wireless Substitution: Early Release of Estimates from the
National Health Interview Survey.” January−July 2015. http://​www.cdc.gov/​nchs/​data/​nhis/​
earlyrelease/​wireless201512.pdf.
Bowers, J., and M. J. Ensley. 2003. “Issues in Analyzing Data from the Dual-​Mode 2000
American National Election Study.” NES Technical Report Series, Document nes010751.
http://​www.electionstudies.org/​resources/​papers/​technical_​reports.htm.
Burden, B. C. 2000. “Voter Turnout and the National Election Studies.” Political Analysis 8
(4): 389−398.
Burden, B. C. 2003. “Internal and External Effects on the Accuracy of NES Turnout: Reply.”
Political Analysis 11 (2): 193−195.
Brøgger, J., P. Bakke, G. Eide, and A. Guldvik. 2002. “Comparison of Telephone and Post Survey
Modes on Respiratory Symptoms and Risk Factors.” American Journal of Epidemiology
155: 572−576.
Campbell, D. T. 1950. “The Indirect Assessment of Social Attitudes.” Psychological Bulletin 47
(January): 15−38.
Cannell, C. F., P. V. Miller, and L. Oksenberg. 1981. “Research on Interviewing Techniques.”
In Sociological Methodology, edited by S. Leinhardt, 389−437. San Francisco,
CA: Jossey-​Bass.
Chang, L., and J. Krosnick. 2009. “National Surveys via RDD Telephone Interviewing Versus
the Internet: Comparing Sample Representativeness and Response Quality.” Public Opinion
Quarterly 73 (4): 641−678.
Chang, L., and J. Krosnick. 2010. “Comparing Oral Interviewing with Self-​Administered
Computerized Questionnaires: An Experiment.” Public Opinion Quarterly 74 (1): 154−167.
Christian L. M., and D. A. Dillman. 2004. “The Influence of Graphical and Symbolic Language
Manipulations on Responses to Self-​Administered Questions. Public Opinion Quarterly 68
(1): 57−80.
Mixing Survey Modes    73

Christian, L. M., N. L. Parsons, and D. A. Dillman. 2009. “Measurement in Web


Surveys:  Understanding the Consequences of Visual Design and Layout,” Sociological
Methods and Research 37: 393−425.
Couper, M. P. 2000. “Web Surveys:  A Review of Issues and Approaches.” Public Opinion
Quarterly 64: 464–​494.
Day, N. A., D. R. Dunt, and S. Day. 1995. “Maximizing Response to Surveys in Health Program
Evaluation At Minimum Cost Using Multiple Methods.” Evaluation Review 19 (4): 436−450.
de Leeuw, E. 1992. Data Quality in Mail, Telephone and Face-​ to-​
Face Surveys.
Amsterdam: TT-​Publikaties.
de Leeuw, E., and W. de Heer. 2002. “Trends in Household Survey Nonresponse: A Longitudinal
and International Comparison.” In Survey Nonresponse, edited by R. M. Groves, D. A.
Dillman, J. L. Eltinge, and R. J. A. Little, 41–​55. New York: John Wiley & Sons Inc.
de Leeuw, E., J. Hox, E. Korendijk, G. Lensvelt-​Mulders, and M. Callegaro. 2004. “The Influence
of Advance Letters on Response in Telephone Surveys: A Meta-​analysis.” Paper presented at
the 15 International Workshop on Household Survey Nonresponse, Maastricht.
de Leeuw, E., and J. Van der Zouwen. 1988. “Data Quality in Telephone and Face to Face
Surveys:  A Comparative Meta-​ analysis.” In Telephone Survey Methodology, edited by
R. Groves, P. P. Bimer, L. Lyberg, I. T. Massey, W. L. Nicholls, and J. Waksberg, 283–​300.
New York: John Wiley & Sons.
de Leeuw, E. 2005. “To Mix or Not to Mix:  Data Collection Modes in Surveys.” Journal of
Official Statistics 21: 233−255.
Dillman, D. A. 2000. Mail and Internet Surveys:  The Tailored Design Method. 2nd ed.
New York: Wiley.
Dillman, D. A., A. R. Sangster, J. Tarnai, and T. Rockwood. 1996. “Understanding Differences
in People’s Answers to Telephone and Mail Surveys. New Directions for Evaluation 70: 45−62.
Dillman, D. A., J. Smyth, and L. M. Christian. 2009. Internet, Mail, and Mixed-​Mode
Surveys: The Tailored Design Method. New York: Wiley.
Elinson, J. 1992. “Methodology Issues.” In A Meeting Place:  The History of the American
Association for Public Opinion Research, edited by P. B. Sheatesley and W. J. Mitofsky,
AAPOR. Available at:  http://​www.aapor.org/​AAPOR_​Main/​media/​MainSiteFiles/​A_​
Meeting_​Place_​-​_​The_​History_​of_​AAPOR_​(1992)_​-​_​Methodology_​Issues.pdf, accessed
January 3, 2017.
Fowler, F. J., Jr., A. M. Roman, and Z. X. Di. 1998. “Mode Effects in a Survey of Medicare Prostate
Surgery Patients.” Public Opinion Quarterly 62 (1): 29−46.
−Fricker, S., M. Galesic, R. Touranegeau, and T. Yan. 2005. “An Experimental Comparison of
Web and Telephone Surveys.” Public Opinion Quarterly 3 (Fall): 370−392.
Fuchs, M., M. Couper, and S. Hansen. 2000. “Technology Effects: Do CAPI Interviews Take
Longer?” Journal of Official Statistics 16: 273−286.
Gingerich, D. W. 2010. “Understanding Off-​the-​Books Politics: Conducting Inference on the
Determinants of Sensitive Behavior with Randomized Response Surveys.” Political Analysis
18: 349−380.
Groves, R. M., and R. L. Kahn. 1979. Surveys by Telephone:  A National Comparison with
Personal Interviews. New York: Academic Press.
Holbrook, A. L., M. C. Green, and J. A. Krosnick. 2003. “Telephone Versus Face-​to-​Face
Interviewing of National Probability Samples with Long Questionnaires.” Public Opinion
Quarterly 67 (Spring): 79−125.
74    Lonna Rae Atkeson and Alex N. Adams

Holbrook, A. L., and J. A. Krosnick. 2010. “Social Desirability Bias in Voter Turnout
Reports: Tests Using the Item Count Technique.” Public Opinion Quarterly 74 (1): 37−67.
Holmberg, A., B. Lorenc, and P. Werner. 2008. “Optimal Contact Strategy in a Mail and Web
Mixed Mode Survey.” Paper presented at the General Online Research Conference (GOR
08), Hamburg, March. Available at:  http://​ec.europa.eu/​eurostat/​documents/​1001617/​
4398401/​S8P4-​OPTIMAL-​CONTACT-​STRATEGY-​HOLMBERGLORENCWERNER.pdf,
accessed January 3, 2017.
Iannacchione, V. 2011. “The Changing Role of Address-​Based Sampling in Survey Research.”
Public Opinion Quarterly 75 (3): 556−575.
Jordan, L., A. Marcus, and L. Reeder. 1980. “Response Styles in Telephone Household
Interviewing: A Field Experiment.” Public Opinion Quarterly 44: 201−222.
Körmendi, E., and J. Noordhoek. 1989. “Data quality and telephone interviews.” Copenhagen,
Denmark: Danmarks Statistik.
Kreuter F., S. Presser, and R. Tourangeau. 2008. “Social Desirability Bias in CATI, IVR, and
Web Surveys:  The Effects of Mode and Question Sensitivity.” Public Opinion Quarterly
72: 847−865.
Krosnick, J. A. 1991. “Response Strategies for Coping with the Cognitive Demands of Attitude
Measures in Surveys.” Applied Cognitive Psychology 5: 213−236.
Krosnick, J. A. 1999. “Maximizing Questionnaire Quality.” In Measures of Political
Attitudes, pp. 37–​ 58, edited by J. P. Robinson, P. R. Shaver, and L. S. Wrightsman,
New York: Academic Press.
Kwak, N., and B. Radler. 2002. “A Comparison Between Mail and Web Surveys:  Response
Pattern, Respondent Profile, and Data Quality.” Journal of Official Statistics 18 (2): 257−273.
Leighley, J. E., and J. Nagler. 2013. Who Votes Now? Demographics, Issues, Inequality and
Turnout in the United States. Princeton, NJ: Princeton University Press.
Loosveldt, G., and N. Sonck. 2008. “An Evaluation of the Weighting Procedures for an Online
Access Panel Survey.” Survey Research Methods 2: 93−105.
London, K., and L. Williams. 1990. “A Comparison of Abortion Underreporting in an In-​
Person Interview and Self-​Administered Questionnaire.” Paper presented at the annual
meeting of the Population Association of America, Toronto.
Lyberg, L. E., and D. Kasprzyk. 1991. “Data Collection Methods and Measurement Error: An
Overview.” In Measurement Errors in Surveys, edited by P. P. Biemer, R. M. Groves, L. E.
Lyberg, N. A. Mathiowetz, and S. Sudman, 237–​258. New York: Wiley.
Martinez, M. D. 2003. “Comment on ‘Voter Turnout and the National Election Studies.’ ”
Political Analysis 11: 187–​92.
McDonald, M. P. 2003. “On the Over-​Report Bias of the National Election Study Turnout Rate.”
Political Analysis 11: 180–​186.
Mensch, B. S., and D. B. Kandel. 1988. “Underre-​porting of Substance Use in a National
Longitudinal Youth Cohort.” Public Opinion Quarterly 52 (Spring): 100−124.
Oppenheimer, D. M., T. Meyvis, and N. Davidenko. 2009. “Instructional Manipulation
Checks: Detecting Satisficing to Increase Statistical Power.” Journal of Experimental Social
Psychology 45 (4): 867–​872.
Perrin, A., and M. Duggan. 2015. “Americans’ Internet Access: 2000-​2015: As Internet Use Nears
Saturation for Some Groups, a Look at Patterns of Adoption.” http://​www.pewinternet.org/​
data-​trend/​internet-​use/​internet-​use-​over-​time/​.
Mixing Survey Modes    75

Peterson, R. A., and R. A. Kerin. 1981. “The Quality of Self-​Report Data: Review and Synthesis.”
In Review of Marketing, edited by B. M. Enis and K. J. Roering, 5–​20. Chicago: American
Marketing Asociaiton.
Presser, S. 1990. “Can Changes in Context Reduce Vote Overreporting in Surveys?” Public
Opinion Quarterly 54 (4): 586–​593.
Rosenstone, S., and J. M. Hansen. 1993. Mobilization, Participation, and Democracy in America.
New York: Macmillan.
Schuman, H. 1992. “Context Effects:  State of the Past/​State of the Art.” In Context Effects
in Social and Psychological Research, edited by N. Schwarz and S. Sudman, 5−20.
New York: Springer-​Verlag.
Schuman, H., and S. Presser. 1981. Questions and Answers in Attitude Survey: Experiments on
Question Form, Wording and Context. New York: Academic Press.
Schwarz, N. 1996. Cognition and Communication Judgmental Biases, Research Methods, and the
Logic of Conversation. Mahwah, NJ: Lawrence Erlbaum.
Schwarz, N., and E. Strack. 1985. “Cognitive and Affective Processes in Judgments of Subjective
Well-​Being: A Preliminary Model.” In Economic Psychology, edited by H. Brandstatter and E.
Kirehler, 439−447. Linz, Austria: R. Trauner.
Shettle, C., and G. Mooney. 1999. “Monetary Incentives in Government Surveys.” Journal of
Official Statistics 15: 231−250.
Smyth, J. D., D. Dillman, L. M. Christian, and M. J. Stern. 2006. “Effects of Using Visual Design
Principles to Group Response Options in Web Survey.” International Journal of Internet
Science 1: 6−16.
Sudman, S., N. M. Bradburn, and N. Schwarz. 1996. Thinking About Answers. San Francisco,
CA: Josey-​Bass.
Tourangeau, R., M. Couper, and F. Conrad. 2004. “Spacing, Position and Order: Interpretive
Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68
(3): 368−393.
Tourangeau, R., and K. A. Rasinski. 1988. “Cognitive Processes Underlying Context Effects in
Attitude Measurement.” Psychological Bulletin 103: 299−314.
Tourangeau, R., L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
Cambridge, UK: Cambridge University Press.
Tourangeau, R., and T. Yan. 2007. “Sensitive Questions in Surveys.” Psychological Bulletin
133: 859−883.
Traugott, S. 1989. “Validating Self-​ Reported Vote:  1964–​ 1988. ANES Technical Report
Series, no. nes010152.” Unpublished manuscript, University of Michigan. http://​www.
electionstudies.org/​Library/​papers/​documents/​nes010152.pdf.
Van Campen, C., H. Sixma, J. Kerssens, and L. Peters. 1998. “Comparisons of the Costs and
Quality of Patient Data Collection by Mail Versus Telephone Versus In-​Person Interviews.”
European Journal of Public Health 8: 66−70.
Wright, G. C. 1990. “Misreports of Vote Choice in the 1988 ANES Senate Elec-​tion Study.”
Legislative Studies Quarterly 15: 543−563.
Wright, G. C. 1993. “Errors in Measuring Vote Choice in the National Election Studies, 1952-​
88.” American Journal of Political Science 37 (1): 291−316.
Zickuhr, K., and A. Smith. 2012. Digital Differences. Washington, DC:  Pew Internet and
American Life Project. http://​pewinternet.org/​Reports/​2012/​Digital-​differences.aspx.
Chapter 4

Taking th e St u dy
of P oli t i c a l
Behavior  Onl i ne

Stephen Ansolabehere and


Brian F. Schaffner

Survey research in the United States has crossed a threshold. Over the past two
decades there has been an explosion in the number of academic studies making use
of Internet surveys, which are frequently conducted using opt-​in samples rather than
samples of randomly selected individuals. News media polls have followed suit, and
today nonprobability Internet polls are nearly as common as random digit dialing
phone polls. Internet polling is here to stay, at least until the next revolution in survey
research.
The change has been driven by a variety of factors. First, phone surveys have be-
come more difficult to conduct. Since 2005 there has been a precipitous decline in the
use of landline phones in the United States, especially among young adults, and there
are legal barriers to many techniques used by market researchers for random digit di-
aling of phone numbers. In addition, social norms about answering phone surveys
have changed, causing response rates to most phone polls to drop into the single
digits. Second, cost calculations have changed. Survey research firms dedicated to
using the Internet and nonprobability based sample selection methods as a mode of
data collection, such as Knowledge Networks and YouGov, have emerged and have
produced relatively low cost alternatives to phone and other modes of survey contact.
Third, researchers have realized the survey design opportunities available with Internet
polls. Online surveys offer the opportunity to show visuals and videos, to conduct
experiments within surveys easily, and to implement new forms of questions. They are
also generally easy to field quickly, making them a way in which researchers can receive
data back in a timely manner. Fourth, people respond to Internet surveys in several ad-
vantageous ways. There is evidence of less social desirability bias when no interviewer is
involved, and people read faster than they speak, meaning that people can answer many
Taking the Study of Political Behavior Online    77

more questions in an online poll than in one conducted over the phone in the same
amount of time.
While some firms like Gfk (formerly Knowledge Networks) deliver surveys online
to panels that are recruited with probability sampling methods, most online firms use
some form of opt-​in recruitment strategy. While techniques often vary widely across
online polling firms, the highest quality firms tend to spend substantial resources
recruiting individuals to join their panels through online advertising, referrals, and
other approaches. Once people join the panel, they are asked to take surveys from
time to time, often in exchange for points that can be redeemed for some reward (like
a gift card). Some firms, such as YouGov, have millions of individuals throughout the
United States who are active members of their panel. When a researcher contracts with
YouGov to conduct a survey, the firm attempts to collect responses from a sample of
their panelists who would be fairly representative of the population that the researcher
is interested in.
While these strategies are often referred to as nonprobability samples, that termi-
nology can be misleadingly simplistic. First, some online polling firms, like YouGov,
sample individuals from their panel using an approach that is based on a randomly
selected target to which volunteer members of the panel are then matched based on
their demographics (see Rivers 2007). Thus, this technique does have grounding in
probability sampling. Second, as many scholars have noted, the line between probability
and nonprobability recruitment has blurred considerably in the era of exceedingly small
response rates. For example, Andrew Gelman and David Rothschild (2014) note, “No
survey is truly a probability sample. Lists for sampling people are not perfect, and even
more important, non-​response rates are huge. . . . Rather than thinking in a binary way
of probability vs. non-​probability sampling, perhaps it’s better to think of a continuum.”
The point that Rothschild and Gelman are making is that when response rates are
less than 10% and others in the population are not included in the sampling frame
at all, it becomes much more difficult to treat anything as a pure probability sample.
Accordingly, all survey researchers now engage in a substantial amount of modeling
(e.g., weighting) to ensure that the sample they ultimately end up with is representative
of the population they are attempting to draw inferences about. However, it is typically
the case that online opt-​in surveys require more modeling than well-​designed surveys
using probability sampling. We consider this point in greater detail below. However, it
is important to keep in mind that surveys do span a continuum in terms of the degree to
which they rely on modeling versus random selection. Nevertheless, in this chapter we
use the terms online and/​or opt-​in as shorthand for surveys that rely more on modeling
and less on random sampling and face-​to-​face, telephone, and/​or probability samples.
The transition to opt-​in, online polls has been controversial in the community of
survey researchers (e.g., Voosen 2014). The most obvious opposition comes from in-
cumbent survey organizations: those invested in phone and face-​to-​face polls. However,
as we discuss below, there has also been strong resistance in the scholarly and meth-
odological communities. The shift away from pure random sampling was driven partly
by the increasingly higher nonresponse rates to existing survey methods as well as the
78    Stephen Ansolabehere and Brian F. Schaffner

distinct approach that online surveys required. The new technologies also had to prove
their mettle. Could researchers be confident that the new survey methodologies yielded
valid estimates of opinions and behaviors? What would be the basis for drawing sta-
tistical inferences from samples that were not randomly selected? As the move to on-
line polling occurred—​and in the mid-​2000s it seemed inevitable because of the
opportunities the technology presented and the increasing challenges faced by tradi-
tional modes—​what would be gained and lost in the transition to online polling?
This chapter examines the trade-​offs that the survey research and public opinion
field has faced in the transition to online opt-​in polling. The heart of the matter is not
which mode is right or wrong, good or bad. Rather, the transition that survey research
is undergoing forces us to understand how to best make decisions about how research is
conducted.
In this respect, the discussion here points to three significant conclusions, which
we return to at the end of the chapter. First, transitions take time. The early attempts at
Internet polls were error prone, but they improved markedly over time and tend to vary
significantly across survey firms (e.g., Kennedy et al. 2016). The field’s understanding of
survey method is not, then, static, but evolves with societal, technological, and industry
changes. Second, a healthy survey research field will allow for a variety of approaches.
The new challenge is not to pick one best approach, but rather how to synthesize in-
formation from different approaches. By combining data collected using different
approaches we may be able to improve our methods by guarding against the weaknesses
in any single approach. Third, there is a need for ongoing testing. We should constantly
re-​evaluate survey methods, whether they be recently developed or long established.
After all, we have learned that the effectiveness of survey methods can wax and wane
with changes in technology and society, even if the approach itself remains static.
In the next section we discuss the relationship between quality and cost when
conducting survey research. We then turn to focusing on how opt-​in Internet surveys
stack up both in terms of their overall accuracy and also with regard to the manner in
which they are administered to individuals.

Survey Quality and Cost

What has been gained or lost in the transition to online polling? The transition over the
past fifteen years from random digit dialing phone polls to opt-​in panels that rely on the
Internet for response has often been framed as a choice between higher quality proba-
bility samples and lower cost (but lower quality) opt-​in Internet samples (e.g., Pasek and
Krosnick 2010; Chang and Krosnick 2009). That choice was the focus of important liter-
ature on mode effects, which we discuss in the following two sections.
The potential trade-​off between quality and cost is crucial in research design gener-
ally, not just the method through which samples are drawn and surveys conducted. In
the scholarship on survey method, researchers have often focused on the total survey
Taking the Study of Political Behavior Online    79

error (TSE) approach, which recognizes that various components of a survey combine
to affect the total error rate of that survey (e.g., Groves and Lyberg 2010). The resources
of researchers—​time and money—​are limited. With additional resources, it is usually
possible to improve on our data collection methods. But given the constraints faced by
most researchers, we must decide how to best to allocate our resources. Thus, in this
section we consider how to balance the TSE of different approaches with the resources
needed to carry out those approaches.
Survey research has transitioned through many different modes, from in-​person or
face-​to-​face surveys, to mail surveys, to phone surveys, to Internet surveys, and now,
potentially to surveys administered through social media, mobile devices, or services,
such as Mechanical Turk. Each transition in survey mode is almost always framed as
a choice between high-​cost, high-​quality methods and low-​cost, low-​quality methods.
In the 1970s, for example, the debate was whether to switch from in-​person and mail
surveys to random digit dialing phone surveys. At that time, the phone surveys were
viewed as suspect, and in-​person, face-​to-​face surveys were taken as sufficiently supe-
rior in quality that they must be maintained as the standard methodology for survey
research (e.g., Klecka and Tuchfarber 1978; Weeks et al. 1983). But the cost constraints of
in-​person, face-​to-​face surveys meant that research organizations could conduct many
fewer surveys than they could with phone surveys. In the late 1980s there was an explo-
sion of the use of phone surveys for market and political research because researchers
could more quickly field their surveys and could take many more readings of public
opinion. In the area of election surveys, for example, the 1988 and 1992 elections saw
a rapid increase in the number of election polls conducted by media organizations to
gauge the horse race between the Republican and Democratic candidates. The horse-​
race coverage became a standard part of the story of the election.1 By the early 1990s,
random digit dialing phone surveys had become the new standard.
The control of quality in survey research has traditionally come through the use of
random sampling. A  2010 report by the American Association of Public Opinion
Researchers (AAPOR) on survey sampling methods stated strongly that random sam-
pling is the industry standard (Baker et  al. 2010). That report emphasized concerns
about quality, rather than cost, and promoted a specific technical approach to valid
survey research.
Why do random sample surveys produce high-​quality studies? The Polling 101
version of random sample surveys goes something as follows. A  surveyor randomly
selects a certain number of individuals from a population: a random sample. By that
we mean that all people have a probability of being selected into the sample, and that
probability is known and is independent of any characteristic of the individual. That
holds true if a device such as a coin toss or a random number generator creates the prob-
ability of selection. Further, it is assumed that those selected to participate all respond to
the survey and answer questions truthfully and fully. Crudely speaking, that is what is
meant by a random sample survey.
The value of this idealized version is that it states a set of assumptions that imply
an elegant statistical model of the survey that allows for estimation of and inference
80    Stephen Ansolabehere and Brian F. Schaffner

about characteristics of a population. More generally, the key assumption underlying


the theory of estimation and inference using surveys is that cases are selected into the
sample by a process that is independent of any important feature of the sample, also
known as the ignorability assumption (Gelman et al. 2004). Randomness in the sample
selection process ensures ignorability of the selection (or missingness) of the data,
assuming that every individual who is sampled by the surveyor takes the survey.
From the assumption of random sampling, statisticians have developed a theory of
estimation and inference. Under the assumption of random sampling (along with com-
plete and truthful response), one can apply the central limit theorem to define the distri-
bution of possible outcomes from a survey and use that distribution to make inferences,
such as the degree of confidence in an estimate. So, for example, the typical news story
about a poll usually states that a certain proportion of the population has a given char-
acteristic (e.g., approves of the president) and that there is a margin of error of plus or
minus 3 percentage points for that estimate. What is meant by that statement is that
there is a 95% probability that the true proportion of the population that has that char-
acteristic is within 3 percentage points of the estimate yielded by the survey. Thus, if a
poll with a 3 point margin of error finds that 45% approve of the president, then the true
value is very likely to be somewhere between 42% and 48% approval.
The random sample survey with complete and truthful response is the proverbial
“gold standard” of survey research. Like all proverbs, it has a kernel of truth surrounded
by a healthy coating of myth. Perhaps the most troubling problem for conventional
random sample surveys has been declining response rates. In other words, a researcher
can select a random sample, but the researcher cannot force those sampled to respond.
If some types of people are more likely to refuse to participate than other types, then the
sample will ultimately be biased. For example, younger adults are often harder to con-
tact and less likely to be willing to respond to surveys, which means that the samples
obtained by pollsters are often much older than the population that they are attempting
to make inferences about.
The American National Election Study (ANES) expends considerable effort to con-
struct random samples of the U S. population based on addresses and then to conduct
face-​to-​face interviews. According to the ANES, the response rate to the study has fallen
from 80% in 1964 to 60% in 2000 to 53% in 2008 to 38% in 2012.2 The Pew Center on
People and the Press conducts the highest quality phone studies possible. That research
organization reports declining, even lower response rates to phone polls. From 1997 to
2012, the response rate to the Pew phone surveys dropped from 36% to just 9%.3
The high nonresponse rates associated with phone and face-​to-​face surveys since
the 1990s created substantial doubts about the validity of the survey enterprise, and
opened the possibility for another approach. Under the usual statistical theory, high
nonresponse rates raise concerns about the confidence in the assumption of pure ran-
domness; after all, most of the people who were randomly selected into the sample have
declined to participate. As a result, researchers must either fall back on the assumption
of ignorability of nonresponse (i.e., assume that those who refused to answer were no
different than those who participated) and noncoverage (i.e., people who cannot be
Taking the Study of Political Behavior Online    81

reached through the survey mode) or attempt to adjust the survey at the data anal-
ysis stage to correct for patterns of nonsampling errors that are nonignorable (i.e., by
weighting the sample). That is, researchers either had to believe that the 60% of people
who refused to respond to the ANES in 2012 were no different than the 40% of people
who did respond, or they had to use statistical methods to “fix” the sample to make those
who responded look like the original random sample.
Even before the transition to online polling began, survey researchers were already
using weighting to deal with the challenges faced by plummeting response rates. This
is not to say that the actual quality of surveys had declined. Rather, the key facts about
declining response rates had led to an increased impetus among survey researchers
to use statistical methods to adjust for the fact that samples violated the ignorability
assumption. These rising concerns about sampling also provided an opening for
survey innovation, a search for alternative modes and new ways of thinking about
survey design. The challenge for new modes, such as the opt-​in Internet survey, was
demonstrating that these new approaches were of sufficiently high quality and lower
cost to justify the move. The main concerns were nonresponse, noncoverage, and the
lack of randomness as a protection against idiosyncratic errors in sample selection.
The costs of surveys can vary considerably across modes and even within modes.
A typical Internet sample of 1,000 respondents costs in the neighborhood of $10 to $20
per interview. Special samples (say of a specific demographic or region) can be consid-
erably more expensive.4 The costs of a random digit dial phone poll are typically at least
50–​100% higher than high-​quality Internet polls of the same population.
The most expensive surveys, by far, are address based samples conducted face-​to-​face,
such as the ANES and the Panel Study of Income Dynamics. The ANES reports that the
cost of fielding its survey (excluding other activities associated with the project) was ap-
proximately $3 million for 2,000, or a staggering $1,500 per interview. The possible costs
of a national survey of American adults, then, can range from approximately $10 per in-
terview to more than $1,000 per interview.
How should we think about the trade-​off between cost and quality? What are the
benefits of a high-​quality survey, and what are the losses associated with a lower quality
survey? Quantifying those benefits and losses is essential in making a systematic choice
about research design.
Typically, the trade-​off between quality and cost is considered only in relation to a
single study. Given a fixed amount of money, a research team chooses a survey mode
and designs and implements its questionnaire. And in preparing a grant, a research
team must justify its choice of survey methods and modes. In making design decisions,
researchers must consider the consequences of making either Type I or Type II errors.
That is, they must weigh concerns about their wrongly concluding that a hypothesis is
correct when in fact it is not, or wrongly concluding that a hypothesis is wrong when in
fact it is true.
While researchers typically make decisions about mode in relation to a single study,
in academic research it is more fruitful to think about the quality-​cost trade-​off not in
terms of a single survey but in terms of a series of studies that all seek to answer the same
82    Stephen Ansolabehere and Brian F. Schaffner

question—​that is, in terms of an entire literature. If a discipline chooses a higher quality


methodology, then scholars can answer a given question or test a given hypothesis or
conjecture more efficiently than if the discipline used less accurate methods.
Suppose we conduct one study under the strong assumptions of random sampling,
with 100% response rate and no misreporting. We use this survey to produce a point
estimate (say approval for the president) and a confidence interval. In that case, the
chances of “getting the answer right” (creating a confidence interval that includes the
true population value) are 95% for a traditional level of confidence. We take that as a
baseline.
One way to quantify the loss associated with an inferior methodology is to ask how
many studies researchers would have to do to reach the same conclusion as a high-​
quality survey with 95% confidence. There are many ways to quantify that specific
criterion. Suppose that we use simple majority rule: Do a majority of studies confirm
or disprove a given estimate or hypothesis? Adding a degree of confidence to that
statement, we seek to establish how many studies of inferior quality researchers would
have to conduct to have a 95% probability that a majority of studies reach the correct
conclusion. We think of this as a quantification of what is meant by a consensus in a sci-
entific community.
Take, as an example, two types of studies. One type of study uses the superior meth-
odology (random sampling, complete and correct responses). From this, one can build a
confidence interval or conduct a hypothesis test that, in a classical statistical framework,
will have a .95 probability of being true. This is our baseline criterion. The other type of
study uses an inferior methodology. Suppose that the inferior approach would confirm
a hypothesis, if the hypothesis is true, with probability .9 (rather than .95).5
How many studies of inferior quality must be done to have 95% confidence that the
body of research arrives at the right result? Assume that a series of three independent
studies is conducted using the inferior methodology. The probability that all three
studies confirm the hypothesis is .729 (.9 × .9 × .9), and the probability that two of
the three confirm the hypothesis is .243. Thus, the probability that a majority (two or
three) of the three studies confirm the hypothesis correctly is .972. Importantly, this cal-
culation assumes that the studies are independent of one another. Positive correlation
among the studies can make this an underestimate of the number of studies needed; by
the same token, negative correlations among studies can actually gain efficiency. Setting
that concern aside, under the assumption of independence, if we conduct three inferior
studies, we have as much confidence that a majority of those studies are correct as we
would if we conducted one study using the superior methodology.
This approach allows us to quantify the quality-​cost trade-​off. A direct cost calcula-
tion is simply the number of surveys that are required to obtain a consensus, given a level
of quality of a survey. Other considerations, such as opportunity costs of researchers,
might be factored in as well. The simple implication of the calculation above is that it is
worth using the superior quality survey only if the cost of doing one such survey is less
than the cost of doing three inferior quality surveys. Likewise, it may be worth using an
inferior survey methodology if the cost of such surveys is less than one-​third the cost of
Taking the Study of Political Behavior Online    83

the superior methodology. We see a similar logic play out when it comes to horse-​race
polling during campaigns. While some very high-​quality surveys are useful indicators
of the state of the race in their own right, most seasoned scholars and pundits focus on
aggregated indicators of the state of the race taken from multiple polls (i.e., polling aver-
ages). The justification for this approach is that one can generally learn at least as much
from averaging multiple inferior polls as from looking at a single poll, even one of very
high quality.
Viewed in this way, it becomes extremely useful to measure the relative quality of
various survey methodologies to contextualize the cost differentials. Denote the
degree of quality of the inferior methodology as q, the probability that the hypothesis is
confirmed using the inferior quality methodology given that the hypothesis is right.
In the calculation above, q = .9, and we ask how many studies with q = .9 must be
performed to have a probability of .95 or higher that the majority of those studies
confirm the hypoth­esis when that hypothesis is true. Now consider doing the same
thought experiment for lower levels of quality, namely, q = .8, q = .7, and q = .6.
Table 4.1 presents the number of studies needed in a literature to attain a 95% level
of confidence that a majority of the studies conclude that the hypothesis is true, when
in fact it is true. Again, we assume that the studies are independent of one another.
Lessening the level of quality from q = .9 to q = .8 increases the number of studies
needed to reach at least a 95% level of confidence from three to seven. In other words,
if survey quality concerns raise the probability of a false negative from .05 to .20, then
a research community must conduct at least seven studies before a sufficiently strong
consensus is reached. Continuing in that vein, if the level of quality drops to q = .7,
then the research community must conduct at least fifteen studies to reach a con-
sensus, and if the level of quality is as low as q = .6 (meaning there’s a 40% chance
of a false negative on any single survey), then the research community would have
to conduct sixty-​five studies before a majority of studies clearly answers the research
question.
This formalization of the quality-​cost trade-​off has several important implications for
the choice of survey mode.

Table 4.1 Survey Quality and the Number of Studies Needed


to Obtain a “Consensus”
Probability Correct* Number of Studies Needed

q = .9 3 (at least 2 of 3 correct with probability .95)


q = .8 7 (at least 4 of 7 correct with probability .95)
q = .7 15 (at least 8 of 15 correct with probability .95)
q = .6 65 (at least 33 of 65 correct with probability .95)

*  Probability that one will conclude H is true, given that it is true.


84    Stephen Ansolabehere and Brian F. Schaffner

First, very high-​quality surveys have a significant edge in choice of research method.
A small degradation of quality, say from q = .95 to q = .90, assuming independence of
surveys, means that multiple studies must be conducted to test with high confidence a hy-
pothesis, or that sample sizes must be increased considerably. In other words, a survey that
has a 10% error rate imposes three times as much cost (three times as many studies need to
be done) as a survey that has a 5% error rate. The cost, in terms of the total number of studies
required to achieve a consensus, grows exponentially as the rate of false negatives grows.
Second, the lower cost of Internet polls has been winning out over the “gold standard”
polls, in part because of the exceptionally high cost of address based sampling, face-​to-​face
polls. Consider the comparison of the costs of conducting the ANES in 2012 and a similar-​
sized Internet poll. The ANES’s face-​to-​face, in-​person survey is more than one hundred
times more expensive to do than a high-​quality Internet poll. In other words, for the cost
of the ANES one could do at least one hundred high-​quality Internet polls. With that cost
differential, it is worth it to the scientific community to use the lower cost modes to answer
research questions, even when the probability of a false negative is as high as 40%!
Third, this framing of the problem raises the natural question of what constitutes a
scientific consensus. Is a .95 probability that a test confirms the hypothesis when that
hypothesis is true too high? Might a research community feel that a consensus emerges
with a lower probability that a majority of studies reach the same conclusion? If a con-
sensus emerges at a lower level of confidence, then the advantage of the higher quality
approach is even less pronounced.
The approach we have sketched here also offers insight into the question of multiple
or mixed modes of survey research. Suppose a research group conducts three surveys to
test a hypothesis. That research group might conduct three superior quality surveys (at
considerable expense) or three inferior quality surveys (at much less cost), or it might
employ a mix of approaches. An analogous calculation to that in Table 4.1 reveals that
there may be an advantage to mixing the modes, or, equivalently, using multiple survey
modes in a research project or literature.
Table 4.2 presents the probabilities that a majority of studies reach the correct
conclusion, for various levels of survey quality and mixes of inferior and superior

Table 4.2 Survey Quality, Mixed Modes, and the Probability That a Majority of


Studies Reach the Correct Result
Quality of Inferior 3 Superior Quality 2 Superior 1 Superior 3 Inferior Quality
Survey Surveys 1 Inferior Quality 2 Inferior Quality Surveys

q = .9 .993 .988 .981 .972


q = .8 .993 .979 .944 .896
q = .7 .993 .969 .889 .784
q = .6 .993 .960 .816 .648
Taking the Study of Political Behavior Online    85

methodologies. If the researchers were to conduct three superior quality surveys, each
of which has a .95 probability of concluding that the hypothesis is correct when in fact
it is, then there is a .993 probability that at least two of three or three of three surveys
reach the correct conclusion. Interestingly, if the researchers were to include an inferior
quality survey along with two superior quality surveys, they would have nearly the same
(very high) level of confidence that a majority of their surveys are correct. If q = .9 for one
of the surveys and .95 for two of the surveys, then the probability of a correct conclusion
among a majority of surveys is .988. Even if the low-​quality survey has a q of just .60, the
probability that a majority of the three surveys is correct is .960. See the third column of
the table. Using multiple surveys provides some protection against false inferences from
inferior quality surveys. That said, quality does harm inference: the lower the quality,
the lower the probability of reaching the correct inference. The drop-​off in confidence
can be quite large with lower quality, especially when all surveys are of the inferior sort.
One important implication of the simple analysis is that not all surveys need to have
the same quality for a scientific consensus to emerge. For example, with one superior
quality survey and two inferior quality surveys (q = .8), the probability that a majority of
surveys yields the correct answer is still approximately .95.
This points to a possible cost-​saving approach in research. Having multiple survey
modes allows a research group or an entire field of study to lower the cost of reaching a
scientific consensus. In fact, having all surveys be of very high quality might even be in-
efficient. If, to reach consensus, at least three studies need to be conducted in a literature,
then three very high-​quality surveys will have an extremely high probability of agree-
ment. A discipline, then, can tolerate a mix of higher quality and lower quality surveys
and still attain a high probability that a majority of surveys reach the correct conclusion.
Having multiple survey modes in a research literature also allows for testing of
the validity and quality of the modes. If there are several modes being actively used,
researchers can compare the relative quality of various modes.
As new modes for conducting surveys emerge, the key question, then, is what quality
of results is derived from those new modes. How closely do new modes of conducting
surveys approximate the ideal of a random sample survey with complete and truthful
answers? In the early 2000s, as nascent Internet survey methods began to emerge, that
is precisely the question survey researchers faced. And today, as Mechanical Turk and
other platforms for data collection emerge, the same questions arise.

Quantifying the Quality
of Internet Surveys

There are two approaches to quantifying the quality of a result estimated from a partic-
ular method: (1) compare the survey results from that mode with objective indicators
and (2) compare estimated quantities (means, variances, correlations, and regression
86    Stephen Ansolabehere and Brian F. Schaffner

coefficients) for identical questions asked in different survey modes (e.g., phone versus
mail or phone versus Internet).
Comparison with objective indicators offers the strongest measure of survey quality
because it allows researchers to compare their survey estimates with the quantity that
they are actually trying to estimate. Suppose that a survey attempts to measure a charac-
teristic of a population, such as the percent of votes won by the Republican candidate for
president in each of several states. The survey samples a few hundred people in each state
and asks for whom they voted for president. The deviation between the survey estimates
and the actual election results (the true or population value) reflects the random and
non-​random errors that occur in the survey process. This is often referred to as the TSE
(Platek and Sarndal 2001). Total survey error includes the deviation of the estimated
value from the actual population value as a result of all parts of the survey, including
nonresponse, misreporting, poorly asked questions, and other problems. These errors
may be random (and add to the variance of the estimate) or systematic (and cause bias in
the estimates). In statistical terms, the TSE is the mean squared error, which equals the
square of the bias of the survey estimate of a given quantity (e.g., a mean or proportion
or regression coefficient) survey plus the sampling variance of the estimated quantity
(i.e., the square of the standard error).
To measure TSE, or mean squared error, multiple measures are needed. The deviation
of any one survey’s estimate from the actual value of a given quantity is a single realiza-
tion of the TSE. Suppose the same survey method is repeated many times (either many
different quantities within a single survey or many replications of the same survey), and
the deviation of the survey from the actual value is calculated for each replication. The
average of those deviations gauges the bias—​the extent to which the survey instrument
is systematically too high or too low—​and the variance of the deviations estimates the
mean squared error.
The Cooperative Congressional Election Study (CCES) provides an ideal example
and case for measuring the TSE associated with Internet polls.6 The CCES is conducted
every year and is designed to measure the vote choices and political preferences of
American adults. The study employs very large samples, in excess of 30,000 in 2006 and
2008 and in excess of 50,000 in 2010, 2012, and 2014. The large samples make it pos-
sible to estimate the vote in each state for president, U.S. Senate, and governor, and to
compare those estimates to the actual results at the state level. For each state one can
calculate the theoretical (or expected) standard error under the usual assumptions of
sampling theory, such as random sampling or ignorability, and one can calculate the
deviation of the estimate from the actual result. The average deviation (bias), mean
squared error, average number of cases per state, and expected standard error are
presented in Table 4.3 for each statewide race and year for which the CCES provides
estimates (see Ansolabehere and Schaffner 2015, 16–​20). In all there are twelve separate
contests, but each is measured at the state level. Consequently, there are over three hun-
dred individual-​level elections (for each unique combination of state, year, and office)
represented in Table 4.3. The table displays the results aggregated to each year and office
and, at the foot of the table, aggregated across all states, years, and offices.
Taking the Study of Political Behavior Online    87

Table 4.3 presents an overall picture of the accuracy or quality of the CCES, relative to
the ideal survey of the same size. The average bias is 0.4%, which means that averaging
over every case, the average deviation overstated the Democrat’s share of the vote, but
only by four-​tenths of 1 percentage point. The average mean squared error is 3.19%, and
we contrast that with the expected standard error. The average standard error, under
the assumption of ignorability or random sampling, is 2.36%. That is approximately 25%
smaller than the mean squared error, our estimate of the true variance of the TSE.
A further analysis of the data allows us to calculate the quality of the CCES along the
lines of the analysis suggested by Table 4.1. For each office, year, and state we calculate the
squared deviation of the survey result relative to the squared standard error for that state’s
sample. The average of those relative deviations estimates the expected quality of the
survey. It is a multiplier indicating how much larger the true variance of the survey is (the
variance of the TSE) than the variance of the idealized survey. That calculation suggests
that the true standard deviation of the survey is approximately 1.35 times the expected

Table 4.3 Comparing Survey and Actual Results: Bias, Mean Squared Error, and


Standard Error for the Cooperative Congressional Election Study,
2006–​2014
Average Error Root Mean Squared Average Expected Standard
(Dem. Bias) Error Number Error

2014
Governor −0.84% 3.95% 626 3.59%
U.S. Senate +0.34% 3.38% 515 4.26%
2012
President +2.16% 3.32% 1,069 1.53%
U.S. Senate +2.27% 3.66% 1,217 1.94%
Governor +1.49% 3.56% 666 1.43%
2010
Governor −0.95% 2.22% 982 1.93%
U.S. Senate −0.50% 1.30% 882 1.98%
2008
President +0.57% 2.89% 940 2.05%
U.S. Senate −0.58% 4.04% 638 2.26%
Governor +1.12% 3.40% 511 2.35%
2006
Governor −0.04% 2.24% 604 2.53%
U.S. Senate +0.28% 4.29% 689 2.42%
Average 0.43% 3.19% 2.36%
88    Stephen Ansolabehere and Brian F. Schaffner

standard error. We can now use that standard error to construct a test statistic, rather
than the conventional standard error calculation. The implication is that this Internet
survey lowers the quality of inferences somewhat. If the probability of a false negative
is .05 for a test statistic constructed using the usual (expected) standard error, then the
probability of a false negative is .15% using a test statistic constructed using the estimated
square root of the mean squared error as the standard error. A more appropriate calcu-
lation of the quality of the survey relative to the ideal standard is of the estimated mean
squared error relative to the expected standard error for each office and year.
Is that a substantial degradation compared to other surveys? Very few other surveys cal-
culate the TSE associated with their projects. An analogous, but not as expansive, concept
is the design effect, which measures the variation in a survey that comes from clustering
and nonindependence of observations and other features of the design that can produce
higher sampling variances than occur with pure random sampling. The design effect does
not capture biases that occur due to misreporting, nor does it account for whatever bias re-
mains after weights are applied to adjust for nonresponse. The design effect, however, can
be thought of as degradation in quality relative to pure random sampling, as such effects
increase the probability of false negative inferences. The design effect of the ANES has
been estimated to be in the neighborhood of 1.2 to 1.6.7 In other words, the inflation of the
standard error with the Internet sample used by the CCES is approximately on the same
order as the design effect associated with the sampling procedure used by the ANES. This
suggests that there may be little degradation in the ability to draw inferences using Internet
polls relative to traditional random sample, face-​to-​face surveys.
These calculations are presented to demonstrate how researchers may assess the
quality of new survey methods relative to existing methods. In the case of the YouGov
samples relied on by the CCES, there is little evidence of systematic bias and evidence
of some loss of precision relative to the idealized pure random sample survey. However,
no surveys currently match the idealized pure random sample survey. Comparing the
ANES design effects and the TSE of the CCES, there appears to be little degradation in
the ability to draw inferences compared with more traditional sampling modes.8
Any new and untested methodology faces tough scrutiny, and ought to. Total survey
error provides a general framework for assessing the quality of new modes of surveying.
This framework allows us to measure in clear quantitative terms the quality side of the cost-​
quality trade-​off. The example of the CCES offers a good case study of the use of TSE to pro-
vide a critical evaluation of the performance of an Internet survey. Importantly, the analysis
of the CCES over a ten-​year time frame revealed that the study did not, in fact, represent a
significant reduction in quality, compared with the design effects of traditional surveys.
A second manner of assessing the quality of any new approach relative to established
methodologies is a carefully designed study that compares the modes. Unlike TSE, the
framework of a mode study is to compare the estimates yielded by competing modes.
No comparison with an objective reality is usually made, so it is possible that there are
biases that affect all modes. However, a mode study is useful in determining whether a
new mode might alter conclusions we have drawn using established modes.
Studies of mode differences in the early 2000s found substantial differences be-
tween opt-​in Internet samples and random digit dialing phone samples. For example,
Taking the Study of Political Behavior Online    89

in a study of alcohol use among young adults, Link and Mokdad (2005) found that
phone and mail surveys yielded similar results, but that their Internet sample produced
different results. Studies such as this one led a group at AAPOR to conclude in 2010
(Barker et al. 2010) that opt-​in Internet samples differed from other modes of inquiry.
More recent research, however, shows few or no significant differences between tra-
ditional modes and opt-​in online survey approaches. Ansolabehere and Schaffner
(2014) conducted a mode study comparing phone, mail, and Internet samples. They
found no substantial differences across modes in reported behaviors, such as voting,
vote preference, donating blood, smoking cigarettes, moving, or owning a home. They
found no significant differences in regression coefficients or correlations across modes
in explaining approval of Obama; approval of Congress; and attitudes about abortion,
affirmative action, gay marriage, Social Security privatization, or taxes. We have also
conducted mode studies comparing the face-​to-​face version of the ANES with versions
conducted in two separate online survey formats. Our results show that in terms of both
point estimates and cross-​item correlations, online surveys track closely with responses
secured through the face-​to-​face sample.
Other studies have reached similar conclusions. The Pew Center for People and the
Press conducted a study in 2015 comparing a survey recruited through random digit di-
aling with nine opt-​in Internet samples (Kennedy et al. 2016). That study found that the
randomly selected sample was exactly in the middle on most of the measures gauged.
The authors concluded that vendor choice matters much more than mode. Other recent
studies reach similar conclusions: the differences between quality opt-​in and random
digit dialing samples have become trivial. What is most important is not which mode
you use, but choosing a high-​quality vendor to execute the selected approach.
Whether the standard is the absolute level of quality (TSE) or the relative level
of quality, the past fifteen years have witnessed a substantial improvement in the
demonstrated quality of opt-​in Internet surveys. Over this time span Internet surveys
turned a significant corner. Although some concerns remain about the use of opt-​in
surveys, high-​quality Internet surveys appear to have gained broad acceptance both
within and beyond academia.
One important lesson from the debate over the quality of Internet surveys is that
quality control is essential for surveys, which are a vital research tool. The need to
maintain quality creates a difficult problem for those organizations that attempt to set
standards for the field, such as AAPOR. Standards seem necessary to maintain quality,
but they also can stifle innovation and the evolution of the field.

Qualitative Differences in the


Changing Modality of Surveys

An entirely separate set of issues drove the rise of Internet polls over the past decade: the
qualitative differences between an online interface and interviewer-​led questions.
90    Stephen Ansolabehere and Brian F. Schaffner

Online polls present new opportunities for conducting research, including the
ability to show respondents videos and images and to present new question formats.
Experimentation has driven the move online as much as considerations of cost and
quality (Evans and Mathur 2005; Sue and Ritter 2012). While the quality-​cost debate
concerns the validity of population estimates compared with random sample surveys,
the opportunity to conduct experiments has made the Internet survey a natural choice.
For at least a generation, psychologists (working in all fields) have relied on experiments
involving college students to test their ideas. Internet surveys (and newer tools such as
Mechanical Turk) offer a much broader population with which to conduct research.
In this respect we see three different attitudes regarding modality. First, Internet
surveys differ in mode of data collection only. The quality-​cost trade-​off treats the
questionnaires as the same and merely views the Internet as a more convenient and less
expensive mode of data collection. This is an important aspect of the choice of mode,
as researchers do not want to lose the continuity with past research, especially for long-​
lived research projects like the ANES, the General Social Survey, or the Panel Study of
Income Dynamics.
A second view, to use Marshall MacLuen’s phrasing, is that the medium is the message.
The rise of the Internet and social media has fundamentally changed the way people
communicate. The rise of online polling is simply the adaptation of research on social
attitudes, opinions, and behaviors to changes in technology and society. The random
digit dial phone survey itself was an adaptation to changing communications in society.
Sticking with the landline-​based mentality today amounts to sticking with older ways
of communicating, which are quickly becoming inadequate for the study of society. By
2012 one-​quarter of all people could not be reached by a random digit dial phone survey.
That number is estimated to exceed one-​third of all people in 2016, and it will continue
to grow. Many more people are now accessible online.
Not only do new media reach more people, but they involve fundamentally different
forms of communication. The Internet is a visual medium. Respondents read online
surveys, rather than have the surveys read to them by an interviewer, and the removal of
the interviewer from the process makes progressing through the survey much quicker.
Visuals and video can be embedded in a survey, and it is easier to randomly assign
survey respondents to see different versions of a message or image. These innovations
have opened new ways of asking questions and new ways of analyzing data. The length
of time it takes to answer a question, for example, can be easily recorded and provides
implicit measures of the degree of cognitive effort a respondent expends in answering a
question (Mulligan et al. 2003).
People also interact with new media differently, and that too is part of the survey ex-
perience. For example, the 2008 ANES face-​to-​face survey included a seventy-​three-​
minute pre-​election interview and a ninety-​one-​minute post-​election interview.9 These
were some of the longest conversations that respondents had about politics, especially
with a stranger. Most Internet surveys also allow respondents the flexibility to com-
plete questionnaires at their own convenience and at their own pace. This means the
survey is much less of an intrusion on an individual’s daily routine. Indeed, we have
Taking the Study of Political Behavior Online    91

found that people who answer online polls frequently do other things while they are
working through the survey (Ansolabehere and Schaffner 2015). For example, 15–​20%
of respondents watch television while they answer an online survey. Many respondents
also take breaks to have a conversation with a family member or roommate, to check
email, or to have a phone call. About half of online respondents report doing at least one
other thing during the course of taking their survey. Notably, the interruptions and mul-
titasking do not appear to degrade the quality of responses given by respondents.
The self-​administered nature of online surveys not only provides a benefit by allowing
respondents to finish them at their own pace, but it also means that the responses given
are likely to be more accurate. Studies consistently find that respondents are more
honest when they answer self-​administered surveys, especially those conducted online
(Chang and Krosnick 2009; Kreuter, Presser, and Tourangeau 2008). The presence of
an interviewer (either in person or on the phone) often discourages respondents from
answering sensitive questions truthfully, but when those same individuals can complete
the questionnaire privately, they are more likely to provide honest responses.
Overall, online surveys provide an innovative and flexible interface for collecting
data. Thus, Internet polls can collect a wider array of data more efficiently, more conven-
iently, and more accurately than modes that necessitate the presence of an interviewer.
When combined with the increasing accuracy and continued affordability of Internet
surveys, the flexible and convenient interface is yet another reason that scholars have
increasingly used online polls.

Making Wiser Choices about


Survey Mode

Survey modes and methods will continue to change as communications technologies


change. Today, online polls have gained wide acceptance, and the ascendancy of this
new, less expensive methodology has put enormous pressure on more expensive modes,
especially face-​to-​face surveys. And so it goes. New, cheaper ways of conducting surveys
replace the old approaches, only to eventually be replaced themselves. Survey researchers
are trying to figure out the best way to gauge public opinion using mobile devices and
social media. Amazon.com’s Mechanical Turk is quickly emerging as a faster and less ex-
pensive platform for conducting experiments that were previously done in conventional
surveys. And as with other new methods, the debate over the quality of that approach
has already begun (e.g., Berinsky, Huber, and Lenz 2012). Mechanical Turk and other
new ways of studying political and social behavior will become accepted, and possibly
even ascendant. Researchers, facing the inevitable question of how to most efficiently
conduct their inquiries, will eventually abandon older methods in favor of newer ones.
That cycle of innovation is inevitable. It is a cycle of creativity: new technologies intro-
duce new ways of reaching people, asking questions, and studying behavior.
92    Stephen Ansolabehere and Brian F. Schaffner

We have sought in this chapter to introduce a different way of thinking about the fu-
ture, about what comes next. The debate over methods of studying behavior is often
framed in two ways. Both are informative, but neither is adequate. First, the debate is
often over “this approach or that”; phone or Internet, mail or face-​to-​face, probability
or opt-​in. While that may be the choice that any researcher faces in designing a specific
study, it does not reflect the broader concern of a research literature. That quest is to find
the most efficient way for research to come to a scientific consensus over an important
conjecture or problem. No single mode may be the answer.
Second, the debate over mode is often framed as a debate over a scientific or indus-
trial standard. What are the technical specifications that all researchers must adhere
to in order to gain acceptance by a broader community? Such technical specification
standards are really social norms, as much as actual quality guarantees. In that regard, it
is very important to note that researchers in the United States and the United Kingdom
adhere to very different technical standards for their surveys. The technical specifica-
tion standard that was the norm in the United States for several generations was random
sampling; that is, the method of selection must be unrelated to any characteristic of
individuals. The technical specification standard that has long been the norm in the
United Kingdom is representative sampling; that is, the sample ought to represent the
population along several key characteristics, such as age, gender, and education level.
If a random sample in the United Kingdom is not sufficiently representative, it is unac-
ceptable. If a representative sample is presented at the AAPOR, it is suspect because the
sample was not randomly drawn. These are norms that imply a way that surveys must
be done to be acceptable. Such standards serve as a barrier to innovation and barriers to
entry in the marketplace of survey research and marketing firms.
Our framing of the problem is that quality goals, rather than technical specifications,
are essential. From a scientific perspective, the choice of survey research mode weighs two
considerations, cost and quality. If researchers can get higher quality at the same cost, they
should buy the higher quality mode. That approach is good not only for the individual re-
searcher working on a tight budget, but also for the scientific community as a whole, as that
approach will lead more quickly and efficiently to a consensus around the correct conclusion.
However, we do not operate in a world in which the highest quality is the cheapest.
We usually have to make a choice between an expensive but accepted “gold standard”
and a cheap but innovative methodology or technology. First, there is considerable un-
certainty about the quality of new technologies. Established technologies and technical
standards are in place because they won the last fight over methodology. And the re-
search needed to assess the quality of competing modes has rarely been conducted when
a new methodology is just emerging, when the choice between competing modes is most
difficult to make. Second, the economics of the survey business (or any business) often
create a cost difference. The incumbent methodologies are often most expensive because
newer technologies are adapted as innovations in cost and quality and because technical
standards protect incumbent firms (creating a monopoly advantage for those firms).
If the real goal is maximizing the efficiency of the scientific enterprise rather than
conforming to technical standards, how should researchers think about the choice of
Taking the Study of Political Behavior Online    93

which methodologies to use now? The framework we have introduced offers guidance
about a more effective way of proceeding, both the way to develop a healthy research en-
terprise and some cautions.
First, there is a place for standards. Standards can offer a means of quality control
for the entire research program or research activity. As our examination of Table 4.1 re-
vealed, a few very high-​quality studies can be worth dozens of low-​quality surveys. The
high-​quality studies would, then, be the best research method if the costs were not ex-
ceedingly high relative to the lower quality methods.
Second, to assess quality there needs to be continual study of new research modes.
Researchers cannot make informed design decisions unless data are available about the
quality of inferences made using different methodologies. Some of that information
can be gained as studies are conducted. For example, the CCES builds in measures of
quantities that allow for calculation of TSE. Some of that information can be gained by
conducting carefully designed mode comparison studies.
Third, a mix of modes offers efficiency advantages. There is naturally a mix of modes
in a research area. At any time there are new ways of conducting research, and those
new ideas are contending with established approaches. Our examination of Table 4.2
revealed that a mix of modes can be very efficient, allowing an entire field of research
to reach a scientific consensus at a much lower cost. Further, having several different
methodologies at work in a field of study allows researchers to compare the different
approaches and to draw their own conclusions about quality and the trade-​off be-
tween quality and cost. Also, different modes can have different strengths and different
weaknesses. Using many different modes in a research literature can offer a hedge
against the weaknesses of any single mode. Not every survey has to employ a mixed
mode, but a healthy literature has a mix of modes across studies. We should be suspi-
cious of anyone who avers that there is one and only one ideal way of doing research.
Fourth, technical specifications of the “gold standard” survey, although they serve an im-
portant function, can be a poor way of ensuring an efficient development of scientific under-
standing. Technical specifications can force the trade-​off between quality and cost to be made
in one way for all researchers. If every survey is forced, by virtue of technical specifications, to
have the same mode, then the advantages of deploying multiple modes are lost.
Fifth, survey quality can become a public good. Research occurs in a decentralized
fashion. Every research project makes its own decisions about how to conduct surveys,
how to trade off cost against quality. Technical standards can force all researchers to
make the trade-​off in the same way, say toward the high-​quality, high-​cost method, but
in a way that stifles innovation. The opposite problem can emerge as well. If a research
area employs multiple modes of study, there may be a race to the bottom. Every research
team might choose the low-​cost approach and let someone else bear the cost of the very
high-​quality study. As a result, a field might collectively proceed very slowly and ineffi-
ciently if every researcher chooses the cheap, low-​quality option.
The challenge, then, is how to push the boundary outward, how to create inno-
vation in survey quality and survey cost simultaneously. In some respects that al-
ready happens. Internet surveys opened new ways of measuring opinions, attitudes,
94    Stephen Ansolabehere and Brian F. Schaffner

and behaviors. Professional association standards can be helpful in creating


guidance about where quality improvements are needed and possible with existing
technologies and in maintaining a mix of methodologies so that the rush to a new
methodology does not completely lose the value of what existed before. Government
agencies, such as the National Science Foundation and the National Institutes of
Health, and private research foundations, such as Pew, can serve an important pur-
pose as well. They simultaneously maintain those projects and methods deemed to
be very high quality by a scientific community and invest in new technologies and
methodologies that show promise of emerging as a platform for the research commu-
nity at large, such as Time-​share Experiments in Social Sciences. And in this respect,
there is also tremendous value in careful research about survey methods and robust
academic debate about those methods. Professional standards, government and
foundation investment, and academic research about survey mode, however, should
not be about picking winners, but about informing researchers generally about the
quality, the strengths and weaknesses, of alternative ways of studying public opinion
and political behavior.

Notes
1. Thomas Mann and Gary R. Orren, eds., Media Polls in American Politics. Washington,
DC: Brookings, 1992.
2. For data on ANES response rates (AAPOR RR1) from 1952 to 2000, see http://​www.
electionstudies.org/​overview/​dataqual.htm#tab1. For data on response rates in 2008, see
http://​www.electionstudies.org/​studypages/​2008prepost/​2008prepost.htm. For data on re-
sponse rates in 2012, see http://​www.electionstudies.org/​studypages/​anes_​timeseries_​2012/​
anes_​timeseries_​2012_​userguidecodebook.pdf, p. 12.
3. http://​w ww.people-​press.org/​2012/​05/​15/​assessing-​t he-​representativeness-​of-​public-​
opinion-​surveys/​.
4. Personal communication with Samantha Luks, Senior Vice President, YouGov, San
Francisco, CA.
5. There are other ways to formalize this choice. For example, it is analogous to the question
of how many studies we need to include in a meta-​analysis (Valentine et al. 2010). Here we
focus on this simple approach, as it makes clear the quality-​cost trade-​off.
6. The CCES is a large-​N cooperative survey project carried out every fall since 2006. The
survey is conducted by YouGov, using its methodology of matching opt-​in respondents to
a randomly selected target sample. More details about the survey and access to the survey
data can be found at http://​projects.iq.harvard.edu/​cces/​data.
7. Matthew Debell, “How to Analyze ANES Survey Data,” ANES Technical Report Series no.
nes012492 (Ann Arbor: Stanford University and the University of Michigan, 2010), 21.
8. This comparison actually favors ANES, as the design effect captures the inefficiency in the
standard errors to reflect clustering and other features of design, while total survey error
contains both inefficiency and bias.
9. See:  http://​www.electionstudies.org/​studypages/​anes_​timeseries_​2008/​anes_​timeseries_​
2008.
Taking the Study of Political Behavior Online    95

References
Ansolabehere, S., and B. F. Schaffner. 2014. “Does Survey Mode Still Matter? Findings from a
2010 Multi-​Mode Comparison.” Political Analysis 22 (3): 285–​303.
Ansolabehere, S., and B. F. Schaffner. 2015. “Guide to the 2014 Cooperative Congressional
Election Study.” Release 1, June. https://​dataverse.harvard.edu/​dataset.xhtml?persistentId=
doi%3A10.7910/​DVN/​XFXJVY.
Baker, R., et al. 2010. “Research Synthesis: AAPOR Report on Online Panels” Public Opinion
Quarterly 74: 711–​781.
Berinsky, A., G. Huber, and G. Lenz. 2012. “Evaluating Online Labor Markets for Experimental
Research: Amazon.com’s Mechanical Turk.” Political Analysis 20: 351–​368. doi: 10.1093/​pan/​
mpr057
Chang, L., and J. A. Krosnick. 2009. “National Surveys via RDD Telephone Interviewing versus
the Internet: Comparing Sample Representativeness and Response Quality.” Public Opinion
Quarterly 73 (4): 641–​678.
Evans, J. R., and A. Mathur. 2005. “The Value of Online Surveys.” Internet Research 15
(2): 195–​219.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. 2004. Bayesian Data Analysis, 2nd ed.
New York, NY: Chapman & Hall.
Gelman, A., and D. Rothschild. 2014. “When Should We Trust Polls from Non-​probability
Samples?” Washington Post, April 11. https://​www.washingtonpost.com/​news/​monkey-​
cage/​wp/​2014/​04/​11/​when-​should-​we-​trust-​polls-​from-​non-​probability-​samples/​.
Groves, R. M., and L. Lyberg. 2010. “Total Survey Error:  Past, Present, and Future.” Public
Opinion Quarterly 74 (5): 849–​879.
Kennedy, C., A. Mercer, S. Keeter, N. Hatley, K. McGeeney, and A. Gimenez. 2016. “Evaluating
Online Nonprobability Surveys.” Pew Research Center. http://​www.pewresearch.org/​2016/​
05/​02/​evaluating-​online-​nonprobability-​surveys/​.
Klecka, W. R., and A. J. Tuchfarber. 1978. “Random Digit Dialing: A Comparison to Personal
Surveys.” Public Opinion Quarterly 42 (1): 105–​114.
Kreuter, F., S. Presser, and R. Tourangeau. 2008. “Social Desirability Bias in CATI, IVR, and
Web Surveys; The Effects of Mode and Question Sensitivity.” Public Opinion Quarterly 72
(5): 847–​865.
Link, M. W., and A. H. Mokdad. 2005. “Effects of Survey Mode on Self-​Reports of Adult
Alcohol Consumption: A Comparison of Mail, Web and Telephone Approaches.” Journal
of Studies on Alcohol 66 (2): 239–​245. http://​www.jsad.com/​doi/​abs/​10.15288/​jsa.2005.66.239
Mulligan, K., J. T. Grant, S. T. Mockabee, and J. Q. Monson. 2003. “Response Latency
Methodology for Survey Research:  Measurement and Modeling Strategies.” Political
Analysis 11 (3): 289–​301.
Pasek, J. and Krosnick, J. A., 2010. Measuring intent to participate and participation in the 2010
census and their correlates and trends: comparisons of RDD telephone and non-​probability
sample internet survey data. Statistical Research Division of the US Census Bureau, 15.
Platek, R., and C.-​E. Särndal. 2001. “Can a Statistician Deliver?” Journal of Official Statistics 17
(1): 1–​20.
Rivers, D. 2007. “Sampling for Web Surveys.” In Joint Statistical Meetings. http://​yg-​public.
s3.amazonaws.com/​Scientific/​Sample+Matching_​JSM.pdf
Sue, V. M., and L. A. Ritter. 2012. Conducting Online Surveys. Sage.
96    Stephen Ansolabehere and Brian F. Schaffner

Valentine, J. C., T. D. Pigott, and H. R. Rothstein. 2010. “How Many Studies Do You Need?
A  Primer on Statistical Power for Meta-​analysis.” Journal of Educational and Behavioral
Statistics 35 (2): 215–​247.
Voosen, P. 2014. “As People Shun Pollsters, Researchers Put Online Surveys to the Test.”
Chronicle of Higher Education, August 28. http://​www.chronicle.com/​article/​As-​People-​
Shun-​Pollsters/​148503
Weeks, M. F., R. A. Kulka, J. T. Lessler, and R. W. Whitmore. 1983. “Personal versus Telephone
Surveys for Collecting Household Health Data at the Local Level.” American Journal of
Public Health 73 (12): 1389–​1394.
Chapter 5

Sampling for St u dyi ng


C onte xt
Traditional Surveys and New Directions

James G. Gimpel

Introduction

Many advances in the quantity and availability of information have social science
researchers reconsidering aspects of research design that were once considered either
settled or without serious alternatives. Sampling from a population is one of these areas,
in which options to simple random sampling and its common variants have emerged,
along with the technology to implement them. In this chapter I discuss sampling designs
in which subjects’ variable level of exposure to relatively fixed aspects of geographic
space are considered important to the research. In these circumstances a random
sample focused on representing a target population alone will not be sufficient to meet
the researcher’s goals. Traditional sampling will certainly be faithful to the density of the
population distribution, concentrating sampled subjects in highly populated areas. For
research that also requires spatial coverage to represent socioeconomic spaces, however,
common surveys are not the best option, even though they have been widely used in
the absence of better designs (Makse, Minkoff, and Sokhey 2014; Johnston, Harris, and
Jones 2007).
Not every survey is well-​suited to testing hypotheses about context.1 Not that long
ago political scientists and sociologists made creative attempts to use the American
National Election Study (ANES) or the General Social Survey (GSS) to reason about
context, while knowing that their sample designs did not represent a very broad range of
contexts (Giles and Dantico 1982; MacKuen and Brown 1987; Firebaugh and Schroeder
2009). In the design for the ANES, as in the GSS, administrative costs are vastly reduced
by adopting sampling strategies clustered in metropolitan areas, largely ignoring lightly
populated nonmetro locations. Resulting studies commonly found respondents to
98   James G. Gimpel

be residing in less than one-​fifth of the nation’s counties and a limited range of more
granular “neighborhood” areas such as census tracts or block groups (Firebaugh and
Schroeder 2009). Because appropriate data drawn from alternative designs were scarce,
these surveys were commonly accepted as the best, and sometimes only, data available,
and there was no reporting on how well or poorly they captured the diversity of contexts
or living conditions at all—​not even in a note or an appendix. When it came to results,
sometimes context effects appeared, sometimes they didn’t, but one has to wonder how
many Type II errors, or false negatives, appeared due to the paucity of sample points in
many locations that would have added contextual variability. Publication bias against
null findings ensured that many of these investigations never surfaced in journals.
The basic resource deficit social scientists have faced for years is that conventional
survey sampling techniques do not yield the number of subjects necessary to estimate
effects of exposure to stimuli exhibiting geographic variation. Geographic contexts that
are represented are limited to those that underlie the population distribution, which lack
important elements of variability. Consequently, the application of random samples, or
more typically, random samples modified slightly by miscellaneous strata, has led to
errors in the estimates of numerous contextual variables and incorrect conclusions re-
garding the substantive effect of these variables in regression models. What is called for
is a sampling strategy that represents not only the population, but also the variation in
the inhabited environments hypothesized to influence the outcomes of interest. The key
is to allocate sampling effort so as to provide spatial balance to accommodate the need
to estimate exposure to environmental and geographic stimuli even in areas that are less
densely populated. Sometimes we need to represent places, in addition to people.

Location Dependent Nature of Opinion


Formation and Socialization

In social science research it is not new that natural environments matter to opinion
formation and behavior in significant domains of judgment and decision-​making.
A person’s exposure to a hazardous waste dump, a nuclear power plant, an extensive
wildfire, or a devastating hurricane matters greatly to his or her formation of opinions
about it. This is because risk assessment is distance dependent, with subjective levels of
concern varying with citizens’ degree of vulnerability to the hazard (Brody et al. 2008;
Larson and Santelmann 2007; Lindell and Perry 2004; Lindell and Earle 1983). In pro-
cessing news from media sources, communications scholars have found that perceived
susceptibility is an important general cue in processing threatening news, and that the
proximity of the particular threat is a key component of perceived susceptibility (Wise
et al. 2009, 271).
One need not be studying exposure to environmental hazards, weather-​related
catastrophes, or other location-​specific characteristics of the natural environment to see
Traditional Surveys and New Directions    99

how distance from the stimulus matters greatly to one’s reaction to it. Social and polit-
ical environments, while highly diverse across space, are also relatively stable—​not in
the fixed sense in which a mountain range or a hurricane’s path of destruction is, but
by the fact that social settings typically change very slowly, over years and even decades
(Downey 2006). Political scientists have noted that the socializing forces to which
people are regularly exposed typically do not exhibit wild volatility in their climates of
political opinion, but maintain stability over long periods (Berelson, Lazarsfeld, and
McPhee 1954, 298; Campbell et al. 1960; Huckfeldt and Sprague 1995). In this manner,
the same places produce similar political outcomes across several generations, even as
conditions elsewhere may change. Genes are inherited, but so also are environments,
meanings, and outlooks. Indeed, it would be surprising to learn that socioeconomic
environments did not shape opinions and viewpoints to some degree. Remarkably,
patterns of political partisanship and opinion across localities in the 1930s predict par-
tisanship and opinion in those same places in the 2000s and 2010s remarkably well.
Habits of allegiance to parties continue for years, long after the original cause of alle-
giance to those parties has been forgotten. In this manner, the content of partisan and
ideological labels may change over time, but the partisan balance of identifiers will stay
much the same even though new citizens enter and exit the local electorate through gen-
erational replacement and migration (Merriam and Gosnell 1929, 26–​27; Miller 1991;
Green and Yoon 2002; Kolbe 1975). Apparently exposure to the stable socializing forces
abiding in a “neighborhood” or place influences political outlook, whereas distance
from them weakens the impression they make.
Although many sources of political learning and socialization are not local in their
ultimate origin, they may still be moderated or mediated by contextual forces meas-
ured at various levels of geography (Reeves and Gimpel 2012). Through the process
of biased information flow and filtering, places exert a socializing influence. But in-
fluential communication is not always so direct, as there is also the indirect process
of “social absorption” whereby individuals learn what is considered to be normal and
appropriate through observation and imitation. This process is also described as a
neighborhood influence or referred to as exercising a “neighborhood effect.” In the
context of political socialization literature, the idea is that what citizens know and
learn about politics is influenced by local settings and the social interactions within
them and is reinforced by repetition and routine (Huckfeldt 1984; Jencks and Mayer
1990). Importantly, a neighborhood effect is an independent causal impact of the local
context on any number of outcomes, controlling for individual attributes (Oakes
2004). The idea is that all other things being equal, the residents of some locations
will behave differently because of the characteristics of their locations (Spielman, Yoo,
and Linkletter 2013; Sampson, Morenoff, and Gannon-​Rowley 2002). When it comes
to surveying populations for conducting studies on neighborhood effects and poli-
tics, there is reason to question whether traditional sampling strategies are useful for
capturing the variation in environmental exposure theorized to have a causal impact
on opinion formation, judgment, and behavior (Cutler 2007; Kumar 2007; Johnston,
Harris, and Jones 2007).
100   James G. Gimpel

For practitioners studying political behavior from the standpoint of campaign poli-
tics, it is evident from the emergent body of experimental research that local political
environments are widely believed to matter even to short-​term electoral movements.
After all, it is the local political environments that these researchers are attempting to
manipulate. Even if a social scientist could measure every individual personal trait, in-
cluding them all in an explanatory model violates principles of parsimony and commits
the atomistic fallacy by presuming that only individual factors can be causal (Huckfeldt
2014, 47). In addition, it may well be that social and institutional causes of behavior,
those originating out of communities or environments, are more amenable to “policy”
or campaign intervention designed to persuade voters or to stimulate higher turnout.
Changing someone’s personality or fundamental psychological orientation toward pol-
itics may not be within the capacity of any campaign. But it is certainly possible to alter
a voter’s information environment or try other stimuli and communications that might
marginally increase turnout or persuade some share of voters to vote differently than
they would otherwise.
In summary, traditional survey research designs for gathering information on
voter attitudes and behavior usually ignore variability in context in favor of repre-
sentation of a target population. This is understandable given that the usual goal is
to forecast elections, and an accurate measure of the horse race is taken to be the
standard for quality polling. Moreover, through some variation of stratified random
sampling, survey research has become adept at forecasting elections within a few
points. Even the much criticized surveys during the 2012 and 2014 U.S.  general
elections proved to be accurate when they were combined and averaged to balance
out the different information sets derived from slightly varying methods (Graefe
et  al. 2014). When sample sizes are large, these polls also provide reasonably ac-
curate estimates for focal subgroups of the electoral population. In the very act of
achieving those goals, however, scholars frequently eliminate the variations in geo-
graphic context that are likely to matter most to understanding social environments
and the interdependence among voters, limiting variation on such continua as
urban and rural, economic equality and inequality, occupational differences, ex-
posure to physical environmental conditions (e.g., water scarcity, pollution), and a
variety of others.

Examining the Spatial Distribution of


a Simple Random Sample

Suppose that the frame for social scientific research was the state of Ohio’s regis-
tered voter population. What if we were to try to use a typically sized random sample
to study contextual effects on these voters? Just how well would that design work? We
might begin by drawing a pollster’s typically sized random sample of, say, one thousand
Traditional Surveys and New Directions    101

respondents to survey from the state’s file of registered voters. Of course to be faithful to
the real world, one would start by drawing more than one thousand, since many that we
would attempt to contact would refuse to cooperate or would otherwise fail to respond.2
For purposes of this chapter, we ignore that practical necessity and keep the considera-
tion only to the initial one thousand cases.
The geographic distribution of cases from that example of one thousand cases drawn
from the Ohio voter file from spring 2013 is shown in figure 5.1, with the state’s major
cities displayed in gray outline and the larger counties also labeled. Predictably, the
sample shows a geographic concentration of sample points in exactly the locations
we would expect them to be positioned if we were trying to represent the voter pop-
ulation of the state:  in the three major metropolitan areas running diagonally from
southwest to northeast, Cincinnati, Columbus, and Cleveland, respectively. The black
ellipse on the map summarizes the one standard deviation directional dispersion of the
sampled points around their geographic center. What the ellipse shows is that this typ-
ical random sample achieves very good representation of the geographic distribution
of the state’s electorate. Summary tabulations show that 7%, 10.1%, 11.6%, and 3.9% of
all registered voters from the state’s voter file reside in Hamilton (Cincinnati), Franklin
(Columbus), Cuyahoga (Cleveland), and Lucas (Toledo) Counties, respectively. In turn,
7.8%, 10.4%, 12%, and 4% of the simple random sample from figure 5.1 were distributed
within these four large counties, certainly an acceptably close reflection of the true pop-
ulation proportions.
Simple random samples are important for undergirding statistical theory but are
rarely utilized in the practice of survey research, for well-​known reasons detailed
elsewhere in this volume and in reputable texts (Levy and Lemeshow 2008; Bradburn
and Sudman 1988; Kish 1965). One drawback is that a simple random sample,
selected on the equal-​probability-​of-​selection principle, may not provide enough
cases with a particular attribute to permit analysis. More commonly, random sam-
pling occurs within some subgroups identified by researchers before the sample is
drawn, according to reasonable and compelling strata, and sometimes in more than
one stage.
Across the social sciences, usually the strata chosen for surveys are attributes of
individuals, such as their race, income, age group, or education level. By first stratifying
into educational subgroups, for example, one can ensure that enough high school
dropouts, or college graduates with two-​year degrees, are included to permit compar-
ison with more common levels of educational attainment. When stratified random sam-
pling is sensitive to location, it is usually in the form of an area probability sample in
which selected geographic units are randomly drawn with probabilities proportionate
to estimated populations, and then households are drawn from these units on an equal
probability basis (Cochran 1963; Kish 1965; Sudman and Blair 1999). Ordinarily the
point of such sampling schemes is not to estimate the impact of contextual variation
or opinion formation within the structure of localities. The resulting samples are geo-
graphically clustered in a limited number of locations to reduce costs. As Johnston and
his colleagues (2007) have convincingly demonstrated, stratified samples may ensure a
102   James G. Gimpel

e
Lak
Toledo
Lucas
Cleveland Metro

Cuyahoga

Summit
rai
Lo
Akron Youngstown
Mahoning

Stark

Columbus
Montgomery

Franklin

Dayton
Warren

Butler
Hamilton

Hamilton
Cincinnati Metro
Clermont

Sample Points
Cities
Standard Dev Ellipse
Counties

Figure  5.1  Spatial Distribution of Simple Random Sample of Registered Voters from Ohio
Voter File, 2013.

nationally representative survey of voters (after weighting) but do not ensure a repre-
sentative sample of the varied socioeconomic contexts within a state or nation.
Better representation of localities is important. With respect to campaign intensity, it
is well-​recognized that parties and candidates invest far more effort in some places than
in others. One means for capturing some of this variability in resource allocation is to
Traditional Surveys and New Directions    103

measure spatial variation in exposure to campaign stimuli by media market area. For
purposes of purchasing advertising, the A. C. Nielsen Company has divided the nation
into designated market areas (DMAs) representing the loci out of which television and
radio stations grouped in a region broadcast to a surrounding population. With only a
couple of exceptions, Nielsen uses the nation’s counties to segment the country into mu-
tually exclusive and exhaustive market regions. Advertisers, including businesses and
political campaigns, use these market boundaries to guide the planning and purchasing
of broadcast advertising.3
Ohio is presently divided into twelve DMAs, three of which are centered in
other states; two in the southeast emanating from Charleston-​Huntington, and
Parkersburg, West Virginia; and a third in the northwest, centered in Fort Wayne,
Indiana, and extending across the border to encompass two rural Ohio counties (see
figure 5.2). By using the DMAs as strata, social science researchers can ensure that
no media markets go entirely unrepresented in a survey. Using simple random sam-
pling, it is possible that no cases could be drawn from the markets that are small in
population.

Proportional Allocation of a Sample

To avoid the possibility that some market areas wind up without any cases at all,
stratifying the sample allocation by DMA makes sense as an initial step. Then allocating
the total sample population proportionally is straightforward:  if the largest media
market contains 50% of a state’s voters, then a total sample of one thousand would al-
locate five hundred survey respondents to that stratum. In the case of Ohio, about 34%
of Ohio voters reside in the Cleveland media market. Stratifying by media market and
allocating the sample proportionally should result in a survey that positions approx-
imately one-​third of sample members within that market. One such sample is shown
in figure 5.3, which places 336 sample points in the Cleveland area, with the sample
populations in other DMAs also closely proportional to their share of the Ohio voter
population. The three major media markets, Cleveland, Columbus, and Cincinnati, are
home to 67.8% of the total sample shown in figure 5.3.
In practice, the results of a stratified sample may not look much different than a simple
random sample, but the stratification with proportional allocation ensures that at least
a few voters will be drawn from each of the twelve DMAs. The standard deviational
ellipse shown in figure 5.3 for the proportionally allocated sample shows slightly more
sensitivity to smaller DMAs than the simple random sample in figure 5.1. Note that the
proportionally allocated sample is less pulled in the direction of the sizable Cleveland
area DMA and is sensitive to the cases in the smaller DMAs in western Ohio. Ideally
the greater sensitivity to the smaller DMAs would permit us to obtain estimates from
some strata that a simple random sample would ignore. Several of the DMAs are very
small, however, and the sample size of one thousand remains too modest to represent
them adequately. The Fort Wayne DMA contains only four sample points (figure 5.3),
104   James G. Gimpel

Toledo

Cleveland Metro

TOLEDO

FORT CLEVELAND Akron Youngstown


WAYNE YOUNGSTOW
N

LIMA

DAYTON
WHEELING-
Columbus STEUBENVILLE
COLUMBUS ZANESVILLE
Dayton

PARKERSBURG
Hamilton

Cincinnati Metro
CINCINNATI

CHARLESTON-HUNTINGTON

Cities
DMAs

Figure 5.2  Ohio Designated Market Area (DMA) Map.

and three others still contain fewer than ten, far too few for adequate analysis. This is a
clear reminder that the total sample size should be substantially larger than one thou-
sand in order to obtain more confident estimates of the means for these strata under
proportional allocation. This helps us explain why many polls remain inadequate for
testing contextual effects even under conditions of stratified sampling by geographic
units, whatever those geographic units happen to be (Johnston, Harris, and Jones 2007).
Proportionally allocating samples to strata is certainly effortless. Researchers also
consider it an improvement over simple random sampling from the standpoint of
ensuring that a geographic container such as a metropolitan area or DMA thought to
Traditional Surveys and New Directions    105

Toledo

Cleveland Metro

TOLEDO

CLEVELAND Akron Youngstown


FORT WAYNE YOUNGSTOW
N

LIMA

DAYTON
WHEELING-
Columbus ZANESVILLE STEUBENVILLE
COLUMBUS

Dayton

PARKERSBURG
Hamilton

CINCINNATI
Cincinnati Metro

CHARLESTON-HUNTINGTON

Sample Points
DMAs
Cities
Standard Dev Ellipse (Stratified Random)
Standard Dev Ellipse (Simple Random)

Figure 5.3  Stratified Random Sample with Proportional Allocation by DMA.

unify a population guides sample selection for purposes of generating an estimate. For
especially small DMAs, however, the resulting sample subpopulations are too small to
be useful. Any helpful contextual variation these DMAs might add will remain unac-
counted for because they are not well represented.
106   James G. Gimpel

Unless the sample is considerably larger, any contextual characteristics that capture
features closely associated with lower density environments cannot be suitably tested
for their impact, including measures that benchmark important hypothesized causes of
a large range of attitudes and behaviors that vary greatly by location, including equality
and inequality; some dimensions of racial and ethnic diversity, longevity, health, envi-
ronmental protection, crime, self-​employment, social capital, and many others.
Across several decades, researchers have borrowed regularly from surveys designed
for one purpose, representation of a target population, to evaluate theories and test
hypotheses about geographic contexts, without incorporating a proper range of con-
textual variation (Stipak and Hensler 1982). This has sometimes led researchers to con-
clude prematurely that context does not matter or has only substantively trivial effects
once we have properly controlled for individual-​level characteristics (King 1996; Hauser
1970, 1974). Contextual “effects,” by these accounts, are mostly an artifact of specifica-
tion error. Arguably such conclusions were based on reviewing early studies that had
adopted research designs that were ill suited for testing for contextual effects in the first
place. The 1984 South Bend Study by Huckfeldt and Sprague (1995) was among the first
in political science to randomly sample within neighborhoods purposely chosen to rep-
resent socioeconomic diversity. Their sample of fifteen hundred respondents is concen-
trated within sixteen neighborhoods, containing ninety-​four respondents each, and
reflects a broad range of living conditions among the population’s residents (Huckfeldt
and Sprague 1995; Huckfeldt, Plutzer, and Sprague 1993). Could it have been even more
widely sensitive to contextual variation? Yes, perhaps, if it had been “The St. Joseph
County Study,” “The Indiana Study,” or even “The Midwestern Study,” but other costly
features of their program included a three-​wave panel design and a separate survey of
nine hundred associates of the primary respondents. Given the multiple foci of the re-
search, narrowing the geographic scope of the work was a practical and realistic step.
In summary, under the stratified sample with proportional allocation, in order to re-
liably estimate values in all regions of interest, researchers would be required to greatly
enlarge the sample to ensure the number of cases necessary to generate a confident esti-
mate across the full range of contextual circumstances. A less costly alternative might be
to choose some other means for allocating the sample at the start.

Balanced Spatial Allocation


of a Sample

As indicated previously, sometimes the research goal is not to generate a forecast of the
coming election, but to understand the impact of context, or changing some aspect of
the local political environment, on an outcome. Perhaps there are hypotheses in the re-
search about media effects, or response to advertising stimuli, some element of locally
tailored campaign outreach, or reaction to public policy adoption. These hypotheses
Traditional Surveys and New Directions    107

may be subject to testing via observation or field experimentation, but the research is
carried out within and across particular geographic domains, which should then be
sampled and compared accordingly.
Sometimes campaign researchers are at work fielding experimental manipulations
of campaign messages, varying the content, duration, and other qualities of broadcast
television and radio advertisements (e.g., Gerber et al. 2011). Relying on stratified, pro-
portionally allocated sampling strategies to assess these effects is a poor and potentially
costly alternative to designing a spatially balanced survey that will capably estimate re-
sponse to the market-​by-​market variation being introduced by the research team.
The Ohio map with a sample of one thousand respondents drawn in equal
proportions from each of the state’s twelve media markets is shown in figure 5.4. The
standard deviational ellipse identified as “Strata Equal” indicates the summary of the
spread of sample points from an equal allocation of a random sample across the dozen
DMAs. This ellipse, which strikingly extends outward nearly to the state’s borders,
marks a decided contrast with the diagonal-​shaped ellipse representing the “Strata
Prop” or sample points that were distributed on a population size basis. Quite vis-
ibly, the equally allocated sample is a very different one than either the simple random
sample shown in figure 5.1 or the sample allocated proportionally shown in figure 5.3.
Specifically, figure 5.4 shows that a sample of one thousand when divided equally among
twelve DMAs results in equal groups of eighty-​three respondents positioned within
each market, densely crowding small markets such as Lima and Zanesville, perhaps, but
more sparsely dotting Cleveland and Toledo.
Clearly the geographically balanced sampling strategy in figure 5.4 would not pass
muster with a traditional pollster aiming for a close geographic representation of the
state’s registered voter population. The pollster’s preference would surely be some-
thing akin to the sample shown in figure 5.3. But for a strategist testing media messages,
having randomized a roll-​out schedule with perhaps two advertisements, airing them
for variable periods over four weeks’ time across the twelve DMAs, a more spatially
sensitive strategy conveys some genuine advantages. For one, it becomes possible to
produce context-​specific regression estimates for all media markets for an individual
opinion (i.e., candidate support) on an individual characteristic (i.e., party identifi-
cation). The traditional pollster, implementing a sample design concentrated in the
largest markets, would only be able to produce an estimate for a few of the state’s media
markets, including Cleveland, Cincinnati, and Columbus, and these happen to be the
most costly and urban ones. Additional experimental variations, at far lower cost than
in the Cleveland or Cincinnati markets, can be subject to research in the lower cost
markets, but not if they have no experimental subjects included in the survey sample.
Representation of a state’s population is not everything. Sometimes who is predicted to
win an election is one of its less interesting aspects. Researchers are frequently interested
in observing geographic differences in the etiology of opinion about the candidates,
estimating the influence of the survey respondents’ social environments, gauging the
variable impact of issues on candidate support across the state, and evaluating the im-
pact of voters’ social and organizational involvements on their views. These ends are
Toledo

Cleveland Metro

TOLEDO

CLEVELAND Akron Youngstown


FORT WAYNE
YOUNGSTOWN

LIMA

DAYTON
WHEELING-
Columbus ZANESVILLE STEUBENVILLE
COLUMBUS

Dayton

PARKERSBURG
Hamilton

Cincinnati Metro
CINCINNATI

CHARLESTON-HUNTINGTON

Sample Points
Cities
DMAs
Standard Dev Ellipse (Strata Prop)
Standard Dev Ellipse (Strata Equal)

Figure 5.4  Stratified Random Sample with Spatially Balanced Allocation by DMA.


Traditional Surveys and New Directions    109

more likely to be met by randomly drawing nearly equal-​sized subsamples from each
market while including questions about the local venues within which citizens’ lives are
organized.
Even if there is no field experimentation underway, past observational research has
suggested many ways in which space and place matter to our everyday lives. Economic,
social, health, and political outcomes are all hypothesized to be shaped by a multi-
level world. Survey samples have only recently become large enough to produce reli-
able estimates of the impact of contextual variables with samples that include areas of
relatively sparse population. In other cases, with some forethought and planning, it is
possible to sample densely enough to represent even sparsely populated locales and
media markets using conventional stratified random sampling. Such samples are costly,
requiring multiple thousands and even tens of thousands of cases, but they are more
available now than in previous decades thanks to easier forms of outreach to potential
respondents. What is not easy to do is to retrospectively extract from the major archives
surveys of size eight hundred, one thousand, and twelve hundred and use these to either
reveal or debunk the existence of neighborhood or contextual effects.
Of consequence to political scientists and campaign professionals, we have long
recognized that differing locations display very different political habits and outcomes.
No one would argue with the notion that variation in rates of voting participation,
political party support, and propensity to donate money or to show up at campaign
rallies is somehow related to the presence of socializing norms, ecological conditions,
and the assemblage of opportunities collected in a locale. Advantaged neighborhoods
offer more optimistic, efficacious, and empowering environments than impoverished
areas. Moreover, voters have been found to perceive accurately the climate of eco-
nomic circumstances and opinion in their proximate environments. Awareness of these
conditions, in turn, appears to be immediately pertinent to the formation of political
judgments (Newman et al. 2015).
Conventional sampling strategies have permitted the accumulation of knowledge about
only a limited set of context effects in social science literature, particularly those going to
racial and ethnic context, and a considerably smaller number that have examined socioec-
onomic status. Given that variation in race/​ethnic environment is often robust within the
major metropolitan areas where low cost samples are frequently clustered, we should not be
surprised to see so many published works addressing the subject. Should social science then
conclude that racial/​ethnic context is the only one that counts? Probably not, until we field
and evaluate more surveys that represent exposure to a far broader range of environments
than we have up to now. The social science convention that attributes important behavioral
outcomes to only one level of influence, usually the most immediate one, is not only mis-
leading, but progressively unnecessary in an era of information abundance.
To conclude, the very limited geographic coverage of traditional samples will not move
us forward without much larger sample populations. Such large samples are becoming
available, and there are also hybrid designs that propose to achieve population represen-
tation and spatial coverage at optimal sample size. These developments promise to ad-
vance the understanding of space and place effects in the formation of political attitudes
110   James G. Gimpel

and behavior, something that conventionally designed survey samples were ill-​equipped
to do. Across the social sciences more broadly, new study designs promise to contribute
to greater knowledge about the spatial dependency and multilevel causality behind social,
economic, health, and political outcomes. They won’t do so without well-​formulated, mul-
tilevel theories of behavior, though. There are legitimate complaints about the ascension
of data analysis techniques over theory, and these criticisms are surely apt in the study of
place effects on behavior. Analysis should be driven not simply by the level of spatial data
available, but by theoretical considerations governing the etiology of the behavior. The ex-
plosion in the quantity and quality of social and political data dictates that a variety of
perspectives and tools should be brought to social science subject matter. But more com-
plex and realistic designs for data analysis require more sophisticated conceptualizations
of relationships within and across the multiple levels of analysis. Finally, just as the new
techniques for sampling and data analysis are shared by many disciplines, so too are the
theories of the underlying social processes going to draw from sources ranging across
disciplines. Even as relatively youthful social science fields build their own bodies of
knowledge from the rise in information, high-​quality work will require awareness of
developments in other fields. The answers to substantively important problems are in-
creasingly within the reach of social scientific expertise, broadly construed, but probably
out of the reach of those working narrowly within any particular social science field.

Notes
1. Because this chapter has self-​critical aims, I do not cite the work of others as much as I oth-
erwise would. The criticisms apply as much to my own work as to that of others. Where I do
cite the work of others, it should be considered only as a case in point, not as singling out a
particular scholar or study.
2. Contemporary pollsters commonly suggest drawing as many as fifteen or twenty times
the intended number of respondents in order to fulfill the required number of completed
surveys. Failures to respond by phone are generally met with repeated efforts to call back
the selected respondents, unless and until they flatly refuse to cooperate. Many polling
firms are now paying respondents a fee to induce their cooperation.
3. These boundaries are not impermeable, of course, and there are many examples of radio
and television broadcasts that spill over into neighboring markets.

References
Berelson, B. R., P. F. Lazarsfeld, and W. N. McPhee. 1954. Voting: A Study of Opinion Formation
in a Presidential Election. Chicago: University of Chicago Press.
Bradburn, N. M., and S. Sudman. 1988. Polls and Surveys: Understanding What They Tell Us.
San Francisco: Jossey-​Bass Publishers.
Brody, S. D., S. Zahran, A. Vedlitz, and H. Grover. 2008. “Examining the Relationship between
Physical Vulnerability and Public Perceptions of Global Climate Change in the United
States.” Environment and Behavior 40 (1): 72–​95.
Traditional Surveys and New Directions    111

Campbell, A., P. E. Converse, W. E. Miller, and D. E. Stokes. 1960. The American Voter.
New York: John Wiley and Sons.
Cochran, W. G. 1963. Sampling Techniques. New York: John Wiley and Sons.
Cutler, F. 2007. “Context and Attitude Formation: Social Interaction, Default Information or
Local Interests.” Political Geography 26 (5): 575–​600.
Downey, L. 2006. “Using Geographic Information Systems to Reconceptualize Spatial
Relationships and Ecological Context.” American Journal of Sociology 112 (2): 567–​612.
Firebaugh, G., and M. B. Schroeder. 2009. “Does Your Neighbor’s Income Affect Your
Happiness?” American Journal of Sociology 115 (3): 805.
Gerber, A. S., J. G. Gimpel, D. P. Green, and D. R. Shaw. 2011. “How Large and Long-​lasting
Are the Persuasive Effects of Televised Campaign Ads? Results from a Randomized Field
Experiment.” American Political Science Review 105 (1): 135–​150.
Giles, M. W., and M. K. Dantico. 1982. “Political Participation and Neighborhood Social
Context Revisited.” American Journal of Political Science 26 (1): 144–​150.
Graefe, A., J. S. Armstrong, R. J. Jones, and A. G. Cuzan. 2014. “Accuracy of Combined
Forecasts for the 2012 Presidential Election: The PollyVote.” PS: Political Science & Politics 47
(2): 427–​431.
Green, D. P., and D. H. Yoon. 2002. “Reconciling Individual and Aggregate Evidence
Concerning Partisan Stability: Applying Time-​Series Models to Panel Survey Data.” Political
Analysis 10 (1): 1–​24.
Hauser, R. M. 1970. “Context and Consex: A Cautionary Tale.” American Journal of Sociology 75
(4, pt. 2): 645–​664.
Hauser, R. M. 1974. “Contextual Analysis Revisited.” Sociological Methods & Research 2
(3): 365–​375.
Huckfeldt, R. R. 1984. “Political Loyalties and Social Class Ties: The Mechanisms of Contextual
Influence.” American Journal of Political Science 28 (2): 399–​417.
Huckfeldt, R. 2014. “Networks, Contexts, and the Combinatorial Dynamics of Democratic
Politics.” Advances in Political Psychology 35 (S1): 43–​68.
Huckfeldt, R., E. Plutzer, and J. Sprague. 1993. “Alternative Contexts of Political
Behavior:  Churches, Neighborhoods, and Individuals.” Journal of Politics 55
(2): 365–​381.
Huckfeldt, R., and J. Sprague. 1995. Citizens, Politics and Social Communication: Information
and Influence in an Election Campaign. New York: Cambridge University Press.
Jencks, C., and S. E. Mayer. 1990. “The Social Consequences of Growing Up in a Poor
Neighborhood.” In Inner-​city Poverty in the United States, edited by M. McGeary, 111–​186.
Washington, DC: National Academy Press.
Johnston, R., R. Harris, and K. Jones. 2007. “Sampling People or People in Places? The BES as an
Election Study.” Political Studies 55: 86–​112.
King, G. 1996. “Why Context Should Not Count.” Political Geography 15 (2): 159–​164.
Kish, L. 1965. Survey Sampling. New York: John Wiley and Sons.
Kolbe, R. L. 1975. “Culture, Political Parties and Voting Behavior: Schuylkill County.” Polity 8
(2): 241–​268.
Kumar, N. 2007. “Spatial Sampling Design for a Demographic and Health Survey.” Population
Research and Policy Review 26 (3): 581–​599.
Larson, K. L., and M. V. Santelmann. 2007. “An Analysis of the Relationship between Residents’
Proximity to Water and Attitudes about Resource Protection.” The Professional Geographer
59 (3): 316–​333.
112   James G. Gimpel

Levy, P. S., and S. Lemeshow. 2008. Sampling of Populations: Methods and Applications. 4th ed.
New York: John Wiley and Sons.
Lindell, M. K., and T. C. Earle. 1983. “How Close Is Close Enough: Public Perceptions of the
Risks of Industrial Facilities.” Risk Analysis 3 (4): 245–​253.
Lindell, M. K., and R. W. Perry. 2004. Communicating Environmental Risk in Multiethnic
Communities. Thousand Oaks, CA: Sage Publications.
MacKuen, M., & Brown, C. 1987. “Political Context and Attitude Change.” American Political
Science Review 81 (02): 471–​490.
Makse, T., S. L. Minkoff, and A. E. Sokhey. 2014. “Networks, Context and the Use of Spatially-​
Weighted Survey Metrics.” Political Geography 42 (4): 70–​91.
Merriam, C. E., and H. F. Gosnell. 1929. The American Party System. New York: Macmillan.
Miller, Warren E. 1991. “Party Identification, Realignment, and Party Voting:  Back to the
Basics.” American Political Science Review 85 (02): 557–​568.
Newman, B. J., Y. Velez, T. K. Hartman, and A. Bankert. 2015. “Are Citizens ‘Receiving the
Treatment’? Assessing a Key Link in Contextual Theories of Public Opinion and Political
Behavior.” Political Psychology 36 (1): 123–​131.
Oakes J. M. 2004. “The (Mis)estimation of Neighborhood Effects:  Causal Inference for a
Practical Social Epidemiology.” Social Science and Medicine 58 (10): 1929–​1952.
Reeves, A., and J. G. Gimpel. 2012. “Ecologies of Unease: Geographic Context and National
Economic Evaluations.” Political Behavior 34 (3): 507–​534.
Sampson, R. J., J. D. Morenoff, and T. Gannon-​Rowley. 2002. “Assessing ‘Neighborhood
Effects’:  Social Processes and New Directions in Research.” Annual Review of Sociology
28: 443–​478.
Spielman, S. E., E.-​H. Yoo, and C. Linkletter. 2013. “Neighborhood Contexts, Health, and
Behavior:  Understanding the Role of Scale and Residential Sorting.” Environment and
Planning B: Planning and Design 40 (3): 489–​506.
Stipak, B., and C. Hensler. 1982. “Statistical Inference in Contextual Analysis.” American
Journal of Political Science 26 (1): 151–​175.
Sudman, S., and E. Blair. 1999. “Sampling in the Twenty-​First Century.” Journal of the Academy
of Marketing Science 27 (2): 269–​277.
Wise, K., P. Eckler, A. Kononova, and J. Littau. 2009. “Exploring the Hardwired for News
Hypothesis:  How Threat Proximity Affects the Cognitive and Emotional Processing of
Health-​Related Print News.” Communication Studies 60 (3): 268–​287.
Chapter 6

Questionnaire S c i e nc e

Daniel l. Oberski

Why It Is Important to Ask


Good Questions

In polling, everything hinges on asking good questions. If I  tried to measure your


opinion about the current president by asking “How much do you like ice cream?,”
I would not get very far; that question would have no validity. But even if I did ask your
opinion about the president, but did so in such a convoluted way that you would not
know what to make of it, your answer might not be as valuable as it could have been.
Take this made-​up question, for example:

To which extent do you disagree with the statement “the current president’s actions
are not entirely unlike my own actions sometimes but some of his policies are not
often bad”?
2 Not entirely disagree
3 Disagree
−1 Don’t know
−2 Agree somewhat
−3 Agree slightly
−4 Neither agree nor disagree

Is the statement about the president positive or negative, and to what extent? What
“actions” and “policies” come to mind? Which is stronger: “somewhat” or “slightly” ’?
Is category −1 neutral? These are just a few of the many issues plaguing this unfortu-
nate survey question. When you answer the question, you need to solve these issues
in order to answer, but since the solutions are ambiguous at best, different people will
choose different answer strategies—​even if they had the same opinion about the presi-
dent. If you changed your mind about the president next year, you might even solve the
problem of answering this terrible question differently and give the same answer as you
114   Daniel L. Oberski

did previously, even though you changed your opinion. Such differences in answers be-
tween people with the same opinion are called “unreliability” in the literature (Lord and
Novick 1968). So even when a question is about the right topic, the way it is asked still
determines how reliable the answers will be.
Unreliability is important because it strongly biases estimates of relationships (Fuller
1987; Carroll et al. 2006). For example, if I were interested in the relationship between
presidential approval and consumer confidence, I might calculate a correlation between
these two variables; unreliability would then attenuate this correlation downward, while
common method variance would spuriously increase it. So this estimate would be se-
verely biased, and without additional information about the reliability and common
method variance, there is no way of knowing the size and direction of this bias.
Unreliability’s effects on estimates of relationships extends to relationships over time,
such as panel or longitudinal data and time series (Hagenaars 1990). Random measure-
ment error will cause spurious shifts in opinion and jumps in time series that are purely
due to the measurement error. Common method variance, on the other hand, can make
opinions appear much more stable than they truly are.
When comparing groups, the measurement error resulting from poor question de-
sign may again bias the analysis. For example, prior research suggests that highly
educated respondents tend to “acquiesce”—​agree to a statement regardless of its
content—​less (Narayan and Krosnick 1996). If we compared the average response to
an agree-​disagree question in Washington, DC, where 49% of adults hold a bachelor’s
degree, to West Virginia, where only 17% do,1 on average we would expect the West
Virginians to agree more with any statement, regardless of its content. A researcher who
found that Virginians indeed agreed more with her statement would then be at a loss to
say whether this was because of a difference in opinion or one of measurement error.
This incomparability is also called “measurement non-​invariance,” “measurement non-​
equivalence,” or “differential item functioning” in the literature (see Oberski 2012).
My contrived example serves to illustrate how unreliability may result from a
question’s phrasing and other characteristics, and that this unreliability is vital to draw
accurate conclusions about many social phenomena. Of course I purposefully broke
every rule in the book when phrasing the above question. Real polling questions follow
“best practices,” a set of approximate rules handed down by textbooks, or they are
designed by experts. Even so, differences in respondents’ answering strategy still occur,
with the resulting unreliability of answers. And how can we be sure that all the many
issues that could plague a survey question are actually taken care of in its formulation? Is
expert opinion enough?
The remainer of this chapter aims to answer these questions. I argue that deferring
to textbooks and experts is not enough to design the best questions, but that a body
of scientific knowledge about questionnaire design does exist, comprising cognitive
theory, empirical observations, and carefully designed experiments. I then discuss some
examples of scientific knowledge about questionnaire design, including a large meta-​
analysis that has yielded user-​friendly software encoding such knowledge.
Questionnaire Science   115

What We Do Not Know about Asking


Questions

Pollsters and other survey research agencies have vast amounts of experience doing
surveys. Thanks to these researchers’ awareness that everything hinges on asking good
questions, it has become common practice to vet the questions in advance using ques-
tionnaire reviews, pretests, and other such evaluations (see Madans et al. 2011 for an
overview). These procedures are meant to ensure that the right questions are asked
in the best way possible. Regardless of the procedure followed to improve a question,
though, the initial design typically follows “best practices”:  standards for designing
survey questions that have become encoded in the many textbooks now available on
good questionnaire construction.
So what practices are currently considered “best,” and how many of them do survey
researchers actually implement? To answer these questions, I picked up a selection of
well-​and lesser-​known “how-​to” advice books on survey and questionnaire design, as
well as the very comprehensive Handbook of Marketing Scales (Netemeyer et al., 2011),
which contains over 150 meticulously documented examples of vetted questionnaires
used in marketing research. Table 6.1 shows what these books advise regarding negative
questions in a battery (“Negative”), the preferred number of categories (“Categories”),
the use of agree-​disagree questions (“Agree-​disagree”), and double-​barreled questions.
These examples are by no means an exhaustive list of possible design choices, but are all
commonly mentioned in the textbooks and serve to demonstrate how question design
advice is given and taken.
Table 6.1 shows that, broadly, there is a consensus on some of these best practices,
while others are contradictory. For example, all textbooks listed in the table agree that
double-​barreled questions are a bad idea, and most agree that negatively formulated
questions are to be avoided. On the other hand, there is little agreement among these
authors on the use of agree-​disagree questions or the number of categories; here, one
author’s best practice is another’s faux pas.
The bottom row of table 6.1 is meant to give an idea of the actual—​as (possibly)
opposed to “best”—​practices of marketing research surveys from a small sample of the
scales in the Handbook. Where textbook authors agree on the “best” practice, the ac-
tual practice is more often than not the opposite; for example, I found double-​barreled
questions in 60% of the sampled scales, and about half of the scales use the negative
formulations that textbooks agree should be avoided. Moreover, there was very little ac-
tual variation in the number of scale points, most scales using seven-​point scales: here
there is a common practice even though a best practice is not actually agreed upon by
the textbooks. A researcher following Bradburn et al.’s advice (2004, 149) to take existing
questionnaires as a starting point may then be forgiven for thinking that seven-​point
scales represent a consensus best practice.
116   Daniel L. Oberski

Table 6.1 Best and Actual Practices for Four Commonly Discussed Question


Characteristics
Book Negative Categories Agree-​disagree Double-​barreled

Bradburn et al. (2004) Avoid (p. 325) 7 (p. 331) Good (p. 244) Bad


Dijkstra and Smit (1999) Avoid (p. 83) –​ Avoid (p. 95) Bad
Dillman (2011) Avoid (p. 73) –​ Avoid (p. 62) Bad
Folz (1996) –​ –​ Neutral Bad
Fink (2009) Avoid (p. 29) 4 or 5 Neutral Bad
Fowler (2014) –​ –​ Avoid (p. 105) Bad
Marketing Scales* 50% 5, 6, or 7 67% 60%

The aspect is mentioned, but no negative or positive advice is given.


*
Based on a random sample of 10 scales from the book (s.e. about 15%).

While very limited, the microreview offered by table 6.1 suggests that (1) some “best”
practices are contradictory; (2) some consensus best practices are not usually followed;
and (3) a strong common practice may be present, absent any actual consensus on the
best practice. In short, to quote Dillman (2011, 50) “the rules, admonitions, and prin-
ciples for how to word questions, enumerated in various books and articles, present a
mind-​boggling array of generally good but often conflicting and confusing directions
about how to do it”; deferring to common or “best” practices is clearly not enough to
warrant trustworthy conclusions from our surveys.

Beyond Agreeing to Disagree:


What We Do Know

If best practices are so conflicting, is question design a matter of taste? After all, the title
of one of the most classic of all question design textbooks, Payne’s The Art of Asking
Questions (1951), directly suggests exactly that. And if that is true, this arbitrary nature
of survey question design would detract from the trustworthiness of conclusions based
on such questions. Fortunately, though, we can decide which practices truly are “best”
under specific circumstances by experimenting with them, and there is now a substan-
tial literature arbitrating among such practices.
As an example, consider one of the design choices of some apparent contention among
textbooks: the agree-​disagree scales that proved so popular in existing questionnaires.
There are three good reasons to think that agree-​disagree scales are, in fact, a bad idea.
First are theoretical reasons. Cognitive psychology suggests that agree-​disagree scales
place an unnecessary cognitive burden on the respondent that causes respondents to
Questionnaire Science   117

“satisfice”—​that is, to take shortcuts when answering the questions. Révilla et al. (2013)
compared the process needed to answer an agree-​disagree question such as “to what
extent do you agree or disagree that immigration is bad for the economy?” with that
needed to answer an “item-​specific” question such as “how good or bad for the economy
is immigration?” The latter, a well-​known model of cognitive survey response suggests,
is answered in several stages: comprehension of the question, retrieval of relevant infor-
mation, judgment of this information, and response (Tourangeau et al. 2000).
In the example question “how good or bad for the economy is immigration?,”
the respondent would first read and understand words such as “immigration,”
“economy,” “good,” and “bad,” as well as the grammatical structure of the sentence
that gives it meaning—​for example, the presence of the WH word “how,” turning
the phrase into a request for graded information. If the respondent is satisficing, the
phrase might not be read, but the answer categories might be read directly instead.
These might say something like “immigration is very good for the economy,” a sen-
tence that communicates the required meaning on its own. Subsequently, informa-
tion stored in memory about relevant concepts is retrieved until the respondent has
had enough. When satisficing, the respondent may only retrieve the most salient in-
formation: things that he or she may have heard just recently or very often. In the
next stage, the theory suggests, this information is weighed and the actual opinion
formed. Again, instead of weighing all the pros and cons as a professional economist
might do, a respondent trying to get through the questionnaire may use simple rules
to reach a judgment. Finally, the opinion must be mapped onto the response scale. If
the respondent’s internal idea about his or her opinion matches the labels closely, this
can be a matter of “choosing the option that comes closest,” as we often instruct our
respondents. A satisficing respondent may choose a different strategy. For example, he
or she may choose one side of the issue and opt for the most extreme response on that
side. This is known in the literature as “extreme response style.” Thus, at each stage
there is a potential for satisficing.
Our hypothetical journey through a survey question-​and-​answer process shows that
answering a question is a complicated cognitive process. Because it is so complicated,
different respondents holding the same opinion could give different answers. The higher
the cognitive burden of answering a question, the more respondents will satisfice, and
the more their answers will differ erroneously and correlate spuriously.
And that is precisely the theoretical problem with the agree-​disagree format, such as
“to what extent do you agree or disagree that immigration is bad for the economy?”: its
cognitive burden is higher than that of the direct question. At the response stage, it
is not enough for the respondent to simply find the response option closest to his or
her opinion. Instead, the respondent must create a mental scale of opinions, locate the
statement on it, locate his or her own opinion on it, and then decide how the distance
between them maps onto an agreement scale (e.g., Trabasso et al. 1971). If this process
sounds incredibly burdensome, you are right. To avoid this burden, respondents often
satisfice. Thus, we think that agree-​disagree questions simply involve a higher cognitive
burden, because respondents take much longer to answer an agree-​disagree question
118   Daniel L. Oberski

than to answer the corresponding direct question, and when they do, we observe more
satisficing behaviors.
The psychologist Rensis Likert (1903–​1981), who is often said to have invented
agree-​disagree questions, was well aware of this potential problem. His solution to
the problem was to authoritatively assume it away: “It is quite immaterial what the
extremes of the attitude continuum are called. . . . [I]‌t makes no difference whether the
zero extreme is assigned to ‘appreciation of ’ the church or ‘depreciation of ’ the church”
(Likert 1932, 48). We now know this to be false. Experiments show that varying the ex-
tremeness of the statement or negating it with the word “not,” which Likert thought
would not make any difference, can in fact radically shift the answers people give (e.g.,
Schuman and Presser 1981). Worse still, the effect seems to differ across respondents,
causing random errors.
This brings us to the second set of reasons to discard agree-​disagree scales: they are
less valid and less reliable than direct questions. “Unreliable” means there will be vari-
ations in the answers of people who we suspect have the exact same opinion. After all,
if two people have the same opinion, the ideal, perfectly reliable, opinion poll would
yield equal answers. Similarly, known differences should be reflected in the answers. For
example, a question about the role of women in society should at least on average be re-
lated to gender. An invalid question, which does not measure the intended opinion, will
fail such tests.
Unfortunately, a person’s “true opinion” cannot be observed. We can, however, trans-
late the two requirements of reliability and validity into numbers that can be estimated
from observable data. There are various approaches to doing so, all of which involve
taking not just one but several measures of the same phenomenon to make statements
about reliability and/​or validity. Commonly used approaches are the quasi-​simplex
model (Heise and Bohrnstedt 1970; Wiley and Wiley 1970; Alwin 2007, 2011), in
which each respondent is asked the same question in multiple waves of a panel, and
the multitrait-​multimethod (MTMM) approach (Campbell and Fiske 1959; Andrews
1984; Saris and Gallhofer 2007b; Saris et al. 2012), in which a within-​persons experiment
is performed on the question format. Various studies performed in several countries
suggest that both the reliability and the validity of questions estimated in this way in an
agree-​disagree format are lower than in other formats (Krosnick and Fabrigrar 2001;
Saris et al. 2010).
The third and final reason to discard agree-​disagree scales might form an explana-
tion for the empirical finding that these scales are less valid and reliable: acquiescence.
Acquiescence is the empirical finding that “some respondents are inclined to agree with
just about any assertion, regardless of its content” (Révilla et  al. 2013). For example,
Krosnick (2009) reported that 62–​70% of respondents agree with the question “do you
agree or disagree with this statement?” This question measures nothing, but people
lean toward agreeing with it anyway. Other studies have found that a sizable group of
people will agree with both a statement and its opposite (e.g., Selznick and Steinberg
1969). Furthermore, pointless agreement is more common among low-​education
groups, younger people, and tired respondents (e.g., Narayan and Krosnick 1996). So
Questionnaire Science   119

the tendency to agree with anything varies across respondents. This not only creates
random differences between people, but also spuriously correlates any questions that
are asked in the agree-​disagree format, since part of their shared variance will be shared
acquiescence.
The agree-​disagree format is an example of a common practice on which survey de-
sign textbooks do not agree, even though the theoretical and empirical evidence against
it, of which this section has only scratched the surface, is impressive. Reviewing that
body of evidence is not a trivial task, however. What’s more, the agree-​disagree format
is just one of the many choices a researcher is faced with when asking a question; the
number of categories, use of negative formulations, and double-​barreled phrases were
already mentioned. But there are many more: whether to balance the request, for ex-
ample by asking “is immigration good or bad for the economy?,” rather than just “bad
for the economy,” is another example, famously studied by Schuman and Presser (1981).
Other choices are the complexity of the sentences used, the grammatical structure of the
sentences, whether to give further information or definitions to the respondent, where
to place the question in the questionnaire, the choice of answer scale, the choice of labels
if response categories are used, and so forth.
To get a feel for these choices, refer to figure 6.1, and—​without reading the footnote at
the end of this paragraph—​try to spot the differences among the three versions. Some
are obvious, such as the number of scale points. Others are less so. For example, versions

Version A. The next 3 questions are about your current job. Please choose one of the following to describe
how varied your work is.
Not at all varied
A little varied
Quite varied
Very varied

Version B. Please indicate, on a scale of 0 to 10, how varied your work is, where 0 is not at all varied and
10 is very varied. Please tick the box that is closest to your opinion
Not at Very
all varied varied
0 1 2 3 4 5 6 7 8 9 10

Version C. Now for some questions about your current job.


Would you say your work is…[Interviewer: READ OUT]
1 …not at all varied,
2 a little varied,
3 quite varied,
4 or, very varied?
8 (Don’t know)

Figure 6.1  Three ways to ask a question, all tried in the European Social Survey (2002).
120   Daniel L. Oberski

Topic: • Showcard labels overlap


• Domain • Avg. words/sentence • Interviewer instruction
• Concept • Avg. syllables/word • Respondent instruction
• Social desirability • No. subordinate clauses • Position in the questionnaire
• Centrality to respondent • No. nouns • Country
• Fact vs. opinion • No. abstract nouns • Language
• Past/present/future • Introduction used
• Avg. words/sentence, intro Response scale:
Wording: • No. subordinate clauses, intro • Type of response scale
• Direct question vs. other • No. nouns, intro • Number of categories
formulations • No. abstract nouns, intro • Labels full, partial, or no
• Period or date • Avg. syllables/word, intro • Labels full sentences
• WH word used • Order of labels
• Use of gradation Administration: • Numbers correspond to labels
• Balance of the request • Computer assisted • Unipolar/bipolar; theoretical
• Encouragement in question • Interviewer present • Unipolar/bipolar; used
• Emphasis on subjective opinion • Oral/visual • Neutral category
• Other peoples’ opinion given • Showcard used • No. fixed reference points
• Stimulus or statement • Showcard horizontal/vertical • Don’t know option
• Absolute/comparative • Showcard pictures
• Knowledge or definitions • Showcard letters/numbers

Figure 6.2  Some choices made when formulating a question and coded in SQP 2.0.

A and C are very similar, but could in fact be considered to differ on at least six aspects
that the literature has suggested may matter for their reliability and validity.2
Clearly the number of choices made whenever we ask a respondent a question is con-
siderable. Figure 6.2 shows a number of these choices, which the literature has suggested
make a difference to the reliability and validity of the question (Saris and Gallhofer
2007a). While knowing of their existence is useful, this knowledge does not immedi-
ately lead to better survey questions; it would be an insurmountable task for a researcher
to go through the literature on each of these issues or do his or her own experiments for
every single question asked. Moreover, as the example in figure 6.1 illustrates, it may not
be so easy to recognize every single relevant choice made. Without a tool to code these
choices, we are at risk of focusing on issues that happen to be highly studied or that
experts happen to have a strong opinion on, to the possible detriment of other choices
that are less eye-​catching but equally crucial to obtaining adequate measures of people’s
opinions. What we need to make informed, evidence-​based decisions is a structured
summary of the literature on these issues: a meta-​analysis of what makes a better or
worse survey question.

A Meta-​Analysis of Survey Experiments

One such meta-​analysis is a multiyear project we performed in 2011 (Saris et al. 2012)
on several thousand questions that were a part of the European Social Survey, as well
as others part of a project executed in the United States and several European countries
Questionnaire Science   121

(these questions were also included in Andrews 1984; Scherpenzeel 1995; Saris and
Gallhofer 2007b). Other analyses can be found in Alwin and Krosnick (1991) and Alwin
(2007). In this project, we took the following steps:

1. Estimated the reliability and common method variance (together: “quality”) of a


large number of questions.
2. Coded characteristics of the questions that literature suggests relate to question
quality.
3. Predicted question quality from question characteristics (meta-​analysis).
4. Created a freely available online web application that allows researchers to
input their question and obtain its predicted quality; the “Survey Quality
Predictor” (SQP).

The following subsections briefly explain each of these steps, focusing most attention on
the practical tool for applied survey researchers, SQP.

Estimating Question Quality


There are several possible indicators of how good a question is. Two highly impor-
tant indicators of quality are the reliability and common method variance. Both reli-
ability and method variance can be expressed as numbers between 0 and 1 and can be
interpreted as proportion of variance explained (R2) of true variance (reliability) and
method variance, respectively.
The reliability of a question is the correlation that answers to the question will have
with the true values (or “true score”). For example, when asking about the number of
doctors’ visits, reliability is the correlation between the number of times the respondents
claim to have visited the doctor on the one hand, and the actual number of times they
visited the doctor on the other hand. When dealing with opinions, a true value is dif-
ficult to define; instead, a “true score” is defined as the hypothetical average answer
that would be obtained if the same question were repeated and there were no memory
(for more precise explanations of these concepts see Lord and Novick 1968; Saris and
Gallhofer 2007a).
The common method variance of a question is the proportion of variance explained by
random measurement effects, such as acquiescence, that the question has in common
with other, similar questions. This shared measurement error variance causes spurious
correlations among question answers. For example, if a question has a common method
variance of 0.2, it can be expected to correlate 0.2 with a completely unrelated question
asked in the same manner (“method”; Saris and Gallhofer 2007a).
Campbell and Fiske (1959) suggested an experimental design to study both relia-
bility and common method variance simultaneously: the MTMM design. Procedures
to estimate reliability and method variance of survey questions directly using struc-
tural equation models (SEM) were subsequently applied by Andrews (1984). Each such
122   Daniel L. Oberski

experiment crosses three survey questions to be studied (“traits”) with three methods
by which these questions can be asked (“methods”). By applying decomposition of var-
iance using SEM, we can then disentangle what part of the survey questions’ variance is
due to the question, what part is due to how it was asked, and what part is not reproduc-
ible across repetitions (random error). A deeper explanation of MTMM experiments
from a within-​persons perspective can be found in Cernat and Oberski (2017).
Already in 1984, Frank Andrews (1935–​1992) suggested performing not just one, but
several MTMM experiments on survey question format and summarized the results by
comparing the quality of questions in different formats with each other. Over a period
of several decades, this idea was subsequently expanded and improved upon by Saris
and his colleagues (Saris and Andrews 1991; Költringer 1995; Scherpenzeel 1995; Oberski
et al. 2004; Saris and Gallhofer 2007a, 2007b; Saris et al. 2010, 2012; Révilla et al. 2013).
They performed hundreds of MTMM experiments, obtaining estimates of the reliability
and method variance of thousands of survey questions. These efforts led to a large data-
base of 3,483 questions—​among them the “job variety” questions shown in figure 6.1—​
on which approximately sixty characteristics that are thought to affect question quality
in the literature have been coded. Most of these characteristics are shown in figure 6.2.
Not all issues are included, such as the usage of double-​barrelled requests or negative
formulations. However, many issues found in the literature are addressed in this coding
scheme (see Saris and Gallhofer 2007b for more information on the coding scheme and
its development).

Coding Question Characteristics


The questions were coded by two experts as well as a group of trained coders at the
Pompeu Fabra University, Spain. The codes for questions in languages unfamiliar to the
experts were compared to those for the English versions of the questionnaires, and any
differences were reconciled. The resulting database of questions with their codes was
cleaned and merged with a database of estimates of the reliability and common method
variance from MTMM experiments. In these experiments, each respondent answered
two different versions of the same question, with about an hour of interview time in
between—​for example, versions A  and B from figure 6.1. The same respondent also
answered different questions in these same versions A and B—​for example, on satisfac-
tion with wages and health and safety. By combining the answers to different opinion
questions asked in the same way with different methods of asking about the same
opinion, confirmatory factor analysis can be used to separate the effects of the opinion
(reliability) from those of the method (common method variance). (Sometimes the
complement of common method variance is called “validity” in the MTMM literature.
I avoid that term here to prevent confusion with other, perhaps more familiar uses of
that term.) The end result was a large database of questions with two pieces of informa-
tion: the MTMM reliability and common method variance, and the characteristics of
these questions that might predict the reliability and method variance.
Questionnaire Science   123

Predicting Quality from Characteristics


Machine learning techniques were then applied to predict the MTMM reliability and
method variance of a question from its characteristics. By using random forests of re-
gression trees (Breiman 2001), 65% of the variance in reliability across the questions and
84% of the variance in the common method variance could be explained in questions
that were in the “testing sample”—​that is, not used in the estimation of the model.
Figure 6.3 shows an example of one regression tree. The “leaves” of this tree can be
followed downward, according to the characteristics of the question, to come to a predic-
tion of the reliability (shown in logits). For example, the leaf that is second from the left
shows that a question on health issues (domain = 3) that uses a gradation in the question
(“how much,” “to which extent”) is predicted to have a reliability of invlogit(1.198) = 0.768,
or about 80% reliability. There were seventy-​two such questions in this training sample.
These regression trees are, however, prone to overfitting. A random forest therefore ran-
domly samples cases to be in either the training or testing sample. Furthermore, many
variables may be strongly collinear (confounded) with one another. To counter this, the
algorithm samples a random subset of the characteristics as well. This doubly random
sampling is performed fifteen hundred times, and a regression tree is learned on each of
the training sets. Combining the fifteen hundred predictions obtained from each of the

Example regression tree for reliability coefficient

1.955
|
n=1988

domain=3,4,7,11,13,14,112 domain=6,101,103,120

1.724 2.394
n=1303 n=685
domain=3
domain=4,7,11,13,14,112 concept=1,73,78 concept=2,76

0.9636 1.793 1.489 2.622


n=108 n=1195 n=138 n=547

gradation>=0.5 position< 339.5 position< 322.5 position>=322.5


gradation< 0.5 position>=339.5

0.4959 1.198 1.642 2.023 2.384 2.799


n=36 n=72 n=722 n=473 n=233 n=314
position>=410 ncategories>=4.5
position< 410
ncategories< 4.5
1.544 2.165 2.681 3.364
n=108 n=365 n=260 n=54
concept=1,2 position< 404.5
concept=73,75,76 position>=404.5
1.28 2.17 1.97 2.45
n=76 n=32 n=217 n=148

Figure 6.3  Example of a regression tree predicting the reliability of a question from a selection
of its characteristics. The random forest consists of 1,500 such trees.
124   Daniel L. Oberski

trees by taking their average then yields the final prediction from the forest. The same
procedure was applied to predict the common method variance.
The random forest yields a method that can predict the reliability and method variance
of a question from its characteristics. However, following the procedure described here
will be a tedious task for a survey researcher. This is why the results of the meta-​analysis
have been included in an online tool that is free to use. The following section describes
this tool, developed to allow researchers to code their question characteristics and obtain
a prediction from the random forest about the question’s reliability and common method
variance.

Using the Results of the Meta-​analysis to Guide Question


Design Using the SQP 2.0
The SQP 2.0 (http://​sqp.upf.edu/​) is an online web application that is free to use. Its
goals are to

• allow survey researchers to code their questions in the coding system of Saris
and Gallhofer (2007a), becoming aware of the many choices made in designing a
question;
• predict from the meta-​analysis the reliability and common method variance of
the survey question, so that the researcher can get an idea of the adequacy of the
question for the research purpose; and
• tentatively suggest improvements based on the meta-​analysis.

It does not

• estimate average bias in the question, for example due to social desirability;
• predict other measures of a question’s quality, such as the appropriateness of the
question for the research topic or the number of missing responses;
• include every possible characteristic of a question—​although it does include many
of them;
• provide information about cause and effect, since changing characteristics may not
always result in the predicted improvement; or
• give highly accurate predictions for questions about behaviors and fact. The main
focus has been questions on opinions, feelings, evaluations, and so forth.

A final caveat is that SQP has not been tested extensively on questions in web
surveys, although research suggests that web and other self-​administration modes
do not differ in reliability and method variance (Révilla 2012a, 2012b; Révilla and
Saris 2012), so that the predictions using self-​administration as the mode may be rea-
sonably adequate.
Questionnaire Science   125

In spite of these limitations, SQP can be a very useful tool for survey designers. To
demonstrate the working of the program, I have coded version A of the “job variety”
question into the system.
The first step is to enter the question text itself into the system. Figure 6.A.1 in the
chapter appendix shows that this text is split up into three parts: the introduction, “re-
quest for an answer,” and answer scale. Each of these choices is explained on the page
itself. As the name implies, the request for an answer refers to the request itself, while
the introduction is any leading text, such as “now for some questions about your health.”
After entering the question text, the coding system appears, as shown in figure 6.A.2 in
the chapter appendix. Clicking the “Begin coding” button begins the coding process.
As figure 6.4 demonstrates, the characteristic will appear on the left while coding,
together with an explanation of it. The user then chooses a value, which is subsequently
displayed on the right and can be amended at any time. Where possible, some charac-
teristics are coded automatically. For questions asked in English and a few other lan-
guages, for example, natural language processing (part-​of-​speech tagging) is applied
automatically to the texts to count the number of nouns and syllables, as figure 6.A.3 in
the chapter appendix shows. The full list of choices made for this question is provided in
the chapter appendix.
After finishing the coding process, some predictions are shown with their uncer-
tainty. The reliability coefficient, “validity coefficient” (complement of the method

Figure  6.4  Coding the characteristics of the questions in the system. More information on
their precise meaning is given with each characteristic.
126   Daniel L. Oberski

effect), and their product, the “quality coefficient” (Saris and Gallhofer 2007a), are
shown (as in figure 6.5). The quality coefficient squared indicates the proportion of
variance in the answers to the questions that we can expect to be due to the person’s
true opinion. The reliability coefficient of 0.8 in figure 6.5 suggests that any true
correlations the answers to this question might have with other variables will be at-
tenuated (multiplied) by 0.8. This includes relationships over time, so that any time
series of this variable will jitter up and down randomly by at least 20% more than is
the reality. A “validity coefficient” of 0.99 indicates that two questions asked in this
same manner can be expected to correlate spuriously by a very small amount (this
spurious additional correlation can be calculated from the “validity” coefficient as 1–​
0.9852 = 0.0298). Common method variance is therefore predicted not to be a great
concern with this question.
In an MTMM experiment performed in the European Social Survey, the relia-
bility coefficient of this particular question was also estimated directly from data.3
These estimates from an actual MTMM experiment can be compared to the SQP
predictions shown in figure 6.5. In this MTMM experiment the reliability coefficient
of this version of the question was estimated as 0.763 and the method effect as 0.038.
Both are close to the predictions of these numbers obtained with SQP.

Figure 6.5  When the coding is complete, a prediction of the MTMM reliability and “validity”
(complement of method effect) coefficients is given, together with the uncertainty about these
predictions.
Questionnaire Science   127

Figure 6.6  SQP can look into its database of experiments to examine the differences in predic-
tion that would occur if one aspect of the question were changed. The above suggests that creating
numbers to correspond with the labels might help.

Finally, a tentative feature of SQP is that suggestions for potential improvement of the
question are given. This is done by examining the “what-​if ” prediction that would be
obtained from the random forest if one characteristic were coded differently. Figure 6.6
shows the suggestions made by SQP 2.0: if the phrasing were simpler, in the sense of using
fewer syllables per word and fewer words, the question would be predicted to have a higher
quality. It is difficult to see how the question’s phrasing (see figure 6.1), which is already
very simple, could be made even simpler. What could be changed is the “scale correspond-
ence.” This is the degree to which the numbers with which the answer options are labeled
correspond to the meaning of the labels. In version A of the question, the labels are not
numbered at all, so this correspondence has been coded as “low.” By introducing numbers
0, 1, 2, and 3 to go with the labels “not at all,” “a little,” “quite,” and “very,” the scale corre-
spondence could be coded as “high” and the predicted quality would improve somewhat.
This process could in principle be repeated until the question is thought to be of “ac-
ceptable” quality or no further sensible improvements can be made. However, note that
there may be good reasons not to make a possible suggested improvement when such
an “improvement” does not make sense in the broader context of the questionnaire.
Furthermore, note that since the meta-​analysis does not directly address causality, there
is no guarantee that this improvement in quality after changing the question will actually
be realized. Addressing the causality of these changes remains a topic for future research.
The SQP should be placed in the much wider context of questionnaire science. For ex-
ample, the meta-​analysis finds that complicated phrasings are bad for reliability, some-
thing that others have also suggested and found (see Graesser et al. 2006). But additional
explanations can also clarify meaning and narrow the range of possible interpretations
of a question, reducing error (Fowler 1992; Holbrook et al. 2006). This serves as a small
demonstration that much more work needs to be done to synthesize the literature than
could be achieved in this book chapter.
128   Daniel L. Oberski

Conclusion

The quest continues. We are far from understanding everything about how to ask
the best possible questions, but can see that the only road to such knowledge is well-​
developed cognitive theory, careful empirical observation and experiment, and system-
atic synthesis of the body of knowledge. Steps on this road are taken in almost every
issue of journals such as Public Opinion Quarterly, Survey Research Methods, and Journal
of Survey Statistics and Methodology. Neither these individual steps, nor SQP, nor any
textbook can give the definitive final word on questionnaire science. But all of these can
help the researcher do better research, keeping in mind this chapter’s counsels:

• We make a bewildering array of choices every time we formulate a survey question.


• Our personal experience does not guarantee knowledge about the optimal choices.
• Experts often have good advice to offer, but are not exempt from the human ten-
dency to overgeneralize.
• What is considered “best practice” differs among people and organizations and
may not correspond to actual best practice as observed in experiments.

In conclusion: always ask for the evidence. There may be plenty of it, or there may be
little. Both cases offer an exciting chance to learn more about the science of surveys.

The Future
The year of this writing marks the two hundredth anniversary of the invention of a rev-
olutionary new human measurement instrument. In 1816 René Théophile Hyacinthe
Laennec, a young physician from a remote provincial town in France, found him-
self practicing in Paris. When a young Parisian lady entered his practice with heart
problems, the modest young doctor hesitated to put his ear directly on her breast, as
was the usual practice. Instead, he rolled a piece of paper into a cylinder, with which he
could hear his patient’s heartbeat “much more neatly and distinctly” than he ever had
before (Laennec 1819, 8–​9). This new measurement method, the stethoscope, replaced
the previous ones.
Today Laennec’s stethoscope remains ubiquitous. Newer methods, such as X-​rays
and magnetic resonance imaging (MRI), have not replaced it, but have complemented
it. After all, a measurement method that is practical, fast, and cost-​effective is hard to
replace. The survey question is such a method in the social sphere. It therefore seems
unlikely that newer measurement methods will fully replace the survey question in
the foreseeable future. However, survey researchers and other students of human
opinion and behavior should ponder the possible ways in which other measurements
can be used to complement surveys. Furthermore, as argued in this chapter, the survey
Questionnaire Science   129

question still warrants improvement using modern methods of investigation. I briefly


elaborate on these two points below.
First, it is clear that the questionnaire is experiencing competition from other meas-
urement instruments, old and new. Implicit association tests (Greenwald et al. 1998), for
example, intend to measure prejudice with reaction times; functional MRI and other
brain imaging techniques show how the brain reacts to certain stimuli (Raichle and
Mintun 2006); genome-​wide genetic sequencing has become feasible (Visscher et al.
2012); and data from companies’ and governments’ administrative registers provide
some of the information we are after through record linkage (Wallgren and Wallgren
2007). The use of everyday technology to measure human behavior is also becoming
more popular. Monitoring smartphone usage with an app may be a better measure
of smartphone usage than a questionnaire (Révilla et al. 2016); monitoring the global
positioning system in peoples’ cars may be a better measure of their movements during
the day (Cui et al. 2015); and Facebook (an online social network application from the
early twenty-​first century) “likes” strongly correlate with various personal characteris-
tics (Kosinski et al. 2013).
All of these other measurement instruments are sometimes touted as being more
“objective.” I personally believe that this is not a helpful way to think about measure-
ment (see also Couper 2013). As we have seen, answers to questions have their biases and
unreliabilities. But so do fMRI (Ramsey et al. 2010), genome-​wide association studies
(Visscher et  al. 2012), administrative registers (Groen 2012; Bakker and Daas 2012;
Kreuter and Peng 2014), and “big data” such as Facebook posts or monitoring studies
(Manovich 2011; Fan et al. 2014). Furthermore, validity is often an issue with such meas-
ures: What if we were not interested in the person’s movements and Internet use, but in
their political opinions, their desire to have children, or the people they fall in love with?
A more helpful way of thinking about these other instruments is as attempting to
measure the same things that survey questions intend to measure. Which is the best way
of doing that, or whether perhaps several ways should be combined to obtain the best
picture, is then an empirical matter that pertains to a particular research question. For
example, Révilla et al. (2016) claimed that smartphone monitoring is better for meas-
uring the amount of Internet usage on a person’s phone—​no more, no less. Scientific
experiments should then be used in the same way that we have been using them to
look at the quality of survey measures alone. In short, no single measurement method
is perfect. Instead, social researchers would do well to take a page from the medical
practitioners’ book and use a variety of measurement methods, old and new, cheap and
expensive, and more or less reliable, valid, and comparable (Oberski 2012), to zero in on
the phenomenon being studied.
Aside from the inevitable opportunities and challenges afforded by the combination
of surveys with other types of data, the survey question itself still warrants considerable
improvement. This has been the topic of the current chapter, and SQP is discussed as
one attempt at such an improvement. However, this attempt is of necessity limited in
scope and application. First, it has been applied only to a subset of questions, to specific
groups of people, in a subset of countries, languages, and settings, during a particular
130   Daniel L. Oberski

time period. Second, it is only as good as the method used to measure the quality of
survey questions, the MTMM experiment in this case. Third, it accounts for only certain
aspects of the survey process and question characteristics. While the SQP project made
every effort to widen its scope in each of these aspects and does so over an impressive
range of countries, settings, questions, and so forth, no project can cover every conceiv-
able angle. Therefore, I see SQP’s general philosophy, contributed by its fathers Frank
Andrews and Willem Saris, as one of its most important contributions to the future of
social research: that social measurement can be investigated scientifically.
In my ideal future, the Andrews-​Saris approach to social research would become standard
across the social sciences. Any way of measuring opinions, behavior, or characteristics of
people would be studied by experiment and the experiments summarized by meta-​analyses
that would be used to determine the best way to move forward. An example of a recent meta-​
analysis relating to nonresponse rather than measurement error is Medway and Fulton
(2012). To ensure that such meta-​analyses afford an appropriate picture of scientific evi-
dence, we would also take into account lessons about the appropriate way to conduct science
that are being learned in the emerging field of “meta-​research.”4 In particular, in addition to
all the usual considerations for conducting good research, all conducted experiments should
be published (Ioannidis 2005), and preferably preregistered (Wagenmakers et  al. 2012),
conducted collaboratively (“copiloted”; Wicherts 2011), and fully open and reproducible
(Peng 2011). When we all join in this effort, questionnaire science in particular, and the inves-
tigation of human opinion and behavior in general, will make a huge leap forward.

Notes
1. http://​en.wikipedia.org/​wiki/​List_​of_​U.S._​states_​by_​educational_​attainment.
2. In terms of the coding scheme used in this section, these are direct question (C) vs. other
(A); use of a WH word (“how”); complexity of the request (A has more words and more
syllables per word); interviewer instruction (C); labels are numbers (C)  vs. boxes (A);
presence of a “don’t know” category. There may be more.
3. Program input and output for the MTMM analysis can be found at http://​github.com/​daob/​
ess-​research/​blob/​master/​input/​mplus/​Job/​jobmtmm.out.
4. See, e.g., http://​metrics.stanford.edu/​ and http://​www.bitss.org/​.

References
Alwin, D. 2007. Margins of Error: A Study of Reliability in Survey Measurement. New York:
Wiley-​Interscience.
Alwin, D. 2011. “Evaluating the Reliability and Validity of Survey Interview Data Using the
MTMM Approach.” In Question Evaluation Methods: Contributing to the Science of Data
Quality, edited by J. Madans, K. Miller, A. Maitland, and G. Willis, 263–​293. New York: Wiley
Online Library.
Alwin, D. F., and J. A. Krosnick. 1991. “The Reliability of Survey Attitude Measurement: The
Influence of Question and Respondent Attributes.” Sociological Methods & Research 20
(1): 139–​181.
Questionnaire Science   131

Andrews, F. 1984. “Construct Validity and Error Components of Survey Measures: A Structural
Modeling Approach.” Public Opinion Quarterly 48 (2): 409–​442.
Bakker, B. F., and P. J. Daas. 2012. “Methodological Challenges of Register-​Based Research.”
Statistica Neerlandica 66 (1): 2–​7.
Bradburn, N. M., B. Wansink, and S. Sudman. 2004. Asking Questions: The Definitive Guide
to Questionnaire Design—​ for Market Research, Political Polls, and Social and Health
Questionnaires. Rev. ed. San Francisco: Jossey-​Bass.
Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–​32.
Campbell, D., and D. Fiske. 1959. “Convergent and Discriminant Validation by the Multitrait-​
Multimethod Matrix.” Psychological Bulletin 56: 81–​105.
Carroll, R., D. Ruppert, L. Stefanski, and C. Crainiceanu. 2006. Measurement Error in Nonlinear
Models: A Modern Perspective. CRC Monographs on Statistics & Applied Probability, vol.
105. /​Boca Raton, FL: Chapman & Hall.
Cernat, A., and D. L. Oberski. 2017. “Extending the Within-​persons Experimental Design: The
Multitrait-​Multierror (MTME) Approach.” In Experimental Methods in Survey Research, ed-
ited by P. J. Lavrakas. New York: John Wiley & Sons.
Couper, M. P. 2013. “Is the Sky Falling? New Technology, Changing Media, and the Future of
Surveys.” Survey Research Methods 7: 145–​156.
Cui, J., F. Liu, J. Hu, D. Janssens, G. Wets, and M. Cools. 2015. “Identifying Mismatch between
Urban Travel Demand and Transport Network Services Using GPS Data: A Case Study in
the Fast Growing Chinese City of Harbin.” Neurocomputing 181: 4–​18.
Dijkstra, W., and J. H. Smit. 1999. Onderzoek met vragenlijsten:  Een praktische handleiding
[Survey research: A practical guide]. Amsterdam: VU University Press.
Dillman, D. A. 2011. Mail and Internet Surveys: The Tailored Design Method—​2007 Update with
New Internet, Visual, and Mixed-​Mode Guide. New York: John Wiley & Sons.
Fan, J., F. Han, and H. Liu. 2014. “Challenges of Big Data Analysis.” National Science Review 1
(2): 293–​314.
Fink, A. 2009. How to Conduct Surveys: A Step-​by-​Step Guide. 4th ed. Los Angeles: Sage.
Folz, D. H. 1996. Survey Research for Public Administration. Los Angeles: Sage.
Fowler, F. J. 1992. “How Unclear Terms Affect Survey Data.” Public Opinion Quarterly 56
(2): 218–​231.
Fowler, F. J. 2014. Survey Research Methods. Los Angeles: Sage.
Fuller, W. 1987. Measurement Error Models. New York: John Wiley & Sons.
Graesser, A. C., Z. Cai, M. M. Louwerse, and F. Daniel. 2006. “Question Understanding Aid
(Quaid) a Web Facility That Tests Question Comprehensibility.” Public Opinion Quarterly
70 (1): 3–​22.
Greenwald, A. G., D. E. McGhee, and J. L. Schwartz. 1998. “Measuring Individual Differences
in Implicit Cognition:  The Implicit Association Test.” Journal of Personality and Social
Psychology 74 (6): 1464.
Groen, J. A. 2012. “Sources of Error in Survey and Administrative Data: The Importance of
Reporting Procedures.” Journal of Official Statistics (JOS) 28 (2): 173–​198.
Hagenaars, J. A. P. 1990. Categorical Longitudinal Data: Log-​Linear Panel, Trend, and Cohort
Analysis. Newbury Park, CA: Sage.
Heise, D., and G. Bohrnstedt. 1970. “Validity, Invalidity, and Reliability.” Sociological
Methodology 2: 104–​129.
Holbrook, A., Y. I. Cho, and T. Johnson. 2006. “The Impact of Question and Respondent
Characteristics on Comprehension and Mapping Difficulties.” Public Opinion Quarterly 70
(4): 565–​595.
132   Daniel L. Oberski

Ioannidis, J. P. 2005. “Why Most Published Research Findings Are False.” PLOS Medicine 2
(8): e124.
Költringer, R. 1995. “Measurement Quality in Austrian Personal Interview Surveys.” In The
Multitrait-​Multimethod Approach to Evaluate Measurement Instruments, edited by W. Saris
and A. Münnich, 207–​225. Budapest: Eötvös University Press.
Kosinski, M., D. Stillwell, and T. Graepel. 2013. “Private Traits and Attributes Are Predictable
from Digital Records of Human Behavior.” Proceedings of the National Academy of Sciences
110 (15): 5802–​5805.
Kreuter, F., and R. D. Peng. 2014. “Extracting Information from Big Data:  Issues of
Measurement, Inference and Linkage.” In Privacy, Big Data, and the Public Good: Frameworks
for Engagement, ed. Julia Lane, Victoria Stodden, Stefan Bender & Helen Nissenbaum, 257.
Cambridge: Cambridge University Press.
Krosnick, J. 2009. “The End of Agree/​Disagree Rating Scales: Acquiescence Bias and Other
Flaws Suggest a Popular Measurement Method Should Be Abandoned.” European Survey
Research Association 2009 Conference, Warsaw, Poland.
Krosnick, J., and L. Fabrigrar. 2001. Designing Questionnaires to Measure Attitudes.
Oxford: Oxford University Press.
Laennec, R. T. H. 1819. Traité de l’auscultation médiate, et des maladies des poumons et du coeur,
vol. 1. Paris: J.-​A. Brosson et J.-​S. Chaudé libraires.
Likert, R. 1932. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22: 55.
Lord, F. M., and M. R. Novick. 1968. Statistical Theories of Mental Scores.
Reading: Addison–​Wesley.
Madans, J., K. Miller, A. Maitland, and G. Willis. 2011. Question Evaluation
Methods: Contributing to the Science of Data Quality. New York: Wiley.
Manovich, L. 2011. “Trending: The Promises and the Challenges of Big Social Data.” Debates in
the Digital Humanities 2: 460–​475.
Medway, R. L., and J. Fulton. 2012. “When More Gets You Less: A Meta-​analysis of the Effect
of Concurrent Web Options on Mail Survey Response Rates.” Public Opinion Quarterly 76
(4): 733–​746.
Narayan, S., and J. A. Krosnick. 1996. “Education Moderates Some Response Effects in Attitude
Measurement.” Public Opinion Quarterly 60 (1): 58–​88.
Netemeyer, R. G., K. L. Haws, and W. O. Bearden. 2011. Handbook of Marketing Scales: Multi-​
Item Measures for Marketing and Consumer Behavior Research. 3rd ed. Los Angeles: Sage.
Oberski, D. 2012. “Comparability of Survey Measurements.” In Handbook of Survey Methodology
for the Social Sciences, edited by L. Gideon, 477–​498. New York: Springer-​Verlag.
Oberski, D., W. E. Saris, and S. Kuipers. 2004. “SQP:  Survey Quality Predictor.” Computer
software.
Payne, S. L. 1951. The Art of Asking Questions. Oxford, UK: Princeton University Press.
Peng, R. D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226.
Raichle, M. E., and M. A. Mintun. 2006. “Brain Work and Brain Imaging.” Annual Review of
Neuroscience 29: 449–​476.
Ramsey, J. D., S. J. Hanson, C. Hanson, Y. O. Halchenko, R. A. Poldrack, and C. Glymour. 2010.
“Six Problems for Causal Inference from fMRI.” Neuroimage 49 (2): 1545–​1558.
Révilla, M., C. Ochoa, and G. Loewe. 2016. “Using Passive Data from a Meter to Complement
Survey Data in Order to Study Online Behavior.” Social Science Computer Review.
doi: 10.1177/​0894439316638457.
Questionnaire Science   133

Révilla, M. A. 2012a. “Impact of the Mode of Data Collection on the Quality of Answers to
Survey Questions Depending on Respondent Characteristics.” Bulletin de Méthodologie
Sociologique 116: 44–​60.
Révilla, M. A. 2012b. “Measurement Invariance and Quality of Composite Scores in a Face-​to-​
Face and a Web Survey.” Survey Research Methods 7 (1): 17–​28.
Révilla, M. A., and W. E. Saris. 2012. “A Comparison of the Quality of Questions in a Face-​to-​
Face and a Web Survey.” International Journal of Public Opinion Research 25 (2): 242–​253.
Révilla, M. A., W. E. Saris, and J. A. Krosnick. 2013. “Choosing the Number of Categories in
Agree–​Disagree Scales.” Sociological Methods & Research 43 (1) 73–​97.
Saris, W. E., and F. M. Andrews. 1991. “Evaluation of Measurement Instruments Using a
Structural Modeling Approach.” In Measurement Errors in Surveys, edited by P. Biemer, R.
Groves, L. Lyberg, N. Mathiowetz, and S. Sudman, 575–​599. New York: John Wiley & Sons.
Saris, W., and I. N. Gallhofer. 2007a. Design, Evaluation, and Analysis of Questionnaires for
Survey Research. New York: Wiley-​Interscience.
Saris, W. E., and I. Gallhofer. 2007b. “Estimation of the Effects of Measurement Characteristics
on the Quality of Survey Questions.” Survey Research Methods 1: 29–​43.
Saris, W. E., J. A. Krosnick, and E. M. Shaeffer. 2010. “Comparing Questions with Agree/​
Disagree Response Options to Questions with Item-​Specific Response Options.” Survey
Research Methods 4 (1): 61–​79.
Saris, W. E., D. L. Oberski, M. Révilla, D. Z. Rojas, L. Lilleoja, I. Gallhofer, and T. Gruner.
2012. “Final Report about the Project JRA3 as Part of ESS Infrastructure (SQP 2002-​2011).”
Technical report, RECSM. Barcelona, Spain: Universitat Pompeu Fabra.
Scherpenzeel, A. 1995. A Question of Quality:  Evaluating Survey Questions by Multitrait-​
Multimethod Studies. Amsterdam: Royal PTT Nederland NV.
Schuman, H., and S. Presser. 1981. Questions and Answers in Attitude Surveys: Experiments on
Question Form, Wording, and Context. Thousand Oaks, CA: Sage.
Selznick, G. J., and S. Steinberg. 1969. The Tenacity of Prejudice: Anti-​Semitism in Contemporary
America. Oxford, UK: Harper & Row.
Tourangeau, R., L. Rips, and K. Rasinski. 2000. The Psychology of Survey Response. Cambridge,
UK: Cambridge University Press.
Trabasso, T., H. Rollins, and E. Shaughnessy. 1971. “Storage and Verification Stages in
Processing Concepts.” Cognitive Psychology 2 (3): 239–​289.
Visscher, P. M., M. A. Brown, M. I. McCarthy, and J. Yang. 2012. “Five Years of GWAS
Discovery.” American Journal of Human Genetics 90 (1): 7–​24.
Wagenmakers, E.-​J., R. Wetzels, D. Borsboom, H. L. van der Maas, and R. A. Kievit. 2012.
“An Agenda for Purely Confirmatory Research.” Perspectives on Psychological Science 7
(6): 632–​638.
Wallgren, A., and B. Wallgren. 2007. Register-​Based Statistics: Administrative Data for Statistical
Purposes. New York: Wiley.
Wicherts, J. M. 2011. “Psychology Must Learn a Lesson from Fraud Case.” Nature 480: 7.
Wiley, D., and J. A. Wiley. 1970. “The Estimation of Measurement Error in Panel Data.”
American Sociological Review 35 (1): 112–​117.
134   Daniel L. Oberski

Appendix

Full List of Choices Made in SQP 2.0


The following chart contains the full list of choices I  made for the characteristics of
the “job variety” question in figure 6.1 using SQP 2.0 (http://​sqp.upf.edu/​). Further
explanations about the precise meaning of these codes can be found while coding on the
website as well as in Saris and Gallhofer (2007a).

SQP Screenshots

Characteristic Choice Code

Domain Work 7
Domain: work Other 11
Concept Evaluative belief 1
Social desirability A bit 1
Centrality A bit central 1
Reference period Present 2
Formulation of the request for an Indirect requests 1
answer: basic choice
WH word used in the request WH word used 1
“WH” word How (quantity) 9
Request for an answer type Imperative 2
Use of gradation Gradation used 1
Balance of the request Unbalanced 1
Presence of encouragement to No particular encouragement present 0
answer
Emphasis on subjective opinion in No emphasis on opinion present 0
request
Information about the opinion of No information about opinions of others 0
other people
Use of stimulus or statement in No stimulus or statement 0
the request
Absolute or comparative judgment An absolute judgment 0
Response scale: basic choice Categories 0
Number of categories 4 4
Labels of categories Fully labeled 3
Labels with long or short text Short text 0
Order of the labels First label negative or not applicable 1
Correspondence between labels Low correspondence 3
and numbers of the scale
Questionnaire Science   135

Characteristic Choice Code


Theoretical range of the scale Theoretically unipolar 0
bipolar/​unipolar
Number of fixed reference points 0 0
Don’t know option DK option not present 3
Interviewer instruction Absent 0
Respondent instruction Present 1
Extra motivation, info, or Absent 0
definition available?
Introduction available? Available 1
Number of sentences in 1 1
introduction
Number of words in introduction 9 9
Number of subordinated clauses 0 0
in introduction
Request present in the Request not present 0
introduction
Number of sentences in the 1 1
request
Number of words in request 13 13
Total number of nouns in request 2 2
for an answer
Total number of abstract nouns in 1 1
request for an answer
Total number of syllables in 17 17
request
Number of subordinate clauses in 0 0
request
Number of syllables in answer 16 16
scale
Total number of nouns in answer 0 0
scale
Total number of abstract nouns in 0 0
answer scale
Show card used Showcard not used 0
Computer assisted Yes 1
Interviewer Yes 1
Visual presentation Oral 0
Position 50 50
136   Daniel L. Oberski

Figure 6.A.1  Entering the “job variety” question into the SQP system.

Figure 6.A.2  The SQP opening screen to begin coding the question.


Questionnaire Science   137

Figure 6.A.3  Some characteristics, such as the number of nouns and syllables, are detected au-
tomatically using natural language processing techniques. Others must be coded by hand.
Pa rt  I I

DATA C OL L E C T ION
Chapter 7

E xit P ollin g Today


an d What th e Fu t u re
M ay  Hol d

Anthony M. Salvanto

Imagine a hypothetical election night. You tune in to the news broadcast to get the
results, and in the span of the next few minutes you see all that an exit poll can provide. It
is a sequence that will repeat time and again over the next few hours, as it does—​and as it
has—​in some form, across many networks and many such nights.
A network’s Decision Desk is ready to project the winner in a key race, and anchors
deliver that breaking news to viewers. The polls have closed, though not all the votes
have been counted yet, but the exit poll has collected enough data that the analysts can
be confident who will win. As the topic turns to why it happened—​what was on voters’
minds—​“we talked to voters” is the authoritative introduction, and the data might
show, for example, how voters were concerned about the economy and how that con-
cern is driving the vote results you see. Then more perspective is provided. Perhaps this
electorate is much older than in previous years, or much younger, and we see that by
comparing exit polls then and now; this is key historical context. All of this analysis is
fueled by exit poll data.
It is all in a day’s work—​a long night’s work, too, really—​for the exit poll, which is not
just among the most visible pieces of research used anywhere, but is perhaps the ulti-
mate multitasker of the polling world. And it is unique, the only operation of its kind
undertaken across the United States each cycle, though its users offer many different and
valuable interpretations of the data.
This chapter considers exit polls from a researcher’s perspective, pointing out how
they compare in terms of operation and sampling to more conventional pre-​election
polling and speculating about what exit polling in the United States might look like in
the future. Taken as a research study in itself, we think about how it might it adapt over
time, in the context of the explosion in new data sources, lists, and new technologies;
importantly, we account for changes in the way Americans go to the polls, which is
142   Anthony M. Salvanto

increasingly not on Election Day at all, but in the days or weeks before, or by mail or ab-
sentee ballot.

The Roles of the Exit Poll

First let us review the exit poll’s more prominent roles and how it fits in amid the various
types of other valuable polls and election studies. We see that exit polls serve at least five
major, and in many ways distinctive, functions.
First among these is unmatched timeliness, as exit polls continually report and
update their results as Election Day unfolds. This is critical for the news media cov-
ering the event, because everyone wants to know what is happening as soon as pos-
sible. As Mitofsky and Edelman (2002) describe their designs in the early days of exit
polling, “As each new precinct in a state reported its vote we were going to make a new
estimate.”
In the second function, adding to their real-​time value, exit polls’ design gives analysts
the potential to project final results much sooner after poll closing than the counted
vote would usually permit, and to do so with a high degree of statistical confidence.
Importantly, such projections require a great deal of additional modeling and analysis
of the data, but the exit poll is designed to facilitate that modeling and analysis through
its sampling approach, reporting structure, and large scale. We discuss this more below.
In the third function, unlike many other studies, this is not entirely about likely voters
(i.e., those who tell pollsters they plan to vote), but rather about voters interviewed in
person right at the polling place and asked questions with little time between when they
voted and when they get the questionnaire. (There are some important exceptions to
this, which we discuss in detail below.) From a survey research point of view, this adds
confidence to the measurements; from the editorial point of view, it adds equity as we
describe the findings. “We talked to voters” means we went right to the source, which is
a nice thing for both reporters and survey researchers.
The fourth function has to do with the polls’ enormous size and scope. In 2016’s
General Election, exit pollsters conducted more than 110,000 interviews with voters and
processed more than 70,000 questionnaires in national and state surveys in November,
including over 16,000 telephone interviews with absentee and early voters. In the
primaries more than 100,000 questionnaires were processed.
Compare that, for example, to a typical national survey, which might involve a thou-
sand or so interviews. Exit polls are a lot larger in sample size than conventional polls,
not just to help with accuracy and projections, but also to help us explore subgroups
with robust findings. If you want to know how independents voted, a lot of polls can
estimate that, because independents are a sizable portion of voters. But if you want to
know, with confidence, how independents who were also conservative voted, you are
now starting to break out smaller and smaller pieces of the electorate and need large
total samples to analyze. And, yes, the exit polls need to help Decision Desks make a
Exit Polling Today and What the Future May Hold    143

series of estimates to cover the night, which in the United States is not a single national
vote but simultaneous ones in states and districts.
In the fifth function (which really follows directly from the rest), exit polls become
one important measure of record, the most robust and comprehensive study of voters’
voices for the election and a valuable go-​to reference for analysis in the days and years
afterward. Later, when everything has been counted and people want to know what the
election “means” and what happens next for governance, the exit poll results offer em-
piricism amid what might otherwise be just conjecture or spin. The continuity of the
exit poll, its similar methodology, and its comparable questions each Election Day allow
it to offer historical context as well, which is often a key part of understanding what
elections mean.
But there are both challenges and possibilities for the future of exit polling as a re-
search study—​for collecting data about voters, from voters, as fast and accurately as pos-
sible up to and including Election Day. Voters are changing the way they vote—​earlier,
and in more ways now than ever before—​while at the same time improvements in com-
puter and database files offer more data on voters than ever before.

Design and Methods

This section includes a short primer on how U.S.  exit polls used by the major news
networks are designed, so that the reader can consider them in the context of other
forms of voter surveys and data collection. Importantly, what people often call the “exit
poll” as currently constructed might just as well be called the “exit operation,” because
it involves more than just a survey of voters. It includes a very large-​scale field and tabu-
lation effort to process it in real time and collects reported votes from precinct officials
along with responses to questionnaires.
The design of an exit poll begins with the process of sampling voting locations, with
the aim of selecting a representative sample of places to send the interviewers, so the
sampling frame—​that is, the list of things we sample—​for any given state exit poll is a
list of all of a given state’s voter precincts. Compiling this list requires advance research,
and this process is different than for a conventional telephone poll, which often begins
with a sample of phone numbers and from there might randomly select people within
households (or perhaps just the person on the end of the line if it is a cell phone with
one user).
A recent past race is selected, and the precinct list is assembled to reflect the state’s
precincts as they existed at that time. Wherever possible, the past race used for the past
vote data in the precinct is an analogous and recent contest. Prior to sampling, the state
and precincts are stratified by geographic region based on county—​usually four or five
strata depending on the size of the state—​and ordered such that the range of precinct-​
level partisan vote share in that past race—​Republican versus Democratic and vice
versa—​will also be represented in the subsequent sample. (This and further discussion
144   Anthony M. Salvanto

of sampling can be found in Mitofsky and Edelman 2002, Mitofsky and Edelman 1995;
Mitofsky 1991; Edelman and Merkle 1995; Merkle and Edelman 2000; Merkle and
Edelman 2002.)
In the hypothetical case that these precincts behave differently than they have in the
past, the exit poll should pick that up; it reports the vote as voters describe it and ul-
timately uses the current precinct counts. Moreover, model estimates can also com-
pare the current data to the past race and estimate the differences and where they are
occurring, which can be useful as well. It is a statewide sample; there are no “key” or
“bellwether” precincts on which the sample hinges.
A sample of precincts is drawn such that the chance of inclusion for a precinct is pro-
portional to its number of voters. From this sample of precincts is then drawn the list of
places where reporters will collect vote counts from precinct officials (“reported vote”
precincts) at poll closing and in which interviewers will be stationed for voter interviews,
subsequently called “survey” or “interview” precincts. The national survey is designed
to estimate the national vote; its precincts are sampled such that all the precincts repre-
sent their proper proportion of the national vote.
Sometimes commentators discuss “swing” and “bellwether” counties or precincts in
the course of a night that “indicate” which way a race will go, but those might be based
on their own editorial judgments or separate analyses. They are not part of the exit poll,
which is a probability sample. (For a discussion of older methods, including quota sam-
pling used in the late 1960s, see Mitofsky and Edelman 2002.)
For the state polls, precinct sample sizes typically vary from state to state and year
to year, depending on editorial coverage needs; in this regard the exit poll’s resource
allocations are much like other polling research we see during an election, as pollsters
invariably concentrate more on the states that are highly competitive, not just due to
editorial interest but also because competitiveness creates closer contests, which might
need larger samples to properly estimate. In the presidential voting year 2012, for ex-
ample, the most hotly contested “battleground” states had at least forty and often fifty
exit poll interviewing precincts and more than eighty reported vote precincts, including
those survey precincts.
On Election Day the interviewer at the polling place is tasked with subsampling
voters at the precinct. It would not be feasible to have the interviewer approach every
single voter who is exiting, as this would surely overload her or him or invariably result
in unaccounted misses. So the interviewer is given an interviewing rate to subsample
voters randomly, counting off every nth voter to approach; the rate is based on expected
turnout in the precinct, such that one can expect to get 100 to 125 completes per pre-
cinct for the day and may have to account for the fact that some physical locations host
voters from multiple precincts. The rate is computed based on the number of voters in
the past race for the precinct and designed to produce the desired number of interviews
for a precinct of that size. The rate can be adjusted from headquarters during the day if
needed, depending on turnout.
The interviewer will invariably miss some voters—​ perhaps because a selected
voter hurries by or heads off in another direction—​while others may directly refuse.
Exit Polling Today and What the Future May Hold    145

Interviewers record, based on their own best evaluation and training, the age range,
gender, and race of those voters who have refused or whom they have missed, and those
tallies are incorporated into adjustments made in the exit poll results, such that the
weighting of completed interviews within each age, race, and gender category accounts
for refusals and misses within those categories statewide. While this coding system,
like any, is expected to be subject to some error, it allows the poll to estimate something
about the noncompletes, which in turn allows important adjustments to be made in the
overall estimates. For example, if for some reason men were refusing to take the survey
at a significantly greater rate than women, weighting of the completed questionnaires
from men could adjust for that differential nonresponse when producing the estimate
of all voters.
Compare this, for example, to a random-​digit dial telephone survey, which is not al-
ways apt to know such things about the voter who does not pick up the phone but might
incorporate regional demographics into the final weighting. (Other kinds of samples,
like those drawn from a voter list, might know more.) We do not know the demo-
graphics of who turns out or not at the precincts on Election Day until Election Day, of
course (though we discuss this below, too). But that is something the exit poll is trying
to estimate.
The respondent is given a small two-​sided paper questionnaire to fill out, of which
there may be—​and often are—​different versions, so that more questions can be
administered across the survey. The questionnaires always ask about votes in the races
being covered and some basic demographics. The voter places the survey privately in a
box when finished.
In terms of training, many of the interviewers hired each year are people who have
worked on exit polling and similar style intercept interviewing before. Whether they
have or not, all interviewers go through training beforehand. They receive a full manual
with instructions, along with online training—​as this is an operation with interviewers
in just about every corner of the country—​including an interactive component and a
video. To successfully complete the training and be stationed at a polling place, the in-
terviewer has to complete the training course and correctly answer a series of questions
on the material. There is also a full rehearsal of the call-​in procedure using test data,
which takes place during the week before the real Election Day. Questionnaires are
available to voters in English or offered in Spanish-​language versions in states with over
15% Hispanic populations.
As thousands of voters fill out questionnaires and the election is in full swing, the exit
poll data begin flowing into centralized computations. This allows the team to monitor
the data and ensure quality control throughout the day, as any pollster would during
the field period, making sure interviewers are in place and collecting data at the correct
rate, and to monitor any issues that may have arisen with election officials at the polling
place, the permission for which is arranged in advance. Interviewers tabulate their
results, and at three points—​usually in the morning, once in mid-​or late afternoon, and
again near the very end of the voting period—​the results of the questionnaires from all
precincts for the state are reported by the interviewer to the call center via telephone.
146   Anthony M. Salvanto

This is a massive amount of data—​Mitofsky (1991) called the logistics “staggering”—​to


be compiled.
After the second round of data collection the survey can begin to give an indication
of what is on voters’ minds and can help plan for later that night during the prime time
of election coverage, after polls have closed. Even then, the data are still preliminary, as
there is still another wave of voter interviews to come.
As in any poll, there are potential sources of error, and the researcher needs an
approach for estimating its size and direction. (For a general discussion see, e.g., Groves
et  al. 2002; Groves, Biemer, et  al. 1988; Brick 2011; Lepkowski, Tucker, et  al. 2008.)
Sampling issues can produce error but this can be quantified through statistical theory;
there is a possibility of differential response between demographic categories correlated
with vote or between supporters of the candidates (see Mitofsky 1991, Mitofsky
International and Edison Research 2004; Mitofsky and Edelman 2002; Blumenthal
2004; Best and Kruger 2012.) In more conventional surveys, researchers might look at
the poll’s performance as data are collected to evaluate completions and refusals, and
also to reference outside data such as the census parameters for various demographic
groups and how the sample compares, for example. The real vote totals are the best avail-
able information on hand for the exit poll operation, which can take advantage of getting
the vote information at the precinct level as well as having sampled at the precinct level.
After the third round of reporting their data from exit poll questionnaires,
interviewers get the candidate vote tallies from precinct officials at their interviewing
precincts, as well as the total number of voters who actually cast ballots that day, as
soon as those numbers are available. Additional reporters collect data from additional
sample precincts. This collection is arranged in advance, part of an effort that involves
outreach to elections and precinct officials well before Election Day and credentialing
and training interviewers, that comprises such a large and important part of the exit
poll setup. Not every county and every state makes these reports available; however,
the majority do. It is usually known in advance which states and counties can provide
data. This is not reflected in the sampling beforehand, but analysts can adjust their
expectations.
For precincts with both reported and survey vote, the difference between the
weighted reported vote and the actual vote can then be computed, and once enough
such precincts are available that the two counts can be compared with confidence, an
adjustment can be made in the statewide estimate that reflects any estimated overstate-
ment of a candidate, if there is one, throughout the surveyed precincts. This adjustment
can help the poll maintain its accuracy in the time between poll closing and when a
large amount of the official vote becomes available. Part of the role of the analysts in the
newsroom and at decision desks is to evaluate the possibility and magnitude of such
occurrences throughout the night. Official vote data are therefore incorporated from
precincts into survey weightings, estimation models, and ultimately the networks’ elec-
tion estimates. The exit poll estimates weight to the best available estimate for each can-
didate based on the models and including reported votes at the regional and eventually
state levels, when that information becomes available later. This vote count component
Exit Polling Today and What the Future May Hold    147

thus delivers both improved estimates on election night and an improved statewide esti-
mate in its final form.
Even with this procedure, a poll is still a poll, and estimates for subgroups in
particular—​that is, smaller populations of voters—​will be expected to have somewhat
higher margins of error. A New York Times Upshot blog (Cohn 2016) compared the exit
poll results to those of the Census’s Current Population Survey and to the records of
voter files—​each of which is itself also subject to possible error (e.g, Ansolabehere and
Hirsh 2012; Belli et al. 1999)—​and suggested that there could be relatively more voters of
higher education in exit polls than there are in the electorate.

Models and Estimates

A decision team will have at its disposal a number of models running concurrently on
the data, designed to assist in making statewide estimates for each candidate and in
making evaluations of the exit poll performance early in the night, before a lot of county
and precinct vote information is available. Some of these estimators group the precincts
by past partisanship (measured by the past Democratic vote and past Republican vote)
before forming statewide estimates, while others group precincts geographically. The
exit poll design also allows analysts to examine the correlations at the precinct level be-
tween the current race and a range of selected past races; typically these correlations are
high, and if they are not, some of the models are not likely to perform as well. But even
they do not, that does not necessarily mean the exit poll as a whole is off. Each cam-
paign is different, and one also needs to evaluate it in context, such as the possibility
that a campaign is in fact doing unusually well or poorly in a particular geographic area
(e.g., the candidate’s home region), or look for consistencies in the findings across sim-
ilar precincts, regardless of geography.
Finally, the discussion is not complete—​and the exit poll is not complete—​without
accounting for absentees. The telephone poll component is designed to survey absentee
and early voters, who will not be intercepted at the precinct. All respondents are asked
the demographic and selection weighting questions (number of telephones, adults),
the survey is weighted to known population parameters, and subsequently the early/​
absentee vote is estimated from the subset of voters who said they have voted or are
voting absentee or early. The phone poll is done via traditional random digit dialing
(RDD) sampling, includes cell phones (the percentage of completes on cell phones will
vary), and is eventually combined with the in-​person data such that the absentee poll
respondents are represented in the same proportion as absentee voters are statewide.
Initially these estimated proportions are drawn from research, past vote, and election
officials. This makes state exit polls in which both in-​person interviews and a phone poll
have been done multimode surveys—​meaning two different methods of data collection
have been done. The questions asked of respondents in each are the same. Interviewing
for these polls continues through the final weekend before the election.
148   Anthony M. Salvanto

On Election Day one of the important tasks is to ascertain the final size of the absentee
vote, but the ease or difficulty of this function can vary by state, depending on the way
in which states or counties report counts of early and absentee votes. Some counties re-
port absentee totals as separate, virtual precincts that hold only absentee counts. Others
combine the counts of absentees at the polling place with Election Day voters. This can
affect within-​precinct error calculations early in the evening if the reported vote counts,
when obtained, might include ballots of voters whom the interviewer did not have the
opportunity to approach, so the analyst has to know the counting methods state by state
when making evaluations.

What the Future May Hold

Those absentees are as a good a place as any to jump off into thoughts about the fu-
ture,1 because the number of absentee voters is growing. If trends continue, soon almost
four in ten voters nationally will vote early or absentee. The use of so-​called conven-
ience voting methods—​early voting; absentee voting in many forms, such as same-​day
request and drop-​off; permanent absentee lists whose voters automatically get a ballot in
the mail—​has jumped dramatically and across most states in the last ten years. In 2000,
16% of the nation’s ballots were cast early or absentee, according to estimates provided by
the Associated Press. In 2004 that portion went up to 22%; it jumped to 34% in 2008 and
roughly matched that level (35%) in 2012.
Some states, like Colorado, joining the ranks of Washington and Oregon, have now
moved to voting by mail, so there are no conventional polling precincts at which to in-
terview voters.2 In these places the exit poll data are currently collected by telephone
poll, in the days just before Election Day (or perhaps we should call it “counting” day.)
Whereas years ago absentee voters were a small segment (and conventional wisdom
was that they leaned conservative, often being older voters), today the absentee votes
much more closely resemble the wider electorate, and in recent elections Democrats
appeared to have the edge with absentees. For example, President Obama in 2012
won states by winning the absentee/​early vote despite losing the precinct Election
Day vote.
In this regard the world is quite different now than when exit poll methods were de-
veloped, and the changes have accelerated in the last ten to fifteen years. In the late 1960s
and 1970s, when much of this methodology was developed, and even through the late
1990s and early 2000s, it was perfectly reasonable to simply describe most Americans’
voting patterns as precinct based; more than nine in ten cast ballots in a local precinct,
on the day of the election.3
Accompanying this rise in absentee voting are some improvements in voter list com-
pilation and maintenance, especially in the last decade (Alvarez et al. 2012), and there
are more publicly available voter lists in many states (see, e.g., Green and Gerber 2008;
Eisenberg 2012), as well as many publicly available, real-​time state records of absentee
Exit Polling Today and What the Future May Hold    149

and early voters during pre-​election periods. In some states it is possible to obtain data
on the voters who requested and returned early/​absentee ballots.4
In the aggregate, these data could be routinely incorporated into the phone portion of
exit polling to help estimate the size and geographic distribution of absentee votes before
Election Day or to help guide adjustments to demographic targets for the population of
known absentee voters (because age or gender are often known from the voter file).
The primary sampling unit across most exit polling is the precinct. Might the need to
account for more and more absentee voters lead to changing that, so that the voter is the
sampling unit? List-​based sampling is not currently used for the phone portion of exit
polls, but it could be considered. That might make the study more efficient, confirm or
provide geographic and vote history information, and provide other data that could en-
hance the study.5
The exit poll at present does not attempt to validate reported absentee votes, but one
might do so with these lists. This is not an issue at the precinct intercept, but in phone
polls there is, in theory, a possibility that the voter never really turned in a ballot or in-
correctly remembered what he or she did or was thinking while filling it out, with a
greater time distance between the vote and the interview, especially if the absentee pe-
riod is lengthy.
There are, however, issues in all these possibilities that would need to be addressed.
Unlike RDD samples, which start with phone numbers we can append and dial, voter
list samples do not always include phone numbers, so phone numbers must be found
for selected voter records, which in turn usually involves a secondary database and a
matching algorithm (see, e.g., Ansolabehere and Hirsh 2012), whose ability to match
may vary from state to state or may create biases in the sample. And voter lists and
the specific information on them—​as well as the accuracy of the information—​vary;
missing or incorrect information could also be a source of error. On the other hand, it is
possible to estimate something about those not matched or included from other infor-
mation already on the list, and weighting could be designed accordingly as well.
This would necessitate a state-​by-​state design in sampling methodology to account
for state-​by-​state differences. Under the current approach the same basic RDD meth-
odology can be applied to each state. This would also require understanding any
differences arising from these differing approaches when comparing states, accounting
for the availability of the lists from one year to the next, and keeping in mind any large
differences in sample design when comparing results.
Next consider the in-​person early vote. In many states traditional polling places, with
voting machines, are set up days or weeks in advance of the traditional Election Day. The
voters in these locations are currently covered by the phone polls, but they could con-
ceivably be interviewed in person. The early voting locations are known in advance and
could be sampled. Because the early voting period runs for many days, the researcher
might sample days or sample hours across days to minimize any time of day effects,
to station an in-​person interviewer. Years ago Murray Edelman and Warren Mitofsky
discussed the terms “voter poll” versus “exit poll”; sampling voters in person at these
locations would be one way to interview these voters upon exiting, also (Edelman 2015).
150   Anthony M. Salvanto

But one hurdle to this approach, besides the increased cost, would be sampling based
on size of place or past vote. Recall that precinct size and past vote are both involved in
the precinct sampling, but there are usually fewer early voting locations than there are
precincts, such that they cover wider geographic areas, and it would be more difficult to
know which voters are showing up at which location this time, as opposed to in the last
election. It is not unknown for states or counties to use consolidated voting centers—​for
example, places where any voters within a county can vote on Election Day, regardless
of their home precincts. One potential design could station exit poll interviewers at that
center, but comparing the center’s reported vote to past vote would pose a challenge if
the past vote is only known by precinct, or county, and not by this central location.
In a similar vein, added interviewers could be assigned to any given polling place on
Election Day. Although the sampling rate should give interviewers time and space to
complete the interviews, it is possible that at more crowded locations and busy times
a second interviewer could help get additional completes. This, however, would bring
with it added costs.
In the 2015–​2016 cycle the Associated Press and GfK, doing research under a grant
from the Knight Foundation, conducted experiments in polling with the expressed in-
tention of looking for alternatives to the exit poll. Their report cited “the rise of early and
absentee voting” and the need for exit polling to do phone polling to cover that, which
added to the exit polls’ costs and, the report asserted, added challenges to its accuracy
(Thomas et al. 2015, 2016). In their pilot studies they described using a probability-​based
online panel to interview voters, with multiday field periods, and they were able to dem-
onstrate accuracy in estimating final vote results. One key methodological difference
between these experiments and the exit poll, of course, is that the traditional exit poll
had the in-​person component on Election Day, whereby voters who did vote at the
polling place were interviewed leaving it. For context, the reader will note that many
pre-​election polls, conducted by phone and online, are able to gauge the outcome of an
election, whereas the exit poll has historically been an interview at the polling place, for
the voters who must or choose to vote in person.
Continuing with thoughts about the use of external voter data, consider the potential
for precinct-​level research that is based on public voter file information. The exit poll
ascertains gender and race of voters by asking them or by interviewing coding. Would it
make sense to incorporate into the estimate precinct-​level data such as age distributions
(known from the voter file) in the precinct or the precinct’s known absentee voters? It
might, but the survey researcher needs to be aware of changing parameters in the pop-
ulation, too—​in this case, how the characteristics of actual voters might differ from
those of all registered voters in the precinct. Unless that was accounted for, a poll could
be making assumptions in the weighting or sampling that introduce error. States with
same-​day registration could introduce added difficulties.
It is possible that precinct-​level demographic characteristics could be used or incor-
porated into the initial sampling as well, but the researcher would have to be mindful of
the risk of introducing error, if there were differences in turnouts among those groups
or errors in the source data, whereas past vote and total turnout have often (though not
Exit Polling Today and What the Future May Hold    151

always) been consistent and uniform in their swings across most areas in most states.
Still, one can certainly imagine a role for added precinct-​level data in exit poll estimates,
and that this could help estimates in states where such past races have not been uniform.
A statewide multivariate estimate using added demographic variables at the precinct
level could certainly be estimated, provided one was confident in the estimates of sub-
group turnout in the precincts compared to the precinct’s composition. In other words,
if a precinct is, say, 60% female, and one was using that data in a multivariate estimate,
it would be important to know with some confidence whether actual turnout was in
fact 40% female or 70% female. This would be difficult to validate on the fly beyond the
questionnaire responses. Remember that the exit poll ultimately incorporates the ac-
tual vote results into its estimates, which it gets from the precinct officials, but unlike
the vote counts, officials do not and cannot provide other information about the day’s
voters, such as age, race, or gender. That information can be gleaned later from voter
files, but that takes months, and the exit poll is a real-​time operation. So any multivariate
estimates need to account for some remaining uncertainty if they use those parameters.
In this era, when tablet computers seem ubiquitous and cellular networks allow
fast transfer of data, the exit poll is done on paper and read in over the phone for
processing and tabulation, as we noted previously. This raises the question of what
possibilities—​or what gains or disadvantages—​might open up if the poll were done
on a tablet or electronic device. Several come to mind. On a tablet, the exit poll
might offer more timely or late-​breaking questions, because paper questionnaires
take time for printing and distribution to the interviewers, at least a few days before
Election Day. This rarely raises any issues, and single breaking news questions can
be written in by hand, but this process has obvious limitations in format. A major
event, or one that needed to be probed in greater detail, could be covered with new
questions if the questions could be downloaded right onto a device, maybe even
on Election Day. Question rotation could be programmed and readily randomized.
The layout could be more malleable, with fewer limits than the edges of a piece of
paper. Data transmission might be faster, so it is possible more interviews could be
completed, or instantaneously, which would afford more opportunities to analyze
interviews between the currently proscribed call-​in schedule. New updates and new
data would be available to everyone faster, and after all, fast is a big part of what the
exit poll is all about.
Joe Lenski, executive vice president of Edison Media, the firm that fields the exit poll
for the networks and performs many other intercept polls for private research, notes
(Lenski 2015; Lenski 2016) that when electronic devices are used in other studies, people
do not seem put off by technology, including older voters—​perhaps part of that same
ubiquity of the devices—​so there might be less worry that only tech-​savvy young people
would participate. Lenski also notes that in these studies global positioning system
(GPS) technology can help interviewers find their correct interviewing locations. If
applied to voting, GPS technology could offer an important time saver on Election Day,
and real-​time monitoring by supervisors could be easier, to ensure performance or ad-
just interviewing rates.
152   Anthony M. Salvanto

But technology brings issues, too, that paper does not have. One is security: paper is
not going to be a target for thieves the way an electronic device might. And paper does
not need recharging. The exit poll interviewer is out all day, from poll opening in the
early morning to closing sometimes after 8:00 or 9:00 p.m., and delays in recharging
might lead to time-​of-​day errors in the interviewing. It is difficult to get a device’s bat-
tery to last all day without charging when it is not in constant use, let alone to get it to
last a few hours with the screen turned up and transmitting data. Battery life would be
yet something else for an interviewer to monitor. Moreover, given differences in types of
polling locations and distance requirements, not all interviewers are inside near a usable
power outlet. This issue could introduce a bias toward polling places that have available
charging facilities and against those where the interviewer is standing in a parking lot.
In addition, not every place in the United States has data coverage, and sampling only
precincts that do have it could introduce error.
And then there is the weather. If rain falls on a piece of paper, it wrinkles but might
still be usable. With an electronic device, there might be a very different story unless
some weather protection is devised. November can bring a lot of uncertain weather in
much of the United States, and exit polling takes place in rain, snow, cold, and heat.
Speaking of heat, in bright sunlight there are issues with visibility of tablet and smart-
phone screens, as anyone who has used either outdoors can attest, whereas there is no
such problem with a piece of paper. Some kind of shade might also be needed, which
would be yet another item for the interviewer to take care of.
Also consider that device failure could cancel out an entire precinct’s worth of data,
which would be a difficult loss on Election Day, and using backup devices would dra-
matically increase costs (always a concern to any researcher) as well as burden the inter-
viewer with having to store the backup and keep it charged and safe.
This is hardly the first situation in which information and technology have offered
challenges and potential benefits for exit polling (see, e.g., Frankovic 1992). And of
course in an increasingly wired, social-​media-​obsessed world, it seems there is more
and more discussion—​much of it anecdotal—​on Election Day about things like turnout
and what’s happening at polling places. However, exit polls do not speculate on these
topics: they are not used to characterize the state of a race until after all the polls have
closed in a given state, so that people have a chance to vote, and only then can we all start
discussing what the voters chose and why.
Exit polls remain an essential part of understanding U.S. elections, both while they
are happening and afterward. They help project races, but can we project their future?
One can imagine them continuing to improve and change along with accessibility of
“big data” and broader trends in voting.

Notes
1. Please note that the descriptions and discussions of exit polls here are presented by the
author only for promoting scholarly and theoretical consideration of the practice of exit
Exit Polling Today and What the Future May Hold    153

polling more generally—​that is, to consider them as a survey research project in itself. No
part of the discussion about future methodology should be taken as directly describing or
evaluating current or planned projects, proposals, or procedures of the National Elections
Pool or any of its contractors or vendors.
2. More discussion and turnout data including early vote can be found at http://​www.
electproject.org/​2014_​early_​vote.
3. The National Election Pool has contracted with Edison Research since 2003 (along with
Mitofsky International, until 2006) to conduct exit polling on its behalf; before that exit
polling was done through arrangement with Voter News Service and, prior to that, with
Voter Research and Surveys (VRS).
4. For examples of uses, see, e.g., Salvanto et al. (2003) and Cohn (2014).
5. For the results of a study of this for exit polling, see Mitofsky, Bloom, Lenski, Dingman,
and Agiesta (2005); for more discussion, see, e.g., Brick (2011); Green and Gerber (2006);
Butterworth, Frankovic, Kaye, Salvanto, and Rivers (2004).

References
Alvarez, R. M., et  al. 2012. “Voting:  What’s Changed, What Hasn’t, & What Needs
Improvement.” Report of the CalTech/​MIT Voting Technology Project. http://​vote.caltech.
edu/​content/​voting-​what-​has-​changed-​what-​hasnt-​what-​needs-​improvement.
Ansolabehere, S., and E. Hirsh. 2012. “Validation:  What Big Data Reveal About Survey
Misreporting and the Real Electorate.” Political Analysis 20 (4):  437–​ 459. http://​pan.
oxfordjournals.org/​content/​early/​2012/​08/​27/​pan.mps023.short.
Belli, R. F., M. W. Traugott, M. Young, and K. A. McGonagle. 1999. “Reducing Vote
Overreporting in Surveys.” Public Opinion Quarterly 63 (1): 90–​108.
Best, S. J., and B. S. Kruger. 2012. Exit Polls: Surveying the American Electorate. Washington,
DC: CQ Press.
Blumenthal, M. 2004. “Exit Polls: What You Should Know.” Mystery Pollster Blog, November
2. http://​www.mysterypollster.com/​main/​2004/​11/​exit_​polls_​what.html.
Brick, M. J. 2011. “The Future of Survey Sampling.” Public Opinon Quarterly 75 (5): 872–​888.
Butterworth, M., Frankovic, K., Kaye, M., Salvanto, A., Rivers, D. 2004. “Strategies for Surveys
Using RBS and RDD Samples.” Paper presented at the annual meeting of AAPOR, Phoenix,
AZ, May 13–​17.
Cohn, N. 2014. “Early Voting Returns for Midterms in Colorado, North Carolina and Georgia.”
New  York Times, The Upshot (blog), October 31. http://​www.nytimes.com/​2014/​10/​31/​up-
shot/​early-​voting-​election-​results.html?abt=0002&abg=1.
Cohn, N. 2016. “There Are More White Voters Than People Think.” New  York Times, The
Upshot (blog), June 9. http://​www.nytimes.com/​2016/​06/​10/​upshot/​there-​are-​more-​white-​
voters-​than-​people-​think-​thats-​good-​news-​for-​trump.html?_​r=0.
Edelman, M. 2015. Interview with author, January.
Edelman, M., and D. M. Merkle. 1995. “The Impact of Interviewer Characteristics and Election
Day Factors on Exit Poll Data Quality.” Paper presented at the annual conference of the
American Association for Public Opinion Research, Fort Lauderdale, FL, May 18–​21.
Eisenberg, S. 2012. The Victory Lab:  The Secret Science of Winning Campaigns. New  York:
Broadway Books.
“Evaluation of Edison/​Mitofsky Election System 2004.” 2005. Prepared by Edison Media
Research and Mitofsky International for the National Election Pool (NEP). January 19.
154   Anthony M. Salvanto

Frankovic, K. A. 1992. “Technology and the Changing Landscape of Media Polls.” In Media
Polls in American Politics, edited by T. Mann and G. Orren, 32–​54. Washington, DC: The
Brookings Institute.
Green, D. P., and A. S. Gerber. 2006. “Can Registration-​Based Sampling Improve the Accuracy
of Midterm Election Forecasts?” Public Opinion Quarterly 70 (2, Summer): 197–​223.
Green, D. P., and A. S. Gerber. 2008. Get Out the Vote. Washington, DC: Brookings Institute
Press.
Groves, R. M., D. A. Dillman, J. L. Eltinge, and R. J. A. Little, eds. 2002. Survey Nonresponse.
New York: John Wiley & Sons.
Groves, R. M., P. P. Biemer, et al., eds. 1988. Telephone Survey Methodology. New York: John
Wiley & Sons.
Lenski, J. 2015. Interviews conducted with author, Edison Media Research, January.
Lenski, J. 2016. Interviews conducted with author, Edison Media Research, December.
Lepkowski, J. M., C. Tucker, et  al., eds. 2008. Advances in Telephone Survey Methodology.
New York: Wiley and Sons.
Merkle, D., and M. Edelman. 2000. “A Review of the 1996 Voter News Service Exit Polls from a
Total Survey Error Perspective.” In Election Polls, the News Media, and Democracy, edited by
P. J. Lavrakas and M. W. Traugott, New York: Chatham House.
Merkle, D., and M. Edelman. 2002. “Nonresponse in Exit Polls: A Comprehensive Analysis.” In
Survey Nonresponse, edited by R. D. Groves, D. A. Dillman, et al., 243–​258. New York: John
Wiley & Sons.
Mitofsky, W. 1991. “A Short History of Exit Polls.” In Polling and Presidental Election Coverage,
edited by P. Lavrakas and J. Holley, 83–​99. CA: Newbury Park CA: Sage.
Mitofsky, W., J. Bloom, J. Lenski, S. Dingman, and J. Agiesta. 2005. “A Dual Frame RDD/​
Registration-​Based Sample Design:  Lessons from Oregon’s 2004 National Election Pool
Survey.” In Proceedings of the Survey Research Methods Section of the American Statistical
Association, Alexandria, VA, 3929–​3936.
Mitofsky, W., and M. Edelman. 1995. “A Review of the 1992 VRS Exit Polls.” In Presidential Polls
and the News Media, edited by P. J. Lavrakas, M. Traugott, and P. Miller, 81–​99. Boulder,
CO: Westview Press.
Mitofsky, W., and M. Edelman. 2002. “Election Night Estimation.” Journal of Official Statistics
18 (2): 165–​179.
Salvanto, A. 2003. “Making Sure Absentees Aren’t Absent.” Paper presented at the annual
meeting of the American Political Science Association, Philadelphia, PA.
Thomas, R. K., F. M. Barlas, L. McPetrie, A. Weber, M. Fahimi, and R. Benford. 2015. “Report
for the Associated Press: November 2015 Election Studies in Kentucky and Mississippi.” GfK
Custom Research, December.
Thomas, R. K., F. M. Barlas, L. McPetrie, A. Weber, M. Fahimi, and R. Benford. 2016. “Report
for the Associated Press:  March 2016 Presidential Preference Primary Election Study in
Florida.” GfK Custom Research, May. https://​www.ap.org/​assets/​documents/​fl_​2016_​re-
port.pdf.
Chapter 8

Sam pling Hard-​to -​L o c at e


P opul ati ons
Lessons from Sampling Internally Displaced
Persons (IDPs)

Prakash Adhikari and Lisa A. Bryant

At its heart, survey research is about people. It is about capturing and tracking the
preferences, beliefs, opinions, and experiences of individuals. For government officials,
surveys provide a link between policymakers and those affected by policies. For social
scientists, surveys provide an understanding of how and why people behave as they do.
For the public, surveys provide an opportunity to share their experiences, voice opinions
about important issues, and in some cases influence change in policies and programs
(Brehm 1993; Tourangeau 2004; Dillman, Smyth, and Christian 2009). The accuracy of
surveys in providing this information depends on asking clear questions and collecting
pertinent information, but it also depends on the representativeness of the sample, the
size of the sample and determining how respondents should be selected. Researchers
have long been concerned about these issues, but as the science of survey methodology
has advanced and questions have become more nuanced, the representativeness of both
the sample population and sample respondents, to prevent bias in the results and error
in the interpretation, has become an even more pressing issue. Poorly defined sampling
frames and underrepresentation are threats to the reliability, validity, generalizability,
and usefulness of the data.
While many populations are somewhat easy to identify and are fairly accessible,
researchers can face formidable sampling issues and methodological challenges in
acquiring valid data for hard-​to-​survey populations. For example, if one is trying
to collect information on the homeless, simply identifying these populations and
creating a sampling frame poses a challenge, given that we have inaccurate census
data on the homeless population in most areas (Kearns 2012). Traditional contact
methods such as the Internet, telephone surveys, and mail surveys are likely out of
156    Prakash Adhikari and Lisa A. Bryant

the question, as these individuals often have little to no access to technology and have
no stable residence or mailing address. Researchers might be able to leave surveys at
shelters or take to the streets to conduct face-​to-​face interviews, or they might have
to rely on convenience or snowball sampling to capture enough individuals to have
meaningful results. Similarly, acquiring accurate data on victims of natural disaster
and armed conflict, who are displaced, is extremely challenging if not impossible.
In such cases, it is often unknown exactly how many people were driven from their
homes or where they relocated. While this chapter focuses on displaced persons, it is
important to note that identifying and collecting quality information from hard-​to-​
locate populations is a large and important problem that affects researchers working
with a wide variety of populations. Contacting hard-​to-​locate populations is an issue
for epidemiologists who study communicable diseases; nonprofits and nongovern-
mental organizations (NGOs) that are trying to provide clean drinking water, health-
care services, and even shelter to those in need; marketers who are trying to get
pharmaceuticals and medical devices to vulnerable populations in underdeveloped
parts of the world; and environmental scientists and those in agriculture and natural
resource management, among others.
Strictly academic literature illustrating the nature of these difficulties in survey re-
search and ways to address the challenges researchers face in accurately sampling and
surveying such populations is still somewhat limited. However, there is a large and ever-​
growing body of research on these problems that is produced by government agencies,
such as the U.S. Census Bureau (Avenilla 2012; Durante 2012) and the Department of
Labor (Gabbard and Mines 1995), as well as nonprofits and NGOs, including the United
Nations (van der Heijden et al. 2015), that are implementing a variety of techniques to
learn about and deal with important issues such as sex trafficking, slavery, poverty, the
spread of disease, and terrorism, to name a few. This chapter discusses the use of non-​
random sampling for these hard-​to-​survey populations, addressing the challenges faced
in enumerating hard-​to-​reach populations, and develops guidelines for best practices
in sampling such populations. Using a study that surveyed internally displaced persons
(IDPs) in the aftermath of the Maoist insurgency in Nepal that began in 1996, we dem-
onstrate the application of some best practices in studying hard-​to-​survey populations.
Overall, we demonstrate that the challenges of studying hard-​to-​survey populations can
be overcome with good planning and a little extra effort, and by being attentive to local
conditions, language, and culture.

Categories of Hard-​to-​Survey
Populations

In an ideal situation, researchers have a sample frame that includes the complete list of
all members of a population from which they can draw a sample to interview. However,
Sampling Hard-to-Locate Populations    157

there are a number of important groups with unknown or uncertain populations to


which standard sampling and estimation techniques are simply not applicable. For ex-
ample, in some cases it may be difficult to calculate or identify the sample population
based on population estimates, as is the case with the LGBT community, for whom the
true population is unknown, but that is estimated to make up approximately 3.8% of the
population in the United States (Newport 2015), and for whom sampling populations
cannot be identified based on standard demographic questions such as gender, age,
ethnicity, or even religion. Other groups may be hard to estimate because they do not
want to be identified by a certain characteristic, such as undocumented immigrants,
documented immigrants who hold green cards or visas, victims of sexual assault or
child abuse, or even individuals who commit crimes or engage in other forms of illegal
behavior. Certain populations are simply harder to reach due to geographical or contex-
tual issues, including war and natural disasters, or have remote living quarters in hard-​
to-​reach locations, such as mountainous or jungle villages with little to no physical (e.g.,
roads, transportation) or technological (e.g., electricity, phones, Internet) infrastructure
in place.
Reasons that some populations are more difficult to reach and survey can generally be
grouped into five broad categories: (1) hard to identify, (2) hard to sample, (3) hard to lo-
cate, (4) hard to persuade, and (5) hard to interview (Tourangeau 2014). Different sam-
pling strategies can be utilized to recruit from each of these hard-​to-​reach categories.1
These sampling techniques include the use of special lists or screening questions, mul-
tiple frames (Bankier 1986; Kalton 2003; Lohr and Rao 2000, 2006), disproportionate
stratification (Groves et al. 2009; Kalton 2001; Stoker and Bowers 2002), multiplicity
sampling (Lavrakas 2008; Rothbart et  al. 1982), snowball sampling (Atkinson and
Flint 2001; Browne 2005; Cohen and Arieli 2011; Noy 2008; Welch 1975), multipur-
pose surveys (Fumagalli and Sala 2011; Groves and Lyberg 2010), targeted sampling
(TS) (Watters and Biernacki 1989), time-​location (space) sampling (TLS) and facility
based sampling (FBS) (Magnani et al. 2005), sequential sampling (Myatt and Bennett
2008), chain referral sampling and respondent-​driven sampling (RDS) (Aronow and
Crawford 2015; Goel and Salganik 2010; Heckathorn 1997, 2002, 2007; Platt et al. 2006;
Salganik and Heckathorn 2004; Volz and Heckathorn 2008; Wejnert and Heckathorn
2008; Wejnert 2009), indigenous field worker sampling (IFS) (Platt et al. 2006), conven-
tional cluster sampling (CCS) and adaptive cluster sampling (ACS) (Seber and Salehi
2012; Thompson 1997; Thompson and Seber 1994), and capture recapture (CR) sampling
(Aaron et al. 2003; Fisher et al. 1994; LaPorte 1994). There have also been innovative and
significant advances using geospatial tools for sampling, using high dimensional data
for locating IDPs in countries such as Darfur, the Sudan, and Colombia (Florance 2008);
for mapping of disease patterns (Tatem et al. 2012); and for carrying out UN peace-
keeping missions (MacDonald 2015). This chapter cannot include a complete review of
all of these methods; however, we offer a brief discussion of how some of these methods
could be used in practice and provide an in-​depth example of how a combination of
CCS, snowball sampling, and non-​random sampling techniques were used to survey
hard-​to-​reach populations in Nepal.
158    Prakash Adhikari and Lisa A. Bryant

Hard-​to-​Identify Populations
One reason that populations may be hard to survey is the difficulty in identifying
members of certain populations. There are a variety of reasons that some groups do not
want to be identified. One reason may be that there is some sort of “stigma” attached
to or associated with the identification of certain populations (Tourangeau 2014).
Another can stem from something seemingly simple, such as defining a population
of interest based on a specific characteristic (Tourangeau 2014) where underreporting
is a common problem, such as being an adopted child or receiving federal assistance.
There are also issues that are very complex, such as identifying cultural or religious
minorities, immigrants in the United States (Hanson 2006; Massey 2014), and vulner-
able populations like former child soldiers or victims of domestic violence. A variety of
tools such as including screening questions and various recruitment techniques can be
used to address many of these types of issues.
People are often reluctant to disclose certain characteristics or demographic in-
formation such as income or age, and in these cases undercoverage can be a major
issue. (See Horrigan, Moore, Pedlow, & Wolter 1999 for one of the best examples of
underreporting that occurred in the National Longitudinal Survey of Youth in 1997.)
One way to get access to a household member of the desired age group is to include
screener questions. Some screening methods may work better than others, and in 2012
Tourangeau et al. (2014) carried out an experiment to capture underreporting when
respondents were asked to disclose their age. Respondents were divided into three
groups, and each group was asked to respond to a somewhat varied version of the same
question. One group of households was directly asked (a) “Is anyone who lives there
between the ages of 35 and 55?” Another group was asked (b) “Is everyone who lives
there younger than 35? Is everyone who lives there older than 55?” A (c) third house-
hold group was administered a battery of questions asking each member of the family
to report his or her “sex, race, and age.” The final approach returned a higher response
rate, 45%, than the first direct approach (32%) and the second (35%). The third approach
is called a “full roster approach” and is recommended for overcoming issues associ-
ated with age-​related underreporting (Tourangeau 2014). The full roster approach
also provides the added benefit of not revealing to respondents the specific group
of interest; however the additional questions can sometimes come at the expense of
increased interviewing time and the possibility of lower response rates (Tourangeau,
Kreuter, and Eckman 2015), especially in telephone surveys, in which it is easier to cut a
survey short than it is in a face-​to-​face setting. To avoid early termination, the number
of questions in this type of sampling technique should be kept to a minimum to avoid
irritating respondents.
Finally, individuals who are victims of conflict, such as IDPs, victims of wartime
sexual violence, persons engaging in criminal activity, or people who are carriers of
communicable diseases may not want to be identified, creating yet another identifi-
cation problem. As discussed in detail in the next section, people who have fled their
homes due to intimidation by rebel groups do not want to be identified for fear of
Sampling Hard-to-Locate Populations    159

reprisal. Snowball sampling and RDS are common techniques used to deal with the
issue of hard-​to-​identify populations and provide an immediate personal connection
for researchers to draw in respondents.
Snowball sampling, or chain referral sampling, is a nonprobability sampling tech-
nique that depends on a referral (or referrals) from initially known subjects (or seeds) to
recruit additional subjects into the sample (Coleman 1958; Shaghaghi, Bhopal & Sheikh
2011). This method uses a chain of social networks to reach the targeted population. For
example, a study on adoption might begin by recruiting one person or a few people who
are known to the researchers to have adopted children, a population that can be elusive
due to legal privacy concerns or issue sensitivity. The adoptive parents included in the
study are asked to recommend additional participants they know through their support
groups or social networks. When interviewed, those participants will be asked to rec-
ommend and provide an introduction to their associates as additional participants,
and so on, until the desired sample size is reached. One major assumption in snowball
sampling is that there are links between initial subjects and other known subjects in
the population of interest (Biernacki and Waldorf 1981; Shaghaghi et al. 2011). Snowball
sampling is especially useful in trying to recruit members of vulnerable populations
where trust might be required to encourage participation. One clear concern with this
technique is that findings are not easily generalized to the target population, only to the
network studied (Shaghagi et al. 2011).
Respondent-​driven sampling was developed by Heckathorn (1997, 2007, 2011) to
address the concerns of inference and generalizability caused by snowball sampling.
The technique is frequently employed for sampling hidden or stigmatized populations
such as illegal immigrants or illicit drugs users because it allows participants to mask or
protect the identities of their connections (Salganik 2012; Tourangeau 2014). Similar to
snowball sampling, RDS is a peer-​to-​peer recruitment technique, in which researchers
start with a few known subjects who are the study “seeds” (Heckathorn 1997, 179). The
seeds are then offered incentives to recruit contacts into the study; however, the number
of referrals is limited to minimize sample bias. The two major differences between snow-
ball sampling and RDS are that RDS, unlike snowball sampling, which only rewards
subjects for participation, provides dual incentives, for participation as well as for re-
cruitment, and participants in RDS can remain anonymous.2

Hard-​to-​Sample Populations
A sampling method typically begins with the list or sample frame that includes all
members of the population. However, for some populations there is no complete or even
partial list from which to draw a sample because the true population may be unknown,
may be uncertain or wish to remain “hidden,” may be highly mobile and hard to locate
(Dawood 2008; Kish 1987, 1991; Sudman and Kalton 1986), or may simply be rare in the
total population or have rare characteristics of interest (Tourangeau 2014; Sudman 1972;
Kalton 2001). Examples of such groups are intravenous drug users and prostitutes, for
160    Prakash Adhikari and Lisa A. Bryant

whom true numbers and identities are unknown, and who are somewhat hidden and
therefore difficult to find in the population; those with various infectious diseases who
may wish to hide their illness; the nomadic Roma (or Romani) in Europe, who are highly
mobile; people who visited Disneyland in the last year; Native Americans who relocated
from reservations to urban areas; people born in 1992 (or any particular year), who
make up a fairly small percentage of the total population; and people who are part of po-
litical protests such as Occupy Wall Street or Arab Spring events in Tunisia or Egypt. The
problem becomes even more complicated when the target population is out of the reach
of telephone or mail. At this point, standard survey sampling methods such as random
digit dialing (RDD) or address based sampling (ABS) become less useful, and face-​to-​
face interviews based on area probability sampling techniques, such as the one used in
this chapter, can be utilized to provide more flexibility and options for researchers.
Rare populations, or those representing a small fraction of the larger population
frame, pose two challenges that make them hard to sample (Kalton and Anderson
1986; Sudman, Sirken and Cowan 1988; Tourangeau 2014). First, there is dispropor-
tionate stratification or density sampling, in which populations may be concentrated
more heavily in certain geographic areas than others or there is wide variance across
the areas where the rare subgroup is prevalent. In some cases this problem can be
addressed by oversampling strata where the rare population is relatively high and
undersampling areas where it is relatively low (Kalton 2001; Tourangeau 2014).3
This method, also known as disproportionate sampling, can be cost effective be-
cause it reduces the need for screening questions at the beginning of the survey in the
strata with higher concentrations of the rare population (Kalton 2001). This method
has also been used with filter questions to narrow respondents to the target popula-
tion more efficiently, such as with Latino/​Hispanic populations in the United States
(Brown 2015). There are trade-​offs to using this approach, and one possible negative
effect of this method is the introduction of coverage error, because not all members of
the population have a non-​zero probability of being included in the sample. A second
possible challenge in using this approach is that populations in particular areas, espe-
cially when they represent a larger share of the population, may display attitudes and
behaviors that are different than those who live outside of those geographic locations,
which could lead to bias in the data. For example, Latinos/​Hispanics who live in
areas with a high concentration of co-​ethnics have different opinions, attitudes, and
behaviors than Latinos/​Hispanics who live in areas with a lower number of co-​ethnics
(Garcia-​Bedolla 2005; Abrajano and Alvarez 2012). When researchers are applying a
density sampling design, these issues with heterogeneity need to be examined carefully;
however, post-​survey adjustment such as applying weights to the data can help to re-
solve some of these issues.
When the distribution of the rare population is evenly spread or is unknown (Kalton
2001; Smith 2014; Tourangeau 2014), the cost of locating the population of interest tends
to be high and can quickly exceed actual interviewing budgets (Sudman, Sirken, and
Cowan 1988). The cost of obtaining responses from those populations is likely to be even
higher if one considers issues such as nonresponse rates and difficulties in accessing
Sampling Hard-to-Locate Populations    161

geographic locations where the rare population resides (see Kalton 2009 for gaining
sampling efficiency with rare populations).
In addition to rare populations, “elusive populations” or “hidden populations” such
as the homeless, migrant workers, and street children present particular challenges for
sampling (Sudman, Sirken, and Cowan 1988; Neugebauer and Wittes 1994; Kalton 2001,
2009, 2014). These populations are often mobile and dynamic in nature, which means
that researchers need to pay special attention to how they change in size and composi-
tion over short periods of time and how such changes affect sampling and inference. In
these cases, standard clustered sampling methods cannot be employed.
Typically, for populations like these researchers rely on location sampling, space sam-
pling, or FBS to overcome such challenges. These sampling techniques involve sampling
places where the members of the elusive population are likely to be found rather than
sampling the members of the population directly (Kalton 2001; Shaghaghi et al. 2011;
Tourangeau 2014). Under these methods, sets of locations are identified “such that a
high proportion of the target population will visit one or more of these locations during
the data collection period” (Kalton 2014, 415). For example, Everest climbers may be
sampled at Everest Base Camp when they go there to prepare for climbing, and the
homeless are likely to visit soup kitchens or shelters. This method has also been used
for sampling “very rare” populations, such as street prostitutes in Los Angeles County
(Kanouse, Berry, and Duan 1999), where researchers sampled 164 streets known to have
previous prostitution activity, and in a study of men who have sex with men (although do
not necessarily identify as homosexual) in locations such as “gay bars, bathhouses and
bookstores” (Kalton 2009, 137). A serious limitation of this approach is that equal access
to these locations is not guaranteed to the entire population of interest, and as Kalton
(2009) points out, it “fails to cover those who do not visit any of the specified locations in
the particular time period” (137). The approach also requires that researchers include a
way to account for repeat visitors to the sampled locations (Kalton 2014, 415), so without
careful record keeping, there is a risk of response bias due to replication in the sample.

Hard-​to-​Locate Populations
In addition to populations that may be reluctant to be identified, there are also hard-​
to-​locate populations. Tourangeau (2014) identifies the following four types of mobile
populations that may be hard to locate:

a) Members of traditionally nomadic cultures (such as the Bedouins of Southwest


Asia, the Tuareg of North Africa);
b) Itinerant minorities (such as the Romani (Roma) in Europe or the Travelers in
Ireland);
c) Persons who are temporarily mobile or displaced (recent immigrants, homeless
persons, refugees); and
d) Persons at a mobile stage in their life cycle (college students).
162    Prakash Adhikari and Lisa A. Bryant

Populations that would also fit under the category of hard to locate but might not fit
into one of these groups, are those who are forcefully moved from place to place (e.g.,
those involved in human trafficking, slaves) or who live in conditions where they may
not want to be discovered (e.g., prostitutes, criminals, terrorists, runaways). Trying to
locate these individuals can not only pose a challenge to researchers, but in some cases
can be quite risky.
In this chapter we are particularly interested in the third population group identified
by Tourangeau, which includes forced migrants. Forced migrants are constantly on
the move, making them hard to locate. Forced migrants typically fall into one of two
categories: (1) refugees, who leave their homes and cross an international border, or
(2) IDPs, who flee their homes but do not cross the border into another country. There
are a number of reasons that people flee from their homes, including both man-​made
and natural disasters, but it is extremely difficult to obtain accurate data on such forced
migrants because it is typically difficult to know where they go when they flee. A va-
riety of economic, political, and social factors may affect the choices individuals make
of where to go when they are forced to leave their homes. In some cases, such as with
Hurricane Katrina, governments may move IDPs to a temporary relocation destination
(or destinations), while others leave on their own, without government assistance or
direction, and choose where to relocate. In many cases, there is no official government
communication or direction about where IDPs should relocate. This may especially be
the case when there is conflict or violence involved and the government has little control
over or is the perpetrator of violence.
Among forced migrants, IDPs are one of the most mobile populations. According to the
Internal Displacement Monitoring Center (IDMC) of the Norwegian Refugee Council,
40.3 million people worldwide were internally displaced by conflict and violence at the
end of 2016 (IDMC 2017). Unlike refugees who cross borders and often live in refugee
camps, IDPs may be more difficult to locate because they are constantly on the move in
response to physical threats or the social stigma of being called domestic refugees or being
identified as “outsiders” in their own country.4 While people attached to a community
by marriage or employment are more likely to remain in one place and are more likely
to be found (Tourangeau 2014), people displaced by natural disaster or conflict are ex-
tremely difficult to find (Pennell et al. 2014; Mneimneh et al. 2014) and even more diffi-
cult to recontact, making it extremely unlikely that they are included in longitudinal or
panel surveys (Couper and Ofstedal 2009; Tourangeau 2014). Given these complications,
standard survey techniques often fail to capture a usable sample of IDPs. The next
section describes some techniques that can be used to overcome the challenges faced by
researchers in locating hard-​to-​find populations, such as IDPs, in greater depth.

Hard-​to-​Persuade Populations
Even when we are able to identify and reach vulnerable or stigmatized populations
of interest, respondents may still be unwilling to take part in surveys. The steady rise
Sampling Hard-to-Locate Populations    163

in nonresponse rates is a clear indicator of the problem of getting people to agree


to take part in surveys. According to Tourangeau (2014), two factors are associated
with the resistance to surveys:  “the sense of busyness that seems to pervade con-
temporary life and falling levels of civic engagement” (see also Brick and Williams
2013). Several studies have been conducted to understand factors associated with low
or falling response rates in the United States. The main variables of interest in these
studies include availability of time or “busyness,” civic engagement, and voting and
volunteering behaviors. In general, people involved in civic engagement or voting
and volunteering are found to be more willing to participate in surveys (Abraham,
Maitland, and Bianchi 2006; Groves, Singer, and Corning 2000; Tourangeau, Groves,
and Redline 2010; Abraham, Helms, and Presser 2009). Social exchange theory
suggests that people traumatized by natural and or man-​made disasters also tend to
be more willing to participate in surveys to share their stories of hardship and view
surveys as an opportunity for a social interaction with strangers (Bradburn 2016;
Dillman, Smyth, and Christian 2009).
Research also finds that response rates may vary with the “specifics” of the survey.
Whether or not people decide to participate in a survey can depend on the respondents’
evaluation of the features of the survey, such as the “topic of the survey, the survey
sponsor, its length, or the incentives it offers” (Groves et al. 2000). A variety of techniques
have been suggested to increase response rates. These include making questionnaires
short, ensuring that the survey relates to the interests and stories of the respondents,
highlighting the value of the survey to society, and monetary incentives (Gideon 2012).
Monetary incentives are commonly used in two different ways. One is to send the re-
ward after the survey has been completed and returned to the researcher. This method
promises the respondents a reward in return for their time. The second method rewards
the respondent before the survey is completed. Here respondents are provided with
an incentive that is not contingent upon completion of the survey. Usually, surveys are
mailed to the respondents with a small sum of cash or a gift as a token of appreciation
for participation. The gift is intended to mandate the respondent to return the question-
naire (Gideon 2012).5

Hard-​to-​Interview Populations
The final group of hard-​to-​survey populations includes those that are hard to inter-
view. A variety of factors can make certain populations hard to interview. For example,
the population of interest may include those considered vulnerable populations, such
as children, prisoners, or people who engage in illegal activities; those who have psy-
chological problems; or those who may not speak the language in which the survey is
written and administered (Tourangeau 2014). Some respondents may not be able to
read the questionnaire even if they speak the language. Despite these difficulties, it is
possible to collect data on these populations. For example, children can be interviewed
with consent from parents, questionnaires can be translated into the language spoken
164    Prakash Adhikari and Lisa A. Bryant

by respondents, and enumerators can orally administer surveys to gather data from
respondents who are unable to read.
Researchers have developed several metrics to overcome all of these difficulties. In the
United States, the U.S. Census Bureau uses a hard-​to-​count measure created by Bruce
and Robison (2003) to identify and account for the reasons that people are missed in
surveys. The hard-​to-​count score is calculated using “twelve tract-​level variables known
to be associated with mail return rates in the 2000 Census” (Tourangeau 2014, 16).6 The
procedure was specifically developed to account for the underrepresentation of young
children in Census data, who are often not able to respond themselves or be interviewed
directly, but could be applied to a variety of vulnerable or at-​risk populations that are
hard to count or access.
While many of the sampling methods and metrics discussed here are useful for
addressing the issue of underreporting in developed countries, they often tend to be
ineffective in surveying hard-​to-​survey populations in less-​developed or developing
countries. The next section describes in greater depth the problems associated with
surveying forced migrants and methods used to overcome these challenges.

Lessons from a Survey


on Forced Migration

The Nepal Forced Migration Survey (NFMS) sought to explore the problem of con-
flict-​induced displacement (see Adhikari 2011).7 Each year, millions of people around
the world are forced to make decisions on whether or not to abandon their homes due
to conflict. While a decision to stay indicates that choice is available even when life is
at risk, empirically this raises important questions about individual behavior during
conflict: Why do some people choose to stay while others choose to leave, and how
do those who choose to stay cope with conflict? Furthermore, only a fraction of those
who abandon their homes ever return, even after the conflict has ended. This raises
even more questions, including why some people do not return home after they are dis-
placed and under what conditions they are able to resettle in their new locations. This
study was motivated by puzzles such as these. One of the major challenges in this study,
and in most studies of displaced persons, was to identify and reach these populations
of individuals who never fled, those who fled and returned, and those who fled and
never returned. This called for a method that would enable us to reach the population of
interest.
Any study of flight behavior is incomplete without an understanding of why some
people choose not to flee. Therefore, conditional on an individual’s decision to leave or
stay, we want to know the different choices that are at the disposal of individuals and the
factors that contributed most in their making a particular choice. For those who stayed,
what choices did they make in coping with the conflict situation, and what conditions or
Sampling Hard-to-Locate Populations    165

resources allowed them to make those choices? How were they different than those who
chose to flee? The survey was designed to help understand those factors that contribute
to an individual’s decision to leave or stay put in the face of violent conflict. In addition,
the survey sought to understand why some people do not want to return once they have
decided to flee, while others are willing to return. When identifying the population of
interest and designing a sampling frame, the research questions should be taken into
account. To address these interesting questions, we designed a method that coordinates
two samples, pairing individuals who decided to stay and those who decided to leave
within the same contextual environment. This is discussed in more detail in the next
section.
Scholars in the field acknowledge that it is extremely difficult to obtain accu-
rate data on forced migration caused by conflict (Crisp 1999), and this research
was no exception. Executing the survey with the population of IDPs posed formi-
dable challenges that required additional efforts and planning, and answering the
questions we posited required a multistage research design (described in detail in the
following section). This study reveals a number of interesting and useful lessons for
future research in the field. Despite the challenges one faces in enumerating the hard-​
to-​survey population of forced migrants, limited resources, and physical challenges
presented by rough terrain, one can reach respondents with careful planning and
thoughtful strategy.

The Maoist Insurgency in Nepal


and Challenges of Identifying Forced
Migrants

Beginning in 1996, Nepal went through a decade of Maoist insurgency, in which over
13,000 people were killed, thousands were displaced, and many more disappeared.
Similar to other conflicts, figures on displacement during the Maoist insurgency in
Nepal vary dramatically. Only one organization, the Informal Sector Service Center
(INSEC), a national human rights organization operating throughout Nepal since
1988, made a concerted effort to document and verify displacement figures. Their
work was conducted on a subnational, district-​by-​district basis. According to INSEC,
50,356 people were displaced from across the seventy-​five districts by the end of 2004.
There are strong reasons to believe that the data collected by INSEC are the most re-
liable and accurate. Because INSEC operates in all seventy-​five districts of Nepal, the
data collected by their district offices are more reliable than other national estimates.
In addition, INSEC was the only organization to collect data on displacements at the
level of the village development committee (VDC), the smallest administrative unit in
Nepal. Knowing the population estimate and distribution of displaced persons was im-
portant when determining the sampling frame, deciding on a sampling method, and
166    Prakash Adhikari and Lisa A. Bryant

creating and distributing the survey instrument. Use of INSEC data as a sampling frame
is also consistent with the practice of using NGO data for conducting surveys in similar
circumstances (see Mneimneh et al. 2014).

Nepal Forced Migration Survey:


Design and Implementation

As previously stated, the survey was intended to capture factors and conditions that
contributed to individual choices about whether to remain in a conflict area or to flee
to an uncertain destination. Questions in the survey focused on the perceived level of
violence and threat to one’s life, economic conditions, coping mechanisms, size and mo-
bility of the family, and additional demographic information. To answer these questions,
several important steps were taken to ensure adequate response rates and reliability of
the results.
First, once the survey questionnaire was created, an initial trip was made to Nepal to
pretest the survey instrument before administering it to the entire sample population.
Pretesting the survey instrument provided an opportunity not only to revise it, but also
to gauge the accessibility of the respondent population to the research team.8 This step
is incredibly important when surveying hard-​to-​survey populations, whether dealing
with populations that are hard to locate or hard to persuade. A pretest allows the re-
searcher to adjust the survey instrument as necessary to avoid depressed response rates
and to reassess the sampling frame and sampling method if necessary. While pretesting
instruments may pose a challenge, this step is especially important in surveys of hard-​
to-​locate populations because it is even less likely that researchers can easily recontact
and resurvey the respondents through traditional survey means such as telephone, mail,
and online surveys. If they are transient or temporarily relocated, it might be difficult to
find them again, so it is important to identify flaws in the survey as early as possible. If
researchers do not have the funds or ability to pretest the instrument among the pop-
ulation of interest, they could attempt to find a population or experts familiar with the
topic of interest through nonprofits, service providers, or community organizations to
get feedback prior to administering the survey.
Related to this, it is recommended that researchers become familiar with local
conditions such as geography, local language, and the cultural norms of the populations
they are researching or hire assistants or enumerators who are native to the area or very
familiar with the region; this might make it easier to navigate the terrain and locate
potential respondents. For example, familiarity with local geography in Nepal helped
researchers budget time and resources, as some of the sampling units (villages or wards)
took several days of trekking on foot to reach. One of the authors of this chapter is from
Nepal, which was an asset; however, sometimes researchers may have to use outside
resources to proofread and provide feedback on their survey instruments, help with
Sampling Hard-to-Locate Populations    167

translation, act as guides in the region, or provide insights into local social factors that
may influence the ability to successfully contact participants.9 While collecting data for
the study, it was important to be sensitive to the wartime suffering and experiences of
the respondents, which is not critical for locating respondents but can be especially im-
portant for those who are difficult to persuade.
Finally, if traveling to a remote location or conducting interviews or distributing
surveys outdoors, researchers must be aware of and familiar with local environmental
factors and weather patterns, such as monsoon and harvesting seasons, especially when
conducting research in developing countries like Nepal. Most of the research for this
study was conducted during the summer months, and the monsoons arrived during
data collection. Most Nepalis are farmers and return to their fields with the onset of
the monsoon season. This made it extremely challenging to find respondents, further
straining already limited resources. In extreme cases, surveys were administered in the
fields, alongside farmers as they were planting their crops. If the authors had the oppor-
tunity to conduct another survey in Nepal or a similar country and were able to choose
the time, it would probably be in the winter, when villagers would be less busy. Despite
the challenges, we were able to successfully administer the survey and achieve an overall
response rate of 86%.

Sampling Frame and Method

The Nepal Forced Migration Survey was designed to test three different aspects of
conflict-​induced displacement. Accordingly, the survey was divided into three main
sections. In the first section, the questionnaire was designed to investigate in detail
the causal factors leading to internal displacement at the individual level. The second
section was devoted to explaining the choice of coping mechanisms individuals used for
staying behind. The final section focused on the IDPs themselves, with an emphasis on
understanding the factors that affected their choice to resettle, return home, or remain
in limbo. Again, the main objective of the survey was to study causes of displacement
during the Maoist insurgency in Nepal and the ability of individuals to cope with their
situations under conflict. Empirical findings from this research are published elsewhere
(see Adhikari 2012, 2013; Adhikari, Hansen, and Powers 2012; Adhikari and Samford
2013; Adhikari and Hansen 2013).
During the survey, individuals were asked about the violence they had experienced
and whether or not they had engaged in activities such as paying rent to the Maoists
to help them stay in their homes. Individuals who had experienced violence should be
more likely to flee their homes, while those who had not directly experienced violence
should be more likely to stay. The more often individuals paid rent to the Maoists, the
more likely they would be able to stay in their homes, so those with more economic re-
sources would be more likely to stay, rather than flee.10 Other coping activities included
joining the Maoists by changing party affiliation; participating in protests, rallies, or
168    Prakash Adhikari and Lisa A. Bryant

other political activities organized by the Maoists; and joining a community organiza-
tion. Community organizations such as the community forest users’ group, mothers’
group, or small farmers’ development programs provided a mechanism for people to
come together, enabling them to cope with the difficulties of war.
Understanding causal factors leading to forced migration, choice of coping
mechanisms for those who stayed back, and factors associated with a decision to return
or resettle required that three types of populations be included in the sample: (1) those
displaced by conflict, (2) those never displaced, and (3) those who returned after dis-
placement. This knowledge influenced which sampling techniques were necessary to
capture a representative sample. The study sought to ensure that the sample represented
(a) districts that were hard hit during the conflict, (b) all three topographical regions,11
(c) all five economic development regions, and (d) both rural and urban parts of the
country, and included the three population types mentioned previously. The study
was conducted in two phases. The first phase captured rural interviews outside of the
capital of Kathmandu. The second phase captured urban displaced persons living in
Kathmandu. To define a sampling frame for the first phase, selection criteria were based
on secondary data provided by INSEC that included the number of people killed and
displaced from each district between 1996 and 2006 due to conflict. In the first stage, a
CCS technique was utilized, and all districts that had recorded at least 500 casualties or
500 displacements during the conflict were selected. A total of nineteen districts met
this threshold. Four of the five economic development regions contained exactly two
districts that met the threshold, and they varied topographically, so these eight districts
were chosen. The remaining districts were all located in the fifth region, the midwestern
region where the fighting originated and there was widespread displacement. One dis-
trict was randomly chosen from each of the three topographical regions located within
the midwestern region, which resulted in a total of eleven districts from which to sample.
The sample was drawn using a complex sample design. Because some people were no
longer able to be located at the address provided, we used sampling with replacement,
meaning we would randomly select another displaced person from within the same
VDC. Once a respondent was included in the survey, we then asked that person to pro-
vide the names of six additional displaced persons for us to survey (snowball sampling).
Households were selected from 226 sampling units, called wards, from across the eleven
districts. A weighted multistage cluster sampling technique was used to go from region,
to district, to VDC, to ward level, and then two samples were randomly drawn: one of
individuals at the ward level and another of displaced persons originating from those
wards. Use of wards as the sampling units had the advantage of offering a paired design
of individuals who decided to stay and those who decided to leave within the same con-
textual environment.
Given time and resource constraints, the total number of interviewees for the first
phase of the survey was set at 1,500 for the eleven districts, with a target of 1,000 displaced
and 500 nondisplaced persons. The number of displaced persons was further divided
into two groups: 500 interviewees who were still displaced and 500 interviewees who
had returned home. In each of the eleven districts, the target number of interviewees
Sampling Hard-to-Locate Populations    169

was determined by the proportion of displaced persons identified by INSEC in each dis-
trict. Each district in Nepal is divided into VDCs, with each VDC further subdivided
into nine wards. Only VDCs with ten or more displaced persons were used in the sam-
pling of respondents. Out of several VDCs meeting this threshold in each district, five
were randomly selected, and the targeted number of respondents was determined by
the proportion of displaced persons in each of these VDCs. Next, the targeted number
of respondents from each of the five VDCs was randomly sampled from the wards in
proportion to the number of displaced in each ward. Displaced respondents, which
included both males and females, were randomly selected from a list maintained by
INSEC of all displaced persons originating from these wards. The 500 nondisplaced
respondents were selected from the same districts/​VDCs/​wards from which the dis-
placed originated, and systematic sampling was used, whereby interviewers surveyed
every third house in a sample area. Target numbers of nondisplaced from each ward
were based on the same proportions used for sampling the displaced.
The full data set gathered in this study consists of a national sample of 1,804 re-
spondent households from fifty-​six VDCs, representing the eleven districts in Nepal
included in the sampling frame for the first phase, plus displaced persons living in tem-
porary shelters in the capital of Kathmandu, which is not located in one of the eleven
districts in the sampling frame, but was home to a large number of displaced persons
after the insurgency and was the area of concentration in the second phase of the study.12
Table 8.1 lists the eleven districts included in phase one of the study, including in-
formation about the economic development region and topographic zone where each
is located. The table also includes the target number and actual number of displaced

Table 8.1 Eleven Districts Selected for Sampling with Target (and Actual) Number


of Respondents Interviewed
Economic Development Regions

Far West Midwest Western Central East

Mountains Bajura: Kalikot: Taplejung:


84 (70) 203 (218) 44 (50)
Hills Rolpa: Lamjung: Ramechhap:
Topographic 105 (96) 49 (47) 73 (88)
Zones   Thawang
  Kureli
  Uwa
  Mirul
  Bhawang
Plains Kailali: Bardiya: Kapilbastu: Chitwan: Jhapa:
118 (124) 94 (108) 152 (151) 48 (43) 30 (17)
170    Prakash Adhikari and Lisa A. Bryant

respondents in the district, based on the proportion of displaced persons originating in


each of the districts out of the total number of displaced persons identified in the eleven
districts. For example, Rolpa had 1,817 displaced out of the total 17,386 displacements in
the eleven districts, resulting in a target number of 105 (1,817/​17,386 × 1000 = 105) dis-
placed interviewees and 52 (1,817/​17,386 × 500 = 52) nondisplaced interviewees. Rolpa
is then further divided into the five randomly selected VDCs. Based on the propor-
tion of actual displacement in each of the five VDCs, a target number of interviewees is
given, along with the actual number of displaced persons interviewed and the number
of nondisplaced persons interviewed.
There are a total of fifty-​one VDCs in Rolpa, from which five, with ten or more IDPs,
were selected; 363 people were displaced from these five VDCs, with 99 coming from
Thawang, 94 from Kureli, 85 from Uwa, 74 from Mirul, and 11 from Bhawang. The
targeted number of respondents from each of the five VDCs was determined by the pro-
portion of displaced persons in each of the VDCs (e.g., Thawang: 99/​363 × 105 = 28).
Next, the targeted number of respondents from each of the 5 VDCs was randomly
sampled from the wards in proportion to the number of displaced in each ward. These
numbers are shown in Table 8.2.
The target and actual number of interviewees differs somewhat for each VDC be-
cause INSEC’s and the Nepali government’s identification and documentation of dis-
placed persons, as well as people injured, killed, and disappeared, was still ongoing at
the time the interviews were conducted, so the identification of conflict-​induced dis-
placement was uncertain. For example, while INSEC had information on the number
of people displaced from each VDC, a complete list of names was not always available,
and on occasion the randomly drawn subject could not be found when enumerators
approached the identified households. While some of the randomly drawn subjects
had moved to new places after being displaced or returned, others were simply hard
to locate. Some returned individuals refused to participate when first approached by

Table 8.2 Rolpa as an Example of the Sampling Process


Target Actual Target
Proportion Number of Number Response Number of Actual Response
Displacement Respondents Interviewed Rate Respondents Nondisplaced Rate
VDCs in VDCs* (IDPs) (IDPs) (IDPs) (Non-​IDPs) interviewed (Non-​IDPs)

Thawang 0.27 28 19 68% 22 28 127%


Kureli 0.26 27 37 137% 18 12 67%
Uwa 0.23 24 20 83% 15 11 73%
Mirul 0.20 21 15 71% 14 7 50%
Bhawang 0.03 3 5 167% 4 2 50%
Total 1.00 105 96 91% 73 60 82%

* These were the five randomly seleted VDCs for the Rolpa district.
Sampling Hard-to-Locate Populations    171

enumerators due to social stigma as well as from fear of reprisal.13 Of these, returnees
were the most difficult to locate, for two main reasons. First, because the survey was
conducted in the immediate aftermath of an insurgency, some people were not readily
willing to disclose their identify for fear of reprisal from the party that had displaced
them in the first place. Second, many people who had left their villages had started a job
or business in the city to which they fled. They were still in the process of moving back
into their old houses and were not available for interview when enumerators first vis-
ited them. Under these difficulties, along with time and resource constraints that some-
times prevented repeated attempts to interview subjects, the targeted number could
not always be reached. A combination of snowball sampling and RDS was utilized to
overcome these difficulties. Once a respondent was located, he or she was asked to pro-
vide contact information for six other people whom that person knew had been dis-
placed. Attempts were made to locate those individuals until the target was met. With
the overall district targets in mind, VDC targets were sometimes exceeded in villages
where the number of available displaced subjects appeared to exceed the original INSEC
figures. These are just a few of the types of challenges that can arise when attempting to
study hard-​to-​locate populations, and researchers should be prepared to be flexible and
come up with creative solutions while in the field.
While the overall Response Rate 6 (RR6) was 85.8%, it varied by districts (AAPOR
2016).14 For example, the response rate was 100% in the districts of Bardiya and
Ramechhap; 99% in Kalikot; over 80% in Taplejung and Kapilbastu; over 70% in Bajura
and Lamjung; and over 60% in the districts of Chitawan, Jhapa, and Rolpa. The re-
sponse rate was lowest in the eastern district of Jhapa (60%). This area was one of the
least affected during the insurgency and residents possibly did not have the same level
of interest in social exchange compelling them to participate and share their stories.
For Rolpa, the district where the conflict began, the response rate was 69%. The survey
sought to ensure a fair representation of female respondents. Nepal is a patriarchal so-
ciety, and it can be quite challenging to interview women, especially in private. To over-
come this challenge, we included a female enumerator for all the districts covered in
the survey. Females constitute 40% of the total respondents, which is fairly high for a
developing country with patriarchal traditions. Female enumerators conducted 23% of
the surveys and interviewed around 10% of the female respondents. Even though they
conducted a small proportion of the total number of interviews with women, female
enumerators were required because some women are not allowed by their husbands to
talk to a male interviewer, a common cultural practice in the region.
Looking at the demographics of those displaced during the insurgency, we obtained
a reasonably representative sample using a multistage sampling design. According to
INSEC records, 8% of the people displaced during the insurgency came from the Eastern
Development Region, followed by 10% from the Central, 12% from the Western, 58%
from the Midwest, and 13% from the Far-​Western region of Nepal. In our sample, 7% of
the respondents were from the Eastern region, followed by 13% from the Central, 16%
from the Western, 44% from the Midwest, and 20% from the Far-​Western region. Our
sample approximates the distribution of the IDP population in other indicators as well,
172    Prakash Adhikari and Lisa A. Bryant

such as topographic region and economic indicators. Overall, through careful attention
to details and a complex, multistage sample strategy, we were able to obtain good re-
sponse rates and secure representative data.

Discussion and Conclusion

The study in this chapter demonstrated a few lessons from the field in trying to survey
hard-​to-​locate populations. Based on our experiences and the lessons learned from
surveying hard-​to-​survey populations, we recommend some best practices. First, it is
important to pretest the survey instrument before it is actually administered to the full
sample population. This provides an opportunity not only to revise the survey instru-
ment, but also to gauge the accessibility of the respondent population.
It is helpful, and in some cases essential, for researchers to become familiar with local
conditions, such as geography, language, and culture, when dealing with hard-​to-​locate
populations. There may be societal norms that, if violated, could make it very difficult to
conduct interviews or survey respondents. In addition, when in areas where travel may
be difficult, becoming familiar with the region or working with those who are will help
researchers stay on schedule and on budget.
As a researcher it is important to be sensitive to the experiences of vulnerable
populations, such as those who have experienced wartime suffering; have been victims
of sex trafficking; are living with a life-​threatening disease; or may have stigmatizing
professions, personal experiences, or characteristics. For example, when approaching
respondents in a country that has undergone decades of conflict, researchers must bear
in mind that these individuals have been interrogated several times by rebels as well as
the security forces and may be skeptical about why they are being interviewed again.
Working with vulnerable populations may require a high degree of perseverance and
patience on the part of the researcher.
Researchers may need to adopt a flexible approach and be prepared to adjust survey
instruments as well as research schedules while surveying hard-​to-​reach populations,
especially under changing political circumstances. Surveying or interviewing hard-​
to-​locate populations or hard-​to-​survey populations of any type may require a mixed
method approach to obtain samples that may be used to make meaningful inferences.
Surveying with replacement, convenience sampling, snowball sampling, location
based sampling, and using geographic information system (GIS)-​assisted sampling
techniques are just a few of the possibilities that could help researchers locate and access
the populations of interest. There is no way to determine which approach will produce
the highest level of compliance or garner the most representative sample, and successful
use of any of these approaches depends on knowledge about the target population.
Finally, researchers need to invest time in training enumerators. Quality training
will go a long way to ensure quality data. Whether they are undergraduates, grad-
uate students, or paid personnel, enumerators or interviewers should practice going
Sampling Hard-to-Locate Populations    173

through the survey several times with various individuals before they are placed in the
field. Specific terms or ideas that are important to data collection but may not be used
frequently outside of a research environment should be discussed in detail and fully
explained to those fielding the survey. Depending on the location and conditions avail-
able, there are a variety of ways to collect surveys, and the use of portable technology
such as tablets is making it easy to collect and transfer responses directly into survey
software from the field as well as to use complex survey designs that could be quite diffi-
cult on paper (Benstead, Kao, et al. 2016; Bush and Prather 2016). Again, it is important
for researchers to be aware of the conditions in the area where they are conducting re-
search in regard to availability of electricity and the safety of researchers when sending
electronics into the field. In the case of Nepal, used in this study, there is sparse access to
electricity, there is no Internet connection available in most of the country, and sending
enumerators into the field with a tablet device would likely result in robbery, possibly
placing them in harm’s way. All of this should be considered when deciding the best
mechanism (electronic or paper) for collecting survey data.
This study demonstrates how sampling hard-​to-​survey populations in general can be
difficult. Members may be reluctant to participate, mobile, and even rare or hard to lo-
cate in the larger population, and sampling frames are often not available. This makes
standard probability based sampling techniques inappropriate and difficult to use, as
well as extremely costly. Knowing something about the population from existing records
can be useful in narrowing down who should be included in the sample population and
help researchers determine the best sampling methods to use to acquire a representative
sample. Being thoughtful and careful in the research design and data collection process
can result in fruitful, quality data for even the most difficult populations of interest.
Many important questions and issues will need to be addressed in the future
involving hard-​to-​survey populations. Whether one is studying displaced persons
because of conflict or climate change, looking for people with rare experiences or
characteristics, or trying to answer questions about populations that are widely, non-​
systematically dispersed, this study reveals a number of interesting and useful lessons
for future research in the field. Researchers today are armed with a wide variety of
approaches to answer questions regarding hard-​to-​survey populations. It should be
noted that conditions can vary widely by location, and lessons learned from a partic-
ular location or in a particular context may not be useful for all contexts. The sampling
technique that is most appropriate and will provide the most representative results
is highly question dependent, but with careful planning, thoughtful strategy, good
research design, and some willingness to take creative approaches, one can reach
respondents and obtain quality results.

Acknowledgments
Funding for this research came from the U.S. National Science Foundation (SES-​0819494). We
thank Wendy L. Hansen and Lonna R. Atkeson for their support during the process of this
174    Prakash Adhikari and Lisa A. Bryant

survey as well as the Informal Sector Service Center (INSEC) in Nepal for institutional support
during the fieldwork.

Notes
1. This section draws from categorizations laid out by Tourangeau (2014).
2. See Goodman (2011) for a comparison of snowball sampling and respondent-​driven
sampling.
3. For example, if a researcher were trying to capture Hmong in a survey of Asian Americans,
who are relatively rare in the United States, with a population estimate of only .08%
(280,000) of the total population, they would want to oversample Asians in Sacramento,
CA, Fresno, CA, and Minneapolis, MN, which are the primary areas in the United States
where Hmong reside (Pfeifer et al. 2012).
4. IDPs may also be more difficult to identify based on physical appearance for techniques
such as snowball sampling or place-​based sampling, because unlike refugees, who often
differ in appearance from the population in their new place of residence (e.g., Syrians
in Germany in 2015, Guatemalans in Canada in the 1980s), IDPs are still in their home
country and resemble the population at large.
5. Gideon also refers to social exchange theory as an explanation for an increase in response
rate resulting from this incentive method.
6. The score is calculated using the following percentages (all of these are taken from
Tourangeau 2014): (1) “Percent of dwelling units that were vacant’; (2) “Percent that were
not single-​family units”; (3) “Percent of occupied units that were occupied by renters”;
(4) “Percent of occupied units with more than 1.5 persons per room”; (5) “Percent of
households that were not husband/​wife families”; (6) “Percent of occupied units with
no telephone service”; (7) “Percent of persons below the poverty line”; (8) “Percent of
households getting public assistance”; (9) “Percent of persons over 16 who were unem-
ployed”; (10) “Percent of households where none of the adults (over 14) spoke English
well”; (11) “Percent of households that moved in the past year” and (12) “Percent of adults
without a high-​school education.” Each tract receives a score ranging between 0 and
11 “on each of these indicators, depending on which of twelve categories that tract fell
into for each variable” (p. 16). The overall hard-​to count scores range between 0 and 132
(see Tourangeau 2014 for details).
7. This section draws heavily from Adhikari (2013) (with permission from the publisher).
8. The insurgency ended with the signing of a comprehensive peace accord (CPA) between
the rebels and the government in November 2006. The pretest was conducted in the fall of
2007 when the political climate was in flux. The rebel disarmament procees was ongoing.
We were not sure if all the districts of the country were accessible given the destruction of
infrastructure such as roads, bridges, and airports. The pretest was aimed at assessing fea-
sibility of movement for researchers as well as willingness of respondents to participate in
the survey given the recently ended conflict.
9. For example, before surveying illicit drug users, a researcher may want to talk to drug
counselors about trigger issues or terms to avoid including in the survey.
10. This is an interesting finding and a good example of how contextual factors make certain
groups more or less socially vulnerable (Cutter et al. 2008). In many cases, such as envi-
ronmental disasters, an increase in economic resources might enable people to flee, as was
Sampling Hard-to-Locate Populations    175

the case in New Orleans when Hurricane Katrina hit and the levees breeched (Brodie et al.
2005), but the opposite was true in Nepal.
11. Topographically, Nepal is divided into three regions: mountain, hill, and plain.
12. A total of 1,515 respondents from the eleven districts were surveyed during the summer
of 2008 in the first phase of the study, and a random sample of displaced persons living
temporarily in Kathmandu was surveyed during follow-​up fieldwork in fall 2008. The
Kathmandu sample consists of respondents from twenty-​nine districts (see Adhikari and
Hansen 2013).
13. The hard-​to-​locate respondents were replaced by others via the method described below.
In calculating response rate, the hard-​to-​locate respondents are treated as “Non-​contact
(2.20)” and those refusing as “Refusal and break-​off (2.10)” (AAPOR 2016).
14. All houses in the sample were contacted at least twice in order to get a response. People
in these regions are often farmers and are either home or in the fields, so the response
rates are much higher than if the survey were administered by more traditional means
(phone/​mail) or in primarily urban areas. We used the most conservative approach in
calculating response rate, treating “Interviews” with 50–​80% of all applicable questions
answered as “partial” (P) and more than 80% as “complete” (I). We use Response Rate 6
(RR6) in estimating the response rate because there are no “unknown cases” in our sample
(AAPOR 2016, 62).

References
Aaron, D. J., Y. F. Chang, N. Markovic, and R. E. LaPorte. 2003. “Estimating the Lesbian popu-
lation: A Capture-​Recapture Approach.” Journal of Epidemiology and Community Health 57
(3): 207–​209.
Abraham, K. G., S. Helms, and S. Presser. 2009. “How social process distort measurement: the
impact of survey nonresponse on estimates of volunteer work in the United States.”
American Journal of Sociology 114 (4): 1129–​1165.
Abraham, K. G., A. Maitland, and S. M. Bianchi. 2006. “Nonresponse in the American Time
Use Survey: Who is missing missing from the data and how much does it matter.” Public
Opinion Quarterly 70 (5): 676–​703.
Abrajano, M., and R. M. Alvarez. 2012. New Faces, New Voices:  The Hispanic Electorate in
America. Princeton, NJ.: Princeton University Press.
Adhikari, P. 2011. “Conflict-​Induced Displacement: Understanding the Causes of Flight.” PhD
diss., University of New Mexico. http://​repository.unm.edu/​handle/​1928/​13117.
Adhikari, P. 2012. “The Plight of the Forgotten Ones:  Civil War and Forced Migration.”
International Studies Quarterly 56 (3): 590–​606.
Adhikari, P. 2013. “Conflict-​Induced Displacement, Understanding the Causes of Flight.”
American Journal of Political Science 57 (1): 82–​89.
Adhikari, P. and W. L. Hansen. 2013. “Reparations and Reconciliation in the Aftermath of Civil
War.” Journal of Human Rights 12 (4): 423–​446.
Adhikari, P., W. L. Hansen, and K. L. Powers. 2012. “The Demand for Reparations: Grievance,
Risk and the Pursuit of Justice in Civil War Settlement.” Journal of Conflict Resolution 56
(2): 183–​205.
Adhikari, P., and S. Samford. 2013. “The Dynamics of the Maoist Insurgency in Nepal.” Studies
in Comparative International Development 48 (4): 457–​481.
176    Prakash Adhikari and Lisa A. Bryant

American Association for Public Opinion Research (AAPOR). 2016. Standard Definitions: Final
Dispositions of Case Codes and Outcome Rates for Surveys. 9th edition. http://​www.aapor.
org/​AAPOR_​Main/​media/​publications/​Standard-​Definitions20169theditionfinal.pdf
Aronow, P. M., and F. W. Crawford. 2015. “Nonparametric Identification for Respondent-​
Driven Sampling.” Working Paper 106. Cornell University.
Atkinson, R., and J. Flint. 2001. “Accessing Hidden and Hard-​to-​Reach Populations: Snowball
Research Strategies.” Social Research Update 33 (1): 1–​4.
Avenilla, L. R. 2012. “Enumerating Persons Experiencing Homelessness in the 2010
Census:  Methodology for Conducting Service-​Based Enumeration.” In Proceedings of
the International Conference on Methods for Surveying and Enumerating Hard to Reach
Populations, October 31–​November 3, New Orleans, LA. http://​www.eventscribe.com/​2012/​
ASAH2R/​assets/​pdf/​49898.pdf.
Bankier, M. D. 1986. “Estimators Based on Several Stratified Samples with Applications to
Multiple Frame Surveys.” Journal of the American Statistical Association 81 (396): 1074–​1079.
Benstead, L. J., K. Kao, P. F. Landry, E. M. Lust, and D. Malouche. 2016. “Using Tablet
Computers to Implement Surveys in Challenging Environments.” Unpublished manuscript,
Portland State University, OR.
Biernacki, P., and D. Waldorf. 1981. “Snowball Sampling: Problems and Techniques of Chain
Referral Sampling.” Sociological Methods & Research 10 (2): 141–​163.
Bradburn, N. M. 2016. “Surveys as Social Interactions.” Journal of Survey Statistics and
Methodology 4 (1): 94–​109.
Brehm, John. 1993. The Phantom Respondents: Opinion Surveys and Political Representation.
Ann Arbor: University of Michigan Press.
Brick, J. M., and D. Williams. 2013. “Reason for Increasing Nonresponse in the U.S. Household
Surveys.” Annals of the American Academy of Political and Social Science 645: 36–​59.
Brodie, M., E. Weltzien, D. Altman, R. J. Blendon, and J. M. Benson. 2005. “Experiences of
Hurricane Katrina Evacuees in Houston Shelters:  Implications for Future Planning.”
American Journal of Public Health 96 (8): 1402–​1408.
Brown, A. 2015. “The Unique Challenges of Surveying U.S. Latinos.” Pew Research Center.
November 12. http://​www.pewresearch.org/​2015/​11/​12/​the-​unique-​challenges-​of-​
surveying-​u-​s-​latinos/​.
Browne, K. 2005. “Snowball Sampling: Using Social Networks to Research Non‐heterosexual
Women.” International Journal of Social Research Methodology 8 (1): 47–​60.
Bruce, A., and J. G. Robinson. 2003. Tract Level Planning Database with Census 2000 Data.
Washingtong, DC: US Census Bureau.
Bush, S., and L. Prather. 2016. “An Experiment on the Use of Electronic Devices to Collect
Survey Data.” Paper presented at Visions in Methodology Conference, May 16–​ 18,
Davis, CA.
Cohen, N., and T. Arieli. 2011. “Field Research in Conflict Environments:  Methodological
Challenges and Snowball Sampling.” Journal of Peace Research 48 (4): 423–​435.
Coleman, J. S. 1958. “Relational Analysis:  The Study of Social Organization with Survey
Methods.” Human Organization 17: 28–​36.
Couper, M. P., and M. B. Ofstedal. 2009. “Keeping in Contact with Mobile Sample Members.”
In Methodology of Longitudinal Surveys, edited by P. Lynn, 183–​203. Chichester, UK: John
Wiley & Sons.
Crisp, Jeff. 1999. “ ‘Who Has Counted the Refugees?’ UNHCR and the Politics of Numbers.”
New Issues in Refugee Research, Working Paper No. 12.
Sampling Hard-to-Locate Populations    177

Cutter, S. L., L., Barnes, M. Berry, C. Burton, E. Evans, E. Tate, and J. Webb. 2008. “A place-​based
model for understanding community resilience to natural disasters.” Global Environmental
Change 18: 598–​606.
Dawood, M. (2008). “Sampling rare populations.” Nurse Researcher 15 (4): 35–​41.
Dillman, D. A., J. D. Smyth, and L. M. Christian. 2009. Internet, mail, and mixed-​mode
surveys: The tailored design method (3rd ed.). New York, NY: John Wiley & Sons.
Durante, D. 2012. “Enumerating persons experiencing homelessnees in the 2010
Census: identifying service-​based and targeted on-​shelter outdoor locations. In H2R/​2012
Proceedings. Alexandria, VA: American Statistical Association.
Fisher, N., S. W. Turner, R. Pugh, and C. Taylor. 1994. “Estimating Numbers of Homeless and
Homeless Mentally Ill People in North East Westminster by Using Capture-​Recapture
Analysis.” British Medical Journal 308: 27–​30.
Florance, P. 2008. “The Use of Geospatial Technology to Survey Urban Internally Displaced
Persons.” Paper presented at the GDEST 2008 Conference. https://​2001-​2009.state.gov/​g/​
stas/​events/​110756.htm.
Fumagalli, L., and E. Sala. 2011. The Total Survey Error Paradigm and Pre-​election Polls: The Case
of the 2006 Italian General Elections. Institute for Social and Economic Research, University
of Essex: London.
Gabbard, S. M., and R. Mines. 1995. “Farm Worker Demographics: Pre-​IRCA and Post-​IRCA
Field Workers.” In Immigration Reform and U.S. Agriculutre 3358:  63–​72. University of
California, Division of Agriculture and Natural Resources, Oakland, CA.
Garcia-​Bedolla, L. 2005. Fluid Borders:  Latino Power, Identity and Politics in Los Angeles.
Berkeley: University of California Press.
Gideon, L. 2012. Handbook for Survey Methodology for the Social Sciences. New  York:
Springer.
Goel, S., and M. J. Salganik. 2010. “Assessing Respondent-​Driven Sampling.” Proceedings of the
National Academy of Sciences 107 (15): 6743–​6747.
Goodman, L. A. 2011. “Comment: On Respondent-​Driven Sampling and Snowball Sampling
in Hard-​to-​Reach Populations and Snowball Sampling in Not Hard-​to-​Reach Populations.”
Sociological Methodology 41 (1): 347–​353.
Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau.
2009. Survey Methodology. Vol. 561, Wiley Series in Survey Methodology. New York: John
Wiley & Sons.
Groves, R. M., and L. Lyberg. 2010. “Total Survey Error:  Past, Present, and Future.” Public
Opinion Quarterly 74 (5): 849–​879.
Groves, R. M., E. Singer, and A. Corning. 2000. “Leverage-​salience theory of survey participa-
tion.” Public Opinion Quarterly 64 (3): 299–​308.
Hanson, G. H. 2006. “Illegal Migration from Mexico to the United States.” Journal of Economic
Literature 44 (4): 869–​924.
Heckathorn, D. D. 1997. “Respondent-​Driven Sampling:  A New Approach to the Study of
Hidden Populations.” Social Problems 44 (2): 174–​179.
Heckathorn, D. D. 2002. “Respondent-​ Driven Sampling II:  Deriving Valid Population
Estimates from Chain-​ Referral Samples of Hidden Populations.” Social Problems 49
(1): 11–​34.
Heckathorn, D. D. 2007. “Extensions of Respondent-​Driven Sampling: Analyzing Continuous
Variables and Controlling for Differential Recruitments.” In Sociological Methodology, ed-
ited by Y. Xie, 151–​207. Boston: Blackwell.
178    Prakash Adhikari and Lisa A. Bryant

Heckathorn, D. D. 2011. “Comment:  Snowball versus Respondent-​ Driven Sampling.”


Sociological Methodology 41 (1): 355–​366.
Horrigan, M., Moore, W., Pedlow, S., and Wolter, K. 1999. “Undercoverage in a large national
screening survey for youths?” In Joint Statistical Meetings Proceedings, Survey Research
Methods Section. Alexandria, VA: American Statistical Association.
Informal Sector Service Center (INSEC). 2004. Nepal Human Rights Year Book.
Kathmandu: INSEC.
Internal Displacement Monitoring Center (IDMC). 2017. Global Report on Internal
Displacement 2017. Norwegian Refugee Council. http://​www.internal-​displacement.org/​
global-​report/​grid2017/​pdfs/​2017-​GRID.pdf
Kalton, G. 2001. “Practical Methods for Sampling Rare and Mobile Populations.” In Proceedings
of the Annual Meeting of the American Statistical Association, August 5–​9, 2001. http://​www.
websm.org/​uploadi/​editor/​1397549292Kalton_​2001_​Practical_​methods_​for_​sampling.pdf
Kalton, G. 2003. “Practical Methods for Sampling Rare and Mobile Populations.” Statistics in
Transition 6 (4): 495–​501.
Kalton, G. 2009. “Methods for Oversampling Rare Populations in Social Surveys.” Survey
Methodology 35 (2): 125–​141.
Kalton, G. 2014. “Probability Sampling Methods for Hard-​to-​Sample Populations.” In Hard-​to-​
Survey Populations, edited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N.
Bates, 401–​423. Cambridge: Cambridge University Press.
Kalton, G., and D. W. Anderson. 1986. “Sampling Rare Populations.” Journal of the Royal
Statistical Society: Series A (General) 149 (1): 65–​82.
Kanouse, D. E., S. H. Berry, and N. Duan. 1999. “Drawing a Probability Sample of Female Street
Prostitutes in Los Angeles County.” Journal of Sex Research 36: 45–​51.
Kearns, B. 2012. “Down for the Count:  Overcoming the Census Bureau’s Neglect of the
Homeless.” Stanford Journal of Civil Rights and Civil Liberties 8: 155.
Kish, L. 1987. Statistical Designs for Research. New York: John Wiley & Sons.
Kish, L. 1991. “Taxonomy of Elusive Populations.” Journal of Official Statistics 7 (3): 339–​347.
LaPorte, R. E. 1994. “Assessing the Human Condition:  Capture-​ Recapture Techniques.”
BMJ: British Medical Journal 308: 5.
Lavrakas, P. J. Ed. 2008. Encyclopedia of Survey Research Methods. Thousand Oaks, CA: Sage
Publications.
Lohr, S. L., and J. N.  K. Rao. 2000. “Inference from Dual Frame Surveys.” Journal of the
American Statistical Association 95 (449): 271–​280.
Lohr, S. L., and J. N.  K Rao. 2006. “Estimation in Multiple-​Frame Surveys.” Journal of the
American Statistical Association 101 (475): 1019–​1030.
MacDonald, A. 2015. “Review of Selected Surveys of Refugee Populations, 2000–​2014.” Paper
written for the United Nations High Commissoner for Refugees. http://​www.efta.int/​sites/​
default/​files/​documents/​statistics/​training/​Review%20of%20surveys%20of%20refugee%20
populations.pdf.
Magnani, R., K. Sabin, T. Saidel, and D. Heckathorn. 2005. “Review of Sampling Hard-​to-​
Reach and Hidden Populations for HIV Surveillance.” AIDS 19: S67–​S72.
Massey, D. 2014. “Challenges to surveying immigrants.” In Hard-​to-​Survey Populations,
edited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N. Bates.
Cambridge: Cambridge University Press.
Mneimneh, Z. N., W. G. Axinn, D. Ghimire, K. L. Cibelli, and M. S. Alkaisy. 2014.
“Conducting Surveys in Areas of Armed Conflict.” In Hard-​to-​Survey Populations,
Sampling Hard-to-Locate Populations    179

edited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N. Bates, 134–​156.


Cambridge: Cambridge University Press.
Myatt, M., and D. E. Bennett. 2008. “A Novel Sequential Sampling Technique for the
Surveillance of Transmitted HIV Drug Resistance by Cross-​sectional Survey for Use in Low
Resource Settings.” Antiviral Therapy 13: 37–​48.
Neugebauer, R., and J. Wittes 1994. Annotation: Voluntary and involuntary capture-​recapture
samples—​Problems in the estimation of hidden and elusive populations. American Journal
of Public Health 84 (7): 1068–​1069.
Newport, F. 2015. “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.” Social Issues,
Gallup. May 21. http://​www.gallup.com/​poll/​183383/​americans-​greatly-​overestimate-​
percent-​gay-​lesbian.aspx.
Noy, C. 2008. “Sampling Knowledge: The Hermeneutics of Snowball Sampling in Qualitative
Research.” International Journal of Social Research Methodology 11 (4): 327–​344.
Pennell, B. E. D., Y. Eshmukh, J. Kelley, P. Maher, J. Wagner, and D. Tomlin. 2014. “Disaster
Research:  Surveying Displaced Populations.” In Hard-​to-​Survey Populations, ed-
ited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N. Bates, 111–​133.
Cambridge: Cambridge University Press.
Pfeifer, M. E., Sullivan, J., Yang, K. and Yang, W. 2012. “Hmong Population and Demographic
Trends in the 2010 Census and 2010 American Community Survey.” Hmong Studies Journal
13 (2): 1–​31.
Platt, L., M. Wall, T. Rhodes, A. Judd, M. Hickman, L. G. Johnston,  .  .  .  A. Sarang. 2006.
“Methods to Recruit Hard-​to-​Reach Groups:  Comparing Two Chain Referral Sampling
Nethods of Recruiting Injecting Drug Users across Nine Studies in Russia and Estonia.”
Journal of Urban Health 83 (1): 39–​53.
Rothbart, G. S., M. Fine, and S. Sudman. 1982. “On Finding and Interviewing the Needles
in the Haystack:  The Use of Multiplicity Sampling.” Public Opinion Quarterly 46
(3): 408–​421.
Salganik, M. J., and D. D. Heckathorn. 2004. “Sampling and Estimation in Hidden Populations
Using Respondent-​Driven Sampling.” Sociological Methodology 34 (1): 193–​240.
Salganik, M. J. 2012. “Commentary:  Respondent-​ Driven Sampling in the Real World.”
Epidemiology 23 (1): 148–​150.
Seber, G. A. F., and M. M. Salehi. 2012. Adaptive Sampling Designs: Inference for Sparse and
Clustered Populations. New York: Springer Science & Business Media.
Shaghaghi, A., Bhopal, R. S., and Sheikh A. 2011. “Approaches to Recruiting ‘Hard-​To-​Rearch’
Populations into Research:  A Review of Literature.” Health Promotion Perspectives 1
(2): 86–​94.
Smith, T. W. 2014. “Hard-​to-​Survey Populations in Comparative Perspective.” In Hard-​to-​
Survey Populations, edited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N.
Bates, 21–​36. Cambridge: Cambridge University Press.
Stoker, L., and J. Bowers. 2002. “Designing Multi-​level Studies: Sampling Voters and Electoral
Contexts.” Electoral Studies 21 (2): 235–​267.
Sudman S. 1972. “On Sampling Very Rare Human Populations.” Journal of the American
Statistical Association 67: 335–​339.
Sudman S., and G. Kalton. 1986. “New Developments in the Sampling of Special Populations.”
Annual Review of Sociology 12: 401–​429.
Sudman, S., M. G. Sirken, and C. D. Cowan. 1988. “Sampling Rare and Elusive Populations.”
Science, n.s., 240 (4855): 991–​996.
180    Prakash Adhikari and Lisa A. Bryant

Tatem, A. J., S. Adamo, N. Bharti, et al. 2012. “Mapping Populations at Risk: Improving Spatial
Demographic Data for Infectious Disease Modeling and Metric Derivation.” Population
Health Metrics 10: 8. doi: 10.1186/​1478-​7954-​10-​8
Thompson, S. K. 1997. “Adaptive Sampling in Behavioral Surveys.” NIDA Research Monographs
167: 296–​319.
Thompson, S. K., and G. A.  F. Seber. 1994. “Detectability in Conventional and Adaptive
Sampling.” Biometrics 50 (3): 712–​724.
Tourangeau, R. 2004. “Survey Research and Societal Change.” Annual Review of Psychology
55: 775–​801.
Tourangeau, R. 2014. “Defining Hard-​to-​Survey Populations.” In Hard-​to-​Survey Populations,
edited by R. Tourangeau, B. Edwards, T. P. Johnson, K. M. Wolter, and N. Bates, 3–​20.
Cambridge: Cambridge University Press.
Tourangeau, R., R. M. Grove, and C. D. Redline. 2010. “Sensitive topics and reluctant
respondents: demonstrating and link between nonresponse bias and measurement error.”
Public Opinion Quarterly 74 (3): 413–​432.
Tourangeau, R., F. Kreuter, and S. Eckman. 2015. “Motivated Misreporting: Shaping Answers
to Reduce Survey Burden.” In Survey Measurements: Techniques, Data Quality and Sources of
Error, edited by Uwe Engel, 24–​41. Campus Verlag GmbH: Frankfurt-​on-​Main.
van der Heijden, P. G. M., L. de Vries, D. Böhning, and M. Cruyff. 2015. “Estimating the Size
of Hard-​to-​Reach Populations Using Capture-​Recapture Methodology, with a Discussion
of the International Labour Organization’s Global Estimate of Forced Labour.” In Forum
on Crime and Society:  Special Issue—​Researching Hidden Populations:  Approaches to and
Methodologies for Generating Data on Trafficking in Persons, New  York:  United Nations
Office on Drugs and Crime (UNODC), Vol 8: 109–​136. https://​www.unodc.org/​documents/​
data-​and-​analysis/​Forum/​Forum_​2015/​15-​00463_​forum_​ebook_​E.pdf
Volz, E., and D. D. Heckathorn. 2008. “Probability Based Estimation Theory for Respondent
Driven Sampling.” Journal of Official Statistics 24 (1): 79–​97.
Watters, J. K. and P. Biernacki. 1989. “Targeted Sampling: Options for the Study of Hidden
Populations.” Social Problems 36 (4): 416–​430.
Wejnert, C. 2009. “An Empirical Test of Respondent-​Driven Sampling:  Point Estimates,
Variance, Degree Measures, and Out-​of-​Equilibrium Data.” Sociological Methodology 39
(1): 73–​116.
Wejnert, C., and D. D. Heckathorn. 2008. “Web-​Based Network Sampling:  Efficiency and
Efficancy of Respondent-​Driven Sampling for Online Research.” Sociological Methods and
Research 37: 105–​134.
Welch, S. 1975. “Sampling by Referral in a Dispersed Population.” Public Opinion Quarterly 39
(2): 237–​245.
Chapter 9

Reaching Beyond
L ow-​H angin g  Fru i t
Surveying Low-​Incidence Populations

Justin A. Berry, Youssef Chouhoud,


and Jane Junn

Introduction

An increasingly diverse U.S.  population presents survey researchers with new and
multifaceted challenges. Those seeking to map American attitudes and behaviors with
more precision and gradation can expect, for example, myriad difficulties attendant
on surveying groups that constitute a relatively small portion of the populace. Such
low-​incidence populations can be characterized by recency of immigration, foreign-​
language dominance, racial and ethnic minority status, and geographic idiosyncrasies
(i.e., whether the population of interest is relatively dispersed or concentrated in a given
location). Thus, many of the characteristics related to higher unit and item nonresponse
in polls are often more prevalent among these groups. Difficult as it is to identify and
survey low-​incidence populations, however, the descriptive and inferential findings
gleaned from these efforts add valuable nuances to general population trends, allow for
informative intra-​and intergroup comparisons, and elaborate subgroups of particular
political or theoretical importance.
This chapter outlines strategies for reaching beyond the “low-​hanging fruit” of
populations that are relatively easy to identify and survey. We argue for creative and
targeted strategies rather than a one-​ size-​
fits-​
all approach to capturing informa-
tion on low-​incidence populations, beginning with consideration of the character-
istics that make populations difficult to sample, interview, and analyze. To illuminate
our approach, we utilize three cases of low-​incidence populations in the United States
characterized by religion, race and ethnicity, and political behavior. We begin by
182    Justin A. Berry, Youssef Chouhoud, and Jane Junn

conceptualizing low-​incidence populations and highlighting the existing empirical lit-


erature on these populations. We then turn our attention to framing the challenges of
polling low-​incidence populations, with an overview of sampling, contacting, and an-
alytical strategies. In this section we highlight the inherent trade-​offs of each approach
and point to the factors that have to be considered when determining which strategy
is best suited to particular research questions. Next we detail polling efforts designed
to capture attitudes and behaviors of three low-​incidence populations in the United
States:  (1) American Muslims, (2)  Asian Americans, and (3)  nonelected political
activists. We conclude with a discussion of fruitful polling practices for conducting re-
search on low-​incidence populations in the United States. Ultimately, we argue that the
approach to polling these populations must be equally informed by the unique charac-
teristics of the target group and the analytical conclusions one seeks to draw.

Low-​Incidence Populations

Often referred to in the polling literature as “rare” or “special,” low-​incidence populations


can be thought of as a subset of difficult-​to-​reach populations. By low incidence we mean
a group of individuals who share a common characteristic and make up a relatively
small proportion of the broader population. Although difficult-​to-​reach populations
may also have low incidence rates, these two traits are not necessarily intertwined. For
example, corporate CEOs constitute a low-​incidence population that is often difficult to
reach. Alternatively, young adults between eighteen and twenty-​nine years of age form
a large segment of the population, but they can be quite difficult to reach, and when
contacted are less likely to cooperate (De Leeuw et al. 2007; Curtin, Presser, and Singer
2005). Young adults are less likely to live in homes with landlines, reside in single-​unit
homes, remain in the same residence for an extended period of time, or be registered to
vote, all of which makes it less likely that they will be adequately covered in a sampling
frame that relies on data tied to these characteristics (Blumberg and Luke 2009).
Although empirical studies on low-​incidence populations often focus on racial or
ethnic minorities, this line of research also targets groups on the basis of, for example,
specific types of illness, military service, or socioeconomic status. Studies based on
samples of racial and ethnic low-​incidence populations have been done about American
Indians (Lavelle, Larsen, and Gundersen 2009), American Jews (Reed 1975; Lazerwitz
1978; Shor 2000), Afro-​Caribbean blacks (Greer 2013), young black females (Ericksen
1976), non-​English-​speaking Chinese (Elliott et al. 2012), and Cambodian immigrants
(Elliott et  al. 2009). In addition, researchers have compiled national samples of mi-
nority populations, including the Pilot Asian American Political Survey (Lien, Conway,
and Wong 2004), the National Asian American Survey (Wong et al. 2011), the National
Survey of Black Americans (Jackson and Gurin 1987, 1999; Jackson and Neighbors 1997),
the National Black Election Study (Jackson, Gurin, and Hatchett 1984; Tate 1997), the
National Politics Study (Jackson et al. 2004), the Latino National Survey (Fraga et al.
Surveying Low-Incidence Populations    183

2006), and the Latino National Political Survey (De la Garza et  al. 1998). Multiple
studies have also analyzed groups who suffer from a rare illness (Czaja et al. 1998; Sirken,
Graubard, and McDaniel 1978), are at a greater risk of contracting an infectious disease
(Watters and Biernacki 1989), and other at-​risk populations (Biernacki and Waldorf
1981; O’Donnell et  al. 1976; Rossi et  al. 1987). Finally, research has investigated low-​
incidence populations on the basis of common military service (Rothbart, Fine, and
Sudman 1982), and membership in an elite circle (Rossi and Crain 1968).
Each of the aforementioned studies focuses on low-​incidence populations, but the
particular characteristics of each population vary considerably. Some of the important
differences include the extent to which the unifying rare characteristic is identifiable
to the researcher, whether the group is geographically concentrated or dispersed, the
level of preexisting research on the group, and finally the degree of uniformity among
its members. The unique characteristics of one’s population, coupled with the inferences
one seeks to draw, ought to inform a study’s approach to sampling, contacting, and
analyzing a target population. We identify three particular challenges to polling low-​
incidence populations and discuss each in turn below.

Sampling Low-Incidence Populations


One of the central challenges of sampling low-​incidence populations is identifying
and locating individuals who share the characteristics in question. Low-​incidence
populations are often not characterized by either an observable trait or one that is re-
corded in official records. In our discussion of cases of religious and behaviorally de-
fined groups below, we detail ways researchers have addressed the challenges of
identifying and locating the low-​incidence populations of American Muslims and po-
litical activists who do not serve in elective office. In these cases, a priori and robust
measures of religious affiliation and political engagement are not available in official
government data. Aside from certain historical spans when the U.S. racial taxonomy
included categories for Jews and “Hindus,” for example, religious affiliation has not been
officially enumerated for the U.S. population (see Nobles 2000; Hattam 2007). Similarly,
when interested in selecting a sample of politically active Americans, records of be-
havioral traits such as taking part in community-​based political events are not readily
available. In addition, and in partial contrast to religious affiliation (except for conver-
sion and the sometimes fluid designation of a largely religious identity; see, e.g., Pew
Research Center 2013), participatory behavior is a dynamic, moving target, changing
with context, environment, and time.
Even when characteristics of low-​incidence populations are observable and recorded,
for example in racial enumeration records, identifying and locating groups that match
a specific trait is complicated by geographic dispersion and heterogeneity within ra-
cial groups. Polling Asian Americans is a third case we examine in greater detail below.
While grouped together racially in the official U.S. taxonomy, Asian Americans are a
remarkably diverse set of people with a wide range of both immigrant trajectories and
184    Justin A. Berry, Youssef Chouhoud, and Jane Junn

sending countries. Asian immigration to the U.S. is relatively recent, a function of pre-​
1965 federal immigration policies barring new entrants to the United States from Asian
nations. As a result, Asian Americans today are a largely immigrant population, with
nearly eight in ten adults born abroad. Immigration from Asia has not been dominated
by a single nation, and Asian Americans come from a multitude of countries and speak
a wide variety of languages. While family names may be distinctively “Asian” for East
Asian nations such as China, Korea, and Japan, surnames for Asian Americans with co-
lonial histories such as Filipino Americans and South Asian Americans are more dif-
ficult to distinguish from Americans whose racial and ethnic backgrounds are Latino
or Arab American. The distinct surnames, coupled with the diversity of languages
spoken, pose significant challenges to researchers who wish to poll this low-​incidence
population.
Recognizing these inherent difficulties in locating and identifying low-​incidence
populations, researchers have utilized three approaches to sampling individuals in these
groups: (1) stratified designs, (2) list-​based selection, and (3) density strategies. We pro-
vide an overview of each of the sampling strategies, weigh the associated trade-​offs,
and highlight the factors to be considered when determining which approach is best
suited for research. As we further illustrate in the case studies following this section,
researchers must tailor their sampling strategies to the unique characteristics of their
target populations and the type of inferences they seek to draw.
Stratified sampling (sometimes referred to as “purposive” or “multistage”) is one
probability technique available to surveyors of low-​incidence groups. To construct a
stratified sample, researcher first must identify the characteristics by which they wish to
stratify along with the frequency at which these strata occur in the target populations,
and subsequently sample individuals within these strata at random until the preset
frequency is reached (Neyman 1934). Selection is often made based on demographic,
socioeconomic, or geographic traits. This approach enables researchers to address
the additional challenges associated with low-incidence while still obtaining a repre-
sentative sample of the target population. In addition, if a stratified sample is chosen at
random, the researcher can better guard against potential selection threats. An addi-
tional benefit of this sampling strategy is that by setting a target sample size during the
design phase, researchers can better ensure that their sample is large enough for the type
of analysis they wish to conduct.
While this approach has a number of advantages, it has significant drawbacks and
is not appropriate for many low-​incidence populations. First and foremost, this sam-
pling strategy can be costly, and the cost increases with the relative rarity of the low-​
incidence population. Researchers who lack sufficient financial resources are likely to
find the costs of building an adequate size sample prohibitive. For example, the prin-
cipal investigators of the 2008 National Asian American Survey attempted to utilize a
telephone interviewing strategy through random digit dialing and yielded a very small
number of successful contacts with Asian Americans from several thousand numbers.
The relatively low-incidence of the Asian American population (5%) and the high rate of
English as a second language made this sampling strategy particularly inefficient.
Surveying Low-Incidence Populations    185

Second, stratified sampling requires a “benchmark” survey, such as the U.S. Census,
to ensure that the size and diversity of the low-​incidence population is representative
of the target population. As previously discussed, low-​incidence populations are often
classified by a shared characteristic—​such as religion, immigration status, sexual pref-
erence, political activity, or illness—​that is not accurately recorded in government data.
Thus it may be difficult to ensure that one’s stratified sample accurately represents the
actual size and diversity of the target population.
Considering these drawbacks, stratified sampling may be better suited for intra-​as
opposed to intergroup analysis. If researchers seek only to ensure that their sample
includes a subsample that is reflective of the group’s low-incidence within the larger
population, stratified sampling may be an effective strategy. On the other hand, if in-
stead they seek to better understand the low-​incidence population itself, it may be best
to employ an alternative sampling strategy that increases the sample’s size and diversity.
Since individuals who are contacted often vary considerably from those who are diffi-
cult to contact—​and often in theoretically significant ways—​a small sample is unlikely
to be representative of the target population. Researchers who lack the necessary finan-
cial resources, are interested in a particularly rare population, or are seeking to con-
duct intergroup analysis are likely to find stratified random sampling ill-​suited for their
research.
Another approach to studying low-​incidence populations is the use of list sampling
(Green and Gerber 2006; Sudman and Kalton 1986; Link et al. 2008; Gentry et al. 2010;
Lazerwitz 1978; Brick, Williams, and Montaquila 2011). List sampling requires access to
a record that provides enough information to identify and contact eligible members of
the low-​incidence population. In essence this catalog, which may be a combination of
multiple lists, serves as a single sampling frame (Sirken 1972). Lists may be constructed
to serve a particular public function, for instance, voter registration (Green and Gerber
2006) or delivery of mail via the U.S. Postal Service (Brick et al. 2011; Link et al. 2008;
Gentry et al. 2010). Potential sampling frames of particular populations may also be
constructed by civic organizations, unions, special interest groups, or commercial firms
and may prove very useful for empirical work on low-​incidence populations (Wong
2006; Greer 2013; Lazerwitz 1978; Shor 2000). Finally, if a list of the broader population
includes information that enables one to identify eligible members of a low-​incidence
population, one may remove ineligible members and randomly sample individuals
who remain. While this approach still requires interviewers to screen respondents on
their initial contact, it nevertheless reduces the cost of screening and greatly increases
contact rates.
Researchers often incorporate multiple lists to increase the coverage of their sam-
pling frame (Kalton and Anderson 1986; Lohr and Rao 2006). One may also make use of
samples from preexisting surveys (Reed 1975; Sudman and Kalton 1986) or may incor-
porate lists with a known high frequency of low-​incidence population within a larger
representative sample of the broader population (Kalton and Anderson 1986). List sam-
pling can dramatically decrease the cost of sampling a low-​incidence population, while
at the same time enabling researchers to increase the size of their sample.
186    Justin A. Berry, Youssef Chouhoud, and Jane Junn

An additional advantage of list sampling is that if eligible members of a group are


identified prior to contact, researchers may design the survey protocol in a way to
maximize response rates. For instance, one may alter the description of the survey to
cater to the interests or assuage the concerns of particular populations. Research has
demonstrated that potential respondents are far more likely to respond if they have
confidence in the sponsor of the survey, perceive the survey topic to be salient and
relevant, or anticipate their participation in the survey will be rewarding and mean-
ingful (Groves, Singer, and Corning 2000; Groves et al. 2006). Furthermore, one may
match potential respondents to interviewers who share characteristics or language
similar to those of potential respondents to further increases response rates. List
samples provide researchers prior knowledge of the potential respondents, enabling
them to design the survey and method of data collection in a way that can maximize
the participation of the population they seek to analyze. The size of one’s sample and
the associated costs of polling are not merely a function of the number of potential
respondents one contacts, but also of the percentage of those who comply and com-
plete the survey. List sampling may provide researchers with a more effective way to
accomplish both.
While list sampling provides a cost-​efficient and practical way to construct a suf-
ficiently large sample of a low-​incidence population, it presents a number of meth-
odological trade-​offs. One of the drawbacks to the list-​sampling approach is that one
cannot be sure that the frame completely covers the population, possibly introducing
noncoverage bias. Second, there may be an increased risk that lists developed by civic
organizations or special interest groups do not meet the requirement that respondents
in a sample be independent of one another. This approach may result in over-​coverage,
meaning individuals have an unequal probability of being selected, making the con-
struction of robust sample weights particularly challenging. This problem may be
further compounded by the fact that multiple lists are often used to ensure broader
coverage. Third, if one constructs a list from individuals who were sampled in pre-
existing surveys, in addition to facing the challenge of duplicates, each individual
survey is likely to have distinct sampling protocols, again complicating the weighting
methodology. Finally, due to issues pertaining to privacy or commercial concerns,
organizations may not be willing to share lists or may only make them available at a
considerable cost.
A final sampling method researchers may employ is density sampling, which is also
referred to as “area” or “clustered” sampling (Waksberg 1978; Ericksen 1976; Hedges
1979; Waksberg, Judkins, and Massey 1997; Lien, Conway, and Wong 2004; Fraga et al.
2006; Blair and Czaja 1982). While low-​incidence populations are by their very defini-
tion small in size, they may also be concentrated within a particular geographic area.
This heavy concentration of a particular subgroup may be the result of segregation
and isolation or of self-​selection. And thus, while the targeted group may have a low-
incidence within the broader population, it may have a high incidence within a more
narrowly restricted area. The density sampling approach seeks to take advantage of this
Surveying Low-Incidence Populations    187

concentration to increase contact rates and consequently lower the greater cost typically
associated with surveying a low-​incidence population.
Density sampling is a multistage process that is similar to stratified sampling. As
previously discussed, stratified sampling begins by identifying characteristics that
researchers believe are important indicators of the outcomes they seek to measure. The
population is then divided into these strata and is sampled in a manner to reflect how the
broader population is stratified along these lines (Neyman 1934). In density sampling,
a researcher identifies particular geographic regions such as neighborhoods, census
blocks, metropolitan statistical areas, states, or larger regions that have a higher con-
centration of a low-​incidence population. Once these areas, or clusters, are identified—​
typically through the use of enumeration or previous reliable survey data—​researchers
may either randomly sample individuals from this primary sampling unit or further di-
vide the area into smaller strata and randomly sample at a lower level of observation
(Kalton and Anderson 1986; Hedges 1979; Waksberg 1978). If a low-​incidence popula-
tion is geographically concentrated within a defined area, density sampling can signifi-
cantly increase contact rates and consequently significantly reduce the associated costs
of polling. Furthermore, if the vast majority of the target population is located within
the area sampled, and the researchers have no a priori reason to suspect that those out-
side this defined area vary in theoretically significant ways, they may construct a sample
that is both representative and of sufficient size to conduct analysis.
As do all forms of sampling, density sampling has its drawbacks, and researchers
must determine if it is the most appropriate sampling approach for their research. First,
the increased efficacy of density sampling, as well as the researchers’ ability to con-
struct weights that properly adjust for disproportionate sampling, are dependent on the
researchers’ ability to accurately estimate the prevalence of low-​incidence populations
at the appropriate level of observation (Kalton and Anderson 1986). This requirement
may pose a significant hurdle because low-​incidence populations tend to be underrep-
resented in larger surveys. This problem is not necessarily mitigated through a reliance
on benchmark Census enumeration, because the unifying characteristic of the low-​
incidence population may not be recorded. Furthermore, given the Census’s infrequent
collection, it may not accurately represent the extant prevalence of a low-​incidence pop-
ulation within a given geographic region.
An additional drawback to this sampling approach is that there is no assurance that
members of the subpopulation who live within these densely populated clusters do
not differ systematically from those who do not. Although low-​incidence populations
present the challenge of detection, they equally present the challenge of inference: To
what extent can the population of the sample be generalized to the subgroup as a whole
(Venette, Moon, and Hutchison 2002)? Consequently, the use of density sampling is not
well suited for all low-​incidence populations. It is a more effective means of polling if the
population surveyors seek to research is geographically clustered in a densely populated
area, and they do not have a priori reason to believe that members of the population in-
side the clustered areas vary significantly from those outside.
188    Justin A. Berry, Youssef Chouhoud, and Jane Junn

Surveying and Gaining Cooperation


with Low-​Incidence Populations
In addition to the challenges associated with sampling, researchers polling low-​
incidence populations face an additional hurdle. While it is a concern for any form
of survey research, gaining cooperation with individuals in relatively rare groups
comes with specific challenges. Precisely because of their relatively low frequency
in an overall population, and as a result of the more complex sampling strategies
undertaken, surveying and gaining cooperation with low-​incidence populations must
be approached with additional care. To this end, Groves and coauthors tested a theory
they refer to as “leverage-​saliency theory” (Groves, Singer, and Corning 2000) They
hypothesize that during the screening phase of a survey, individuals will evaluate the
perceived costs and benefits of participating in the survey, which will impact their pro-
pensity to respond. For instance, potential respondents may assign differing levels of
benefits to participation due to perceived legitimacy of the sponsor, material incentives,
and the perceived saliency and importance of the topic, as well as potential costs (e.g.,
the length of the survey; cognitive or language-​based burden; questions that are deemed
to be embarrassing, invasive, or socially undesirable). Thus, one must tailor the design
of the study to maximize the perceived benefits and minimize the perceived costs of the
particular population one seeks to poll. This theory becomes particularly relevant for
researchers who seek to include the surveying of a low-​incidence population within a
study of the larger population. If the perceived benefits and/​or costs for members of
the low-​incidence population vary considerably from those for the larger population,
researchers may face a significant hurdle to maximizing the level of cooperation of
respondents. Our case studies of American Muslims and Asian Americans further illus-
trate this point.
One of the most common, and meaningful, ways in which the associated costs of a
survey may differ between the general population and smaller subpopulations is lan-
guage. A  respondent’s low level of English fluency may pose a significant hurdle to
completion of the survey, and a researcher’s failure to adequately account for this
difficulty may significantly reduce the representation of particular low-​ incidence
populations. A broad literature that has sought to identity the factors that contribute
to increased levels of unit nonresponse has identified the potential barriers that may
reduce a respondent’s propensity to cooperate. In addition to identifying the role that
perceived burden plays in reducing cooperation rates—​such as the length of a survey,
the level of knowledge that is required, or the risk of being forced to answer embar-
rassing questions—​scholars have also identified language as an important barrier to co-
operation (Groves, Presser, and Dipko 2004). If significant portions of a low-​incidence
population are systematically eliminated from a sample due to their inability to com-
plete the survey in the language in which it is offered—​such as recent immigrant groups
for whom English is their second language—​the resulting sample may not be repre-
sentative of the broader population, and nonresponse bias may result. Since nativity,
Surveying Low-Incidence Populations    189

length of time in the United States, levels of education, and racial and ethnic identity
are correlated with both response propensity and many political outcomes of interest—​
for example, public opinion, voting, voter registration, civic engagement—​there is an
increased risk of nonresponse bias.
Furthermore, if steps are not taken to alter the selection mechanism—​in this case, a
match between the language of the potential respondent and the survey instrument—​
then neither oversampling nor back-​end statistical adjustments are likely to reduce the
level of bias. For instance, if non-​English-​speaking Asian Americans vary systematically
from English-​speaking Asian Americans, even if one constructs a sample that is pro-
portional to the size of the group in the broader population, the respondents within the
subgroup may not be representative of the subgroup as a whole. Failure to correct for
the selection mechanism will not only potentially bias population estimates, but also
prevent accurate subgroup analysis. Furthermore, statistical adjustments on the back
end will be difficult, because researchers will be unable to place greater weight on those
respondents with a low propensity on the basis of language, because they are likely to
have been fully eliminated from a sample.
Recognizing the significant challenges associated with a bilingual population in the
United States, researchers have increasingly conducted surveys in languages other than
English. However, while researchers have increased the number of surveys conducted
in Spanish, the population is increasingly multilingual with an ever-​growing number
of languages being spoken. According to the U.S. Census, the population of adults who
speak a language other than English at home increased from 13.8% in 1990 to 17.8% in
2000. If we extend our analysis to the entire population in the most recent Census (five
years or older), 20% of Americans speak a language other than English at home, and of
this population, 22.4% speak English either “not well” or “not at all” (2010 U.S. Census).
Asian Americans, who now make up the largest share of immigrants, account for
15% of those who speak a language other than English, but represent a higher per-
centage of those who speak English “not well” or “not at all.” Focusing on four of the
largest co-​ethnic Asian American groups, 29.6% of Chinese, 28.4% of Koreans, 33.1% of
Vietnamese, and 15.1% of Japanese Americans speak English “not well” or “not at all.” In
addition to problems associated with nonresponse, the inclusion of respondents who
complete the survey in a language in which they are not fully proficient may increase
measurement error that may similarly bias results. For these reasons, it is essential that
effective protocols be established to ensure that both questionnaires and surveyors are
reflective of the target population of the study. While translation is both a costly and
an arduous process, it is likely to reduce total survey error by increasing both contact
and cooperation rates and reducing the degree of measurement error. Strategies that
have been implemented to increase response rates, for instance advance letters or
prescreening phone messages, will be ineffective if they do not reflect the diverse lan-
guages of the target population.
In an effort to combat these challenges, surveys that focus on low-​incidence
populations, as well as larger surveys seeking a nationally representative survey,
typically have translators available in call centers. However, fielding a bilingual or
190    Justin A. Berry, Youssef Chouhoud, and Jane Junn

multilingual poll can be both challenging and costly. Matching potential respondents
with the correct foreign-​language interviewer and conducting the survey with a
translated instrument is a more costly and difficult process in multiple languages than
it is when the survey is done only in English and Spanish. If the languages spoken
by the translators do not represent the diversity of language spoken by the survey
population, it may not eliminate the potential for nonresponse bias. Furthermore, if
screening calls are still conducted in English, there is an increased risk that the poten-
tial respondent may terminate the interview before the interviewer is able to match
the respondent with an interviewer who can conduct the survey in the appropriate
language. While the percentage of respondents who are lost during the transition
to the translator, and the associated bias that transition may induce, are unknown,
evidence in an analogous area suggests it may pose a problem. More specifically, a
significant number of interviews are terminated during the time that interviews
are transferred from a live interviewer to the interview and data collection system
(Tourangeau, Groves, and Redline 2010).
While maximizing response rates with an unbiased sample of respondents is ideal,
it is not always possible within the constraints of budget and time. When conducting
a baseline study of a set of behaviors or attitudes of the population in question (such
as the National Asian American Survey, discussed below) for which earlier systematic
data are not available, incurring the time and expense of maximum coverage of a low-​
incidence population is imperative. Subsequent studies and other efforts, however, can
utilize baseline studies conducted with maximum-​coverage designs to provide some
measure of comparison when full coverage of a low-​incidence population is not feasible.
Nevertheless, comparisons to baseline data should be conducted with care given the dy-
namism of low-​incidence populations such as Asian Americans.

Drawing Inferences from Data Collected from


Low-​Incidence Populations
After clearing the multiple hurdles associated with sampling and surveying low-​
incidence populations, researchers face additional challenges to analyzing the data. On
the back end of a survey, a survey methodologist may take additional steps to adjust
for the potential of nonresponse bias. The distinct statistical strategies are informed
by the logic of response propensity. One commonly used statistical strategy is post-​
stratification. Akin to stratified sampling, in this strategy researchers attempt to make
the sample more representative of the target population. Researchers identify charac-
teristics they believe are likely to correlate with the outcome measurements of interest—​
typically demographic, socioeconomic, or geographic in nature—​and make statistical
adjustments so that the sample matches the characteristics of a “bench-​mark survey,”
such as the U.S. Census, or those of the sampling frame (Brehm 1993; Olson 2006). These
adjustments are typically accomplished by increasing the weight of responses from
Surveying Low-Incidence Populations    191

individuals whose characteristics match those of a subgroup population that responded


at lower rates than their population proportion.
Another strategy employed by survey researchers is the use of propensity score
weights (Groves et  al. 2006; Peytchev, Peytcheva, and Groves 2010; Groves 2006;
Olson 2006; Tourangeau, Groves, and Redline 2010). This back-​end adjustment tech-
nique is analogous to post-​stratification. However, rather than matching respondents
to the general population along theoretically important strata, one is attempting to
match respondents in the sample to nonrespondents based on their shared low pro-
pensity to respond. In employing propensity scores one is attempting to limit poten-
tial nonresponse bias by modeling the response process. If researchers can identify the
particular predispositions that increase or decrease an individual’s propensity, they can
assign every respondent within their sample a propensity score ranging from 0 to 1. If
the propensity scores are accurate, the survey researchers can place greater weight on
respondents who have a relatively low propensity to respond. By modeling response
propensity, researchers can adjust weights to account for unequal selection rates, as well
as unequal response rates that may bias their results.
Nevertheless, the effectiveness of propensity scores, like that of post-​stratification,
depends on survey researchers’ knowledge of which characteristics best explain re-
sponse propensity. Thus, postsurvey adjustments are based on assumptions about the
relationship between response propensity and the survey estimate in question. Some
survey methodologists argue that nonresponse is a complex and interactive process that
includes a multitude of individual-​and survey-​level characteristics, which are likely to
vary across distinct survey items, and thus caution against overreliance on back-​end sta-
tistical adjustments (Brehm 1993; Brick 2011; Olson 2006; Groves et al. 2006).
A similar technique involves the identification of “high-​effort cases” (Curtin, Presser,
and Singer 2005; Keeter et al. 2000; Keeter et al. 2006; Teitler, Reichman, and Sprachman
2003; Stinchcombe, Jones, and Sheatsley 1981; Sakshaug, Yan, and Tourangeau 2010).
The theory is that if researchers identify respondents in the sample who required mul-
tiple attempts to contact and/​or were initially unwilling to cooperate, those people
can serve as proxies for those who did not respond. If these high-​effort cases share no
unifying characteristic, then nonresponse may be random, thereby minimizing the
threat of bias from this source. On the other hand, if they do share a unifying character-
istic, researchers can account for it.
However, a body of research suggests that high-​ effort cases do not resemble
nonresponders along key demographic lines and thus may not be effective in correcting
for nonresponse bias (Brick 2011; Lin and Schaeffer 1995; Teitler, Reichman, and
Sprachman 2003; Olson 2006; Groves and Couper 1998). These concerns have led
survey researchers to suggest that bias resulting from nonresponse may be more suc-
cessfully dealt with at the design phase and during the process of data collection (Olson
2006; Brehm 1993; Brick 2011; Sakshaug, Yan, and Tourangeau 2010; De Leeuw et al.
2007). Low-​incidence populations, due to a combination of their low prevalence and the
many challenges of researching them, are typically underrepresented in the empirical
literature. As a result, we often lack the empirical data and associated theory required to
192    Justin A. Berry, Youssef Chouhoud, and Jane Junn

make accurate post-​data collection adjustments. It is for this reason that we argue that
such collection adjustments are unlikely to overcome deficiencies in the survey design
and the protocol employed during the phase of data collection.
Taken together, these three challenges facing researchers interested in low-​incidence
populations—​drawing a sample, gaining cooperation, and analyzing the data—​present
a high bar indeed for successful polling. At the same time, the benefits of gathering sys-
tematic and high-​quality data for low-​incidence populations is well worth the effort.
In the next section we detail attempts by researchers to survey three specific types of
low-​incidence populations: the religious group of Muslim Americans, the racial and
ethnic group of Asian Americans, and a behaviorally distinguished group of polit-
ical activists who do not hold official elective office. Our discussion of these three case
studies is meant to serve as a reference for those who have a substantive interest in these
three groups, as well as to elucidate the various factors one must consider when de-
signing survey research on low-​incidence populations. The combination of strategies
undertaken to sample, contact, and poll members of low-​incidence groups must equally
reflect the unique characteristics of the groups, the resources at the researchers’ disposal,
and the type of inferences they seek to draw. Each low-​incidence population presents
unique challenges and opportunities, and researchers must tailor their survey research
accordingly.

Successful Studies
of Low-​Incidence Populations

Although there are standard difficulties that come with designing and implementing
a survey targeting low-​incidence groups, each population will naturally pose its own
unique challenges. Researchers therefore often require a priori knowledge of the pop-
ulation of interest to anticipate any sampling or analytical hurdles they will need to
clear, or at least sidestep. Yet, this universal prescription varies in its ease of applicability.
Surveyors of low-​incidence populations must therefore implement a tailored approach
that accounts for the trade-​offs accompanying key strategic decisions (Dillman, Smyth,
and Christian 2014). The following assessment of efforts to sample low-​incidence groups
begins with the relatively rare population of American Muslims.

American Muslims
Researchers compiling original data on American Muslim attitudes and behaviors face
difficult decisions when formulating a sampling strategy. Like all surveyors, they aim to
minimize survey error while working within time and financial constraints. However,
the calculus undergirding these core considerations can shift dramatically when
Surveying Low-Incidence Populations    193

targeting low-​incidence groups. In seeking a national probability sample, for instance,


the total number of contacts needed to secure an adequate number of respondents can
quickly grow as incidence rate decreases, putting a substantial strain on resources. Yet as
principal investigators move away from an ideal research design to relieve cost burdens,
the already elevated risks of sampling error and myriad biases can become too difficult
to manage or fully account for. Striking a balance between these competing interests is
made all the more challenging in light of complications particular to American Muslims.
Beyond a very low-incidence rate,1 researchers face additional legal, demographic,
and social challenges in surveying American Muslims. The chief obstacle to sampling
this community stems from a lack of any official data, as both the Census Bureau and
Immigration and Naturalization Service are legally barred from compiling statistics on
religious affiliation. This limitation naturally puts those researching any religious group
at a disadvantage compared to those surveying ethnic or racial communities, for whom
population data are readily available. Yet American Muslims’ linguistic and ethnic diver-
sity makes sampling them even more complex than, say, American Jews or Mormons.
Indeed, the most reliable estimates peg Muslims as the most ethnically and racially di-
verse religious minority in America (Pew Research Center 2007, 2011; Gallup 2009),
with a linguistic variety perhaps rivaled only by Asian Americans (Junn et al. 2011).
Moreover, while 80% of American Jews reside in five states (Pew Research Center
2013), and over one-​half of Mormons live in the Mountain West (Pew Research Center
2012a), Muslim congregations are found across all major geographic regions, with Islam
constituting the second largest religion in as many as twenty states (Grammich et al.
2012). There are certainly areas that are home to comparatively large Muslim populations
(the New York metro region; Dearborn, Michigan; and Southern California, to name a
few); on the whole, however, this community is not nearly as geographically concen-
trated as other religious minority groups. Such multidimensional heterogeneity and
wide distribution means that even well-​resourced surveys of American Muslims will
face acute design trade-​offs. Some of these bargains, whether made consciously or ac-
cepted ex post, are highlighted below.
The most comprehensive studies on American Muslims to date are those carried
out by Pew (2007, 2011) and Gallup (2009, 2011)—​though the two organizations took
notably distinct methodological tacks. More specifically, Pew’s (2007) original study
used two sampling frames—​a random digit dial (RDD) sample gleaned from geo-
graphic and list strata, which was coupled with a recontact frame drawn from Muslim
respondents to previous national surveys—​to produce the first national probability
sample of American Muslims. Its representativeness, on the one hand, is bolstered by
the interviews being conducted in four different languages (English, Arabic, Urdu, and
Farsi), but on the other hand, is somewhat undermined by the RDD frame not including
a cell phone component despite estimates at the time that 13.5% of U.S. households with
telephones were wireless only (Blumberg and Luke 2009).2
The focus of the Pew (2007) study was likewise a double-​edged sword. More specifi-
cally, concentrating the survey on Muslims in America allowed the researchers to field
a questionnaire partially tailored to this community. That is, in addition to obtaining
194    Justin A. Berry, Youssef Chouhoud, and Jane Junn

data on standard demographics, civic participation, political affiliation, and so forth,


the survey also asked about respondents’ experience with discrimination following the
September 11, 2011, attacks; belief in official accounts of this terrorist plot; matters of
religiosity particular to Islam; and other such issues that are especially informative for
this population. Yet this greater flexibility in questionnaire design is mitigated by the
heightened sensitivity that respondents may have had to the questions and the conse-
quent threat such a posture poses to measurement validity. In turn, Pew researchers
took several steps to preemptively alleviate nonresponse and social-​desirability bias.
These tactics included extensive pretesting of the questionnaire and an uncommon dis-
closure of the study’s focus early in each interview. This latter decision, however, poten-
tially traded one form of bias for another, further emphasizing the calibrations that belie
a one-​size-​fits-​all model for surveying low-​incidence groups.
Gallup’s (2009, 2011) survey methodology differed from Pew’s (2007, 2011) in several
key respects. Rather than targeting American Muslims, Gallup selected self-​identified
Muslim respondents from the Gallup Poll Daily survey, which tracks a general sample
of American households. That is, rather than an oversample, Gallup aggregated the
responses of 946 Muslims drawn from a database of nearly 320,000 adults across the
United States. One of the more significant analytical advantages of this strategy is the
ability to organically compare the opinions of American Muslims to other segments
of the broader public, given the identical questionnaires and prompts used across all
respondents. In addition, the extensive coverage of this technique is reinforced through
a dual-​mode RDD frame that included both landline and cellular numbers. While this
methodology may have produced the “first-​ever nationwide representative random
sample of Muslim Americans” (Gallup 2009, 16), there were nonetheless several limita-
tions inherent in the design.
Given that the Gallup Poll Daily targets a general population, the survey in turn
had to be generally applicable. As such, many questions specific or more relevant to an
American Muslim sample—​arguably the very questions that scholars and policymakers
most often seek answers to—​were not included in the questionnaire. This broad scope
also meant that there was no incentive to offer interviews in languages other than
English and Spanish, which is especially problematic given that Arabic, Urdu, and Farsi
interviews constituted 17% of Pew’s (2007) sample. Certainly, however, a survey that
does not specifically focus on American Muslim opinions may increase the response
rate among this wary population. Yet a high response rate in itself does not guard against
nonresponse bias (Groves and Peytcheva 2008), and Gallup’s (2009) report, given
the expansive sample it is drawn from, does not provide the same analysis of survey
nonresponse as Pew’s (2007). Ultimately, while a random national sample of American
Muslims may be a significant achievement, it is no panacea for addressing the difficulties
of low-​incidence sampling.
If well-​funded organizations are nonetheless forced to make certain concessions from
a theoretically ideal sampling design, then naturally academic researchers and smaller
institutes working within significantly tighter resource constraints will fare no better.
Indeed, due to the numerous challenges discussed above, the vast majority of studies
Surveying Low-Incidence Populations    195

featuring original survey data on American Muslims, whether academic (Patterson,


Gasim, and Choi 2011; Muedini 2009; Sharif et al. 2011) or institutional (Council on
American Islamic Relations 2006, 2012; Muslim Public Affairs Council 2005), are effec-
tively drawn from convenience samples or, at best, are representative of a local popula-
tion (Bagby 2004). A number of projects with far more modest budgets than either Pew
or Gallup have, however, sought (with varying degrees of success) to obtain a nationally
representative sample of American Muslims.
Zogby International (2001, 2004) compiled arguably the most extensive accounts
of the American Muslim population prior to the Pew (2007) study. The methodology
employed across both of Zogby’s surveys targeted Muslim respondents by randomly
selecting 300 Islamic centers and drawing from a listing of Muslim surnames in the
surrounding area to populate an RDD frame. Additional in-​person interviews sought
out African American Muslims in New York, New York, Washington, D.C., Atlanta,
Georgia, and Detroit, Michigan, to compensate for this subgroup’s likely underrepre-
sentation in the RDD sample. The reports offer no details, however, on how the sam-
pling for the in-​person interviews was undertaken, nor do they provide a rationale for
not including Philadelphia, Pennsylvania, among the cities visited, given its high con-
centration of African American Muslims. Thus, despite conducting more interviews (n
≈ 1,800) than either the Pew or Gallup polls discussed above, the lack of methodological
clarity (in addition, there is no mention of whether the interviews were carried out in
languages other than English) makes it difficult to take the reports’ claims of representa-
tiveness at face value (Zogby 2001, 2004).
Another project that cast a wide net was the Muslim American Public Opinion Survey
(MAPOS) (2010). For this study, researchers recruited local Muslim enumerators in
twenty-​two locations across the United States (eleven cities; two locations in each city) to
hand out two-​page “exit-​poll style” surveys following weekly Friday services and semian-
nual Eid celebrations. The benefits of this strategy include the ability to employ multilingual,
self-​administered surveys, which cut down on nonresponse and social desirability bias.
Reliance on a single sampling mode and the exclusion of cities with large Muslim centers
are among the study’s limitations; but despite these drawbacks, the authors’ contention that
their sample is nonetheless representative is not without merit. The validity of this claim
stems from the study’s central questions, which gauge how religiosity influences political
participation within this minority population. As such, the more informative opinions for
the authors’ purposes are those of the more religiously inclined that this sampling strategy
targets. This method of using a study’s motivating questions as a reference for calibrating
resource allocation, as the concluding section of this chapter discusses, constitutes another
rare universal prescription for pollsters targeting low-​incidence populations.

Asian Americans
In comparison to Muslim Americans—​who can be of any racial or ethnic background—​
the official American racial taxonomy classifies and enumerates Asian American races
196    Justin A. Berry, Youssef Chouhoud, and Jane Junn

as a function of national origin. In 1960, and prior to the reopening of the United States
to immigration from Asia in 1965 with the Immigration and Nationality Act, the size
of the Asian American population was fewer than one million and represented only a
fraction of the entire U.S. population. Subsequent years of increasing immigration to the
United States from Asia have driven the size of the Asian American population to more
than 5% of all Americans. Until the 1990s, Asian Americans were heavily of East Asian
national origins, particularly Chinese and Japanese. But in subsequent decades, immi-
gration from Asia to the United States has expanded to include large numbers of new
Chinese, South Asian Indians, Filipinos, Vietnamese, Korean, and Southeast Asians.
Because the vast majority of immigrants are recent, they speak native languages in-
cluding Mandarin, Cantonese, Hindi, Bengali, Tagalog, Vietnamese, Korean, and Thai,
among others. This degree of variation makes matching targeted Asian Americans to a
language of interview a complex process requiring expertise in Asian culture. Similarly,
and because the Asian American population is more heavily concentrated in some
states, their geographic settlement patterns create challenges for researchers attempting
to survey this low-​incidence population.
Two recent national studies of Asian American opinion and political behavior pro-
vide excellent guidance for researchers interested in polling Asian Americans. The
National Asian American Survey (NAAS) of 2008 was conducted over the telephone
with 5,159 respondents (Wong et al. 2011). The largest national origin groups—​Chinese,
South Asian, Filipino, Vietnamese, Korean, and Japanese—​were interviewed in the
language of their choice. Selection of the sample was accomplished by a combination
of techniques including utilizing lists, RDD, and stratified design, as well as density
sampling. Because the researchers were interested in drawing inferences about Asian
Americans in the United States overall as well as national origin groups and also par-
ticular states that were considered political “battleground” states in 2008, the principal
investigators began the process of conducting the sample at the county level by selecting
locations classified as high and low immigration as well as new and old immigrant
destinations.
Identifying Asian Americans was accomplished primarily through a national list
based primarily on surnames, but the NAAS researchers supplemented the known
universes with both RDD (to test the frequency of incidence and resulting cost of
attempting to screen from a random sample to capture Asian Americans) and lists
constructed specifically to capture Filipino Americans. Many Filipino surnames have
origins in the Spanish colonial experience and therefore are often conflated with Latino
and Hispanic ethnic origin. In addition, NAAS researchers utilized a proprietary name-​
matching database to predict the ethnic origin and therefore the language preference
of potential subjects. As discussed previously, if the initial contact with a respondent is
not made in that person’s language, the likelihood of completing the interview is sub-
stantially reduced. Therefore, the selected sample was coded for likely national origin
and assigned separately to bilingual interviewers who spoke Mandarin and English,
or Tagalog and English, for example. All interviewers were bilingual in English and an
Asian language.
Surveying Low-Incidence Populations    197

The resulting data collection from more than five thousand Asian Americans in the
NAAS represented the first national sample of Asian American political attitudes and
behavior conducted in the United States. Benchmarks for partisan affiliation, voting
turnout, and vote choice are established in these data not only for Asian Americans
nationally, but also for particular national origin groups. Subsequent studies of
Asian Americans have followed this multi-​approach research design pioneered by
the NAAS.
A second important national survey of Asian Americans was conducted by the Pew
Research Center in 2012b. The Pew Asian-​American Survey completed interviews with
3,511 respondents identified as Asian Americans. Similar to the NAAS, the Pew study
conducted telephone interviews with respondents with bilingual interviewers and asked
a series of questions about political activity, opinion, attitudes about politics, and socio-​
demographics. Many of the results are similar to the findings from the 2008 survey, even
though the emphasis of the Pew study was on an overall portrait of Asian American so-
cial attitudes rather than on political activism and public opinion as it was in the NAAS.
The Pew study utilized samples from its previous national studies to locate potential
Asian American respondents in addition to existing lists and targeted density sampling.
This study is another example of a creative use of multiple strategies of identifying,
locating, sampling, analyzing, and surveying a low-​incidence population. It is impor-
tant to note, however, that this sampling strategy is conditioned on successful contact
and willingness to cooperate in the previous study, rendering the eventual sample of
respondents a conglomerate panel of earlier sampling frames. As a result, the survey
sample in the Pew study is comprised of a subset of the larger population interviewed
successfully once before, and this underlying bias should be taken into account when
conducting analysis.

Political Activists
A final case of a low-​incidence population in the United States is defined by a high de-
gree of activism in politics. Despite a range of levels of government in which to par-
ticipate—​from the local community council and school board, to city hall, to the state
house, to the federal government—​most Americans take part in politics by engaging
in one political act every four years, and that is voting in a presidential election. While
voter turnout increased in the high-​interest election of 2008, less than 60% of the eli-
gible population of voters cast a ballot in the 2012 U.S. presidential election. Other forms
of activity in the electoral arena, including working for campaigns or attending rallies,
are even lower frequency, though larger proportions of Americans report having made
a contribution to a candidate or political cause. Despite the old adage that “all politics is
local,” many fewer Americans vote in municipal or statewide elections than in federal
elections, and a relatively small proportion report taking part in activities at the local
level. Even the self-​interested act of contacting an elected official for help in solving a
problem has low-incidence in the U.S. population.
198    Justin A. Berry, Youssef Chouhoud, and Jane Junn

Thus political activists are rare in the United States, and finding those who are en-
gaged in politics without being elected officials is a task that can only be accomplished
by embracing the dynamism of politics and the fluidity of political behavior. While
there are multiple examples of researchers surveying political activists—​from political
party convention delegates to citizens who attend local town hall meetings—​the most
substantial and comprehensive effort to assess the motivations and attitudes of ordi-
nary Americans who are actively engaged in politics is a study conducted by the po-
litical scientists Sidney Verba, Kay Schlozman, Henry Brady, and Norman Nie (Verba,
Schlozman and Brady 1995; Nie, Junn, and Stehlik-​Barry 1996). The Citizen Participation
Study began with a large RDD “screener” of 15,053 respondents. The screener sample
was nationally representative, and interviews were conducted by telephone. Based on
analysis of the screener data, which asked a range of questions on political and civic en-
gagement in voting; electoral politics; community-​based activity; contacting officials,
local boards, and councils; and protest activity, a smaller set of respondents was selected
for reinterview in a subsequent study.
The follow-​up survey was conducted with 2,517 respondents in person and asked
respondents about the specific political activities that they engaged in and the reasons
they took part, along with a wide range of attitudinal and demographic questions.
Oversamples of activists specializing in specific types of activities, such as protesters or
campaign workers, were drawn in addition to a sample of ordinary Americans who were
active in politics in multiple ways. This stratified design allowed researchers to analyze
a randomly selected sample of different types of activists as well as view the U.S. popu-
lation as a whole by employing post-​stratification weights in analysis. The creative de-
sign employed by Verba and colleagues in their study of this particular low-​incidence
population has continued to pay dividends for researchers interested in understanding
the dynamics of political participation. While difficult, expensive, and time-​consuming,
the Citizen Participation Study has yielded important insights into why and why not
Americans take part in the politics of their nation.

Discussion and Conclusion

Surveys provide an indispensable tool to describe and explain characteristics of a


broader public. While it is typically not feasible to contact every individual one seeks
to describe, advances in probability theory, increased computing power, and continual
advancements in modes of communication have enabled survey researchers to generate
valid and reliable measures of a population from the observations of a sample. In the an-
nual presidential address to AAPOR, Cliff Zukin made the claim, “our methodology is
built on the notion—​and science—​of sampling. That is, we select and interview a small
group of people to represent an underlying population” (2006, 428). Sidney Verba goes
even further in extolling the virtues of sampling, comparing the empirical approach
with the normative goals we seek to measure. Verba contends, “Surveys produce just
Surveying Low-Incidence Populations    199

what democracy is supposed to produce—​equal representation of all citizens. The


sample survey is rigorously egalitarian; it is designed so that each citizen has an equal
chance to participate and an equal voice when participating” (1995, 3).
However, the validity and reliability of one’s inferences depend on the extent to which
the sample one observes is, in fact, representative of the broader population one seeks to
describe. Gathering systematic and high-​quality data from low-​incidence populations
presents substantial if not insurmountable challenges to pollsters. Low-​incidence
groups are characterized not only by their relative rarity in the population, but also by
the accompanying fact that these individuals are both more difficult to identify and con-
tact. Yet despite the difficulties low-​incidence populations present, it is essential for the
surveyor to develop effective strategies to meet these challenges.
Similar to difficult-​to-​reach populations more generally, if there is a theoretical
reason to believe that subgroups differ significantly from the broader population along
outcomes of interest, then their omission may bias a study’s estimates. The extent of bias
depends on the relative size of the low-​incidence population to the total population,
as well as the extent to which the low-​incidence population differs from the total pop-
ulation on the measures of interest. Thus, one might argue that due to their inherent
small size, low-​incidence groups are unlikely to bias estimates of the general population.
However, as smaller segments of the public grow in size (e.g., immigrant groups; cell-​
phone users), the omission of these increasingly prevalent individuals raises the risk of
bias. This challenge is further complicated by the fact that low-​incidence populations
tend to be underrepresented in most survey samples, and thus we often lack the empir-
ical evidence to assess the extent to which they differ. Furthermore, in order to conduct
meaningful subgroup analysis, researchers must have adequate sample sizes. In addition
to learning more about the particular subgroup, intergroup comparison will enable
us to test the generalizability of theories. How do groups differ? What explains these
differences? Answers to these questions will enable us to develop conditional theories
that more accurately depict our diverse population.
As this chapter has highlighted time and again, however, researchers face tough
decisions when it comes to resource allocation. Limited time and funding necessarily
entail compromises. Although the difficulties particular to surveying low-​incidence
populations by and large belie one-​size-​fits-​all prescriptions, two broad considerations
should factor into any sampling design. First, mixed-​mode data collection techniques
offer researchers a way to potentially reduce costs and/​or reduce nonresponse (Dillman,
Smyth, and Christian 2014). For example, maximizing the number of responses attained
through a relatively cheap mode (say, a mail-​in or some other self-​administered
survey) before moving on to a more expensive mode (usually one requiring trained
enumerators) is a generally optimal practice that is particularly beneficial to pollsters
surveying rare groups, where the costs associated with coverage can be particularly bur-
densome. Moreover, when collecting data on populations that include a large portion
of members whose first language is not English, the coverage advantages of face-​to-​
face or telephone surveys can be easily outweighed by the nonresponse attendant on
English-​only interviews. In this scenario, adding a self-​administered mode with several
200    Justin A. Berry, Youssef Chouhoud, and Jane Junn

translations of the questionnaire could be far more cost effective than training multilin-
gual interviewers. Indeed, a mixed-​mode strategy is all the more advantageous given
that cultural and linguistic minority groups may be more suspicious of interviewers,
particularly if they are not members of their community (Harkness et al. 2014), yet if
given the opportunity to share their opinions in the language of their choosing, such
minorities may be willing to participate just as often as the majority population (Feskens
et al. 2007).
Second, the inevitable trade-​offs should be optimized with regard to the study’s
core research questions. This tailored approach is, again, applicable to polling gener-
ally, although its advantages are more acute in the case of low-​incidence populations.
For example, researchers with the MAPOS (2010) study aimed to elaborate the role of
mosque attendance in social and civic participation; thus they opted for a sampling
strategy—​polling congregants as they left communal prayers—​that likely skewed any
resultant bias in representation toward the particular subgroup of interest within their
target population. As obtaining a national probability sample of American Muslims
would have been prohibitively expensive, the coordinators for this project maximized
their resources by focusing on the sources of error they could best guard against: first
by providing self-​administered questionnaires in multiple languages, handed out by
Muslim enumerators—​which cut down on nonresponse and social desirability bias—​
and second, and more to the point, by tailoring their coverage priorities to the research
questions motivating the study.
Beyond these front-​end calibrations in research design, there are two meaningful
actions researchers of low-​incidence populations can take to improve the back-​end
data analysis. First, a meta-​analysis of national probability surveys featuring mean-
ingful numbers of the group of interest can provide researchers with more reliable
demographic baselines. These more valid metrics would help researchers design more
effective sampling strategies and apply more accurate post-​stratification weighting. This
approach has successfully been utilized by pollsters studying American Jews (Tighe
et al. 2010) and can effectively curb overreliance on the demographic picture painted by
a single survey.3
Second, researchers and surveyors should be more forthcoming with detailed
appraisals of their methodology. This goes beyond a general ethos of transparency to ac-
knowledge that, as has been shown, nuanced decisions can have quite meaningful effects.
One concrete measure that this can translate into is asking in-​person enumerators, such
as those of the MAPOS (2010) survey, to keep track of—​and report—​descriptive data
on those individuals who opt not to take the survey, in order to paint a fuller picture
of nonresponse error. These reports should include objective traits—​such as age range,
sex, location of contact, an so forth—​but even more subjective inferences regarding the
reasons behind their refusal to participate could prove useful (for example, whether it
was because they were too busy or merely felt suspicious of the enumerators’ motives).
Noting those respondents who required extra cajoling to participate would similarly be
of benefit to this end.
Surveying Low-Incidence Populations    201

Since it is typically impractical (often close to impossible) to observe every unit of


interest, scholars carefully attempt to construct a sample that is generally represen-
tative of the target group. In turn, the validity and reliability of one’s inferences de-
pend on the extent to which the resultant sample meets this criterion. This chapter
discussed the heightened obstacles that researchers of low-​incidence populations face
in this regard and the possible paths they may take in meeting these added challenges.
While there is no methodological silver bullet, each conscientious contribution helps
to fill gaps and advance a more holistic understanding of not just rare populations, but
society at large.

Notes
1. The proportion of adults in America who are Muslim is a contested matter (see Smith
2002 for a review), although Pew Research Center (2007, 2011) places the share at about
.5 percent.
2. This latter concern eventually led Pew methodologists to alter their sampling strategy in a
follow-​up survey of American Muslims (2011), amending the RDD frame to include both
cellular and landline numbers.
3. See, for example, Dana, Barreto, and Oskooii (2011); Djupe and Calfano (2012); and
Patterson, Gasim, and Choi (2011); all comparing original data on American Muslims to
Pew’s (2007) sample to gauge representativeness.

References
Bagby, I. 2004. A Portrait of Detroit Mosques: Muslim Views on Policy, Politics, and Religion.
Detroit, MI: Institute of Social Policy and Understanding.
Biernacki, P., and D. Waldorf. 1981. “Snowball Sampling: Problems and Techniques in Chain-​
referral Sampling.” Social Methods and Research 10: 141–​163.
Blair, J., and R. Czaja. 1982. “Locating a Special Population Using Random Digit Dialing.” Public
Opinion Quarterly 46 (4): 585–​590.
Blumberg, S. J., and J. V. Luke. 2009. “Reevaluating the Need for Concern Regarding
Noncoverage Bias in Landline Surveys.” American Journal of Public Health 99 (10): 1806–​1810.
Brehm, J. 1993. The Phantom Respondents. Ann Arbor: University of Michigan Press.
Brick, J. M. 2011. “The Future of Survey Sampling.” Public Opinion Quarterly 75: 872–​878.
Brick, J. M., D. Williams, and J. M. Montaquila. 2011. “Address-​ Based Sampling for
Subpopulation Surveys.” Public Opinion Quarterly 75 (3): 409–​428.
Council on American-​Islamic Relations. 2006. American Muslim Voters:  A Demographic
Profile and Survey of Attitudes. Washington, DC: Council on American-​Islamic Relations.
Council on American-​ Islamic Relations. 2012. American Muslim Voters and the 2012
Elections:  A Demographic Profile and Survey of Attitudes. Washington, DC:  Council on
American-​Islamic Relations.
Curtin, R., S. Presser, and E. Singer. 2005. “Changes in Telephone Survey Nonresponse over the
Past Quarter Century.” Public Opinion Quarterly 69 (1): 87–​98.
202    Justin A. Berry, Youssef Chouhoud, and Jane Junn

Czaja, A. J., G. L. Davis, J. Ludwig, and H. F. Taswell. 1998. “Complete Resolution of


Inflammatory Activity Following Corticosteroid Treatment of HBsAg-​Negative Chronic
Active Hepatitis.” Hepatology 4 (4): 622–​627.
Dana, K., M. A. Barreto, and K. A. R. Oskooii. 2011. “Mosques as American Institutions: Mosque
Attendance, Religiosity and Integration into the Political System among American
Muslims.” Religions 2 (4): 504–​524.
De la Garza, R., A. Falcon, F. C. Garcia, and J. A. Garcia. 1998. Latino National Political Survey,
1989–​1990. Ann Arbor, MI: Inter-​university Consortium for Political and Social Research.
De Leeuw, E., M. Callegaro, J. Hox, E. Korendijk, and G. Lensvelt-​Mulders. 2007. “The
Influence of Advance Letters on Response in Telephone Surveys a Meta-​Analysis.” Public
Opinion Quarterly 71 (3): 413–​443.
Dillman, D. A., J. D. Smyth, and L. M. Christian. 2014. Internet, Phone, Mail, and Mixed-​Mode
Surveys: The Tailored Design Method. 4th ed. Hoboken, NJ: Wiley.
Djupe, P. A., and B. R. Calfano. 2012. “American Muslim Investment in Civil Society
Political Discussion, Disagreement, and Tolerance.” Political Research Quarterly 65
(3): 516–​528.
Elliott, M. N., W. S. Edwards, D. J. Klein, and A. Heller. 2012. “Differences by Survey Language
and Mode among Chinese Respondents to a CAHPS Health Plan Survey.” Public Opinion
Quarterly 76 (2): 238–​264.
Elliott, M. N., D. McCaffrey, J. Perlman, G. N. Marshall, and K. Hambarsoomians. 2009. “Use
of Expert Ratings as Sampling Strata for a More Cost-​Effective Probability Sample of a Rare
Population.” Public Opinion Quarterly 73 (1): 56–​73.
Ericksen, E. P. 1976. “Sampling a Rare Population: A Case Study.” Journal of American Statistical
Association 71: 816–​822.
Feskens, R., J. Hox, G. Lensvelt-​Mulders, and H. Schmeets. 2007. “Nonresponse Among Ethnic
Minorities: A Multivariate Analysis.” Journal of Official Statistics 23 (3): 387–​408.
Fraga, L. R., J. A. Garcia, R. Hero, M. Jones-​Correa, V. Martinez-​Ebers, and G. M. Segura.
2006. Latino National Survey (LNS), 2006. ICPSR 20862. Ann Arbor, MI: Inter-​university
Consortium for Political and Social Research [distributor], 2013-​06-​05. http://​doi.org/​
10.3886/​ICPSR20862.v6.
Gallup. 2009. Muslim Americans: A National Portrait. Washington, DC: Gallup.
Gallup. 2011. Muslim Americans: Faith, Freedom, and the Future. Abu Dhabi: Abu Dhabi Gallup
Center.
Gentry, R., M. Cantave, N. Wasikowski, and Y. Pens. 2010. “To Mail or to Call: How to Reach the
Hard-​to-​Reach.” Paper presented at the 65th Annual Meeting of the American Association
for Public Opinion Research, Chicago.
Grammich, C., Hadaway, K., Houseal, R., Jones, D. E., Krindatch, A., Stanley, R., and Taylor,
R. H. 2012. 2012 U.S. Religion Census:  Religious Congregations and Membership Study.
Association of Statisticians of American Religious Bodies. Nazarene Publishing House.
www.nph.com/​nphweb/​html/​nph/​itempage.jsp?itemid=9780615623443.
Green, D. P., and A. S. Gerber. 2006. “Can Registration-​Based Sampling Improve the Accuracy
of Midterm Election Forecasts?” Public Opinion Quarterly 70 (2): 197–​223.
Greer, C. 2013. Black Ethnics:  Race, Immigration, and the Pursuit of the American Dream.
New York: Oxford University Press.
Groves, R. M. 2006. “Nonresponse Rates and Nonresponse Bias in Household Surveys: What
Do We Know about the Linkage between Nonresponse Rates and Nonresponse Bias?” Public
Opinion Quarterly 70 (5): 646–​675.
Surveying Low-Incidence Populations    203

Groves, R. M., and M. Couper. 1998. Nonresponse in Household Interview Surveys.


New York: Wiley.
Groves, R. M., M. P. Couper, S. Presser, E. Singer, R. Tourangeau, G. P. Acosta, and L. Nelson.
2006. “Experiments in Producing Nonresponse Bias.” Public Opinion Quarterly 70
(5): 720–​736.
Groves, R. M., and E. Peytcheva. 2008. “The Impact of Nonresponse Rates on Nonresponse
Bias A Meta-​Analysis.” Public Opinion Quarterly 72 (2): 167–​189.
Groves, R. M., S. Presser, and S. Dipko. 2004. “The Role of Topic Interest in Survey Participation
Decisions.” Public Opinion Quarterly 68: 2–​31.
Groves, R. M., E. Singer, and A. Corning. 2000. “Leverage-​Saliency Theory of Survey
Participation: Description and an Illustration.” Public Opinion Quarterly 64 (3): 299–​308.
Harkness, J., M. Stange, K. I. Cibelli, P. Mohler, and B. E. Pennell. 2014. “Surveying Cultural
and Linguistic Minorities.” In Hard-​to-​Survey Populations, edited by R. Tourangeau, B.
Edwards, T. P. Johnson, K. M. Wolter, and N. Bates, 245–​269. Cambridge, UK: Cambridge
University Press.
Hattam, V. 2007. In the Shadow of Race. Chicago: University of Chicago Press.
Hedges, B. M. 1979. “Sampling Minority Populations.” In Social and Educational Research in
Action, edited by M. J. Wilson, 244–​261. London: Longman.
Jackson, J. S., and G. Gurin. 1987. National Survey of Black Americans, 1979–​1980. Vol. 8512. Ann
Arbor, MI: Inter-​University Consortium for Political & Social Research.
Jackson, J. S., and G. Gurin. 1999. “National Survey of Black Americans, 1979–​1980 [computer
file]. ICPSR08512-​v1.” Ann Arbor, MI: Inter-​university Consortium for Political and Social
Research [distributor].
Jackson, J. S., P. Gurin, and S. J. Hatchett. 1989. National Black Election Study, 1984. ICPSR08938-​
v1. Ann Arbor, MI: Inter-​university Consortium for Political and Social Research [distrib-
utor]. http://​doi.org/​10.3886/​ICPSR08938.v1.
Jackson, J. S., V. L. Hutchings, R. Brown, and C. Wong. 2004. National Politics Study.
ICPSR24483-​v1. Ann Arbor, MI:  Inter-​university Consortium for Political and Social
Research [distributor], 2009-​03-​23. http://​doi.org/​10.3886/​ICPSR24483.v1.
Jackson, J. S., and H. G. Neighbors. 1997. National Survey of Black Americans, Waves 1–​4,
1979–​1980, 1987–​1988, 1988–​1989, 1992. ICPSR06668-​v1. Ann Arbor, MI:  Inter-​university
Consortium for Political and Social Research [distributor]. http://​doi.org/​10.3886/​
ICPSR06668.v1.
Junn, J., T. S. Lee, K. Ramakrishnan, and J. Wong. 2011. “Asian‐American Public Opinion.”
In The Oxford Handbook of American Public Opinion and the Media, edited by Robert Y.
Shapiro and Lawrence R. Jacobs, 520–​534. Oxford, New York: Oxford University Press.
Kalton, G., and D. W. Anderson. 1986. “Sampling Rare Populations.” Journal of the Royal
Statistical Society 149 (1): 65–​82.
Keeter, S., C. Miller, A. Kohut, R. Groves, and S. Presser. 2000. “Consequences of Reducing
Nonresponse in a Large National Telephone Survey.” Public Opinion Quarterly, 64: 125–​148.
Keeter, S., C. Kennedy, M. Dimock, J. Best, and P. Craighill. 2006. “Gauging the Impact of
Growing Nonresponse on Estimates from a National RDD Telephone Survey.” Public
Opinion Quarterly 70 (5): 737–​758.
Lavelle, B., M. D. Larsen, and C. Gundersen. 2009. “Strategies for Surveys of American
Indians.” Public Opinion Quarterly 73 (2): 385–​403.
Lazerwitz, B. 1978. “An Estimate of a Rare Population Group: The U.S. Jewish Population.”
Demography 15 (3): 389–​394.
204    Justin A. Berry, Youssef Chouhoud, and Jane Junn

Lien, P., M. M. Conway, and J. S. Wong. 2004. The Politics of Asian Americans: Diversity and
Community. New York: Routledge.
Lin, I., and N. Schaeffer. 1995. “Using Survey Participants to Estimate the Impact of Non-​
participation.” Public Opinion Quarterly 59: 236–​258.
Link, M. W., M. P. Battaglia, M. R. Frankel, L. Osborn, and A. H. Mokdad. 2008. “A Comparison
of Address-​Based Sampling (ABS) versus Random Digit Dialing (RDD) for General
Population Surveys.” Public Opinion Quarterly 72: 6–​27.
Lohr, S., and J. N.  K. Rao. 2006. “Estimation in Multiple-​Frame Surveys.” Journal of the
American Statistical Association 101 (475): 1019–​1030.
Muedini, F. 2009. “Muslim American College Youth: Attitudes and Responses Five Years After
9/​11.” The Muslim World 99 (1): 39–​59.
Muslim American Public Opinion Survey (MAPOS). 2010. http://​www.muslimamericansurvey.
org/​.
Muslim Public Affairs Council. 2005. Religion & Identity of Muslim American Youth Post-​
London Attacks. Washington, DC: Muslim Public Affairs Council.
Neyman, J. 1934. “On the Two Different Aspects of the Representative Method: The Method of
Stratified Sampling and the Method of Purposive Selection.” Journal of the Royal Statistical
Society 97 (4): 558–​625.
Nie, N. H., J. Junn, and K. Stehlik-​Barry. 1996. Education and Democratic Citizenship in
America. Chicago: University of Chicago Press.
Nobles, M. 2000. Shades of Citizenship: Race and the Census in Modern Politics. Palo Alto,
CA: Stanford University Press.
O’Donnell, J. A., H. L. Voss, R. R. Clayton, G. T. Slatin, and R. G. Room. 1976. Young Men
and Drugs: A Nationwide Survey; National Institute on Drug Abuse Research Monograph.
Washington, DC: US Department of Health and Human Services.
Olson, K. 2006. “Survey Participation, Nonresponse Bias, Measurement Error Bias, and Total
Bias.” Public Opinion Quarterly 70 (5): 737–​758.
Patterson, D., G. Gasim, and J. Choi. 2011. “Identity, Attitudes, and the Voting Behavior of
Mosque-​ Attending Muslim-​ Americans in the 2000 and 2004 Presidential Elections.”
Politics and Religion 4 (2): 289–​311.
Pew Research Center. 2007. Muslim Americans:  Middle Class and Mostly Mainstream.
Washington, DC: Pew Research Center.
Pew Research Center for the People & the Press. 2011. Muslim Americans: No Signs of Growth in
Alienation or Support for Extremism. Washington, DC: Pew Research Center.
Pew Research Center. 2012a. Mormons in America: Certain in Their Beliefs, Uncertain of Their
Place in Society. Washington, DC: Pew Research Center.
Pew Research Center. 2012b. The Rise of Asian Americans. Washington, DC:  Pew Research
Center.
Pew Research Center. 2013. A Portrait of Jewish Americans. Washington, DC: Pew Research
Center.
Peytchev, A., E. Peytcheva, and R. M. Groves. 2010. “Measurement Error, Unit Nonresponse,
and Self-​Reports of Abortion Experiences.” Public Opinion Quarterly 74 (2): 319–​327.
Reed, J. S. 1975. “Needles in Haystacks: Studying ‘Rare’ Populations by Secondary Analysis of
National Sample Surveys.” Public Opinion Quarterly 39 (4): 514–​522.
Rossi, P. H., and R. Crain. 1968. “The NORC Permanent Community Sample.” Public Opinion
Quarterly 32 (2): 261–​272.
Surveying Low-Incidence Populations    205

Rossi, P. H., J. D. Wright, G. A. Fisher, and G. Willis. 1987. “The Urban Homeless: Estimating
Composition and Size.” Science 235: 1136–​1141.
Rothbart, G. S., M. Fine, and S. Sudman. 1982. “On Finding and Interviewing the Needles in the
Haystack: The Use of Multiplicity Sampling.” Public Opinion Quarterly 46 (3): 408–​421.
Sakshaug, J. W., T. Yan, and R. Tourangeau. 2010. “Nonresponse Error, Measurement Error, and
Mode of Data Collection: Tradeoffs in a Multi-​Mode Survey of Sensitive and Non-​Sensitive
Items.” Public Opinion Quarterly 74 (5): 907–​933.
Sharif, A., H. Jawad, P. Nightingale, J. Hodson, G. Lipkin, P. Cockwell, S. Ball, and R. Borrows.
2011. “A Quantitative Survey of Western Muslim Attitudes to Solid Organ Donation.”
Transplantation 9 (10): 1108–​1114.
Shor, R. 2000. “Jewish Immigrant Parents from the Former Soviet Union:  A Method for
Studying their Views of How to Respond to Children’s Misbehavior.” Child Abuse & Neglect
24 (3): 353–​362.
Sirken, M. G. 1972. “Stratified Sample Surveys with Multiplicity.” Journal of the American
Statistical Society 67 (3): 224–​227.
Sirken, M. G., B. I. Graubard, and M. J. McDaniel. 1978. “National Network Surveys of
Diabetes.” Proceedings of the Section on Survey Research Methods, American Statistical
Association, 631–​635.
Smith, T. W. 2002. “Review: The Muslim Population of the United States; The Methodology of
Estimates.” The Public Opinion Quarterly 66 (3): 404–​417.
Stinchcombe, A., L. C. Jones, and P. B. Sheatsley. 1981. “Nonresponse Bias for Attitude
Questions.” Public Opinion Quarterly 45 (3): 359–​375.
Sudman, S., and G. Kalton. 1986. “New Developments in the Sampling of Special Populations.”
Annual Review of Sociology 12: 401–​429.
Tate, K. 1997. National Black Election Study, 1996. ICPSR version. Columbus, OH: Ohio State
University [producer]; Ann Arbor, MI: Inter-​university Consortium for Political and Social
Research [distributor], 2004. http://​doi.org/​10.3886/​ICPSR02029.v1.
Teitler, J. O., N. E. Reichman, and S. Sprachman. 2003. “Costs and Benefits of Improving
Response Rates for a Hard-​ to-​Reach Population.” Public Opinion Quarterly
67: 126–​138.
Tighe, E., D. Livert, M. Barnett, and L. Saxe. 2010. “Cross-​SurveyAnalysis to Estimate Low-​
Incidence Religious Groups.” Sociological Methods & Research 39 (1): 56–​82.
Tourangeau, R., R. M. Groves, and C. D. Redline. 2010. “Sensitive Topics and Reluctant
Respondents: Demonstrating a Link between Nonresponse Bias and Measurement Error.”
Public Opinion Quarterly 74 (3): 413–​432.
Venette, R. C., R. D. Moon, and W. D. Hutchison. 2002. “Strategies and Statistics of Sampling
for Rare Individuals.” Annual Review of Entomology 47: 143–​174.
Verba, S. 1995. “The Citizen as Respondent:  Sample Surveys and American Democracy
Presidential Address, American Political Science Association.” American Political Science
Review 90 (1): 1–​7.
Verba, S., K. L. Schlozman, and H. E. Brady. 1995. Voice and Equality: Civic Voluntarism in
American Politics. Cambridge, MA: Harvard University Press.
Waksberg, J. 1978. “Sampling Methods for Random Digit Dialing.” Journal of the American
Statistical Association 73: 40–​46.
Waksberg, J., D. Judkins, and J. T. Massey. 1997. “Geographic-​ based Oversampling in
Demographic Surveys of the United States.” Survey Methodology 23: 61–​72.
206    Justin A. Berry, Youssef Chouhoud, and Jane Junn

Watters, J. K., and P. Biernacki. 1989. “Targeted Sampling: Options for the Study of Hidden
Populations.” Social Problems 36 (4): 416–​430.
Wong, J. S. 2006. Democracy’s Promise:  Immigrants and American Civic Institutions. Ann
Arbor: University of Michigan Press.
Wong, J. S., K. Ramakrishnan, T. Lee, and J. Junn. 2011. Asian American Political
Participation: Emerging Constituents and Their Political Identities. New York: Russell Sage
Foundation.
Zogby, J. 2001. Muslims in the American Public Square. Washington, DC: Zogby International.
Zogby, J. 2004. Muslims in the American Public Square: Shifting Political Winds and Fallout
from 9/​11, Afghanistan, and Iraq. Washington, DC: Zogby International.
Zukin, C. 2006. “Presidential Address: The Future Is Here! Where Are We Now? and How Do
We Get There?” Public Opinion Quarterly 70 (3): 426–​442.
Chapter 10

Improving the Qua l i t y of


Surv ey Data Usi ng C A PI
Systems in Dev e l opi ng
C ountri e s

Mitchell A. Seligson
and Daniel E. Moreno Morales

Introduction

If it can be said that advancement in science depends on improvement in the preci-


sion of measurement, then the development of modern survey research can easily be
counted as one of the, if not the, greatest advances in social science in the twentieth
century. Notwithstanding that claim, researchers also must admit that survey data are
plagued by error, from a variety of sources. Since error can attenuate true relationships
that are in the data, we constantly risk making Type II errors: reporting that there is no
relationship, when in fact there is. In surveys there are so many different sources of error,
and error is so common in each stage of survey research, the fact that researchers ob-
serve any statistically significant relationships between variables is truly an impressive
demonstration of the robustness of this form of research. Yet just because researchers
have made enormous progress in using surveys, that does not mean survey data are free
of error.1
Because of its pervasiveness, error takes its toll on the quality of our research. Given
that these errors are mostly unsystematic (not the product of a particular bias), they
result in noise that weakens the statistical relationship among variables. Bivariate
correlations are attenuated by error, affecting the precision of survey results. Yet some of
the error is indeed systematic, the results of which can produce statistically significant
findings that are misleading (a Type I error). The most important of these systematic
208    Mitchell A. Seligson and Daniel E. Moreno Morales

errors in survey research are those that emerge from interviewing individuals and en-
tire regions that were not intended to form part of the sample. When this happens, as
we suspect it often does, researchers can no longer be certain that each element in the
sample (in this case the respondent) has a known probability of selection, which is the
sine qua non of any scientifically drawn probability sample.
For decades, face-​to-​face surveys were based on paper and pen interviews (which are
sometimes called PAPI surveys).2 Indeed, even today interviewer-​conducted surveys
that are recorded on paper still represent the largest proportion of all face-​to-​face
surveys conducted in developing countries. But in our experience, paper-​based surveys
are responsible for much survey error. Surveys conducted using paper and pencil tech-
nology are prone to a number of different forms of error, both systematic and unsystem-
atic, with consequent negative effects on the precision and accuracy of results.

Questionnaire Application Error
Error can come from the interviewer improperly applying the questionnaire. As
most professionals with experience in the field know, interviewers can sometimes
skip questions, either intentionally (to save time or to avoid complicated or sensitive
items) or unwittingly (because their eyes skipped a row on the page, or they mistakenly
thought they had already filled in the answer). In our experience, both types of error are
all too common, especially when field conditions are difficult (e.g., poor lighting, threat-
ening surroundings). Interviewers can also incorrectly fill in the answers for filtered or
conditioned questions, producing inconsistent response patterns. That is, it is not un-
common to find blocks of questions that are to be administered only to females, or only
to respondents of specific age cohorts, being asked of all respondents. Sometimes, be-
cause pages of surveys can stick together, interviewers can skip entire pages unknow-
ingly as they move from one page to the next in a paper questionnaire. Blank answers are
usually coded as missing data by office coders, which results in a lower N for the skipped
items and thus a reduced chance of finding statistically significant results. When groups
of items that should have been skipped are asked, office coding has to be done to filter
out those responses, but even then, inconsistency can emerge between those who were
asked the correct batteries and those who were asked batteries that should have been
skipped. For example, if a battery on domestic violence that is to be asked only to women
is inadvertently asked to men, those respondents may condition their answers to subse-
quent questions in ways that differ from those men who were not asked those batteries.

Coding Error
But of all the errors in survey data, probably one of the most frequent and damaging
occurs not in the field but back in the home office, when coders incorrectly record the
results in the response columns of the paper surveys, and data entry clerks add error
Quality of Survey Data Using CAPI Systems    209

by entering the data incorrectly. While verification (i.e., double entry) of 100% of data
entry is typically required in most survey contracts, systematic violation of that require-
ment is commonplace in a world in which survey firms attempt to maximize profit by
minimizing costs (the work conducted by data entry clerks is costly and adds to the
overall cost of the survey). Even in nonprofit settings, where presumably the quality of
the data is more important than the “bottom line” of the firm, the drudgery of double
entry of survey data quite likely causes all too many instances of data sets being partially
or entirely unverified.
One vignette from our own experience drives home this point. Some years ago the
senior author of this chapter contracted with a very well-​known survey firm in Latin
America to carry out the fieldwork for a survey. At the end of the project, he received
the “data” from the survey, which turned out to be no more than a series of tabulations.
When he explained to the firm that he would be doing an extensive multivariate anal-
ysis of the data, and that he needed the individual-​level survey data, the head of the firm
responded, “OK, but nobody has ever asked us for that before.” When the data were
examined and compared against the tabulations, discrepancies of all kinds emerged.
The most common was that the tabulations were all neatly coded, with no codes being
out of range. But the data set was filled with out-​of-​range codes. When the author asked
for an explanation of the inconsistency, he was told, “Oh, it is our standard practice to
sweep all out-​of-​range codes into the missing category.” In other words, not only was no
double entry performed, but the firm never went back to the original paper survey to
find out what the true answers were.
Yet not all error is attributable to the coding/​data entry phase. Interviewers can also
easily mark an answer incorrectly, because they misheard or misunderstood the an-
swer, or simply wrote it down wrong. They can also sometimes mark the answer into
the coding box for a different question printed on the page in front of them. Some of
this error is ultimately unavoidable, but paper questionnaires provide no range checks
and therefore allow the entry of impossible responses for age, income, and education.
Hence, interviewers can report a respondent of 239 years of age, when the correct an-
swer should have been 39, or income of 3,000, when 300 was the actual response, or
education of 16 years rather than 6 years.3 Some of these responses can be corrected in
the office, but more often than not one is not certain what the correct answer should
be. We cannot be certain if the correct response was 16 years of education or 6 years,
although we can make a guess based on other items, such as occupation, income, or
other variables.
Even when there is no problem of skipping, incorrect filtering, or incorrect re-
cording of responses, there is often a subtler problem related to the style of delivery of
the question itself. In order to move quickly through the interview and to save time,
some interviewers systematically abbreviate the text of the questions they are required
to ask. For example, the question might read, “How much would you say you trust the
people of this town or village; would you say you trust them (1) a lot, (2) somewhat, or
(3) not at all?” Interviewers who are trying to complete the survey quickly might just ask,
“Do you trust people or not?” Such distortion of questions is common, yet it affects the
210    Mitchell A. Seligson and Daniel E. Moreno Morales

comparability of the responses, as the questions asked of different interviewees are not
exactly the same.

Fraud
The most serious errors involve fraud, a problem that can be greatly attenuated by the
new technology we describe later in this chapter. Interviewers perpetrate fraud by par-
tially or completely filling out questionnaires on their own without reference to a gen-
uine respondent, in effect self-​interviewing, providing random answers to questions
in an effort to shirk the often tedious and sometimes dangerous work of carrying out
door-​ to-​
door surveys, while maximizing (though fraudulently) their earnings in
a given period of time. Some of this fraud can be caught by attentive supervisors and
partial recalls, but collusion between interviewers and supervisors is also possible, in
which both parties benefit from the fraud (perhaps splitting the earnings from fraud-
ulent interviews). Another type of fraud occurs when poorly supervised interviewers
“outsource” the work to others (e.g., a younger brother or sister), thus allowing the
interviews to be conducted by untrained personnel.

Sample Error
Other sources of error can produce biased survey estimates. An example of this is failing
to interview the individual who was selected via the random procedures that guarantee
lack of bias. Paper questionnaires place a heavy burden on interviewers to correctly im-
plement the household selection process. Without proper fieldwork tools, interviewers
can over-​or undersample some segments of the population (e.g., gender or age groups),
resulting in a data set that produces biased averages.
Interviewers can also visit the wrong geographic area, either knowingly or unknow-
ingly, conducting the survey in a place other than where the sample was selected. Ceteris
paribus, interviewers will usually visit easier to reach places, resulting in the population
that lives in harder to reach or more dangerous areas having less opportunity to be in-
cluded in the sample, and thus potentially biasing the results of the survey.
In survey research conducted in developing countries, many of these error sources
are exacerbated by contextual conditions. One of the main issues is the quality of work
that interviewers perform and the difficulties in supervision. For many individuals
involved in the activity, interviewing is a part-​time and occasional source of income.
They rarely have a permanent contract with the polling company, and their earnings
are based on a combination of daily and per-​interview wages. Interviewers usually have
low levels of education and, despite receiving training, are likely to make errors while
administering the questionnaire. Under these conditions, interviewers’ work has to be
closely supervised to minimize error and fraud. But field supervisors may also work part
time and therefore suffer many of the same limitations as the interviewers.
Quality of Survey Data Using CAPI Systems    211

Another factor that defines the conditions under which survey research is conducted
in developing countries is the absence of complete and updated geographical informa-
tion and maps. Census offices and other government sources of official information
often do not have complete listings of residential and building areas, and mapping is
seldom up to date and complete. In other instances, where census maps are available,
government agencies may refuse to make them available to researchers. This makes
it difficult for interviewers to locate a selected area or home to start the interview
according to the sample design.
Finally, some relevant infrastructure limitations need to be considered. One is poor
quality roadways, which make it hard for interviewers to visit some areas, particularly
during rainy or winter seasons. Another is the lack of complete phone coverage; the
fact that not every home has a phone makes personal, face-​to-​face interviewing in the
respondent’s home the only option to produce a probability sample of the national pop-
ulation in many developing countries. Cell phone numbers, of course, are not geocoded,
so a phone with an exchange for a rural area might actually be in the possession of
someone from the capital city.
To a limited extent, many of the errors noted above can be prevented or attenuated
using conventional methodologies. Foremost among them is increasing the inten-
sity and quality of field supervision. Well-​trained, responsible, and motivated field
supervisors can make a world of difference in the quality of surveys, but this is a costly
element that can significantly increase the overall budget of a project. In small sample
projects, having the Principal Investigator (P.I.). in the field supervising a small team
of interviewers is perhaps the best guarantee of quality. Yet in large-​scale surveys such
means are impractical, lest the fieldwork extend over many, many months, and only
rarely would a P.I. have the time for such an effort. Further, the field supervisor cannot
be in all households at the same time, leaving some interviewers to get it right only
when under direct supervision. Finally, there is no ultimate guarantee that the field
supervisors have not colluded with interviewers to cheat.

CAPI Surveys: Benefits and Costs

In light of these grim realities of the survey fieldwork process using paper questionnaires,
the question is how to reduce or minimize each of these sources of error and deal with
the contextual obstacles while conducting survey research so the results are as precise
and reliable as possible. Academics, survey professionals, survey data users, and others
interested in survey results care about the quality of the data, and they should under-
stand the paramount importance of the survey collection process to guaranteeing that
quality.
One strategy for dealing with these sources of error and limitations is to use com-
puter assisted personal interview (CAPI) systems in handheld devices provided to the
interviewers who conduct the fieldwork (this approach is sometimes referred to as
212    Mitchell A. Seligson and Daniel E. Moreno Morales

MCAPI, mobile computer assisted personal interviews). The CAPI surveys can help by
displaying the questionnaire in a way that is less prone to error than paper, showing one
question at a time per screen and automatically including logical checks and skip patterns.
These systems also produce paradata, information about the context and the conditions
in which an interview was performed, allowing for better control of the fieldwork process
and facilitating the supervision of the interviews (Couper 2005; Olson 2013).
Since advancements in computer technologies have made CAPI systems possible, so-
cial researchers and survey professionals have looked at their potential benefits for the
quality and speed of data collection (Tourangeau 2005). Research has been conducted
comparing computer assisted interviews with traditional paper-​based surveys; some re-
count the differences in large government studies that started applying CAPI systems as
soon as they became available, such as the British Household Panel Study (Banks and
Laurie 2000) and the U.S. General Social Survey (Smith and Kim 2003). Some others re-
call the experience of innovating the use of CAPI data collection methods in developing
countries (Caviglia-​Harris et al. 2012; Shirima et al. 2007). Most of these studies con-
clude that CAPI surveys reduce error compared to paper and pen interviews, and that
they reduce the length of the data collection process (De Leeuw, Hox, and Snijkers 1998;
van Heerden, Norris, Tollman, and Richter 2014).
One of these systems is the Android Data Gathering System (ADGYS). It was devel-
oped by a team working in Cochabamba, Bolivia, in close partnership with LAPOP,
the Latin American Public Opinion Project at Vanderbilt University, and Ciudadanía,
Comunidad de Estudios Sociales y Acción Pública, LAPOP’s local academic partner
in Bolivia. The beta version of the software was developed in 2011 and used in the
AmericasBarometer survey of 2012 in Bolivia. Since then the software has been
improved, and new versions have been developed and used, with a new version of the
system becoming commercially available in 2015. The software programming company
in charge of the development is GENSO Iniciativas Web, based in Cochabamba.4
ADGYS has a client-​server architecture, with a mobile application and a Web server.
On the server side, ADGYS was designed using open source technologies, including
Scala programming language and a Liftweb framework. The databases are managed
under MySQL and MongoDB. The client side was designed under W3C standards and
uses Html5, Css3, jquery, and bootstrap. ADGYS mobile is a native Android SO appli-
cation that uses Java technology and SQLite for database management. The synchroni-
zation with a Web server is via RestFul Web services, and all data are encrypted during
transmission and while stored in the mobile devices.
The software was designed to deal with the needs and challenges arising from the kind
of work that LAPOP carries out in Latin America and with some of the most common
problems of field survey research enumerated earlier in this chapter. ADGYS was
designed entirely from scratch, making use of available technological resources. This
means that the system was specifically conceived to comply with specific requisites and
demands, including (1) administering complex questionnaires with logical checks and
conditional skips, (2) being able to manage complex samples and quota assignments,
(3) using inexpensive smartphones and tablets, and (4) providing enough information
to allow extensive control of the quality of the fieldwork.
Quality of Survey Data Using CAPI Systems    213

ADGYS allows each survey to include multiple language versions, an important fea-
ture in countries that are language diverse. In the United States, for example, the system
would allow the interviewer to change from English to Spanish when encountering
respondents who feel more comfortable in, or can only speak, that language. In
Guatemala, one of the countries in which LAPOP works, a wide variety of indigenous
languages is spoken, and each of those can be programmed into ADGYS and be avail-
able for the same survey simultaneously.
The ADGYS mobile application works on devices using the Android operating system,
versions 2.2 and newer, and was programmed using Android compatible Java technology.
Since Android is currently on version 5, compatibility with the system back to 2.2 allows for
the use of older, less expensive smartphones and tablets, rather than using only state-​of-​the-​
art, and hence more costly, systems. This feature is crucial for conducting work in low-​income
countries, where the cost of electronic devices is often quite high because of import duties.
Interviewers can use the application to conduct an interview with the device either
online or offline; this feature partially deals with the limitation of not having complete
cell phone coverage over a given territory (which is common not only in developing
countries, but also in remote areas even in developed countries). Unlocking new sample
areas for an interviewer can be done online or by entering a code generated by the
system for each survey area (e.g., a sample segment). Uploading the data to the server
can, of course, only be done while the mobile device is connected to an Internet provider
(either via Wi-​Fi or using a data connection plan from the cell phone service provider).
The mobile application requires a personalized login for interviewers and other levels
of users, such as field supervisors, so that each user is properly noted and tracked. The
sample assignment, defined for each interviewer, is also downloaded onto the phones
or tablets using the application. This means that each member of a team of interviewers
may log into the application and will only see and work on his or her unique personal
assignment of interviews, including different studies (or survey projects). With this fea-
ture, all of the information generated using ADGYS is produced and reported to the
server under the personalized settings for each user.
The second element in ADGYS is the Internet-​based server, which is hosted at www.
Adgys.com. The server is the most important part of the system, storing and managing
the data uploaded from the mobile devices. Questionnaire construction and sample de-
sign programming are done from the server, as well as user creation and editing, in-
cluding assigning new sample areas and quotas to specific users.
The server allows users personalized login with different levels of access. Higher
level users can produce a number of reports on the advance of the fieldwork process,
including reports on sample completion by interviewer or area. Authorized users can
also generate the complete data set at any moment, even if the survey project is still in
the field. This feature makes it possible to get virtually real-​time information from the
field, an important element when using ADGYS in disaster reporting and assessment
surveys. A separate data set with the duration of each question for each case is also avail-
able for download from the system.
The server also produces an Excel spreadsheet or an Internet-​based form, unique
for each survey project, that allows the client to program a questionnaire according
214    Mitchell A. Seligson and Daniel E. Moreno Morales

to the specific goals of that particular study. This feature enables different types of
questions with various levels of measurement to be included in the electronic form
the interviewer sees. Logical checks and conditional skips can be used here, as well as
random assignment of questions and other tools that allow experimental research to be
conducted using the system.
Besides the cost of purchasing Android phones or tablets, the use of ADGYS and
other CAPI systems for fieldwork has some other costs, related to licensing of the soft-
ware and server and data traffic and storage. These costs are absent in PAPI surveys, but
researchers conducting paper and pen interviews need to budget the cost of printing
and transporting the questionnaires to/​from the field, and the data entry and data veri-
fication phase, which also adds considerable time to the process, not to mention the cost
of errors in the final survey results. These costs can vary from one context to another;
depending on the local availability and costs of labor and copies, paper interviews could
be less expensive in some areas, while in other places they can cost more than CAPI
studies. However, once the initial investment in equipment is made, CAPI surveys are
almost certain to be less costly and more convenient for most polling companies.
There are two other common concerns related to the use of CAPI systems in handheld
devices by interviewers. The first is usability of the system, considering interviewers’ po-
tential lack of familiarity with personal computers, particularly among older and poorly
educated interviewers (Couper 2000). The second is safety concerns for the interviewers
carrying expensive equipment in the field. Both concerns are at least partially solved
with the use of an Android-​based CAPI system, such as ADGYS. Given the almost uni-
versal penetration of cell phones (and smartphones over the last few years), Android
mobile devices such as phones and even small tablets are inconspicuous when they are
carried and employed by interviewers. And almost all interviewers own and operate a
cell phone on a daily basis, so they are already familiar with the operating system and
how one of these devices works.
LAPOP’s experience with ADGYS shows that, as happens with most other consumer
electronics, younger interviewers get used to the ADGYS interface more quickly than
their older counterparts do, but in the end all interviewers are able to use the system
without difficulty. Further, we have found that the number of interviewers mugged or
robbed in the field has not increased with the use of Android devices when compared
to previous rounds of the AmericasBarometer survey, in which paper and pencil
interviews were used, so concerns about interviewer safety are unfounded.

Using ADGYS to Improve the Quality


of Survey Data in LAPOP Studies

LAPOP used the ADGYS system extensively in its 2014 round of the AmericasBarometer.
The system was employed by LAPOP and its local partners in nineteen of twenty-​seven
national surveys conducted as part of that AmericasBarometer.
Quality of Survey Data Using CAPI Systems    215

LAPOP’s experience with ADGYS reveals five ways in which this CAPI system can
help improve the quality of survey data. Two are defined ex ante, and conditions influ-
ence interviewers’ administration of the survey. The other three employ the paradata
produced by the ADGYS system to develop mechanisms for quality control.

Conditioning Ex Ante How the Survey Is Administered


There are two ways in which the use of a CAPI system on a handheld device during
the interview has improved the quality of the data from a survey. First, it displays in
electronic format the questions and response choices in a way that is much less prone
to error than paper and pen questionnaires. Second, it does so by assigning sample
segments to specific interviewers.
ADGYS displays one question at a time and does not allow interviewers to move to
the next one until a proper response has been entered for that particular item. A “proper
response” means a substantive answer to the question, a “don’t know,” or “no reply.”
Absent one of these choices, the next question cannot be asked and is not displayed on
the screen of the mobile device. This format therefore substantially mitigates the error
caused by the interviewer skipping questions or entire pages, or entering responses in
the wrong location in the questionnaire. If properly programmed, this feature of CAPI
systems can also eliminate the inconsistent response patterns that occur as a result of the
incorrect use of skips in the questionnaire by the interviewer.
Assigning specific segments of the sample to each interviewer reduces the chances
that two interviewers will cover the same area, or that one area will be left uncovered
during fieldwork. ADGYS allows gender, age, or other socioeconomic quotas to be
assigned to interviewers, which improves the chances of having an unbiased sample at
the end of fieldwork. While this form of sample and quota assignment is also possible
using paper questionnaires, it is greatly facilitated by the use of handheld devices that
only display the areas assigned to the particular interviewer.

Employing Paradata for Controlling the Quality


of the Fieldwork Process
Paradata, or the data that refer to the conditions in which a specific interview was
conducted, can be automatically produced by CAPI systems and represent a valuable
opportunity to reduce error and improve the quality of the data. Paradata can be used in
at least three different forms to control data quality: accessing GPS information for each
interview, reviewing the total time of the interview, and the time for each question.
Geographical coordinates can be produced by smartphones and other handheld
devices in the field using the Global Positioning System radio (GPS) existing in most
devices. The ADGYS application turns the GPS radio on automatically, without involve-
ment of the interviewer, and records the coordinates using the satellite information
216    Mitchell A. Seligson and Daniel E. Moreno Morales

as well as cell phone signal via the device’s Assisted-​GPS or A-​GPS functions. Under
proper conditions (clear skies and a good cell phone signal), all interviews will have a
proper GPS reading recorded. This information can be used by the supervisory team to
make sure that the interviews were conducted in the place where they were supposed to
have been carried out.5
There are some variations in average duration times between interviewers that
can be attributed to their age cohort and familiarity with smartphone technology
(Böhme and Stöhr 2014), but in general the total duration of the interview can be
seen as a proxy for the quality of that interview. Most interviews should fall close to
the average time of a particular study (every questionnaire has a minimum duration
time, which should include the amount of time it takes to read the complete wording
of each question, plus the appropriate response time for the interviewee). Interview
time is usually recorded by CAPI systems using the device’s internal clock. ADGYS
records interview time automatically as part of the paradata recorded for each inter-
view. Interviews that fall under this minimum time, or that exceed it significantly,
should be closely scrutinized and, more often than not, be excluded from the data-
base and replaced.
Partial question time is the number of seconds that the screen for every item in the
questionnaire was displayed. This information can be used to identify odd patterns in
the flow of the questionnaire. In some cases, it can be used to identify interviewers who
attempt to perpetrate fraud, but understand the importance of keeping their total inter-
view time within the expected range.
Partial question time can also be used for improving the quality of the questionnaire
and its design, by providing information that can be related to the time the respondent
takes to understand and answer a particular question or a series of them within a ques-
tionnaire. Mean values across a relatively large number of cases in a survey can reliably
show the flow of the interaction between interviewer and respondent during the inter-
view and suggest corrections in the design of the data collection instrument.
Beyond these ways in which CAPI systems have been and are being used, uses are
also emerging that could further expand their utility. First, the increasingly large screens
on smartphones, as well as the declining costs of tablets, open many possibilities to the
survey researcher for respondent–​screen interaction. It is now possible to consider
showing the respondent small video or voice clips and then ask questions about what he
or she saw. These clips could be randomly altered for some experiments or be selected
based on prior questions in the survey. For example, if a respondent were to identify her-
self as belonging to a certain ethnic group, the video or voice clip chosen could focus on
that group. Male respondents might receive one clip, females another.
Virtually all Android devices contain cameras, of varying quality. With the permis-
sion of the respondent, photos could be taken of the home, which could then later be
coded in terms of the appearance of its quality. However, taking photos in the home
could sharply raise interviewer suspicions (fear that the survey was really a ruse to set
up a future home break-​in). Therefore, one would have to proceed very carefully, and
with full respondent permission, before photos in the home could be taken. Further,
Quality of Survey Data Using CAPI Systems    217

Institutional Review Boards (IRB ) requirements would almost certainly mandate the
removal of such photographs before the data set is publicly released.
This expansion in the possibilities of capturing different forms of paradata also
increases the potential ethical implications related to the privacy of respondents.
While informed consent from the respondent should be necessary for gathering
these data, it does not seem to be sufficient to protect the identity of respondents.
The authors of this chapter want to highlight the responsibility of the researchers for
protecting the subjects who make their research possible by their willingness to an-
swer a survey interview, and that protection depends on the anonymity of responses.
All necessary efforts should be made by both the polling company and the research
team to ensure that the anonymity of respondents is guaranteed and their identities
fully protected, even if they have agreed to the recording of certain data that could
put them at risk.

Conclusion

The experience of using a CAPI system in a large, hemisphere-​wide public opinion


study in the Americas offers substantial evidence of the advantages of this mode of re-
search for the quality of the data produced by surveys in developing countries. LAPOP’s
use of ADGYS offers a good example of the specific pros and cons of this mode of data
collection in survey studies.
By constraining ex ante the way in which the interviewer sees the items and the ques-
tionnaire and by forcing the interviewer to enter one response for each question, CAPI
systems reduce the chances that the interviewer might add error to the study. CAPI sys-
tems prevent the inclusion of some error that is caused by the interviewer at the moment
of conducting the interview and entering the data.
By providing information related to the conditions in which the interview was
conducted, particularly GPS coordinates and partial and total interview time,
CAPI systems provide the team in charge of a survey study with the opportunity
to exert wider control over the field process. Paradata analysis drastically reduces
the opportunities for the interviewers to select and collect data from areas not in-
cluded in the sample. Interview duration can also help control fieldwork by giving
the team in charge a better understanding of how data are really collected in the
field. As a result of these possibilities, paradata discourage fraud being committed by
interviewers.
While CAPI surveys do not solve all problems related to fieldwork or prevent all
sources of error in a survey study, they provide useful resources for improving the
quality of the data in surveys conducted in developing countries. As computer tech-
nology and cell phone infrastructure and connectivity advance swiftly, researchers
should take advantage of the increasing opportunities for improving the conditions
under which data are collected.
218    Mitchell A. Seligson and Daniel E. Moreno Morales

Notes
1. For an ample discussion of error in survey studies see Biemer et al. (1991); for a more specific
discussion of error in studies conducted in developing countries see the methodological re-
port prepared by the United Nations (2005).
2. For a review of the modes of data collection and the error associated with each of them see
Couper (2011) and Lyberg et al. (1997).
3. Some techniques that deal with this type of inconsistency have been developed and are
available to survey researchers (Herzog, Scheuren, and Winkler 2007). While the different
techniques available can improve the quality of a data set, they do so only partially and
cannot be considered a replacement for good data coding and verified data entry.
4. LAPOP surveys can be accessed via the Internet at www.lapopsurveys.org. The research
conducted by Ciudadanía is available at www.ciudadaniabolivia.org. Genso Iniciativas Web
can be visited at www.genso.com. bo.
5. There are ethical implications regarding the collection of paradata, as it could poten-
tially lead to the identification of respondents. Human subject protection standards
recommended by professional associations such as WAPOR and enforced by most insti-
tutional IRB offices require that all information that could potentially lead to the identi-
fication of the respondent of an anonymous survey (as is the case in most public opinion
studies) be removed from the public database. LAPOP and the ADGYS administration
comply with this standard and do not include GPS data or any other information that,
combined with the responses in the questionnaire, could lead to the identification of indi-
vidual respondents, their homes, or their families.

References
Banks, R., and H. Laurie. 2000. “From PAPI to CAPI: The Case of the British Household Panel
Survey.” Social Science Computer Review 18 (4): 397–​406.
Biemer, P., R. Groves, L. Lyberg, N. Mathiowetz, and S. Sudman, eds. 1991. Measurement Errors
in Surveys. New York: Wiley.
Böhme, M., and T. Stöhr. 2014. “Household Interview Duration Analysis in CAPI Survey
Management.” Field Methods 26 (4): 390–​405.
Caviglia-​Harris, J., S. Hall, K. Mullan, C. Macintyre, S. C. Bauch, D. Harris, . . . H. Cha. 2012.
“Improving Household Surveys Through Computer-​ Assisted Data Collection:  Use of
Touch-​Screen Laptops in Challenging Environments.” Field Methods 24 (1): 74–​94.
Couper, M. 2000. “Usability Evaluation of Computer-​Assisted Survey Instruments.” Social
Science Computer Review 18 (4): 384–​396.
Couper, M. 2005. “Technology Trends in Survey Data Collection.” Social Science Computer
Review 23 (4): 486–​501.
Couper, M. 2011. “The Future of Modes of Data Collection.” Public Opinion Quarterly 75
(5): 889–​908.
De Leeuw, E., J. Hox, and G. Snijkers. 1998. “The Effect of Computer-​Assisted Interviewing on
Data Quality: A Review. In Market Research and Information Technology: Application and
Innovation, edited by B. Blyth, Amsterdam: ESOMAR.
Lyberg, L., P. Biemer, M. Collins, E. De Leeuw, C. Dippo, N. Schwarz, and D. Trewin, eds. 1997.
Survey Measurement and Process Quality. New York: Wiley Interscience.
Quality of Survey Data Using CAPI Systems    219

Olson, K. 2013. “Paradata for Nonresponse Adjustment.” Annals of the American Academy of
Political and Social Sciences 645: 142–​170.
Shirima, K., O. Mukasa, J. Armstrong-​Schellenberg, F. Manzi, D. John, A. Mushi,  .  .  .  D.
Schellenberg. 2007. “The Use of Personal Digital Assistants for Data Entry at the Point
of Collection in a Large Household Survey in Southern Tanzania.” Emerging Themes in
Epidemiology 4 (5).
Smith, T., and S. Kim. 2003. “A Review of CAPI-​Effects on the 2002 General Social Survey.” GSS
Methdological Report 98.
Tourangeau, R. 2005. “Survey Research and Societal Change.” Annual Review of Psychology
55: 775–​801.
United Nations. 2005. Household Sample Surveys in Developing and Transition Countries.
New York: United Nations.
van Heerden, A., S. Norris, S. Tollman, and L. Richter. 2014. “Collecting Health Research
Data:  Comparing Mobile Phone-​Assisted Personal Interviewing to Paper-​and-​Pen Data
Collection.” Field Methods 26 (4): 307–​321.
Chapter 11

Survey Resea rc h i n t h e
Arab Worl d

Lindsay J. Benstead

Introduction

Survey research has expanded in the Arab world since the first surveys were conducted
there in the late 1980s.1 Implemented in authoritarian regimes undergoing political
liberalization, early studies conducted by research institutes and scholars broke new
ground. At the same time, they also left many theoretical and policy-​related questions
unanswered. Survey items measuring topics such as vote choice, support for illegal or re-
pressed Islamist movements, and beliefs about some unelected government institutions
were not included in early questionnaires due to political sensitivity.2 Over time, how-
ever, additional countries and numerous questions on gender attitudes, corruption, and
attitudes toward the West were added. By 2010, on the eve of the Arab spring, at least
thirty surveys had been fielded in thirteen Arab countries, Turkey, and Iran, increasing
the total number of countries included in the Carnegie data set (Tessler 2016) from two
in 1988 to thirty in 2010 (see Figure 11.1).
The Arab spring marked a watershed for survey research. Surveys were conducted for
the first time in two countries—​Tunisia and Libya—​following their revolutions. Tunisia
in particular became rich terrain for social scientists as it transitioned to a minimalist
democracy. Countries such as Morocco and Jordan experienced more limited polit-
ical reform, but public opinion also reacted to regional changes. Support for democ-
racy, for example, declined in several countries, enlivening new scholarly and policy
debates about the processes shaping attitudes toward regimes in transition (Benstead
and Snyder 2016).3
Indeed, in some cases opportunities to conduct surveys were fleeting. Egypt returned
to authoritarian rule, and civil war continued in Libya. Yet dozens of underexploited
Survey Research in the Arab World    221

Frequency of surveys conducted in 17 MENA countries


10 9
9
8
7 6
6 5
5 4
4 3 3
3 2 2 2 2 2 2
2 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0
0

14
96
97
98
99
00
01
02
03
04
05
06
07
08
09
10
11
12
13
88
89
90
91
92
93
94
95

20
20
20
20
20
20
19
19
19
20
20
20
20
20
20
20
20
20
19
19
19
19
19
19
19
19
19

Cumulative number of surveys conducted in 17 MENA countries


60
50 46 48
40 36 37
30
30 26 27
22 24
17
20 14 15
9 10
10 4 6
2 2 2 2 2 2 2 3 3 3 3
0
02
03
04
05
06
07
08
09
10
11
12
13
14
88
89
90
91
92
93
94
95
96
97
98
99
00
01
20
20
20
20
20
20
20
20
20
20
20
20
20
19
19
19
19
19
19
19
19
19
19
19
19
20
20

Figure 11.1  Carnegie Middle East Governance and Islam Dataset Surveys as of Mid-​2014.
Figure 11.1 shows the growth of survey research in the Middle East and North Africa (MENA)
region as shown by the countries included in the Carnegie Middle East Governance and Islam
Dataset (Tessler 2016). See http://​www.icpsr.umich.edu/​icpsrweb/​ICPSR/​studies/​32302.

data sets exist for political scientists to address theoretical and policy questions. As
shown in Table 11.1, almost every Arab country is now included in at least one wave of
a major cross-​national survey, including the World Values Survey, Afrobarometer, and
Arab Barometer (Jamal and Tessler 2008; Tessler, Jamal, and Robbins 2012). Numerous
other projects, including the Transitional Governance Project (2016) and the Program
on Governance and Local Development (2016) surveys, greatly increase our ability to
assess data quality, because of replication.4
Interest in survey research is also increasing among scholars of Middle Eastern social
science. The number of papers using survey data that have been presented at the Middle
East Studies Association (2016) annual meetings increased from twelve in 2009 to thirty-​
three in 2016, as shown in Figure 11.2, an almost threefold increase. Survey experiments
combining random probability sampling with random assignment to conditions
involving different photos, question wording, frames, or endorsements are increasingly
popular (Benstead, Jamal, and Lust 2015; Benstead, Kao, and Lust 2014; Bush and Jamal
2014; Corstange and Marinov 2012; Corstange 2014; Shelef and Zeira 2015).
Table 11.1 Nationally Representative Surveys Conducted in the Arab World
Program on
Governance
World Transitional and Local
Values Governance Development
Survey Arab Barometer Afrobarometer Project (TGP) (GLD)

Morocco/​ 2001, 2006 (Wave 1) 2013 & 2017 – –​


Western 2007 & & 2013–​2014
Sahara 2011 (Wave 3)
Algeria 2002 & 2006 (Wave 1), 2013 & 2017 –​ –​
2013 2011 (Wave 2) &
2013 (Wave 3)
Tunisia 2013 2011 (Wave 2) & 2013 & 2017 2012 & 2014 2015
2013 (Wave 3)
Libya 2014 2014 (Wave 3) –​ –​ –​
Egypt 2001, 2011 (Wave 2) & 2013 & 2017 2011 & 2012 –​
2008 & 2013 (Wave 3)
2013
Jordan 2001, 2006 (Wave 1), –​ –​ 2014
2007 & 2010 (Wave 2)
2014 & 2012–​2013
(Wave 3)
Iraq 2004, 2011 (Wave 2) & –​ –​ –​
2006 & 2013 (Wave 3)
2012
Syria –​ –​ –​ –​ –​
Palestinian 2013 2006 (Wave 1), –​ –​ –​
Territories 2010 (Wave 2) &
2012 (Wave 3)
Lebanon 2013 2007 (Wave 1), –​ –​ –​
2011 (Wave 2) &
2013 (Wave 3)
Kuwait 2014 2014 (Wave 3) –​ –​ –​
Qatar 2010 –​ –​ –​ –​
United Arab –​ –​ –​ –​ –​
Emirates
Bahrain 2014 2009 (Wave 9)1 –​ –​ –​
Oman –​ –​ –​ –​ –​
Saudi Arabia 2003 2011 (Wave 2) –​ –​ –​
Yemen 2014 2007 (Wave 1), –​ –​ –​
2011 (Wave 2) &
2013 (Wave 3)
Sudan –​ 2010–​2011 2013 –​ –​
(Wave 2) & 2013
(Wave 3)

1 Small sample of 500, listed in Carnegie (2016) documentation.


Survey Research in the Arab World    223

35 33

30

25 24

20
20

15 14 14
13
12

10 9

0
2009 2010 2011 2012 2013 2014 2015 2016

Figure 11.2  Survey Research Presented at Middle East Studies Association Meetings.


Figure  11.2 shows the growth in number of papers using surveys presented at MESA annual
meetings between 2009 and 2016. It is based on a search for the term “survey” in abstracts, where
the term refers to public opinion surveys rather than surveys of archival material or other quanti-
tative methodologies. See https://​mesana.org/​mymesa/​meeting_​program.php.

However, future attention to quality data is needed. Honest discussions about the ex-
tent, sources of, and solutions for quality issues, such as high rates of missingness, family
members present during the interview, and sampling error, are needed. Regionally
specific issues—​especially the controversial nature of the survey topics and the contin-
uation of authoritarianism or instability in many countries—​raise concerns about so-
cial desirability and underscore the need for methodological research. Technological
advances, including computer assisted personal interviewing (CAPI) using laptop and
tablet computers, are increasing (Benstead, Kao, Landry, et al. forthcoming) and offer
possibilities for real-​time monitoring and methodological research that could prove
crucial for improving the quality of data sets. Yet apart from a handful of studies on
interviewer effects, anchoring vignettes, and a few other topics, almost no research
systematically assesses the impact of the survey methods used on data quality in the
Arab world.
Advances in survey research also bring new ethical challenges in a region where
concerns about protecting subjects have always been important. In more democratic
spaces like Tunisia, work is needed to improve quality as well as promote public under-
standing and acceptance of polls and the role they can play in democracies. This chapter
draws on insights gleaned from the author’s experience conducting surveys in Morocco,
Algeria, Tunisia, Libya, Jordan, and Malawi and frames a substantive and methodological
research agenda for utilizing and advancing social science surveying in the Arab world.
224   Lindsay J. Benstead

Assessment of Data Quality

The cumulative body of research described in Table 11.1 and Figure 11.1 was conducted
by numerous scholars and research institutes as part of several cross-​national surveys.
While this accumulation of surveys raises questions about data quality, very few system-
atic efforts have been made to assess the surveys’ comparability.
One approach to assessing data quality is to compare survey findings across studies
conducted at similar times. To this end, Figure 11.3 shows the mean level of disagree-
ment that democracy is the best form of government for all Arab countries in the
Carnegie data set (Tessler 2016), Transitional Governance Project (TGP 2016), and
Program on Governance and Development (GLD 2016), as long as at least two surveys
have been conducted in a given country. The data show a high degree of comparability
in the results across the surveys—​perhaps more than expected. For example, in 2011 in
Egypt, the Arab Barometer estimated a mean of 1.9 for disagreement that democracy
is best, while one year earlier the World Values Survey found a mean of 1.5 (a 0.4-​unit
difference). This was the largest such difference in a one-​year period.
In general, however, very few surveys conducted within a one-​year period showed
large fluctuations in attitudes toward democracy. The 2014 TGP survey in Tunisia
estimated mean disagreement to be 2.1. A  year earlier, Arab Barometer researchers
estimated it to be 1.8, while the 2012 TGP survey in Tunisia found a mean of 1.7. This
shift may reflect a trend of decreasing support for democracy in transitional Tunisia
(Benstead and Snyder 2016).
Other studies show limited change over time, though as noted and shown in Figure
11.3, there is a general trend of declining support for democracy in the Arab region since
the early 2000s. The 2007 Arab Barometer in Yemen found 1.9, while a year earlier the
World Values Survey estimated 1.8. The 2006 Arab Barometer found the mean level
of disagreement to be 1.5 in Morocco, while a year earlier the Tessler National Science
Foundation survey (Tessler 2016) estimated disagreement to be 1.7; the GLD estimated
it to be 2.0 in Jordan in 2010, while a year later the Arab Barometer found it to be 1.9.
A more comprehensive set of comparisons should be done, but these findings are an
encouraging example of the comparability of data sets now available to answer research
questions.
More concern is warranted when it comes to missing data, the amount of which is
high in some surveys. Missingness greatly decreases the number of observations in
analyses of Arab public opinion, reducing the efficiency of estimates and possibly also
biasing coefficients. More than half of the surveys had 10% or fewer cases missing.
However, 36% of responses were missing in Morocco in 2011, while 30% were also
missing in Morocco in 2005. Missingness is also particularly high (over 20%) for the
disagreement with democracy question in some surveys in Tunisia, Saudi Arabia, Iraq,
and Algeria.
Mean % Missing

Libya AB 2014 2.0 Libya AB 2014 13


Libya TGP Aug. 2013 2.0 Libya TGP 2013 14
Libya TGP June 2013 2.0 Libya TGP June 2013 14
Tunisia TGP 2014 2.1 Tunisia TGP 2014 10
Tunisia AB 3 2013 1.8 Tunisia AB 3 2013 15
Tunisia TGP 2012 1.7 Tunisia TGP 2012 6
Tunisia AB 2 2011 1.8 Tunisia AB 2 2011 22
Sudan AB 3 2013 1.9 Sudan AB 3 2013 6
Sudan AB 2 2011 2.0 Sudan AB 2 2011 7
Saudi Arabia AB 2 2011 2.1 Saudi Arabia AB 2 2011 25
Saudi Arabia WVS 4/5… 2.0 Saudi Arabia WVS 4/5… 17
Egypt AB 3 2013 1.9 Egypt AB 3 2013 16
Egypt AB 2 2011 1.9 Egypt AB 2 2011 9
Egypt WVS 3 2000 1.4 Egypt WVS 3 2000 7
Iraq AB 3 2013 2.1 Iraq AB 3 2013 6
Iraq AB 2 2011 1.8 Iraq AB 2 2011 4
Iraq WVS 4/5 2006 1.7 Iraq WVS 4/5 2006 11
Iraq WVS 4/5 2004 1.7 Iraq WVS 4/5 2004 20
Yemen AB 3 2013 2.1 Yemen AB 3 2013 9
Yemen AB 2 2010/2011 2.0 Yemen AB 2 2010/2011 11
Yemen AB 1 2007 1.9 Yemen AB 1 2007 17
Yemen WVS 4/5 2006 1.8 Yemen WVS 4/5 2006 0
Lebanon AB 3 2013 1.6 Lebanon AB 3 2013 1
Lebanon AB 2 2010 1.7 Lebanon AB 2 2010 2
Lebanon AB 1 2007 1.6 Lebanon AB 1 2007 3
Kuwait AB 3 2014 2.2 Kuwait AB 3 2014 3
Kuwait WVS 4/5 2005 1.7 Kuwait WVS 4/5 2005 5
Morocco AB 3 2013 2.1 Morocco AB 3 2013 8
Morocco AB 1 2006 1.5 Morocco AB 1 2006 8
Morocco NSF 2005 1.7 Morocco NSF 2005 30
Morocco WVS 3 2001 1.3 Morocco WVS 3 2001 36
Algeria AB 3 2013 2.0 Algeria AB 3 2013 15
Algeria AB 2 2011 2.0 Algeria AB 2 2011 22
Algeria AB 1 2006 1.9 Algeria AB 1 2006 17
Algeria NSF 2004 1.9 Algeria NSF 2004 13
Algeria WVS 3 2002 1.7 Algeria WVS 3 2002 16
Palestine AB 3 2012 2.1 Palestine AB 3 2012 4
Palestine AB 2 2010 2.0 Palestine AB 2 2010 7
Palestine AB 1 2008 2.0 Palestine AB 1 2008 3
Palestine AB 1 2006 2.0 Palestine AB 1 2006 5
Palestine NSF 2003 1.9 Palestine NSF 2003 5
Jordan GLD 2014 2.0 Jordan GLD 2014 7
Jordan AB 3 2012/2013 1.9 Jordan AB 3 2012/2013 8
Jordan AB 2 2010 2.0 Jordan AB 2 2010 9
Jordan AB 1 2006 1.9 Jordan AB 1 2006 13
Jordan NSF 2003 1.8 Jordan NSF 2003 12
Jordan WVS 2001 1.7 Jordan WVS 2001 16

0.0 0.5 1.0 1.5 2.0 2.5 0 10 20 30 40

Figure 11.3  Mean Rejection of Democracy and Proportion of Missing Responses.


Figure 11.3 shows mean level of rejection of democracy, where a higher number is lower support
for democracy. “Despite its problems, democracy is the best form of government. Strongly
agree = 1 to strongly disagree = 4.” Source: Tessler (2016); GLD (2016); and TGP (2016). Data are
unweighted. This question was not asked in the Afrobarometer (2016).
226   Lindsay J. Benstead

New Directions in Theoretical and


Policy Research

Much existing literature drawn from public opinion surveys focuses on a few topics,
leaving many other research questions underexplored. Table 11.2 lists topics included
in the Arab Barometer (Wave 1)  and provides a good representation of the types of
questions that have been repeated in other surveys. The most popular topic in Arab
public opinion literature examines support for democracy (Tessler 2002a, 2002b; Tessler,
Jamal, and Robbins 2012; Tessler and Gao 2005; Tezcür et al. 2012; Ciftci 2013; Tessler,
Moaddel, and Inglehart 2006; Benstead 2015; Benstead and Snyder 2016; Benstead and
Tessler 2016).5 Attitudes toward gender equality and social trust have also received some
coverage. However, much less work has explored the development of and cross-​national
differences in political values, including why citizens are more or less accepting of po-
litical competition and debate or want to see different rights and freedoms included in
their constitution, which was later included in the second and third waves of the Arab
Barometer.
Many projects shed light on the gender gap in civil society participation (Bernick
and Ciftci 2014) or examine political participation, especially as it relates to the
interrelationships among civil society membership, social trust, and support for democ-
racy in authoritarian regimes (Jamal 2007a, 2007b). Some research has also examined
boycotting (Benstead and Reif 2015, 2016). However, limited research examines voter
choice—​for example, why voters support Islamist, secular, or other parties—​or media
consumption—​such as why citizens choose different media sources and how that choice
shapes their attitudes.
Other researchers have used the Arab Barometer and related surveys to examine
citizens’ perceptions of and experiences with the state. For instance, some litera-
ture examines how perceptions of government performance and experiences with
corruption and clientelism shape support for democracy (Benstead and Atkeson
2011). A  limited number of studies also assess citizens’ access to services (Program
on Governance and Local Development 2015; Benstead 2016b), as well as the degree
to which women and other minorities are able to contact parliamentarians and local
councilors to access services (Abdel-​Samad and Benstead 2016; Benstead 2015, 2016a,
2016b). At the same time, there is still a need to understand how clientelism and
corruption affect citizens’ interpersonal trust and confidence in state institutions and
how these outcomes affect demand for freer elections (Benstead and Atkeson 2011).
Some studies also examine values and identity, with most of this literature focusing
on gender equality (Alexander and Welzel 2011; Norris 2009) and identity (Benstead
and Reif 2013). Yet there is a dearth of research that examines and explains social and po-
litical values in the Arab world, like levels of tolerance, religiosity, and attitudes toward
controversial issues such as lotteries, women’s dress, apostasy, political rights of religious
and ethnic minorities, and state violations of human rights in the name of security.
Survey Research in the Arab World    227

Table 11.2 Topics in the Arab Barometer (Wave 1)


a. Topics and literature b. Theoretical and policy questions

Attitudes toward political regimes


Preferences for political regimes (Tessler 2002a, Why does support for democracy develop and
2002b; Tessler, Jamal & Robbins 2012; Tessler change? Why do citizens define democracy
& Gao 2005; Tezcür et al. 2012; Ciftci 2013; differently? Why do citizens demand secular
Benstead 2015) versus religious democracy? Why are some
political and economic reforms likely to be more
effective than others for strengthening support
for democracy?
Political values Why are some citizens more supportive of greater
political competition and debate?
Political participation
Civil society membership, political knowledge, How does civic participation relate to trust,
and engagement (Jamal 2007a, 2007b; Bernick & government legitimacy, and support for
Ciftci 2014) democracy? What explains participation
in campaign rallies, petitions, and protests,
including gender gaps in these forms of
engagement? Why do gender gaps exist in
political knowledge, and how does this impact
participation?
Voting Why do voters support Islamist, secular, or other
parties, and what explains why some voters
switch their support in subsequent elections?
What are the extent and impact of vote buying
and clientelism? Are men or women more or
less likely to sell their votes or to vote based on
clientelistic relationships?
Political knowledge and the media Who consumes various media sources, and how
does this choice impact values and partisanship?
Citizen engagement with the state and social institutions
Institutional trust and perceptions of government Why do some citizens evaluate government
performance (Benstead & Atkeson 2011) performance more positively than others? To
what extent do citizens see their governments as
democratic? Why do evaluations of government
performance change over time? How do
clientelism and corruption affect social trust,
regime legitimacy, and support for democracy?
Governance and service provision (Benstead et al. What explains effectiveness and equity in access
2015). State-​society linkages and representation to services, such as security, dispute resolution,
(Abdel-​Samad & Benstead 2016; Benstead 2016b) healthcare, and education?

(Continued)
228   Lindsay J. Benstead

Table 11.2 Continued
a. Topics and literature b. Theoretical and policy questions
Individual orientations and identity
Gender equality (Alexander & Welzel 2011; Norris What explains attitudes toward different
2009) dimensions of gender inequality, such as
women’s status, mobility, wages, and political
involvement?
Identity (Benstead & Reif 2013) How does identity shape culture and political
attitudes?
Tolerance and values Why are some citizens more supportive of greater
political competition and debate?
Religiosity and interpretations of Islam Why does religiosity vary within and across
societies? What are individuals’ views on matters
such as lotteries, women’s dress, apostasy, Islam
and democracy, and minority political rights?
Controversial issues To what extent does the public accept state
violations of security to achieve security?
International affairs
Attitudes about international and regional issues To what extent do citizens see foreign countries
(Tessler & Robbins 2007; Benstead & Reif 2016; like Iran and the United States as democratic?
Tessler & Warriner 1997; Tessler, Jamal & Robbins How do they evaluate the Arab League and other
2012; Tessler, Moaddel & Inglehart 2006) international organizations? Why do citizens
assess differently the reasons for economic
and political challenges in the Arab world? Do
citizens support armed operations against the
United States elsewhere? Why do anti-​and pro-​
American attitudes vary across the Arab world?
To what extent do citizens support a two-​state
solution in Israel/​Palestine? How does living in
Western countries impact social and political
attitudes?

Attitudes about international and regional issues have been the subject of some
studies (e.g., Tessler and Robbins 2007; Benstead and Reif 2016; Tessler and Warriner
1997), but despite their timeliness, much more work should be done on attitudes toward
other international issues and bodies like the Arab League. Research might also ex-
plore how citizens assess the reasons for economic and political challenges in the Arab
world, their perceptions of the motivations for and effectiveness of U.S. democracy-​
promotion efforts, the extent to which citizens support a two-​state solution in Israel
and Palestine, and how living in Western countries impacts social and political
attitudes.
In addition, since the Arab uprisings, a number of new questions have been added
to the Arab Barometer and surveys such as the TGP and GLD, which offer snapshots of
Survey Research in the Arab World    229

transitional politics in Tunisia, Libya, and Egypt. With these surveys, scholars might ex-
plore the following questions:

• What explains voter behavior and support for Islamist and non-​Islamist parties?
• How do regimes reconsolidate in transitions? Are the same voters engaged before
and after the revolution?
• What explains who protested in the Arab uprisings and why?
• What explains electability of candidates with different identities, including gender,
ethnicity, and political ideologies?
• To what extent does vote buying exist, and under what conditions will citizens re-
spond to clientelistic and programmatic appeals?

Survey Research Challenges in


the Arab World

To answer these questions, it is critical to understand challenges that arise when


conducting research in the Arab world and, when possible, to conduct methodological
research needed to improve data quality. While the data quality assessment depicted in
Figure 11.3 offers cause for confidence in existing research, it also highlights problems of
missing data. Other data quality problems may exist as well. Table 11.3 summarizes these
challenges and makes recommendations for assessing quality issues and improving data
quality.

The Survey Genre
The survey genre is still unfamiliar to many respondents in some authoritarian regimes
and transitional democracies. This may create unique challenges for obtaining a repre-
sentative sample. For example, having lived under dictatorship and participated little in
formal politics throughout her life, an elderly Libyan woman answering a survey for the
first time may suggest the interviewer speak with her husband or son instead of to her.
Others unfamiliar with standardized surveys may wonder why they have been selected
or may find the survey interaction unnatural, particularly when interviewers employ
techniques such as probing nonresponse by repeating the question exactly as worded.
This may lead to lower participation rates among some subpopulations.
These challenges may be addressed through an introductory script explaining
the sampling and question-​asking procedure and reminding the respondent that
there are no right or wrong answers.6 However, the impact of scripts on data quality
(e.g., participation and item response rates) should be studied through experiments
as well as behav­ior coding.
230   Lindsay J. Benstead

Training, Capacity, and Monitoring


The newness of the survey genre and the high levels of missingness in Middle East and
North Africa (MENA) region surveys underscore the importance of training interviewers
in standard methods such as probing and feedback—​techniques designed to encourage
the respondent to give thoughtful responses to the survey questions without offering
other information, thereby allowing the interviewer to move efficiently through the ques-
tionnaire and obtain responses to as many questions as possible. Yet the extent of training
and capacity-​building varies. Some firms do not train and may not adequately control
interviewers or implement agreed-​upon methodology. Reports from researchers suggest
that some firms or interviewers sample in public places rather than households; use quota
sampling or improperly implement Kish tables; or ask only a portion of the questionnaire,
filling in the remaining questions after leaving the interview. Poorly monitored interviewers
may falsify data to support their political views. To avoid these problems, before contracting,
research firms should be interviewed about their procedures, and researchers should col-
laborate closely with firms throughout the fieldwork process to support capacity building.
The CAPI procedures provide a new means of monitoring data immediately after data
collection and thus of identifying the small number of interviewers who may be generating
many of the quality problems (Benstead, Kao, Landry, et al. forthcoming).
The extent and impact of data falsification and shortcuts in the survey process on data
quality are unknown and should be studied. Even firms and teams conducting high-​
quality research may benefit from additional capacity building with specialists from
other Arab or Western countries. Randomized studies testing the impact of training and
supervision are also needed, as summarized in Table 11.3.

Weak Incentives to Improve Quality


Varying incentive structures across research groups and countries may help account for un-
even quality. For-​profit organizations may have weaker incentives to implement stringent
sampling and control procedures than do nonprofit groups. This is a particular risk when
authoritarian governments require research authorization or when political connections
are needed to obtain authorization because it limits competition between firms.
Unprofessional firms may shape results to favor the political party paying for the research.7
Academic centers and for-​profit firms in countries with greater political freedom and more
competitive business environments may conduct high-​quality research, but attention to
data quality and better understanding of survey error can help support best practices.

The Survey Environment


Violence and instability pose challenges and sometimes make research impossible.8
Even in more stable countries like Tunisia, some sampled units must be replaced when
interviewers cannot safely access the areas, potentially generating concerns about
Table 11.3 Survey Research Challenges and Recommendations
Challenge Recommendation

The survey genre and Some respondents Develop introductory scripts to explain
general environment are unfamiliar with sampling and question-​asking procedures. Train
surveys. interviewers in standard probing and feedback
to guide respondent through the interaction.
Evaluate the impact of introductory scripts,
probing, and feedback on data quality through
behavior coding, cognitive interviewing, and
experimental research.
Some potential Fears could be allayed through increased
participants professionalism (e.g., name badges, tablet
experience fear. computers), introductory scripts explaining
confidentiality practices, or CAPI. Introductory
scripts should be used emphasizing that
participation is voluntary and confidential and
participants are not obligated to answer any
question with which they are uncomfortable.
Training, capacity, and Interviewer training Take a hands-​on approach to training
monitoring (e.g., sampling, interviewers and survey implementation
standard question if possible. Study impact of training and
asking, probing, monitoring on measurement and representation
clarification, error.
feedback) and
monitoring may not
be sufficient.
Weak incentives to Some firms may Carefully evaluate firms by interviewing them,
improve quality have weak incentives ideally at their offices, before contracting.
to conduct quality Communicate with other researchers about past
research. experience with firms.
The survey Interviewers may use A best practice is to write questionnaires in darja
environment: insecurity, ad hoc translations and train interviewers in standard question-​
diglossia, and interviewer-​ or deviate from asking methods, requiring them to read
respondent relationship standard question questions exactly as worded. Back-​translation,
asking. pretesting, and if possible, behavior coding and
cognitive interviewing, are needed to improve
questionnaires. Studies should examine the
impact of language practices, as well as training
in standard probing, clarification, and feedback
on data quality. These techniques also enrich
discussions of causal mechanisms in research
reports and offer possibilities for publishing on
methodological topics.
Interviewers may Test the impact of respondent-​interviewer
know respondents. relationships on data quality, including by
examining the results of list experiments
comparing results across groups of respondents
who know and do not know the interviewer.
232   Lindsay J. Benstead

sampling error. And in settings like Lebanon, not all firms have access to areas controlled
by local groups. For instance, in refugee camps, respondents may give incorrect infor-
mation if they believe the survey firm is acting out of “entitlement” or does not have a full
picture of the refugee community’s history or circumstances. In countries like Libya and
Lebanon, where there are patchworks of local authorities, firms must have buy-​in from
local leaders if they want to obtain a representative sample of the country as a whole.
Other features unique to the Arab world and developing areas exist as well. Diglossia and
high in-​group solidarity present challenges for survey research because of their potential to
produce measurement and representation error. Diglossia exists when a language commu-
nity uses a lower form of speech—​the spoken dialect—​alongside a higher, written form of
language used in education, government, or business (Ferguson 1959). In Arab countries cit-
izens speak dialects (e.g., Tunisian darja) and use Modern Standard Arabic (MSA) in written
documents, including forms and questionnaires, resulting in a diglossic environment.
Survey researchers in Arab societies must choose and often debate which language
to use. Often the questionnaire is written in the spoken dialect. However, this prac-
tice has drawbacks, as darja is not a codified language and thus, when written, may
not convey the formality or importance of the study. But reading questions in MSA is
awkward because respondents with less formal education in Arabic or native speakers
of Tamazight,9 in particular, may not discuss or think about politics in MSA. In some
instances, Arabic dialects also vary within countries.
Given these considerations, the best practice is frequently to write questionnaires in
the spoken dialect and train interviewers to follow standard question-​asking methods,
requiring them to read questions exactly as worded. Pretesting, and if possible cognitive
interviewing, are also critical for evaluating translations. Yet diversity in spoken language
and divergence from standard interviewing practices, including informal translations of
definitions, present unexplored consequences for data quality. In reality, almost nothing
is known about how questions are read in the field, especially in the absence of inter-
viewer training in standardized interviewing techniques and monitoring (i.e., behavior
coding), which is an unexplored area for methodological research (see Figure 11.3).
Other concerns relate to standard definitions (i.e., Q x Qs), which are rarely used in
the region’s surveys.10 Some interviewers report that they explain the meaning of the
question to the respondent without formal Q x Qs. It is likely that complex terms like
democracy, in the absence of Q x Qs, are defined or explained differently by individual
interviewers, which leads to measurement error. Researchers should use experimental
methods to examine the impact of different language practices, as well as training in
standard probing, clarification, and feedback, on data quality.
Another feature of survey research in Arab societies is that respondents and
interviewers often know one another, especially in rural areas, in part because
interviewers are commonly recruited from regions in the sample, and communities
tend to be tight-​knit. In the 2012 TGP in Tunisia, 7% of interviews were conducted by
an interviewer who knew the respondent; in the 2014 wave, the figure was 12%. These
differences stem from interviewer recruitment methods. In 2014 researchers were
recruited from sampled areas, rather than from a national call for applications via
networks of youth organizations.
Survey Research in the Arab World    233

Employing interviewers from a given region has a number of advantages. First,


interviewers can more easily locate sampled blocks and manage transportation and
lodging. Second, interviewers will purportedly be more able to establish trust, allay fear
and suspicion, and obtain higher response rates. Interviewers will also be safer. In cer-
tain neighborhoods in Tunis, for example, outsiders are unwelcome.11
Yet how and why social networks affect data quality, including refusal and item
nonresponse and measurement error (e.g., social desirability and conformity bias),
have not been explored.12 Citizens of authoritarian and transitional regimes may fear
researchers are collecting information on behalf of the state. Some firms do. Or they
may suspect political parties or other organizations are doing the same, through secretly
recording details in the home such as alchohol possession, as a means to report on a
society’s morals. Reportedly, a religious organization did this in Tunisia.
Still, the claim that the interviewer should know participants—​directly or indi-
rectly—to conduct quality research raises important questions. Are respondents
more or less likely to report sensitive data if the interviewer and respondent know one
another? What measurement or sampling errors arise from interviewer-​respondent
networks and insider-​outsider statuses of interviewers? Without methodological
research, it is difficult to know how and why survey responses and nonresponse are
affected by these factors. And it is likely that the answer depends on the circumstances.
For example, it is possible that using interviewers from the region could induce lower
reporting of socially desirable behaviors and attitudes, while in others it could lead
to more truthful answers. One way to test this would be through a list experiment
comparing the findings across interviewers who knew or did not know the respondent
or to randomize interviewers to respondents or households.

Methodological Research Agenda

Methodological studies are needed not only to pinpoint and minimize total survey error
(Groves et al. 2009), but also to understand social processes such as intergroup conflict
(Benstead 2014a, 2014b; Koker 2009). This section and Table 11.4 summarize methodo-
logical research that can be conducted to measure, understand, and mitigate two main
components of total survey error: measurement and representation error.

Measurement Error
While some research currently examines measurement error in MENA surveys, much
more is needed. Measurement error can stem from the questionnaire and the interviewer-​
respondent interaction, including individuals’ observable traits such as gender or dress
style. It can also result from researchers rushing questionnaires into the field, often
without back translation or pretesting, let  alone using time-​consuming but valuable
234   Lindsay J. Benstead

Table 11.4 Suggested Methodological Research by Source of Total Survey Error*


Source of Error Technique to Study

Measurement Measurement Cognitive interviewing and behavior coding; anchoring


error arising vignettes
from
instrument
Measurement Recording, reporting, and controlling for observable and
error arising nonobservable interviewer traits
from
interviewer
traits
Representation Coverage error Compare coverage error across modes and sampling
approaches
Sampling error Test impact of sampling approaches and degree of sampling
discretion given to interviewers
Nonresponse Examine nonparticipation and item nonresponse
error

* Table 4 summarizes the components of total survey error discussed in the chapter.

techniques such as cognitive interviewing and behavior coding. As a consequence,


respondents may feel frustrated by poorly written questions that fail to offer a response
choice capturing their views. Interviewers may paraphrase clumsy questions or shorten
long questions, deviating from standard processes and producing measurement error.
In addition to pretesting and focus groups, one way to address these challenges is through
behavior coding. While costly, this technique offers unparalleled information for refining
the questionnaire and developing qualitative sections of research about the survey findings.
Behavior coding was developed by Cannell in the 1970s (Willis 1999) as a method for
evaluating interviewer performance and pretesting questionnaires. A method by which an
observer records observations about the interview without interacting with the respondent,
it is used to record interviewer behavior (e.g., question asking, probing, clarification,
feedback) and, if desired, respondent behavior (e.g., asks for clarification; answers “don’t
know”). Behavior coding can be implemented live, with a second individual recording data
about the interaction while the interviewer conducts the interview, or through recording
and subsequent coding of the interaction. It allows researchers to identify interviewers who
need more training and questions respondents find difficult or unclear.
Cognitive interviewing differs from behavior coding in that it involves interaction
between the observer and the respondent, who are asked questions about their thought
processes when answering survey questions. Cognitive interviewing helps pinpoint
problems of comprehension, reveals the meaning of concepts and terms to the inter-
viewer, and avoids question problems such as excessive cognitive burden. Like beha-
vior coding, cognitive interviewing can be recorded by a second person during the
interview or subsequently by using audio or video recordings of the interview.13
Survey Research in the Arab World    235

There are two main cognitive interviewing techniques:  think-​aloud and verbal
probing (Willis 1999). Stemming from the research of Ericsson and Simon (1993), think-​
aloud prompts respondents to verbalize thought processes as they respond to the ques-
tionnaire. Verbal probing, in contrast, involves direct questions asked of the respondent
after he or she answers each question. Table 11.5 offers examples of verbal probing for the
MENA context. For example, the interviewer may ask, “What does democracy mean to
you?” in order to elicit more about how respondents define the term and whether they
are thinking of similar things. He or she may ask, “How did you get to that answer?” to
learn more about the respondent’s circumstances and to assess cognitive burden.
Cognitive interviewing and behavior coding have not been used extensively in the
Arab world for several reasons, despite their value and potential to improve data quality.
There are few native Arabic speakers trained to implement these techniques, and both
require additional funding beyond the survey budget. Audio recording is not generally
used in the Arab context due to authoritarianism, fear, and lack of familiarity with the
research process. However, interviews could be recorded in some circumstances, es-
pecially in the laboratory in freer countries like Tunisia, so long as appropriate human
subjects protocols and consent are followed.
Existing research: anchoring vignettes.  Another tool used to develop valid and reliable
measures of critical and abstract concepts that respondents interpret differently within
or across countries or regions is the anchoring vignette (King et al. 2004; King and Wand
2007). This technique is most useful for variables measured on ordinal scales, such as
level of democracy in one’s country, economic satisfaction, and political efficiency.
Mitchell and Gengler (2014) developed and tested anchoring vignettes in Qatar to
develop reliable measures of economic satisfaction and political efficiency. First, they
asked respondents a self-​assessment question, “How would you rate the current ec-
onomic situation of your family? (Very good, good, moderate, weak, very weak),”
followed by questions rating two hypothetical families’ economic well-​being: one family
with a monthly income of $8,000 and another with $16,500. Rather than creating more
concrete personal assessment questions, respondents’ own self-​assessment is used, and

Table 11.5 Types and Examples of Questions in a “Verbal Probing” Cognitive


Interview

Comprehension: What does democracy mean to you?


Paraphrasing: Can you repeat the question in your own words?
Confidence: Are you certain you met with a parliamentarian during the last year?
Recall probe: How do you remember that you experienced two water cuts during the last year?
Specific probe: Why do you think economic equality is the most important element of democracy?
General probe: How did you get to that answer? Was this a difficult or easy question to answer?
I noticed you were unsure—​why was this?

Source: Adapted from Willis (1999, 6).


236   Lindsay J. Benstead

differences in the meaning of incongruent levels of the concept are subtracted based on
assessments of the anchoring vignettes (King 2015).
Using this technique, Mitchell and Gengler (2014) found that Qatari citizens
overestimated their economic situation, while Qatari women overestimated political
efficacy when anchoring vignettes were not used. Bratton (2010) illustrates how this
technique corrects for incomparability of assessments of democracy level in African
countries; his work offers an example that could be implemented in the Arab world.

Existing research: interviewer effects.  A substantial literature focuses on bias arising


from observable interviewer traits, including religious dress and gender, in responses
to survey questions about religion (Turkey, Egypt, Morocco, and Tunisia), women’s
status (Morocco), and voting for secular and Islamist parties (Tunisia). Two of the
first such studies focused on how interviewer dress affected reported religiosity,
both utilizing same-​sex interviewing and thus holding interviewer gender constant
(Blaydes and Gillum 2013; Koker 2009). Using experimental designs, these studies
found that respondents, facing social desirability pressure and the desire to avoid
sanction or embarrassment, responded strategically to conform to the socially stereo-
typed views of the interviewer, signaled by dress style.
These same studies also found that the degree of bias depended on the level of intergroup
conflict during the study, as well as respondent vulnerability. In field experiments in three
Turkish cities conducted at three points in time, Koker (2009) found that Islamist and
secularist symbols worn by interviewers affected reported religiosity, but the size of the
effect depended on the strength of Islamism at the time of the study, which was greatest
in 2004. In a survey of twelve hundred women in Cairo, Blaydes and Gillum (2013) found
that when the interviewer wore a headscarf, Muslim women expressed higher religiosity
and adherence to cultural practices (e.g., female genital mutilation and forced sex with
spouse), while Christian women expressed lower religiosity and higher adherence to these
practices. Effects were greatest for respondents from socially vulnerable segments of so-
ciety, including younger, less educated, and poorer women.
Benstead (2014b) also examined effects of interviewer dress on reported reli-
giosity in a 2007 observational study of eight hundred Moroccans. Using mixed-​
gender interviewing, she tested whether the impact of interviewer dress depended
on interviewer gender or respondent religiosity. She found that interviewer traits
systematically affected responses to four religiously sensitive questions, and that the
presence and size of effects depended on the religiosity of the respondent. Religious
respondents—​marginalized by the largely secular elite in Morocco—​faced greatest
pressure to amend their responses. Religious Moroccans provided less pious responses
to secular-​appearing interviewers, whom they may link to the secular state, and more
religious answers to interviewers wearing hijab, in order to safeguard their reputa-
tion in a society that values piety. Effects also depended on interviewer gender for
questions about religious dress, a gendered issue closely related to interviewer dress.
In another study Benstead (2014a) examined the impact of interviewer gender on
gender-​sensitive items, focusing specifically on gender dynamics in Moroccan society
Survey Research in the Arab World    237

five years after family code reform. Using survey data from four hundred Moroccans, she
found interviewer gender affected responses for questions related to women and politics
for male respondents, who reported more egalitarian views to female interviewers.
Benstead and Malouche (2015) also examined the impact of traits on past and future
vote choice in transitional Tunisia. Using a nationally representative survey of 1,202
Tunisians conducted in 2012, they found interviewers’ religious dress increased the like-
lihood of respondents’ stating that they had voted for the Islamist En-​Nahda party in the
2011 Constituent Assembly elections, as well as reporting that they planned to do so in
the next elections.
This literature underscores the need for researchers to code, report, and control for
religious dress, particularly in electoral polls in the post-​Arab-​uprising context, to re-
duce bias and gain insights into social identity and intergroup conflict. Yet the im-
pact of interviewer traits on electoral polls has been underexplored. Future studies
should employ experimental designs and a larger pool of interviewers. Behavior
coding, cognitive interviewing, and qualitative interviews are needed to help eluci-
date the underlying effects of causal mechanisms and social processes. New research
should also examine additional interviewer traits such as race, ethnicity, or class and
nonobservable interviewer attitudes and behaviors by surveying the interviewers
about their own views and investigating whether they inadvertently influence answers.
In addition, survey experiments are needed to test the impact of mode, including web,
phone, and CAPI, on interviewer effects and reporting of sensitive information.14

Representation Error
There is limited research examining the impact of methods on representation error,
such as error stemming from coverage, sampling, and nonresponse, which can impact
the accuracy of inferences drawn from the data.
Coverage error arises from a mismatch between the sampling frame and the population
of interest, where the error is systematically related to survey items. At times a sampling
frame of addresses may be available, but it is old and may not include homes constructed
during the previous ten years. Often no sampling frame is available. Probability pro-
portional to size (PPS) sampling using old census figures probably introduces sampling
error, especially in countries like Jordan, Lebanon, and Libya, where substantial popula-
tion movement within and across borders has occurred due to civil war.
Efforts to sample houses not found in the sampling frame are therefore needed. One
solution to this problem is using light maps, as researchers in the 2014 Governance
and Local Development survey in Tunisia did to draw an area probability sample of
households. Another issue relates to the implementation of the random walk and other
techniques by the interviewer. Homes are often not built in a linear way, but rather in
clusters, particularly in rural areas. This requires specific instructions for interviewers
about how to implement random walks. The CAPI methods using tablets are increas-
ingly employed and allow interviewers to use Global Positioning System (GPS) to
238   Lindsay J. Benstead

ensure that the sampled households fall within the enumeration area (Benstead, Kao,
Landry, et al. forthcoming).
Even when probabilistic sampling is used, there are unique conditions and practices
in the region that could produce sampling error. These errors can be introduced through
the process of choosing households or respondents within households.
At the level of household selection, at least four challenges arise. First, some firms or
interviewers use convenience sampling of individuals in cafés and other public places,
even when the reported methodology is household selection. Researchers must pay
attention to this possibility when screening research firms. Second, challenges associ-
ated with choosing sampling blocks within larger primary or secondary sampling units
have been resolved in some countries better than others through the use of low-​level maps
to probabilistically sample down to the level of the neighborhood, for example. When
interviewers are given too much freedom to choose sample units, they may simply go to
the center of a large town or to a single apartment building, where they conduct all surveys
in the sampling block, or worse, to homes or other places where they know friends and
family. Third, random walk patterns are hindered by housing areas in rural and urban areas
that do not fall on blocks in the same way as developed countries. This makes establishing
a trajectory and random walk difficult. More than one family may live in a housing unit,
requiring survey managers to define the household and create a means for selection.
Fourth, some sampled areas may be difficult to reach without transportation or in dan-
gerous places too arduous to enter. Rules for replacing these units are needed; managers
should determine the rules and ideally have supervisors direct interviewers to sampled
housing units, rather than giving interviewers the freedom to select a replacement.
Nonresponse error can arise from nonparticipation as well as item nonresponse.
Nonparticipation rates are higher among urban than rural residents in the Arab world.
Older and less-​educated citizens are systematically more difficult to recruit, likely because
of differential response rates and sampling design biased toward urban areas. As noted
above, tailored scripts may be needed to help recruit women, less educated, and older
individuals, who may have more difficulty understanding a survey about politics and may
feel their opinions are not important. To reduce bias generated from sampling as well as
nonresponse error, post-​stratification weights are typically applied, but studies are needed
to understand how patterns of nonparticipation and nonresponse affect survey estimates.
Existing research: refusal and item nonresponse.  Several studies have examined the im-
pact of observable interviewer traits on refusal and item nonresponse. While not all evi-
dence finds that traits affect item nonresponse (e.g., Benstead 2014b), male interviewers
tended to have lower item nonresponse rates, possibly due to their higher authority in a
patriarchal context (Benstead 2014a). In contrast, in many studies in Western countries fe-
male interviewers have higher participation rates and lower item nonresponse rates due to
their increased ability to establish interpersonal trust.15
Corstange (2014) examined how the apparent sponsor of the research affects re-
fusal and item nonresponse. Corstange found citizens were more likely to respond if
they believed the survey was being conducted by a university, even if American, than
by a foreign government. Participation rates varied across sects and generated bias that
could not be fully corrected by weighting.
Survey Research in the Arab World    239

Few researchers track refusal or systematically analyze why it occurs. To do so,


scholars should use cover pages (see Appendix 2), which are filled out and coded for
all contacts, including noncontacts and refusals. As noted, interviewer characteristics
should be recorded for completed and noncompleted interviews. By doing so, patterns
of response and nonresponse can be gathered.

Ethnical Issues and Conclusions

As survey research has expanded in the Arab world, new ethical questions, summarized
in Table 11.6, remain pertinent. Several relate to new technologies such as CAPI, stream-
line data collection, boost interviewer professionalism, and may offer respondents
increased confidentiality (Benstead, Kao, Landry, et al. forthcoming). But tablets iden-
tify the GPS location of interviews, raising concerns about confidentiality. Problems can
arise when uploading or storing surveys or when these coordinates are not removed
from the data file before public release. Because data must be uploaded to the Web, there
is potential for data interception by governments or other parties. Researchers must
address these risks in human subjects protocols, ensure data are securely uploaded and
stored, and remove identifiable GPS from released data files.
Tablets also allow interviewers to take photos of streets or houses for interviewer su-
pervision or coding neighborhood socioeconomic level, which also increases concerns
about confidentiality. Through their recording capability, tablets can be useful for

Table 11.6 Ethnical Concerns and Recommendations


Concern Recommendation

Tablets record GPS interview location. Data on Researchers must address these risks in human
location of interview are uploaded and stored. subjects protocols and implement procedures to
ensure data are securely uploaded and stored and
that identifiable GPS data are removed from data
files before release.
Tablets can be used to take photos of Human subjects protocols must incorporate
neighborhoods or make audio recordings. these techniques to inform potential participants,
maximize potential benefits, and minimize
potential harm.
Survey firms might conduct research for Researchers must ensure the research
academic groups and governments. sponsor is clearly and accurately stated in the
informed consent script, bolster content about
respondents’ rights (e.g., voluntary participation),
and agree to preclude government/​unauthorized
access to raw data unless revealed to potential
participants and incorporated into human
subjects protocols.
240   Lindsay J. Benstead

monitoring, training, and questionnaire development (e.g., cognitive interviewing).


Recording might, with consent, be used in more limited field or laboratory settings.
Human subjects protocols must incorporate use of photo or audio recording capabilities
and properly reflect benefits and potential harm of tablets.
Ethical issues also arise when survey firms conduct research for social science
projects in authoritarian countries, as well as for governments, who may wish to pre-
dict electoral outcomes, measure and respond to economic demands, or tap opinions.
There are recent examples of survey organizations compelling respondents to pro-
duce cards showing the district in which they are registered to vote. This raises eth-
ical concerns for the broader research community, such as data use by a third party,
the extent to which citizens feel compelled to answer questions, or possible anxiety
experienced by respondents surrounding their participation or nonparticipation.
Further discussion of these ethical issues is critical. Best practices might include
ensuring that the sponsor of the research is clearly and accurately stated in the informed
consent script, paying extra attention to and bolstering content about respondents’
rights (e.g., emphasizing that participation is voluntary), and specifying in agreements
with research firms that government cannot access data or cannot do so until after
ensuring protocols for safeguarding confidentiality and preparing files for release.
Despite the many challenges involved in conducting research in the region, substan-
tial, high-​quality, and underexploited survey data exist. Survey researchers in the Arab
world should continue to use existing data, even while they expand the body of surveys
and conduct new methodological research. Scholars should also continue to focus on
outreach efforts, helping to reinforce the role of survey research in the political process
and supporting the capacity among civil society organizations, political parties, and
media to make use of survey data.

Acknowledgments
I thank Lonna Atkeson for helpful feedback and Tanai Markle, Anna Murphy, Ursula Squire,
Jim Mignano, Narttana Sakolvittaynon, and Anthony Stine for research assistance.

Notes
1. For a list of publicly available surveys conducted in Arab countries, see Carnegie Middle
East Governance and Islam Dataset (Tessler 2016) and Appendix 1 in this chapter.
2. For example, prior to the Arab spring, researchers probed attitudes about sharia law
and the role of religion in the state, but not questions about past or future vote choice,
due to their sensitivity. A measure of respondents’ preferred party appears to have been
contemplated for the first wave of the Arab Barometer, conducted in 2006–​2008 in six
countries (Morocco, Algeria, Lebanon, Jordan, Palestine, and Yemen), but this variable
was not added to the final data set.
3. Figure 11.3 in this chapter shows declines in support for democracy in Jordan between 2011
and 2014; Palestine between 2003 and 2012; Algeria between 2002 and 2013; Morocco be-
tween 2001 and 2013; Kuwait between 2005 and 2014; Yemen between 2006 and 2013; Iraq
between 2004 and 2013; Egypt 200 and 2013; Saudi Arabia between 2003 and 2011; and
Survey Research in the Arab World    241

Tunisia between 2011 and 2014. Support for democracy remains high in the region as a whole
(Robbins and Tessler 2014) and did not appear to decline in Lebanon, Sudan, or Libya.
4. The TGP (2016) was launched by Ellen Lust, Lindsay Benstead, and collaborators
following the Arab spring in part to study and explain electoral behavior and involves
a series of public opinion surveys in transitional Tunisia, Libya, and Egypt. A founda-
tional project of the Program on Governance and Local Development (GLD), the Local
Governance Performance Index (LGPI) was developed by Lindsay Benstead, Pierre
Landry, Dhafer Malouche, and Ellen Lust. It maps public service provision and trans-
parency at the municipal level and has been conducted in Tunisia and Malawi. This
allows for comparisons of public service provision in areas including education, health,
and municipal services across areas.
5. Early publications on Arab public opinion include Grant and Tessler (2002); Nachtwey
and Tessler (2002); Tessler (2000); and Tessler and Warriner (1997).
6. I am grateful to Kristen Cibelli for sharing this idea.
7. As a result of accusations of politically motivated polling, a Tunisian law banned firms
from conducting and the media from disseminating polls or surveys during the 2014 elec-
toral campaign for parliament.
8. For a discussion of challenges facing researchers in insecure environments, see Mneimneh,
Axinn, et al. (2014).
9. Dialects spoken by indigenous inhabitants of North Africa, especially in Morocco and
Algeria.
10. Q x Qs are a list of standard definitions by survey question number that interviewers are
allowed to give when asked for clarification. In standard methodology, interviewers are
not allowed to offer any other definitions. If no definition is given for a term, the inter-
viewer may say, “Whatever _​_​_​_​_​means to you.”
11. There is some evidence from Malawi of higher response rates if citizens know their partic-
ipation brings work to interviewers from the area (Dionne 2015).
12. For additional work on detecting social desirability bias, see Mneimneh, Axinn, et  al.
(2014) and Mneimneh, Heeringa, et al. (2014).
13. Both require appropriate human subjects protocols and informed consent.
14. Phone surveys are also being used in a number of Arab countries, such as Tunisia and
Qatar. These surveys have a similar potential to exclude citizens without mobile or land-
line phones. Studies are needed to assess coverage error issues in face-​to-​face, phone, and
increasingly, web-​based surveys.
15. Item nonresponse in social surveys is higher for female respondents in general (e.g.,
Rapoport 1982), but studies of interviewer gender effects find either that item nonresponse
is unrelated to interviewer traits (Groves and Fultz 1985) or the relationship is weak
(Kane and Macaulay 1993). In U.S. samples, item nonresponse and refusal rates are lower
for female interviewers. Benney, Riesman, and Star (1956) found lower rates of item
nonresponse for female interviewers, and Webster (1996) found that female interviewers
had fewer omitted items, particularly from male respondents, who were likely to “work
hard” in interviews with females. Hornik (1982) found lower unit nonparticipation rates in
mail surveys when participants received a prestudy call from a female.

References
Abdel-​Samad, M., and L. J. Benstead. 2016. “Why Does Electing Women and Islamist
Parties Reduce the Gender Gap in Service Provision?” Paper presented at the After the
242   Lindsay J. Benstead

Uprisings: Public Opinion, Gender, and Conflict in the Middle East Workshop, Kansas State
University, May 5.
Afrobarometer. 2016. Home page. http://​www.afrobarometer.org/​.
Alexander, A. C., and C. Welzel. 2011. “Islam and Patriarchy: How Robust Is Muslim Support
for Patriarchal Values?” International Review of Sociology 21 (2): 249–​276.
Arab Barometer. 2016. Home page. http://​www.arabbarometer.org/​.
Benney, M., D. Riesman, and S. A. Star. 1956. “Age and Sex in the Interview.” American Journal
of Sociology 62 (2): 143–​152. http://​dx.doi.org/​10.1086/​221954.
Benstead, L. J. 2014a. “Does Interviewer Religious Dress Affect Survey Responses?
Evidence from Morocco.” Politics and Religion 7 (4):  734–​760. http://​dx.doi.org/​10.1017/​
S1755048314000455.
Benstead, L. J. 2014b. “Effects of Interviewer-​Respondent Gender Interaction on Attitudes to-
ward Women and Politics: Findings from Morocco.” International Journal of Public Opinion
Research 26 (3): 369–​383. http://​dx.doi.org/​10.1093/​ijpor/​edt024.
Benstead, L. J. 2015. “Why Do Some Arab Citizens See Democracy as Unsuitable for
Their Country?” Democratization 22 (7):  1183–​ 1208. http://​dx.doi.org/​10.1080/​
13510347.2014.940041.
Benstead, L. J. 2016a. “Why Quotas Are Needed to Improve Women’s Access to Services
in Clientelistic Regimes.” Governance 29 (2):  185–​ 2 05. http://​dx.doi.org/​1 0.1111/​
gove.12162.
Benstead, L. J. 2016b. “Does Electing Female Councillors Affect Women’s Representation?
Evidence from the Tunisian Local Governance Performance Index (LGPI).” Paper
presented at the Annual Conference of the Midwest Political Science Association, Chicago,
April 7–​10.
Benstead, L. J., and L. Atkeson. 2011. “Why Does Satisfaction with an Authoritarian Regime
Increase Support for Democracy? Corruption and Government Performance in the Arab
World.” Paper presented at Survey Research in the Gulf: Challenges and Policy Implications,
Doha, February 27–​March 1.
Benstead, L. J., A. A. Jamal, and E. Lust. 2015. “Is It Gender, Religion or Both? A Role Congruity
Theory of Candidate Electability in Transitional Tunisia.” Perspectives on Politics 13 (1): 74–​
94. http://​dx.doi.org/​10.1017/​S1537592714003144.
Benstead, L. J., K. Kao, P. Landry, E. Lust, and D. Malouche. Forthcoming. “Using Tablet
Computers to Implement Surveys in Challenging Environments.” Survey Practice. http://​
www.surveypractice.org/​index.php/​SurveyPractice.
Benstead, L. J., K. Kao, and E. Lust. 2014. “Why Does It Matter What Observers Say?
The Impact of International Monitoring on the Electoral Legitimacy.” Paper
presented at the Middle East Studies Association Annual Meeting, Washington, DC,
November 22–​25.
Benstead, L. J., and D. Malouche. 2015. “Interviewer Religiosity and Polling in Transitional
Tunisia.” Paper presented at the Annual Conference of the Midwest Political Science
Association, Chicago, April 7–​10.
Benstead, L. J., and M. Reif. 2013. “Polarization or Pluralism? Language, Identity, and Attitudes
toward American Culture among Algeria’s Youth.” Middle East Journal of Culture and
Communication 6 (1): 75–​106.
Benstead, L. J., and M. Reif. 2015. “Coke, Pepsi or Mecca Cola? Why Product Characteristics
Shape Collective Action Problems and Boycott Success.” Politics, Groups, and Identities
(October 1): 1–​22. http://​dx.doi.org/​10.1080/​21565503.2015.1084338.
Benstead, L. J., and M. Reif. 2016. “Hearts, Minds, and Pocketbooks: Anti-​Americanisms and
the Politics of Consumption in the Muslim World.” Unpublished manuscript.
Survey Research in the Arab World    243

Benstead, L. J., and E. Snyder. 2016. “Is Security at Odds with Support for Democracy? Evidence
from the Arab World.” Unpublished manuscript.
Benstead, L. J., and M. Tessler. 2016. “Why Are Some Ordinary Citizens in Partly-​Free
Countries ‘Security Democrats’? Insights from a Comparison of Morocco and Algeria.”
Unpublished manuscript.
Bernick, E. M., and S. Ciftci. 2014. “Utilitarian and Modern:  Clientelism, Citizen
Empowerment, and Civic Engagement in the Arab World.” Democratization 22 (7): 1161–​
1182. http://​dx.doi.org/​10.1080/​13510347.2014.928696.
Blaydes, L., and R. M. Gillum. 2013. “Religiosity-​of-​Interviewer Effects:  Assessing the
Impact of Veiled Enumerators on Survey Response in Egypt.” Politics and Religion 6
(3): 459–​482.
Bratton, M. 2010. “Anchoring the ‘D-​Word’ in Africa.” Journal of Democracy 21 (4): 106–​113.
http://​dx.doi.org/​10.1353/​jod.2010.0006.
Bush, S. S., and A. A. Jamal. 2014. “Anti-​Americanism, Authoritarian Politics, and Attitudes
about Women’s Representation:  Evidence from a Survey Experiment in Jordan.”
International Studies Quarterly 58 (4): 34–​45. http://​dx.doi.org/​10.1111/​isqu.12139.
Ciftci, S. 2013. “Secular-​Islamist Cleavage, Values, and Support for Democracy and Shari’a in
the Arab World.” Political Research Quarterly 66 (11):  374–​394. http://​dx.doi.org/​10.1177/​
1065912912470759.
Corstange, D. 2014. “Foreign-​Sponsorship Effects in Developing-​World Surveys:  Evidence
from a Field Experiment in Lebanon.” Public Opinion Quarterly 78 (2): 474–​484.
Corstange, D., and N. Marinov. 2012. “Taking Sides in Other People’s Elections: The Polarizing
Effect of Foreign Intervention.” American Journal of Political Science 56 (3): 655–​670.
Dionne, K. Y. 2015. “The Politics of Local Research Production:  A Case Study of Ethnic
Competition.” Politics, Groups, and Identities 2 (3):  459–​ 480. http://​doi.org/​10.1080/​
21565503.2014.930691.
Ericsson, K., and H. A. Simon. 1993. Protocol Analysis:  Verbal Reports as Data. Rev. ed.
Cambridge, MA: MIT Press.
Ferguson, C. A. 1959. “Diglossia.” In Language in Social Context, edited by P. P. Giglioli, 232–​257.
Middlesex, UK: Penguin.
Grant, A. K., and M. A. Tessler. 2002. “Palestinian Attitudes toward Democracy and Its
Compatibility with Islam: Evidence from Public Opinion Research in the West Bank and
Gaza.” Arab Studies Quarterly 24 (4): 1–​20.
Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau.
2009. Survey Methodology. 2nd ed. Hoboken, NJ: Wiley.
Groves, R. M., and N. H. Fultz. 1985. “Gender Effects among Telephone Interviewers in a
Survey of Economic Attitudes.” Sociological Methods & Research 14 (1): 31–​52. http://​dx.doi.
org/​10.1177/​0049124185014001002.
Hornik, J. 1982. “Impact of Pre-​call Request Form and Gender Interaction on Response to a
Mail Survey.” Journal of Marketing Research 19 (1): 144–​151.
Jamal, A. A. 2007a. Barriers to Democracy: The Other Side of Social Capital in Palestine and the
Arab World. Princeton, NJ: Princeton University Press.
Jamal, A. A. 2007b. “When Is Social Trust a Desirable Outcome? Examining Levels of Trust in
the Arab World.” Comparative Political Studies 40 (11): 1328–​1349. http://​dx.doi.org/​10.1177/​
0010414006291833.
Jamal, A. A., and M. A. Tessler. 2008. “Attitudes in the Arab World.” Journal of Democracy 19
(1): 97–​110. http://​dx.doi.org/​10.1353/​jod.2008.0004.
Kane, E. W., and L. J. Macaulay. 1993. “Interviewer Gender and Gender Attitudes.” Public
Opinion Quarterly 57 (1): 1–​28. http://​dx.doi.org/​10.1086/​269352.
244   Lindsay J. Benstead

King, G. 2015. “Anchoring Vignettes Website.” http://​gking.harvard.edu/​vign.


King, G., C. J. L. Murray, J. A. Salomon, and A. Tandon. 2004. “Enhancing the Validity and
Cross-​Cultural Comparability of Measurement in Survey Research.” American Political
Science Review 98 (1): 191–​207. https://​doi.org/​10.1017/​S000305540400108X.
King, G., and J. Wand. 2007. “Comparing Incomparable Survey Responses: Evaluating and
Selecting Anchoring Vignettes.” Political Analysis 15 (1): 46–​66.
Koker, T. 2009. “Choice under Pressure:  A Dual Preference Model and Its Application.”
Yale Economics Department Working Paper No. 60. http://​www.dklevine.com/​archive/​
refs4814577000000000264.pdf.
Middle East Studies Association. 2016. Middle East Studies Association Annual Meeting
Program. https://​mesana.org/​mymesa/​meeting_​program.php.
Mitchell, J. S., and J. J. Gengler. 2014. “What Money Can’t Buy: Wealth, Status, and the Rentier
Bargain in Qatar.” Paper presented at the American Political Science Association Annual
Meeting, Washington, DC, August 28–​31.
Mneimneh, Z. N., W. G. Axinn, D. Ghimire, K. L. Cibelli, and M. S. Al-​Kaisy. 2014. “Conducting
Surveys in Areas of Armed Conflict.” In Hard-​to-​survey populations, edited by R Tourangeau
et al., 134–​156. Cambridge, UK: Cambridge University Press.
Mneimneh, Z. N., S. G. Heeringa, R. Tourangeau, and M. R. Elliott. 2014. “Bridging
Psychometrics and Survey Methodology:  Can Mixed Rasch Models Identify Socially
Desirable Reporting Behavior?” Journal of Survey Statistics and Methodology 2 (3): 257–​282.
Nachtwey, J., and M. Tessler. 2002. “The Political Economy of Attitudes toward Peace among
Palestinians and Israelis.” Journal of Conflict Resolution 46 (2): 260–​285. http://​www.jstor.
org/​stable/​3176175.
Norris, P. 2009. “Why Do Arab States Lag the World in Gender Equality?” September 16. http://​
www.hks.harvard.edu/​fs/​pnorris/​Acrobat/​Why_​do_​Arab_​States_​Lag3.pdf.
Program on Governance and Local Development. 2015. “The Tunisian Local Governance
Performance Index Report.” University of Gothenburg. http://​gld.gu.se/​media/​1107/​lgpi-​
report-​eng.pdf.
Program on Governance and Local Development. 2016. University of Gothenburg. http://​gld.
gu.se/​.
Rapoport, R. 1982. “Sex Differences in Attitude Expression:  A Generational Explanation.”
Public Opinion Quarterly 46 (1): 86–​96.
Robbins, M., and M. Tessler. 2014. “Arab Views on Governance after the Uprisings.” Monkey
Cage (blog), Washington Post, October 29. https://​www.washingtonpost.com/​blogs/​
monkey-​cage/​wp/​2014/​10/​29/​arab-​views-​on-​governance-​after-​the-​uprisings/​.
Shelef, N. G., and Y. Zeira. 2015. “Recognition Matters! UN State Status and Attitudes towards
Territorial Compromise.” Journal of Conflict Resolution (August 12). http://​dx.doi.org/​
10.1177/​0022002715595865.
Tessler, M. 2000. “Morocco’s Next Political Generation.” Journal of North African Studies 5
(1): 1–​26.
Tessler, M. 2002a.“Do Islamic Orientations Influence Attitudes toward Democracy in the
Arab World? Evidence from Egypt, Jordan, Morocco, and Algeria.” International Journal
of Comparative Sociology 43 (3): 229–​2 49. http://​dx.doi.org/​10.1177/​002071520204300302.
Tessler, M. 2002b. “Islam and Democracy in the Middle East:  The Impact of Religious
Orientations on Attitudes toward Democracy in Four Arab Countries.” Comparative Politics
34 (3): 337–​354. http://​dx.doi.org/​10.2307/​4146957.
Survey Research in the Arab World    245

Tessler, M., and E. Gao. 2005. “Gauging Arab Support for Democracy.” Journal of Democracy 16
(3): 83–​97. http://​dx.doi.org/​10.1353/​jod.2005.0054.
Tessler, M., A. Jamal, and M. Robbins. 2012. “New Findings on Arabs and Democracy.” Journal
of Democracy 23 (4): 89–​103. http://​dx.doi.org/​10.1353/​jod.2012.0066.
Tessler, M., M. Moaddel, and R. Inglehart. 2006. “Getting to Arab Democracy: What Do Iraqis
Want?” Journal of Democracy 17 (1): 38–​50.
Tessler, M., and M. D. H. Robbins. 2007. “What Leads Some Ordinary Arab Men and Women
to Approve of Terrorist Acts against the United States?” Journal of Conflict Resolution 51
(2): 305–​328.
Tessler, M., and I. Warriner. 1997. “Gender, Feminism, and Attitudes toward International
Conflict: Exploring Relationships with Survey Data from the Middle East.” World Politics 49
(2): 250–​281.
Tessler, M. A. 2016. Carnegie Middle East Governance and Islam Dataset, 1988–​2014. Inter-​
university Consortium for Political and Social Research. April 28. http://​doi.org/​10.3886/​
ICPSR32302.v6.
Tezcür, G. M., T. Azadarmaki, M. Bahar, and H. Nayebi. 2012. “Support for Democracy in Iran.”
Political Research Quarterly 65 (2): 235–​247. http://​dx.doi.org/​10.1177/​1065912910395326.
Transitional Governance Project (TGP). 2016. Home page. http://​
transitionalgovernanceproject.org/​.
Webster, C. 1996. “Hispanic and Anglo Interviewer and Respondent Ethnicity and Gender: The
Impact on Survey Response Quality.” Journal of Marketing Research 33 (1): 62–​72. http://​
dx.doi.org/​10.2307/​3152013.
Willis, G. B. 1999. “Cognitive Interviewing: A ‘How to’ Guide.” Short course presented at the
Meeting of the American Statistical Association. http://​appliedresearch.cancer.gov/​archive/​
cognitive/​interview.pdf.
World Values Survey. 2016. Home page. http://​www.worldvaluessurvey.org/​wvs.jsp.

Appendix 1
Public Opinion Data Sources

Publicly Available Data from Arab Countries


Arab Barometer: http://​www.arabbarometer.org/​
World Values Survey: http://​www.worldvaluessurvey.org/​wvs.jsp
Afrobarometer: http://​www.afrobarometer.org/​
ICPSR: https://​www.icpsr.umich.edu/​icpsrweb/​landing.jsp (See in particular Carnegie
Middle East Governance and Islam Dataset, http://​www.icpsr.umich.edu/​icpsrweb/​
ICPSR/​studies/​32302, which includes individual-​level and country-​level variables for
surveys conducted by Mark Tessler and collaborators since 1988.)
Pew Research Center has conducted surveys since 2001 in Morocco, Tunisia, Lebanon,
Jordan, Egypt, and Kuwait. Available online at http://​www.pewglobal.org/​question-​
search/​.
246   Lindsay J. Benstead

Other Survey-​Related Websites


Transitional Governance Project: http://​transitionalgovernanceproject.org/​
Program on Governance and Local Development:  http://​campuspress.yale.edu/​pgld/​
and http://​gld.gu.se/​

Research Centers and Institutes


The Social & Economic Survey Research Institute: http://​sesri.qu.edu.qa/​ (Qatar)
Palestinian Center for Policy and Survey Research: http://​www.pcpsr.org/​ (Palestine)
Center for Strategic Studies: http://​www.jcss.org/​DefaultAr.aspx (Jordan)
A number of non-​and for-​profit marketing and survey firms and research groups in the
region also conduct surveys.

Appendix 2
Sample Cover Page for Interviewer Effects Study
in Tunisia

VARIABLES TO BE FILLED IN BEFORE GOING TO DOOR


COVER PAGE
Please fill out I1 to I8 before going to the door.

I1. Case number |_​_​| |_​_​| |_​_​| |_​_​|

I2. Governorate I3.1 I3.2 I4.1 Interviewer I4.2 Interviewer I6. Total
|_​_​_​_​_​_​_​_​_​_​_​| Mu’atamdiya or Municipality name number number
El-​Imada or local |_​_​__​ ​_​_​_​_​_​_​| |_​_​__​ ​_​_​_​_​_​| of adults
|_​_​_​_​_​_​_​_​_​_​| government living in
area equivalent household
|_​_​_​_​_​_​_​_​_​_​| |_​_​_​_​_​|

I13.3 Block number |_​_​_​_​_​_​_​_​_​_​_​| I13.4 Random start |_​_​_​_​_​_​_​_​_​_​_​|


I13.5 Random walk |_​_​_​_​_​_​_​_​_​_​_​|
I8a. What is the socioeconomic status of the housing based upon the external
appearance of the house and neighborhood?
1 Lower class
2 Lower middle class
Survey Research in the Arab World    247

3 Middle class
4 Upper middle class
5 Upper class
I8b. Do you know anyone in this household? 1. Definitely not 2. Possibly 3. Yes
I8c. What is the nature of your relationship with one or more members of this house-
hold? 1. Friends 2. Family 3. Classmates, coworkers, 4. Tribe/​clan 5. Association, reli-
gious organization 6. Other: _​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​
Please fill out I9 to I15 after the survey is complete or refused.

Case Result I.9 |_​_​| Completed Interview |_​_​| Declined


I5. Number of I10. Interview completed by
adults selected interviewer who selected
[See Kish table respondent? |_​_​| Yes |_​_​| No
next page] |_​
_​_​|
I11. Result code 1st visit 2nd visit 3rd visit 1st 2nd 3rd
[Enter number visit visit visit
1–​10 and visit
number] | _​_​|
1. Interview 5. Premises closed
completed (unsure whether eligible
person lives here because
no one answered the
door)
2. Interview 6. Nonresidential unit
partially (e.g., office, doctor’s
completed office)
3. Interview 7. Premises empty (it is
delayed clear that no one lives
there)
4. No eligible 8. Refusal (i.e., selected
person respondent refuses to
participate) [Go to I12a]
9. Refusal (i.e., cannot
meet with selected
respondent for
interview) [Go to I12a]
10. Other
248   Lindsay J. Benstead

Information on refusals:
If I11 is 8 or 9 (refusal):
I12a. What is the gender of the informant who answered the door?
Male 1. Female
I12b. Religious clothing/​appearance of the informant who answered the door or others in
the home:
3. Very religious 2. Religious 1. Not religious at all 96. Unknown
I12c. Were you able to choose a participant using the Kish table?
No 1. Yes
I12d. What is the gender of selected participant who declined the interview?
Male 1. Female 2. No participant was selected because refusal occurred before
household listing
I12e. Religious clothing/​appearance of the one who declined the interview:
3. Very religious 2. Religious 1. Not religious at all 96. Unknown
I12f. Any reason given for refusal? (if reason guessed or known): _​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​
_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​_​
Completed Questionnaire Information

I13. Interview Date |_​_​| |_​_​| /​ |_​_​| |_​_​| /​


|_​_​| |_​_​|
MDY
Interview Start Time |_​_​| |_​_​| : |_​_​| |_​_​| Interview End Time |_​_​| |_​_​|
: |_​_​| |_​_​|
I14. Total Minutes Spent on |_​_​| |_​_​| |_​_​|
Interview
I15. Field Supervisor’s Name Inspection Date by Field Supervisor |_​_​| |_​_​| /​
|_​_​| |_​_​| /​
|_​_​| |_​_​|
MDY
Chapter 12

The L anguag e -​Opi ni on


C onnect i on

Efrén O. Pérez

Introduction

Can language affect survey response? This is a simple question that should be easy to
answer. All that is seemingly necessary to test this proposition are measures of people’s
opinions and the language they interview in. From there, researchers can statistically
estimate the association between individuals’ reported views and the language of inter-
view (while of course adjusting these estimates for other influences) (Lee 2001; Lee and
Pérez 2014; Lien et al. 2004; Welch et al. 1973). In fact, some scholars might wish to assess
this relationship experimentally (Druckman et al. 2011; Shadish et al. 2002). Better to
fully isolate language’s influence via randomization. Only then can one confidently infer
language’s causal impact on survey response. Simple, right?
Yes, but also utterly deceptive. Because no matter how elegant the research design,
survey researchers have yet to answer basic questions about this “language-​opinion
connection.” First, there is a conceptualization issue: identifying which aspects of lan-
guage influence what facets of survey response (Swoyer 2014). Given that languages
vary by their grammatical structure, scholars should specify how such nuances affect
people’s articulation of opinions, if at all (Slobin 1996). Researchers should also clarify
whether language shapes individual opinions by influencing what a person remembers
and/​or how they remember it (Marian and Neisser 2000). These are not mutually exclu-
sive points, but they are often omitted from existing research on language and survey
response (Lee and Pérez 2014; Welch et al. 1973).
Second, there is a measurement problem. To claim that interview language affects
survey response, scholars must ensure that speakers of different tongues interpret
survey questions equivalently (Boroditsky 2001). Individuals who interview in dis-
tinct tongues may express dissimilar opinions because they construe survey questions
differently (Stegmueller 2011; Pérez 2009; Davidov 2009). Hence, language-​opinion
250   Efrén O. Pérez

gaps might stem from uneven item performance across tongues, rather than language
nuances per se.
Third, it is challenging to select a research design to appraise language’s influence
on survey response. One popular option is to observationally assess this relationship
with national survey data that offer respondents different interview language options
(Dutwin and Lopez 2014; Garcia 2009; Lee and Pérez 2014). This approach is strong
on representativeness: one can be confident that a correlation between language and
opinion(s) reflects a real phenomenon in a mass public. But the same framework is
weak on causality: it is impossible to rule out that other (un-​)observed traits among
individuals, besides language, drive this pattern (Clarke 2005).
Enter experiments, in which interview language can be randomized among bilinguals.
This method isolates language, thus allowing tests of its mechanism(s) (cf. Boroditsky
2001; Marian and Kaushanskaya 2007). Yet such experiments can be criticized on sev-
eral grounds, including, paradoxically, internal validity (i.e., whether language causally
impacts survey response) (McDermott 2011). For example, bilinguals often use one of
their tongues in specific contexts. In surveys, this manifests as a preference to inter-
view in a certain language (Lee and Pérez 2014). However, if one is randomly assigned
to complete a survey in a tongue that is not preferred for that context, this treatment can
become reactive in all the wrong ways, such as angering or worrying people, with down-
stream effects on their opinions (Brader and Marcus 2013).
But the last challenge is the most fundamental. Conceptualization can be addressed,
measurement issues can be overcome, and optimal research designs can be selected.
Yet all of this is for naught if there is weak theory to explain and anticipate language
effects on the varied components of survey response (e.g., recall, reporting) (Tourangeau
et al. 2000). There is no silver bullet here. This must be a sustained effort to answer
questions such as how language shapes survey response, when it does so, and among
whom these effects are strongest. Such an effort might begin by considering whether
language differences produce framing effects, or shifts in how “people develop a par-
ticular conceptualization of an issue or reorient their thinking about an issue” (Chong
and Druckman 2007, 104). It is plausible that by increasing the relative salience of some
considerations, language affects how respondents answer survey questions. For ex-
ample, does interviewing in a tongue that makes explicit gender distinctions (e.g., “his”
versus “hers”) lead respondents to report less support for policy proposals to tackle
gender inequality?
So the simple question about whether language shapes survey response is not
so simple after all. In fact, it is not really one question, but several. This means there
is plenty of work for survey researchers to do. In order to clarify just how much work
there is, I use the next section to pinpoint what scholars know and do not know about
the language-​opinion connection. I  show that while a thick layer of evidence has
accumulated on language’s association with mass opinion, explicit theories about why
this pattern emerges are thin and sparse.
I then explain several findings from cognitive and social psychology, two fields with
rich implications for how survey researchers might strengthen their theoretical grip
The Language-Opinion Connection    251

on the mechanism(s) behind the language-​opinion connection (Boroditsky and Gaby


2010; Marian and Kaushanskaya 2007; Ogunnaike et al. 2010). I argue that any successful
attempt to illuminate language’s influence on survey response should consider heeding
what psychologists have learned, since most of their insights have not been applied to
this domain. The opportunity for synergy across these disciplines is therefore ample
and ripe.
Finally, I  round out these sections with a detailed discussion of how researchers
might make headway on building theories to explain the impact of language on survey
response, while addressing issues of conceptualization and measurement along the
way. What I say in that section is not meant to be exhaustive. My more modest goal is to
highlight what I see as a pressing need to illuminate the microfoundations of language
effects on survey response. Let me begin by providing a better sense of what we are up
against.

What Survey Researchers (Do Not)


Know about the Language-​Opinion
Connection

With the rise of mass surveys, researchers began assessing public opinion across coun-
tries, thus encouraging the development of questionnaires in needed tongues (Almond
and Verba 1963; Ervin and Bower 1952; Stern 1948). Today, scholars increasingly use
multilingual polls to gauge opinions in nations where immigration has brought about
population growth (Davidov and Weick 2011; de la Garza et al. 1992; Lien et al. 2004;
Tillie et al. 2012; Wong et al. 2011). Take the United States, where Asians and Latinos have
arrived in large numbers since 1965. Ryan (2013) reports that about forty million people
in the United States speak Chinese or Spanish at home, even though more than 70% of
them report speaking English “well” or “very well.”1
Giving respondents the opportunity to interview in different languages allows
researchers to yield more representative portraits of mass opinion (cf. de la Garza et al.
1992; Dutwin and Lopez 2014; Fraga et al. 2010; Hochman and Davidov 2014; Inglehart
and Norris 2003; Lien et al. 2004; Tillie et al. 2012; Wong et al. 2011). In many poly-
glot nations, some people will speak only one language, though it may not be the one
used to administer a poll. Some will speak two or more languages, but may prefer to be
interviewed in a tongue also not offered by a survey. But others will speak the tongue
provided by a poll, although they represent but one stratum in the population. Yet to
sample only this last segment because it is easier and cheaper is to misrepresent the
opinions of the larger population (Dutwin and Lopez 2014), especially if those prefer-
ring to interview in certain tongues display varied attitudes and beliefs (Lee and Pérez
2014). Thus, as societies become (even) more linguistically diverse, the use of multilin-
gual polls will likely continue to grow.
252   Efrén O. Pérez

But even as researchers increasingly poll members of mass publics in different


tongues, a dense fog hangs over why and how language affects survey response. This am-
biguity is reflected in leading explanations about how people articulate opinions (Lodge
and Taber 2013; Tourangeau et al. 2000; Zaller 1992). These frameworks suggest that
survey response depends on the question being asked and the considerations it evokes
(Zaller 1992). Specifically, survey questions activate concepts in long-​term memory,
which is associatively organized (Lodge and Taber 2013). This means concepts are linked
to each other in a lattice-​like network, in which stimulation of one energizes others via
spreading activation (Collins and Loftus 1975). Once relevant concepts are aroused, they
are recruited from long-​term memory into working memory—​the “top of the head”—​
where one assembles them into a response (Zaller 1992). Yet nowhere in these theoret-
ical accounts does language explicitly play a role.
This omission is at odds with what some survey researchers are finding. Several
studies show that public opinion is reliably associated with interview language (Lien
et  al. 2004; Pérez 2011; Welch et  al. 1973). Lee (2001) reports robust correlations be-
tween interview language and opinions on several topics in the Latino National Political
Survey (LNPS; 1988–​1989), a seminal study of U.S. Latinos. Lee and Pérez (2014) reveal
that such patterns also emerge in newer data sets, like the Latino National Survey (LNS;
2006). For example, LNS respondents interviewing in English report 10% more knowl­
edge about U.S. politics than those interviewing in Spanish. Moreover, Garcia (2009)
finds that about a fifth of LNS respondents changed interview languages—​from English
to Spanish or Spanish to English—​with this switching affecting people’s opinion reports.
These associations between individual opinions and language of interview are gener-
ally robust to statistical controls and reproducible across several data sets and different
populations that are linguistically diverse (e.g., Asian Americans) (cf. Lee 2001; Lee
and Pérez 2014; Lien et al. 2004). Yet their interpretation remains open to debate—​and
for precisely some of the reasons I discussed in the introduction. Let us start with the
question of research design.

Correlations, Correlations, Correlations


Most evidence affirming a language-​opinion connection derives from correlational
studies of survey data that are representative of populations like Asian Americans or
Latinos (Garcia 2009; Lee 2001; Lee and Pérez 2014; Lien et al. 2004; Welch et al. 1973).
Finding that individual opinions are correlated with interview language is remarkable,
because it implies that what survey respondents report is shaped by the tongue they use
to complete a poll. But the correlational nature of these studies raises strong concerns
about omitted variable bias (Clarke 2005), since interview language is self-​selected by
respondents, not randomly assigned. Scholars have dealt with this by adjusting estimates
of language effects for a litany of observed covariates (e.g., age, education, language pro-
ficiency) (Garcia 2009; Lee and Pérez 2014; Welch et al. 1973). But this ignores unob-
served differences between respondents and makes the generated results increasingly
The Language-Opinion Connection    253

model dependent (Clarke 2005). Clearer and stronger evidence, then, is needed to bol-
ster the claim that language independently influences survey response.

Apples and Oranges


A thornier issue involves measurement: specifically, ensuring that speakers of different
tongues sense the same reality. This is formally known as measurement equivalence,
or what Horn and McArdle (1992, 117)  refer to as “whether or not, under different
conditions of observing and studying phenomena, measurement operations yield meas-
ures of the same attribute” (Davidov 2009; Harkness et al. 2003; Jacobson et al. 1960;
Pérez 2009). Applied to the case of language and survey response, measurement equiv-
alence is achieved if survey questions capture the same attitude, belief, value, and so
forth from respondents who interview in different languages. Consider the assessment
of group identity in a survey. Attaining measurement equivalence here demands that
items appraising this construct do, in fact, capture the same form of identity, to the same
degree, across respondents interviewed in different tongues. If these conditions are not
met, scholars risk comparing “apples” to “oranges” (Stegmueller 2011).2
Despite painstaking questionnaire translations, however, speakers of varied tongues
often interpret survey items differently (Harkness et al. 2003; Pérez 2009; Stern 1948).
Survey questions aim to measure opinions that are latent and not directly observed. This
means a person’s observed score (yi) on a survey question is conditional on their true
opinion score (η) and nothing else. When F(yi | η) holds, observed differences in an-
swering a question reflect true opinion differences. But if speakers of varied tongues in-
terpret a survey item differently, a person’s response to a question is conditional on his or
her opinion and language group (gi)—​that is, F(yi | η, gi). 3
When this happens, language-​opinion differences are conflated with item quality
differences, making it harder to pin a causal effect to language (Stegmueller 2011).
Moreover, if questions are more difficult for some language speakers, then they will
misreport their opinion level. Pérez (2011) shows that even at equal levels of political
knowledge, Spanish interviewees were less likely than English interviewees to correctly
report which candidate won the most votes in their state in the 2004 presidential elec-
tion, due to item bias. Similar results arise in items measuring other traits, with item bias
yielding “false positives” in sample survey data. More reassurance is thus needed that
any language-​opinion gap is real rather than a measurement artifact.4

Where’s the (Theoretical) Beef?
But even if the preceding methodological challenges are resolved, there is the issue
of theory—​or rather, a relative lack of it. Most research on the language-​opinion
connection focuses on detecting this relationship and ruling out alternative influences.
Less emphasis is placed on why language is even linked to survey response in the first
254   Efrén O. Pérez

place (Garcia 2009; Lien et al. 2004; Welch et al. 1973; Wong et al. 2011). For example,
Lee and Pérez argue that language gaps “cannot be reduced to a technical matter about
omitted variable bias, measurement error, or status deference” (Lee and Pérez 2014,
20). But studies like these neglect to clarify how language shapes which aspect of survey
response. Hence, a more convincing case still needs to be made about the pathway(s)
linking language nuances to individual differences in survey response.
So, evidence on the language-​opinion connection is assailable on several fronts. Yet
my sense is that these challenges can be conquered by looking beyond established results
in public opinion research. One area worthy of attention is the work of psychologists,
which illuminates the micromechanisms behind language effects on thinking. Indeed, if
our target is to develop more agile theories to explain the language-​opinion connection,
then heeding these psychological insights stands to make survey researchers sharper
marksmen. Let me explain why.

Language and Thinking:


The View from Psychology

The idea that language affects people’s thinking is often associated with the amateur lin-
guist Benjamin Whorf, who (in)famously claimed that people’s thoughts are completely
determined by language (i.e., linguistic determinism) (Swoyer 2014; Boroditsky et al.
2003). In one of his stronger expositions of this view, Whorf (1956, 221) asserted that
users of varied tongues are led to “different . . . observations . . . and . . . evaluations of
externally similar acts . . ., and hence are not equivalent as observers but must arrive
at . . . different views of the world.”
While certainly intriguing and ambitious, Whorf ’s hypothesis slowly ran aground
on several shoals of criticism, eventually sinking his outlook on language and thinking.
Some of the distress experienced by his hypothesis was self-​inflicted. For all of its bra-
vado, Whorf ’s contention was remarkably light on evidence, with most support based
on anecdote and personal observation of the languages he studied. Some of the trouble,
though, arose from unpersuaded skeptics, who marshalled evidence that shredded the
hull of Whorf ’s hypothesis. Especially damaging here were studies showing that non-​
English speakers could learn English color categories, even though their native tongue
had few words for color (Heider 1972; Rosch 1975).
Consequently, many scholars have found Whorf ’s hypothesis untenable and
unfalsifiable (Boroditsky 2003; Swoyer 2014). But a new generation of psychologists has
refashioned his claim into weaker, but still interesting and testable, versions (Boroditsky
2001; Fuhrman et al. 2011; Marian and Neisser 2000). These researchers have threaded
together varied theoretical accounts about language’s cognitive effects, with their
findings yielding a rich tapestry of evidence. Let us examine some of the parts making
up this whole.
The Language-Opinion Connection    255

Grammatical Differences and “Thinking for Speaking”


One way psychologists have rehabilitated Whorf ’s hypothesis is by drawing on Slobin’s
notion of “thinking for speaking.” Slobin (1996) argues that languages vary in their
grammatical organization, which obliges speakers to focus on varied aspects of their ex-
perience when using a given tongue. As he explains, the “world does not present ‘events’
and ‘situations’ to be encoded in language. Rather, experiences are filtered through lan-
guage into verbalized events.” For example, gender-​less languages, like Finnish, do not
require speakers to designate the gender of objects. In fact, even the word for “he” and
“she” is the same in these tongues. In sharp contrast, gendered tongues, like Spanish, re-
quire speakers to differentiate genders and assign it to objects. For example, to say that
“the sun is rising,” Spanish speakers must denote the masculinity of the “sun” by using
the definite article el, as in “el sol va saliendo.”
Using this framework, psychologists have gathered new and more convincing evi-
dence that language can affect various aspects of cognition, including how people rep-
resent objects in memory (Boroditsky et al. 2003) and how they distinguish between
shapes and substances (Lucy and Gaskins 2001). One research stream studies how
quirks of grammar yield nuances in “thinking for speaking” and thus, variations in how
people sense or judge phenomena (Boroditsky and Gaby 2010; Boroditsky et al. 2003;
Cubelli et al. 2011; Fuhrman et al. 2011; Vigliocco et al. 2005). Here Boroditsky (2001)
teaches us that languages, like English and Mandarin, vary by how they conceptualize
time. English speakers delineate time horizontally with front/​back terms, as in “what
lies ahead of us” and “that which is behind us.” Mandarin speakers employ front-​back
terms, too, but they also use vertical metaphors, as in earlier events being “up” and later
events being “down.”
Such differences should hardly matter, right? Yet careful research shows that these
language differences can become important when individuals think about time. For ex-
ample, Boroditsky (2001) primed English and Mandarin speakers with horizontal cues
(e.g., a visual of a black worm ahead of a white worm) or vertical ones (e.g., a visual of
a black ball above a white ball). Remarkably, she found that Mandarin speakers were
milliseconds faster in confirming that March precedes April when primed vertically
rather than horizontally.
Other scholars have shown that “thinking for speaking” affects people’s sense of spa-
tial locations (Li and Gleitman 2002). For example, Dutch and Tzeltal are tongues that
describe spatial relations in relative and absolute terms, respectively. Seizing this nuance,
Levinson (1996) sat Dutch and Tzeltal speakers at a table with an arrow pointing right
(north) or left (south). He then rotated subjects 180 degrees to a new table with arrows
pointing left (north) and right (south), asking them to choose the arrow that was like the
earlier one. Dutch speakers generally chose in relative terms. If the first arrow pointed
right (north), then they chose the arrow that pointed right (south). In contrast, Tzeltal
speakers generally chose in absolute terms. If the first arrow pointed north (right), then
they chose an arrow that pointed north (left).
256   Efrén O. Pérez

Language and the Encoding Specificity Principle


The studies discussed above powerfully illustrate how “thinking for speaking” can
clarify the influence of language on cognition. But this is not the only way to explain
language’s influence over people’s minds. Other research has drawn inspiration
from what psychologists call the encoding specificity principle, the idea that people
recall information more easily when there is a match between how they learned it
(i.e., encoding) and how they draw it from memory (i.e., retrieval) (Tulving and
Thomson 1973; cf. Godden and Baddeley 1975; Grant et al. 1998).
Accordingly, Marian and associates argue that language facilitates memory re-
call when the tongue used to retrieve information (e.g., childhood memories)
matches the tongue in which the content was acquired (Marian and Neisser 2000).
For example, Marian and Fausey (2006) taught Spanish-​English bilinguals infor-
mation about history, biology, chemistry, and mythology in both tongues. Subjects’
memories were more accurate, and their recall faster, when they retrieved the mate-
rial in the language they learned it in. Similarly, Marian and Kaushanskaya (2007)
asked Mandarin-​English bilinguals to “name a statue of someone standing with
a raised arm while looking into the distance.” Subjects were more likely to say the
Statue of Liberty when cued in English, but more likely to identify the Statue of Mao
Zedong if cued in Mandarin.
Rounding out this research, Marian and her colleagues have also demonstrated
that memories encoded in a specific language are more emotionally intense
when retrieved in that tongue. Marian and Kaushanskaya (2004) asked Russian-​
English bilinguals to narrate a life event that came to mind when given a prompt,
with the researchers tape-​recording all narrations. Two raters coded all the
narrations for their emotional intensity. In line with the encoding specificity
principle, the authors found that subjects articulated narratives that were more
emotionally intense when the language of encoding was congruent with the lan-
guage of retrieval.

The Interface Between Language and Culture


Another fruitful research area examines the bond between language and culture. Social
psychologists have found a strong link between varied tongues and specific cultures,
in which any “two languages are often associated with two different cultural systems”
(Hong et al. 2000, 717; cf. Bond and Yang 1982; Ralston et al. 1995). The paradigmatic
example is research on the private and collective self (Triandis 1989). This work suggests
a private and collective self exists in all of us, with the former revealed in thoughts about
the individual person (e.g., “I am great”) and the latter in thoughts about a person’s
group membership(s) (e.g., “I am a family member”) (Triandis 1989). Yet the relative
The Language-Opinion Connection    257

emphasis a person places on these selves varies between cultures, with people in individ-
ualist cultures like the United States reporting more private self-​cognitions than peers in
collectivist cultures like China (Trafimow and Smith 1998; Trafimow et al. 1991). For ex-
ample, Ross and colleagues (2002) randomly assigned Chinese-​born subjects in Canada
to complete a study in Chinese or English. Revealingly, subjects who participated in
Chinese reported more cognitions about the self in relation to others (“I am a family
member”) than did those participating in English.

The Automatic Influence of Language on Thought


Finally, within this sea of studies on language and thought there is an isle of work
suggesting that language automatically shapes people’s attitudes (Danziger and
Ward 2010; Ogunnaike et al. 2010). What makes this research compelling is that the
attitudes people express in these studies are not self-​reported, but implicit—​that is,
nonverbalized, spontaneously activated, and difficult to control (Pérez 2013). This
implies that language’s cognitive influence is sparked well before people start to cobble
together an opinion to report (Lodge and Taber 2013).
Danziger and Ward (2010), for example, had Arab Israeli undergraduate students
complete an Implicit Association Test (IAT), a millisecond measure that assesses how
quickly people associate different objects like racial groups with words of varied positive
or negative valence (Pérez 2013). The IAT here measured automatic associations between
Arabs (Jews) and words with negative (positive) valence. Subjects completed the IAT in
either Arabic or Hebrew on a random basis. Strikingly, Arab-​Israeli bilinguals evaluated
Arabs less positively when completing the IAT in Hebrew than in Arabic. Yes, you read
that right: people’s spontaneous judgment of ethnic groups shifted with the language used
to evaluate them.
This tantalizing result does not seem to be a fluke, for other researchers have detected
a similar pattern, not once, but twice—​and in different samples, no less. In a study of
Moroccan Arabic-​French bilinguals, Ogunnaike and associates (2010) found that
subjects automatically evaluated Arabic names more positively than French names
when completing an IAT in Arabic. Not to be outdone, a second study revealed that U.S.
Spanish-​English bilinguals automatically evaluated Spanish names more positively than
English names when completing an IAT in Spanish.
These studies are also crucial for another reason. We learned earlier that comparing
the opinions of varied language speakers is difficult because people may construe survey
questions differently. One solution is to establish language effects on nonlinguistic tasks
(Boroditsky 2001), which do not require the use of language (or very little of it). By
showing language effects on the IAT, in which no verbal expression of attitudes occurs,
Danziger and Ward (2010) and Ogunnaike and colleagues (2010) bolster the claim that
language yields nuances in people’s thinking.
258   Efrén O. Pérez

Toward the Psychology of Language


Effects on Survey Response

Clearly, cognitive and social psychologists have amassed a trove of theoretical insights,
complete with empirical evidence, about how language can affect people’s thinking.
But is any of this relevant for survey response? I would like to think so, but the situ-
ation is a little more complex than that. First, most of the evidence we just discussed
is from small-​scale experiments (N < 50)  with convenience samples (Boroditsky
2001; Cubelli et al. 2011; Fuhrman et al. 2011; Marian and Neisser 2000). Low statis-
tical power thus becomes a concern. With so few observations, the deck is stacked
against finding a true effect in these tiny samples; and, when an effect is detected, the
likelihood that it is real and not due to chance is worryingly low (Button et al. 2014;
Cohen 1992).
Second, these challenges are compounded by the “college sophomore” problem
(Sears 1986). Most studies of language effects center on undergraduate college students,
which raises concerns about external validity or whether language can influence
thinking across different subjects, research settings, timings, treatments, and outcomes
(McDermott 2011; Shadish et al. 2002). College students are a thin slice of any popu-
lation, which is a problem insofar as scholars wish to make claims about whether lan-
guage affects survey response in the mass public, where the public entails more than
just “college sophomores.” Thus, one way to increase the external validity of language
effects research is to conduct experimental tests in nonlab settings, with more varie-
gated samples, and with survey response as a dependent variable—​in other words, in a
public opinion survey.
Third, there is a tangled knot between language and culture. Those who do studies
on language and thinking find it difficult to rule out that the main driver of observed
differences between varied language speakers is the tongues they use, not the cultures
they inhabit (Bond and Yang 1982; Ralston et al. 1995; Ross et al. 2002; Trafimow et al.
1991). An even bigger specter, perhaps, is that language might be endogenous to culture,
which would make it hard to sustain the claim that language causes shifts in people’s
survey reports (King et al. 1994).
These are all delicate issues that complicate the wholesale transfer of psycholog-
ical insights to the realm of survey response. But they are not insurmountable, and
they should not detract from formulating theories to explain the language-​opinion
connection. For example, low statistical power is easy to “fix.” Increasing any study’s
power simply demands that researchers be more explicit about the effect sizes they an-
ticipate a priori, while collecting enough observations to be able to detect effects of that
size if they do, in fact, exist.
Public opinion researchers can also credibly address the “college sophomore” issue,
though the solution is not as obvious as it might seem. If the problem is that most studies
of language effects are based on students, then the temptation is to run such experiments
The Language-Opinion Connection    259

on samples that are representative of a population of interest. But the broader issue here
is external validity: the extent to which a language-​opinion connection arises, not just
in larger and more heterogeneous samples, but also across varied research settings,
timings, treatments, and outcomes (McDermott 2011; Shadish et  al. 2002). For the
language-​opinion connection, this entails answering basic questions, such as whether
language shapes survey response across varied samples and data-​collection modes (i.e.,
online, telephone, and face-​to-​face surveys).
Ruling out that language is endogenous to culture can also be overcome with heavy
conceptual lifting. “Culture” is a loaded term that means different things to different
people. Hong and colleagues (2000, 710) note that a common but static view of cul-
ture defines it as a “highly general structure, such as an overall mentality, worldview,
or value orientation.” Yet a more dynamic view of culture deems it a shared mental
map that includes “unstated assumptions, tools, norms, values, habits about sampling
the environment, and the like” (Triandis and Suh 2002: 136), which can be activated by
speaking a specific tongue (Ross et al. 2002; Trafimow et al. 1991). If one views culture
statically, then distinguishing it from the tongue one speaks will involve manipulating
language across distinct cultures and observing its effect on similar outcomes, which
would reveal whether language comparably affects thinking in varied cultural settings
(McDermott 2011). But if one views culture fluidly, the influence of language on it does
not need disentangling, since language is a trigger to cultural knowledge. It all depends
on one’s perspective.
Ultimately, however, resolving these challenges only clears the path for the more diffi-
cult task that is theory building. In particular, public opinion researchers who are inter-
ested in language effects must still clarify how, when, and among whom survey response
is affected by the tongue in which individuals interview.

How Does Language Affect Survey Response?


The most fundamental question to answer, as I see it, concerns how the tongue one
speaks influences survey response. This is a lot more difficult than it seems, because it
requires researchers to specify what aspect of survey response is affected by language.
For instance, does language affect the content of people’s attitudes, beliefs, and values?
Does it affect how those considerations are retrieved? Or does it influence how they are
expressed?
One promising avenue to pursue is to draw explicitly on Slobin’s (1996) notion of
“thinking for speaking.” This is the idea that languages vary in their grammatical organ­
ization, which obliges speakers to focus on different aspects of their experience when
using a given tongue. As Slobin (1996, 75) explains, this is “the thinking that is carried
out, on-​line, in the process of speaking.” It is the act of encountering the contents of the
mind in a way that is consistent with the grammatical demands of one’s tongue. The
trick here, then, is to precisely identify how such quirks of language can affect survey
response.
260   Efrén O. Pérez

That grammar might shape survey responses is not farfetched. For example, Pérez
and Tavits (2015) study the grammatical nuances between gendered and gender-​less lan-
guages to study public attitudes toward gender inequality. They argue that speaking a
gender-​less tongue promotes gender equity by failing to distinguish between male and
female objects. Speakers of a gender-​less language should thus find it harder to perceive
a “natural” asymmetry between the sexes, which leads them to be more supportive of
efforts to combat gender inequality.
To test this, Pérez and Tavits (2015) randomly assign the interview language in a survey
of Estonian-​Russian bilingual adults in Estonia, in which Estonian is a gender-​less lan-
guage and Russian a gendered tongue. Compared to Russian interviewees, Estonian
interviewees are more likely to support making family leave policy flexible so that a fa-
ther can stay home with a baby. They are also more likely to endorse efforts to recruit
more women to top government posts and the nomination of a female defense minister.
Across these outcomes, the boost in the probability of support induced by interviewing
in Estonian ranges between 6% and 8%, which is noteworthy because all other differences
between bilinguals are held constant via randomization. Further, these authors rule out
that support for efforts to combat gender inequality do not come at men’s expense, be-
cause gender-​less language speakers become either pro-​female or anti-​male.
Yet not all public policy proposals break down along gender lines, so perhaps
“thinking for speaking” has limited applicability beyond this crucial, but narrow, do-
main. But recall that “thinking for speaking” variations arise in other areas, like
conceptions of time and space (Boroditsky 2001; Boroditsky and Gaby 2010), which are
incredibly important for how the public evaluates policy proposals. Let me illustrate
with temporal conceptualizations.
Some tongues differ by how future oriented they are. Chen (2013) explains that lan-
guages vary in the degree to which they dissociate the future from the present. Tongues
with a strong future-​time reference (FTR) crisply distinguish the future from the pre-
sent, while weak FTR languages equate the future and present. Chen (2013) argues that
weak-​FTR languages should lead people to engage more in future-​oriented behaviors,
because those tongues conflate “today” with “tomorrow,” finding that speakers of weak-​
FTR tongues save more, retire with more wealth, smoke less, practice safer sex, and are
less obese.
But how might such insights explain people’s policy attitudes? One possibility
acknowledges that time horizons play a major role, as evidenced by research on the tem-
poral dynamics of public opinion (Gelman and King 1993; Stimson 2004). Language
nuances in time perception could plausibly affect public support for policies with
long-​run consequences, such as ones addressing climate change (Pérez and Tavits n.d.;
Villar and Krosnick 2011). Net of one’s ideology or attention to the issue, support for
such policies might be weaker among speakers of tongues with a strong FTR, since they
can more easily discount the future, when climate change consequences will be more
pressing than they are now.
The same is true of public support for changes to entitlement programs (e.g., pensions,
health insurance). Mass publics in many nations face the prospect of reforming
The Language-Opinion Connection    261

expensive entitlement programs today, so that their governments can remain solvent
tomorrow (Pew Research Center 2014). But perhaps to people who speak a tongue that
allows them to more easily brush off the future, government insolvency does not feel like
an immediate problem. Thus, public resistance to such reforms might partly arise from
language, with speakers of strong FTR tongues expressing less support, since it is easier
for them to downplay the future.
Of course, these last two examples offer more promise than fact. Yet I highlight them
to illustrate how “thinking for speaking” can help public opinion researchers assess not
only whether language can affect survey response, but also in which domains.

When Does Language Affect Survey Response?


Another useful question to consider is when language impacts survey response. One
way to do this is by pushing on the boundaries of what we already know about this phe-
nomenon in a world where language does not seem to matter. There, people express
opinions on the basis of considerations evoked by survey questions (Zaller 1992). Think
of framing effects in which simple changes in the phrasing of survey items generate
noticeable changes in people’s responses (Chong and Druckman 2007). Smith (1987),
for example, shows that survey respondents are much more supportive of spending on
“assistance for the poor” than on “welfare.” That basic word changes affect individual
preferences, by evoking varied considerations, implies that people’s opinions might be
shaped by the very language they use to report those opinions. After all, Marian and
colleagues suggest that individual recall of information is facilitated when the tongue
used to retrieve a memory matches the tongue in which a memory was learned (Marian
and Kaushanskaya 2007; Marian and Fausey 2006; Marian and Neisser 2000).
Drawing on Marian and associates’ insights, Pérez (2014) argues that political
concepts, such as U.S. political facts, are more associated with some languages (e.g.,
English) than others (e.g., Spanish). Hence, some political concepts will be more men-
tally accessible on the basis of interview language. Randomizing the language of an in-
terview among a national sample of English-​Spanish bilingual Latino adults (N = 530),
Pérez (2014) finds, inter alia, that English interviewees report up to 8% more polit-
ical knowledge than Spanish interviewees. That is, just interviewing in English allows
people to report more knowledge about American politics, because those facts are more
strongly tied to English. By the same token, English interviewees report reliably lower
levels of national origin identity (e.g., “I am pleased to be Mexican”), since the nation of
origin is a concept that is more strongly tied to Spanish.
Pérez (2014) then buttresses these results in three ways. First, he analyzes his
survey items to establish that such language gaps are not measurement artifacts (i.e.,
multigroup confirmatory factor analysis) (Davidov 2009; Pérez 2009; Stegmueller
2011). Second, he shows that these language-​opinion gaps are not mediated by bilinguals
experiencing strong emotional reactions (i.e., anxiety, anger, and pride) to interviewing
in one of their tongues (Brader and Marcus 2013). Third, he demonstrates that opinion
262   Efrén O. Pérez

differences by language do not stem from English interviewees feeling more efficacious
by interviewing in a dominant tongue, which would motivate them to more thoroughly
search their memories for relevant content.
Nevertheless, Pérez’s (2014) insights stem from an online survey experiment.
True, opinion data are increasingly gathered on the Web, but increasingly is not the
same as always. Many researchers still assess opinions via telephone, face-​to-​face,
and mixed designs (Dutwin and Lopez 2014; Fraga et  al. 2010; Wong et  al. 2011),
and what analysts find in online polls is unlikely to wholly transfer to phone or in-​
person surveys. For example, online polls are anonymous compared to phone or in-​
person surveys, which can affect the prevalence of reported attitudes and behaviors
(e.g., Piston 2010). Once scholars veer into contexts in which interviewees interact
with live interviewers on the phone or face-​to-​face, the relative anonymity of on-
line surveys is replaced with interpersonal pressures arising from respondents
communicating their opinions to an actual person. With live interviewers, it is plau-
sible that respondents will use a survey to “prove” their skill as a speaker of the in-
terview language, perhaps especially when the interviewer is a member their own
race/​ethnicity (Davis 1997). Alternatively, respondents might use a survey context to
show they are more skilled than the interviewer in the language of survey response;
again, perhaps especially when a respondent and interviewer share the same race/​
ethnicity.5
Scholars can also exploit survey mode differences to shed light on when language
effects are independent of culture (Swoyer 2014). To clarify this, one can imagine
manipulating respondents’ interview language and their assignment to an online or
face-​to-​face survey. The assumption here is that if a survey context shifts from an
anonymous online setting to a face-​to-​face context, the pressure to adhere to cul-
tural norms strengthens, because one is directly observed by an interviewer. If the
language-​opinion connection is independent of culture, one should observe re-
liable opinion differences by interview language, with small differences between
survey modes.
Finally, researchers can further explain when language affects survey response by
clarifying how the tongue one speaks maps onto specific domains. Recall that Pérez
and Tavits (2015) argue that interviewing in a nongendered tongue (i.e., Estonian)
liberalizes one’s views about gender inequality. However, they also show this effect
is less likely when strong social norms surround a topic (e.g., people should disagree
that “men are better political leaders than women”). In the absence of strong norms,
language has a wider berth to affect survey response. Scholars can extend this in-
sight by ascertaining whether the language-​opinion connection also depends on how
crystallized one’s attitudes are, with less crystallized attitudes being more malleable.
Here Zaller (1992) and others (Lodge and Taber 2013; Tourangeau et al. 2000) remind
us that individuals do not possess ready-​made opinions on many matters, leading
people to often report opinions formed on the basis of accessible considerations.
Language effects might therefore be more likely when one’s opinion on a topic is not
preformed.
The Language-Opinion Connection    263

Whose Survey Response Is Affected by Language?


Most research on language’s influence on cognition focuses on average treatment effects,
that is, on whether nuances between tongues causally impact an evaluation or judgment
(cf. Boroditsky 2001; Marian and Neisser 2000; Lee and Pérez 2014; Ross et al. 2002).
Less explored is whether such language effects are heterogeneous, which demands the
identification of moderators and their integration into research designs.
At least two possibilities come to mind. The first is cognitive sophistication, a
workhorse of public opinion research (Delli Carpini and Keeter 1996; Luskin 1987).
Sophisticated persons possess more and better organized attitudes and beliefs—​
all considerations that they are more adept at tying to their judgments. Language-​
opinion gaps might thus widen across sophistication levels, because experts might
be more skilled at “thinking for speaking” (Slobin 1996) or smoother at retrieving
relevant considerations (Marian and Neisser 2000). Such possibilities can be tested
by measuring sophistication levels and entering them as a moderator in observa-
tional/​experimental analyses, or by blocking experiments on their basis. Either way,
a clearer sense of where scholars are most likely to uncover language effects should
emerge.
Another possible moderator draws on the immigrant origins of many bilin-
gual communities:  generational status. This attribute reflects how far removed
one is from the immigrant experience (Abrajano and Alvarez 2010; Portes and
Rumbaut 2006). First-​generation individuals are foreign born. Second-​generation
individuals are born in a host society to foreign-​born parents. Members of the
third generation or later are born in a host society to native-​born parents. Seizing
on this characteristic, one might reason that the accessibility of American identity
increases among later generation individuals, who are more likely to speak English.
Since American identity is conceptually associated with the English language (Pérez
2014), interviewing in English should make this identity more accessible across gen-
erational status, thereby producing a gap in American identity levels within immi-
grant groups.
The question of whose opinions are swayed by language differences can also be
answered by tinkering with the bilingual populations that are studied. Not all
bilinguals are created equal. For example, among U.S. Latinos, bilinguals typically
speak English and Spanish. But some of these individuals learn Spanish first, and
then English, whereas others will learn both languages in the opposite sequence.
Hence, the order in which bilinguals learn their languages, and their standing prefer-
ence for one of them, might affect the strength of language effects. I stress, however,
that there is no “perfect” sample of bilinguals. Instead, heterogeneity in bilinguals’
language repertoires might be usefully exploited to establish boundary conditions
for language effects. That is, among what types of bilinguals do we (not) find lan-
guage effects?
These conditions can be probed by considering how degrees of bilingualism
among self-​ reported bilinguals qualify language effects. Bilinguals are often
264   Efrén O. Pérez

identified through self-​reports of skill in two languages (e.g., “Would you say you
can read a newspaper or book in Spanish [English]?”). But this approach lends it-
self to slippage:  people may (un)intentionally misreport their level of skill in two
languages. Determining the reliability of the language-​opinion connection will ulti-
mately depend on whether scholars can consistently uncover it across studies whose
subjects’ degree of bilingualism varies. Yet before we get to that chain of studies,
single investigations will be the order of the day. Figuring out how reliable the
language-​opinion connection is in single studies will require scholars to validate the
self-​reported data they collect from bilinguals. One way is to gauge attitudes with
multiple items so that measurement error can be diagnosed, with lower degrees of
“noise” validating the self-​reported data.
With so much emphasis on bilinguals, it easy to forget that language effects also
imply an influence on monolinguals. Acknowledging this can help scholars make better
sense of puzzling empirical patterns in public opinion research. For example, why is it
that in the United States, Latinos report substantially lower levels of knowledge about
American politics, even after holding constant individual differences in established
correlates of political information (e.g., age, education, political interest)? Well, if facts
about U.S. politics are generally more associated with the English language (Pérez 2014),
then for the many Latinos who prefer to interview in Spanish, this information will be
systematically less accessible, thus contributing to the observed deficit in Latino knowl­
edge about U.S. politics.
Methodologically, researchers can gain a better grip on language’s influence on survey
response by actively incorporating monolinguals into experimental designs. One way
is for researchers to employ monolinguals as something of a control group, allowing
scholars to make better sense of language effects (Ross et al. 2002). Here researchers
can use monolinguals to see how the opinions of bilinguals from the same culture com-
pare. For example, are the opinions of Latino bilinguals who interview in English com-
parable to those of Latinos who are English monolinguals? Researchers might also
consider using monolinguals from different cultures, such as whites who are English
monolinguals, and compare them to Latino bilinguals who interview in English. If the
opinions of the former resemble those of the latter, then it is harder to say that culture
drives opinion differences.
Finally, most psychological and political science studies of language effects inves-
tigate differences between individuals, usually bilinguals within specific national
contexts. But bilinguals are a unique subpopulation, which calls into question the gen-
eralizability of such results to a larger context beyond their specific setting. One way to
further validate these results is by analyzing cross-​national differences in the language-​
opinion connection. This can involve examining the impact of aggregate language
differences on aggregate indicators of opinion. It can also entail analyzing differences
between individuals from nations that primarily speak different tongues. Finding fur-
ther evidence like this (Chen 2013) can bolster the case that observed language effects
are not a strict function of the within-​nation analysis of bilinguals usually undertaken
by researchers.
The Language-Opinion Connection    265

Conclusion: So What, and What to Do?

The preceding pages underscore that failure to include language in models of survey re-
sponse risks distorting our conceptual understanding about how people form opinions,
since language can affect what is activated in people’s minds, what people retrieve from
memory, and what individuals ultimately report in surveys. But some might be tempted
to ask: So what? Many of the language effects I have discussed seem subtle, to the point
of perhaps being negligible.
That is one way to interpret the evidence I have discussed. Another way is to evaluate
the empirical record in terms of effect sizes and their possible implications. For example,
using Cohen’s d as a yardstick, where d is a mean difference divided by its standard de-
viation, language effects on the mental accessibility of attitudes, beliefs, and so forth are
often large (d ≈ .80) (cf. Ogunnaike et al. 2010). This implies that some of language’s
biggest impact occurs at a deep, automatic level, influencing what is initially activated in
memory (Lodge and Taber 2013).
When we turn to reported opinions, effect sizes across observational and experimental re-
search often run between small (d ≈ .20) (Pérez and Tavits 2015) and medium (d ≈ .50) (Pérez
2011). Should analysts care about modest language effects like these? Yes, because even if they
are small, they can have large ramifications. Take gender inequality (Pérez and Tavits 2015),
in which language effects are reliable but lower in size (d ≈ .20). Such effects help to illumi-
nate why gender inequality persists in many nations despite aggregate improvements in their
socioeconomic development, which is known to narrow gender gaps (Doepke et al. 2012).
What, then, should researchers do in light of small and large language effects, espe-
cially since interview language is generally omitted from statistical models of survey
response? One might be tempted here to minimize, if not eliminate, the influence of
language by design:  for example, by pairing rigorous questionnaire translation with
cognitive interviews before the full survey goes into the field. Such efforts, however, are
effective only at ensuring that survey responses are comparable across different lan-
guages (i.e., measurement equivalence). That is, they are a fix to a methodological nui-
sance. Yet the influence of language on survey response is a theoretical proposition, one
backed by scores of psychological studies and some emerging political science research.
The real question, then, is how scholars can empirically account for this theoretical rela-
tionship between language and survey response.
One answer is to include interview language as a covariate in regression models of
survey response. But given the challenges of this approach—​for example, bloated
specifications that overadjust statistical estimates—​scholars might use inclusion of in-
terview language to streamline statistical models of survey response. For example,
in models of Latino opinion, native-​born and citizenship status could plausibly be
reinterpreted as proxies for language’s distal influence, thus substituting one variable
for two. Beyond simply treating language as a covariate, scholars might also consider
conceptualizing language as a moderator of survey response (Baron and Kenny 1986),
266   Efrén O. Pérez

with interview language strengthening (weakening) the relationship between another


factor (e.g., national identity) and survey response (e.g., opposition to immigration).
Nevertheless, these strategies only address the direct association between re-
ported opinions and interview language. They do nothing about language effects fur-
ther up people’s cognitive stream, where the ingredients of individual opinions first
come to mind (Lodge and Taber 2013; Zaller 1992). This requires looking at different
outcomes, such as millisecond differences in the activation of people’s mental contents.
It also entails different modeling strategies, such as mediational analyses, to investi-
gate whether the impact of language on survey response is channeled through these
differences in activation (Baron and Kenny 1986; Imai et al. 2011).
In the end, however, survey researchers should care about language for theoretical,
not methodological reasons. Indeed, without a more concerted effort to engage and in-
tegrate language’s manifold cognitive effects into models of survey response, researchers
risk misinterpreting why people report what they do in public opinion polls.

Acknowledgments
I am indebted to Cindy Kam for her incisive feedback, which enabled me to clarify key ideas in
this chapter. I am also grateful for Kristin Michelitch’s helpful reactions and advice on an early
version of this project, as well as the editors’ thoughtful and constructive feedback on the initial
draft of this manuscript. I also appreciate Marc Trussler’s assistance in editing this paper. Finally,
I thank my young sons, Efrén III and Emiliano, for providing me with an even deeper apprecia-
tion for the power of language.

Notes
1. On the relevance of language for politics beyond the United States, see Laitin (1992); May
(2012); Schmid (2001).
2. Many studies of measurement equivalence have a cross-​national focus, since comparisons
of countries on latent traits is an area in which a lack of equivalence is likely (Davidov 2009;
Stegmueller 2011). But in this research, language is only one of many possible reasons for a
lack of equivalence. Still, the logic and criteria guiding cross-​national analyses of measure-
ment equivalence also guide similar tests in cross-​language settings (Pérez 2011).
3. To diagnose measurement equivalence (e.g., multigroup confirmatory factor analysis),
researchers often need multiple measures of a trait. Yet such data are scarce, since scholars
must weigh the inclusion of multiple items for single traits against space limitations, re-
spondent fatigue, and so forth. Further, even when such data exist, analyses of equivalence
only reveal whether items meet this criterion (Davidov 2009). Some methods can statisti-
cally correct a lack of equivalence (Stegmueller 2011), but these do not fully clarify what lan-
guage features yield nonequivalence.
4. This entails formally verifying measurement equivalence across languages. Scholars can
also work toward measurement equivalence in the design stage by appraising the quality
of their questionnaire via pretesting, such as cognitive interviews with a small set of
respondents, which can identify translation problems (Harkness et al. 2003).
The Language-Opinion Connection    267

5. This is not to mention the possible complex interactions between these groups and lan-
guage (e.g., as in a phenotypically light Latino interviewing a phenotypically dark Latino in
English versus Spanish).

References
Abrajano, M. A., and R. M. Alvarez. 2010. New Faces, New Voices: The Hispanic Electorate in
America. Princeton, NJ: Princeton University Press.
Almond, G. A., and S. Verba. 1963. The Civic Culture: Political Attitudes and Democracy in Five
Nations. Princeton, NJ: Princeton University Press.
Baron, R. M., and D. A. Kenny. 1986. “The Moderator-​Mediator Variable Distinction in Social
Psychological Research:  Conceptual, Strategic, and Statistical Considerations.” Journal of
Personality and Social Psychology 51 (6): 1173–​1182.
Bond, M. H., and K. Yang. 1982. “Ethnic Affirmation versus Cross-​Cultural Accommodation:
The Variable Impact of Questionnaire Language on Chinese Bilinguals from Hong Kong.”
Journal of Cross-​Cultural Psychology 13: 169–​185.
Boroditsky, L. 2001. “Does Language Shape Thought? Mandarin and English Speakers’
Conceptions of Time.” Cognitive Psychology 43: 1–​22.
Boroditsky, L. 2003. “Linguistic Relativity.” In Encyclopedia of Cognitive Science, edited by L.
Nadel, pp. 917–​921. London: Macmillan Press.
Boroditsky, L., and A. Gaby. 2010. “Remembrances of Times East:  Absolute Spatial
Representations of Time in an Australian Aboriginal Community.” Psychological Science 21
(11): 1635–​1639.
Boroditsky, L., L. A. Schmidt, and W. Phillips. 2003. “Sex, Syntax, and Semantics.” In Language
in Mind: Advances in the Study of Language and Cognition, edited by D. Gentner and S.
Goldin-​Meadow, pp. 61–​79. Boston: MIT Press.
Brader, T., and G. E. Marcus. 2013. “Emotion and Political Psychology.” In The Oxford
Handbook of Political Psychology, edited by L. Huddy, D.O. Sears, and J.S. Levy, pp. 165–​204.
Oxford: Oxford University Press.
Button, K. S., J. P. A. Ioannidis, C. Mokrysz, B. A. Nosek, J. Flint, E. S. J. Robinson, and M.
R. Munafo. 2014. “Power Failure:  Why Small Sample Size Undermines Reliability of
Neuroscience.” Nature Reviews Neuroscience 14: 365–​376.
Chen, M. K. 2013. “The Effect of Language on Economic Behavior: Evidence from Savings
Rates, Health Behaviors, and Retirement Assets.” American Economic Review 103
(2): 690–​731.
Chong, D., and J. N. Druckman. 2007. “Framing Theory.” Annual Review of Political Science
10: 103–​126.
Clarke, Kevin A. 2005. “The Phantom Menace:  Omitted Variable Bias in Econometric
Research.” Conflict Management and Peace Science 22 (4): 341–​352.
Cohen, J. 1992. “Statistical Power Analysis.” Current Directions in Psychological Science 1
(3): 98–​101.
Collins, A. M., and E. F. Loftus. 1975. “A Spreading-​Activation Theory of Semantic Processing.”
Psychological Review 82: 407–​428.
Cubelli, R., D. Paolieri, L. Lotto, and R. Job. 2011. “The Effect of Grammatical Gender on Object
Categorization.” Journal of Experimental Psychology: Learning, Memory, and Cognition 37
(2): 449–​460.
268   Efrén O. Pérez

Danziger, S., and R. Ward. 2010. “Language Changes Implicit Associations Between Ethnic
Groups and Evaluation in Bilinguals.” Psychological Science 21 (6): 799–​800.
Davidov, E. 2009. “Measurement Equivalence of Nationalism and Constructive Patriotism in
the ISSP: 34 Countries in a Comparative Perspective.” Political Analysis 17 (1): 64–​82.
Davidov, E., and S. Weick. 2011. “Transition to Homeownership Among Immigrant Groups
and Natives in West Germany, 1984–​2008.” Journal of Immigrant and Refugee Studies
9: 393–​415.
Davis, D. W. 1997. “The Direction of Race of Interviewer Effects Among African
Americans: Donning the Black Mask.” American Journal of Political Science 41 (1): 309–​322.
de la Garza, R., L. DeSipio, F. Garcia, J. Garcia, and A. Falcon. 1992. Latino
Voices: Mexican, Puerto Rican, and Cuban Perspectives on American Politics. Boulder,
CO: Westview Press.
Delli Carpini, M. X., and S. Keeter. 1996. What Americans Need to Know about Politics and Why
It Matters. New Haven, CT: Yale University Press.
Doepke, M., M. Tertilt, and A. Voena. 2012. “The Economics and Politics of Women’s Rights.”
Annual Review of Economics 4: 339–​372.
Druckman, J. N., D. P. Green, J. H. Kuklinski, and A. Lupia. 2011. “Experiments: An Introduction
to Core Concepts.” In Cambridge Handbook of Experimental Political Science, edited by J.
N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia, pp. 15–​26. New York: Cambridge
University Press.
Dutwin, D., and M. H. Lopez. 2014. “Considerations of Survey Error in Surveys of Hispanics.”
Public Opinion Quarterly 78 (2): 392–​415.
Ervin, S., and R. T. Bower. 1952. “Translation Problems in International Surveys.” Public
Opinion Quarterly 16 (4): 595–​604.
Fraga, L., J. Garcia, R. Hero, M. Jones-​Correa, V. Martinez-​Ebers, and G. Segura. 2010. Latino
Lives in America: Making It Home. Philadelphia: Temple University Press.
Fuhrman, O., K. McCormick, E. Chen, H. Jian, D. Shuaimei, S. Mao, and L. Boroditsky. 2011.
“How Linguistic and Cultural Forces Shape Concepts of Time: English and Mandarin in 3D.”
Cognitive Science 35: 1305–​1328.
Garcia, J. A. 2009. “Language of Interview: Exploring Patterns and Consequences of Changing
Language During Interview.” Paper presented at the Annual Meeting of the Western Political
Science Association, Vancouver, BC.
Gelman, A., and G. King. 1993. “Why Are American Presidential Election Campaign
Polls So Variable When Votes Are So Predictable?” British Journal of Political Science 23
(4): 409–​451.
Godden, D. R., and A. D. Baddeley. 1975. “Context-​Dependent Memory in Two Natural
Environments: On Land and Underwater.” British Journal of Psychology 66 (3): 325–​331.
Grant, H. M., L. C. Bredahl, J. Clay, J. Ferrie, J. E. Groves, T. A. McDorman, and V. J. Dark. 1998.
“Context-​Dependent Memory for Meaningful Material: Information for Students.” Applied
Cognitive Psychology 12: 617–​623.
Harkness, J. A., F. J. R. Van de Vijver, and P. Mohler. 2003. Cross-​Cultural Survey Methods.
New York: John Wiley.
Heider, E. R. 1972. “Universals in Color Naming and Memory.” Journal of Experimental
Psychology 93 (1): 10–​20.
Hochman, O., and E. Davidov. 2014. “Relations Between Second-​Language Proficiency and
National Identification: The Case of Immigrants in Germany.” European Sociological Review
30 (3): 344–​359.
The Language-Opinion Connection    269

Hong, Y., M. W. Morris, C. Chiu, and V. Benet-​Martínez. 2000. “Multicultural Minds:  A


Dynamic Constructivist Approach to Culture and Cognition.” American Psychologist 55
(7): 709–​720.
Horn, J. L., and J. J. McArdle. 1992. “A Practical and Theoretical Guide to Measurement
Invariance in Aging Research.” Experimental Aging Research 18: 117–​144.
Imai, K., L. Keele, D. Tingley, and T. Yamamoto. 2011. “Unpacking the Black Box of
Causality:  Learning About Causal Mechanisms from Experimental and Observational
Studies.” American Political Science Review 105 (4): 765–​789.
Inglehard, R., and P. Norris. 2003. Rising Tide: Gender Equality and Cultural Change Around the
World. Cambridge, UK: Cambridge University Press.
Jacobson, E., H. Kumata, and J. E. Gullahorn. 1960. “Cross-​Cultural Contributions to Attitude
Research.” Public Opinion Quarterly 24 (2): 205–​223.
King, G., R. O. Keohane, and S. Verba. 1994. Designing Social Inquiry: Scientific Inference in
Qualitative Research. Princeton, NJ: Princeton University Press.
Laitin, D. 1992. Language Repertoires and State Construction in Africa. Cambridge,
UK: Cambridge University Press.
Lee, T. 2001. “Language-​of-​Interview Effects and Latino Mass Opinion.” Paper presented at the
Annual Meeting of the Midwest Political Science Association, Chicago, IL.
Lee, T., and E. O. Pérez. 2014. “The Persistent Connection Between Language-​of-​Interview and
Latino Political Opinion.” Political Behavior 36 (2): 401–​425.
Levinson, S. 1996. “Frames of Reference and Molyneux’s Question: Cross-​Linguistic Evidence.”
In Language and Space, edited by P. Bloom, M. Peterson, L. Nadel, and M. Garrett, pp. 109–​
169. Cambridge: MIT Press.
Li, P., and L. Gleitman. 2002. “Turning the Tables: Language and Spatial Reasoning.” Cognition
83 (3): 265–​294.
Lien, P., M. Conway, and J. Wong. 2004. The Politics of Asian Americans. New  York:
Routledge.
Lodge, M., and C. Taber. 2013. The Rationalizing Voter. Cambridge, UK:  Cambridge
University Press.
Lucy, J., and S. Gaskins. 2001. “Grammatical Categories and the Development of Classification
Preferences: A Comparative Approach.” In Language Acquisition and Conceptual Development,
edited by L. Bowermand and S. Levinson, pp. 265–​294. Cambridge: Cambridge University Press.
Luskin, R. C. 1987. “Measuring Political Sophistication.” American Journal of Political Science 31
(4): 856–​899.
Marian, V., and C. M. Fausey. 2006. “Language-​Dependent Memory in Bilingual Learning.”
Applied Cognitive Psychology 20: 1025–​1047.
Marian, V., and M. Kaushanskaya. 2004. “Self-​Construal and Emotion in Bicultural Bilinguals.”
Journal of Memory and Language 51(2): 190–​201.
Marian, V., and M. Kaushanskaya. 2007. “Language Context Guides Memory Content.”
Psychonomic Bulletin and Review 14: 925–​933.
Marian, V., and U. Neisser. 2000. “Language-​ Dependent Recall of Autobiographical
Memories.” Journal of Experimental Psychology: General 129: 361–​368.
May, S. 2012. Language and Minority Rights: Ethnicity, Nationalism, and the Politics of Language.
New York: Routledge.
McDermott, R. 2011. “Internal and External Validity.” In Cambridge Handbook of Experimental
Political Science, edited by J. N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia, pp.
27–​40. New York: Cambridge University Press.
270   Efrén O. Pérez

Ogunnaike, O., Y. Dunham, and M. R. Banaji. 2010. “The Language of Implicit Preferences.”
Journal of Experimental Social Psychology 46: 999–​1003.
Pérez, E. O. 2009. “Lost in Translation? Item Validity in Bilingual Political Surveys.” The Journal
of Politics 71 (4): 1530–​1548.
Pérez, E. O. 2011. “The Origins and Implications of Language Effects in Multilingual Surveys: A
MIMIC Approach with Application to Latino Political Attitudes.” Political Analysis 19
(4): 434–​454.
Pérez, E. O. 2013. “Implicit Attitudes:  Meaning, Measurement, and Synergy with Political
Science.” Politics, Groups, and Identities 1 (2): 275–​297.
Pérez, E. O. 2014. “Accented Politics: How Language Shapes Public Opinion.” Paper presented
at the Fall Meeting of the Symposium on the Politics of Immigration, Race, and Ethnicity
(SPIRE) at the University of Pennsylvania.
Pérez, E. O., and M. Tavits. 2015. “His and Hers: How Language Shapes Public Attitudes Toward
Gender Equality.” Paper presented at the Annual Meeting of the Midwest Political Science
Association.
Pérez, E. O., and M. Tavits. n.d. “Today Is Tomorrow:  How Language Shifts People’s Time
Perspective and Why It Matters for Politics.” Unpublished manuscript, Vanderbilt
University.
Pew Research Center. 2014. “Attitudes about Aging: A Global Perspective.” http://​pewrsr.ch/​
1eawAIB
Piston, Specer. 2010. “How Explicit Racial Prejudice Hurt Obama in the 2008 Election.”
Political Behavior 32 (4): 431–​451.
Portes, A., and R. G. Rumbaut. 2006. Immigrant America: A Portrait. Berkeley: University of
California Press.
Ralston, D. A., M. K. Cunniff, and D. J. Gustafson. 1995. “Cultural Accommodation: The Effect
of Language on the Responses of Bilingual Hong Kong Chinese Managers.” Journal of Cross-​
Cultural Psychology 26: 714–​727.
Rosch, E. 1975. “Cognitive Representations of Semantic Categories.” Journal of Experimental
Psychology: General 104 (3): 192–​233.
Ross, M., W. Q. E. Xun, and A. E. Wilson. 2002. “Language and the Bicultural Self.” Personality
and Social Psychology Bulletin 28 (8): 1040–​1050.
Ryan, C. 2013. Language Use in the United States: 2011. Washington, D.C. U.S. Census Bureau.
Schmid, C. L. 2001. The Politics of Language:  Conflict, Identity, and Cultural Pluralism in
Comparative Perspective. Oxford: Oxford University Press.
Sears, D. O. 1986. “College Sophomores in the Laboratory: Influences of a Narrow Data Base on
Social Psychology’s View of Human Nature.” Journal of Personality and Social Psychology 51
(3): 515–​530.
Shadish, W. R., T. D. Cook, and D. T. Campbell. 2002. Experimental and Quasi-​Experimental
Designs for Generalized Causal Inference. Boston: Houghton-​Mifflin.
Slobin, D. 1996. “From ‘Thought and Language’ to “Thinking for Speaking.” In Rethinking
Linguistic Relativity, edited by J. Gumperz and S. Levinson, pp. 70–​ 96. Cambridge,
UK: Cambridge University Press.
Smith, T. W. 1987. “That Which We Call Welfare by Any Other Name Would Smell Sweeter: An
Analysis of the Impact of Question Wording on Response Patterns.” Public Opinion
Quarterly 51 (1): 75–​83.
Stegmueller, D. 2011. “Apples and Oranges? The Problem of Equivalence in Comparative
Research.” Political Analysis 19: 471–​487.
The Language-Opinion Connection    271

Stern, E. 1948. “The Universe, Translation, and Timing.” Public Opinion Quarterly 12: 711–​7 15.
Stimson, J. A. 2004. Tides of Consent: How Public Opinion Shapes American Politics. Cambridge,
UK: Cambridge University Press.
Swoyer, C. 2014. “Relativism.” In The Stanford Encyclopedia of Philosophy, edited by E. N. Zalta.
http://​plato.stanford.edu/​archives/​sum2014/​entries/​relativism/​.
Tillie, J., M. Koomen, A. van Heelsum, and A. Damstra. 2012. “EURISLAM—​Final Integrated
Report.” European Commission:  Community Research and Development Information
Service (CORDIS). http://​cordis.europa.eu/​result/​rcn/​59098_​en.html.
Tourangeau, R., L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
New York: Cambridge University Press.
Trafimow, D., and M. D. Smith. 1998. “An Extension of the “Two-​Baskets” Theory to Native
Americans.” European Journal of Social Psychology 28: 1015–​1019.
Trafimow, D., H. C. Triandis, and S. G. Goto. 1991. “Some Tests of the Distinction Between the
Private Self and the Collective Self.” Journal of Personality and Social Psychology 60: 649–​655.
Triandis, H. C. 1989. “The Self and Social Behavior in Different Cultural Contexts.”
Psychological Review 93: 506–​520.
Triandis, H. C., and E. M. Suh. 2002. “Cultural Influences on Personality.” Annual Review of
Psychology 53: 133–​160.
Tulving, E., and D. Thomson. 1973. “Encoding Specificity and Retrieval Processes in Episodic
Memory.” Psychological Review 80 (5): 352–​373.
Vigliocco, G., D. P. Vinson, F. Paganelli, and K. Dworzynski. 2005. “Grammatical Gender
Effects on Cognition: Implications for Language Learning and Language Use.” Journal of
Experimental Psychology: General 134 (4): 501–​520.
Villar, A., and J. A. Krosnick. 2011. “Global Warming vs. Climate Change, Taxes vs. Prices: Does
Word Choice Matter?” Climactic Change 105: 1–​12.
Welch, S., J. Comer, and M. Steinman. 1973. “Interviewing in a Mexican-​ American
Community: An Investigation of Some Potential Sources of Response Bias.” Public Opinion
Quarterly 37(1): 115–​126.
Whorf, B. L. 1956. Language, Thought, and Reality. Cambridge, MA: MIT Press.
Wong, J., S. K. Ramakrishnan, T. Lee, and J. Junn. 2011. Asian American Political
Participation: Emerging Constituents and Their Political Identities. New York, NY: Russell
Sage Foundation.
Zaller, J. 1992. The Nature and Origins of Mass Opinion. Cambridge, UK:  Cambridge
University Press.
Pa rt  I I I

A NA LYSI S A N D
P R E SE N TAT ION
Chapter 13

Issu es in P ol l i ng
M ethod ol o g i e s
Inference and Uncertainty

Jeff Gill and Jonathan Homola

Introduction

Researchers working with survey research and polling data face many methodological
challenges, from sampling decisions to interpretation of model results. In this chapter
we discuss some of the statistical issues that such researchers encounter, with the goal of
describing underlying theoretical principles that affect such work. We start by describing
polling results as binomial and multinomial outcomes so that the associated properties
are correctly described. This leads to a discussion of statistical uncertainty and the proper
way to treat it. The latter involves the interpretation of variance and the correct treatment
of the margin of error. The multinomial description of data is extended into a composi-
tional approach that provides a richer set of models with specified covariates. We then
note that the dominant manner of understanding and describing survey research and
polling data models is deeply flawed. This is followed by some examples.
The key purpose of this chapter is to highlight a set of methodological issues, clarifying
underlying principles and identifying common misconceptions. Many practices are
applied without consideration of their possibly deleterious effects. Polling data in par-
ticular generate challenges that need introspection.

Polling Results as Binomial


and Multinomial Outcomes

Survey research and public opinion data come in a wide variety of different forms
and shapes. However, for almost all kinds of polling data, two statistical distributions
276    Jeff Gill and Jonathan Homola

provide the background that is necessary to fully understand them and to be able
to interpret the associated properties correctly:  the binomial and multinomial
distributions.
The binomial distribution is usually used to model the number of successes in
a sequence of independent yes/​no trials. In the polling world, it is most useful when
analyzing data that can take on only two different values, such as the predicted vote
shares in a two-​candidate race. The binomial probability mass function (PMF) and its
properties are given by

• PMF: (x | n, p) = ( nx ) p x (1 − p)n − x , x = 0, 1, , n, 0 < p < 1.
• E [ X ] = np
Var [ X ] = np (1 − p )

So the binomial distribution has mean np and standard deviation p(1 − p) / n . If we


assume that np and n(1 –​ p) are both bigger than 5, then we can use the normal approx-
imation for p. One common application of the binomial distribution to public opinion
data is its use in the construction of confidence intervals (CI) around a given estimate or
prediction. More specifically, if we are interested in building a 1 –​α confidence interval
around the unknown value of p (i.e., the vote share of candidate 1), then we start with
p = Y, and define

SE( p) = p(1 − p) / m (1)

by substituting the p estimate for p into the formulat for the SE, where m is the number
of separate binomial trials of size n. Moreover, if we assume that there are m number of
separate binomial experiments, each with n trials, then we can standardize using the z-​
:
score for p

p − p (2)
z=
SE( p)

As an example, consider a study with 124 respondents who self-​identified as Republican


and gave 0.46 support to a particular Republican candidate. The standard error of the
proportion estimate is then given by

0.46(1 − 0.46) (3)


SE(p) = = 0.0447.
124
The 95% confidence interval for π, the true population proportion, is calculated as
follows:

CI 95% = [0.46 − 1.96 × 0.0447 : 0.46 − 1.96 × 0.0447] = [0.372 : 0.548]. (4)
Issues in Polling Methodologies    277

Meaning that nineteen times out of twenty we expect to cover the true population
proportion with this calculation. More generally, a confidence interval is an in-
terval that over 1 –​α% of replications contains the true value of the parameter, on
average.
The multinomial distribution is a generalization of the binomial distribution that
allows for successes in k different categories. In the polling context, this is useful when
working with data that can take on more than just two different values, for example vote
shares in a multicandidate race such as primary elections. The multinomial probability
mass function and its properties are given by

p1x1  pkxk , xi = 0,1,n, 0 < pi < 1, ∑ i =1 pi = 1.


k
• PMF: (x | n, p1 ,, pk ) = n!
x1!xk!

• E  Xi  = npi
• Var  Xi  = npi (1 − pi )
Cov  Xi , X j  = npi p j

One useful application of the multinomial distribution is the possibility to make


predictions based on its properties. Three months before Donald Trump began his pres-
idential campaign, an ABC News/​Washington Post telephone survey of 444 Republicans
between March 26 and March 29, 2015,1 gives the following proportions for the listed
candidates in table 13.1.2

Table 13.1 Support for Republican
Candidates
for President
Candidate X

Jeb Bush 0.20


Ted Cruz 0.13
Scott Walker 0.12
Rand Paul 0.09
Mike Huckabee 0.08
Ben Carson 0.07
Marco Rubio 0.07
Chris Christie 0.06
Other 0.10
Undecided 0.04
None 0.03
278    Jeff Gill and Jonathan Homola

Table 13.2 Covariance
with Rand Paul
Candidate R

Jeb Bush −72.0


Ted Cruz −78.3
Scott Walker −79.2
Mike Huckabee −82.8
Ben Carson −83.7
Marco Rubio −83.7
Chris Christie −84.6

If we assume that this was a representative sample of likely Republican primary


voters, then we can use the properties of the multinomial to make predictions. For
example, suppose we intend to put another poll into the field with a planned sample
of one thousand likely Republic primary voters and wanted to have an expected co-
variance of Rand Paul with each of the other candidates from Cov[Xi , X j ] = npi p j.
This is shown in table 13.2. What we see here is that there is not much difference in
covariances relative to the scale of the differences in the proportions. Notice also that
they are all negative from this equation. This makes intuitive sense, since increased
support for a chosen candidate has to come from the pool of support for all of the
other candidates.
So in all such contests, gains by a single candidate necessarily come at the ex-
pense of other candidates from the multinomial setup. This is less clear in the bi-
nomial case, since the PMF is expressed through a series of trials in which a stated
event happens or does not happen. With k categories in the multinomial, we get a
covariance term between any two outcomes, which is useful in the polling context to
understand which candidates’ fortunes most affect others. Of course any such calcu-
lation must be accompanied by a measure of uncertainty, since these are inferential
statements.

Explaining Uncertainty

This section discusses how uncertainty is a necessary component of polling and how to
properly manage and discuss it. Since pollsters are using samples to make claims about
populations, certain information about the uncertainty of this linkage should always be
supplied. As part of its “Code of Ethics,” the American Association for Public Opinion
Issues in Polling Methodologies    279

Research (AAPOR) provides a list of twenty disclosure items for survey research,
ranging from potential sponsors of a given survey to methods and modes used to ad-
minister the survey to sample sizes and a description of how weights were calculated.3
Most important, and also reflected in AAPOR’s Transparency Initiative, which was
launched in October 2014, the association urges pollsters and survey researchers to pro-
vide a number of indicators that allow informed readers to reach their own conclusions
regarding the uncertain character of the reported data.4 As this section will show, only
a few additional items of information about data quality are crucial in allowing us to in-
terpret results appropriately and with the necessary caution.5
Unfortunately, it is common to provide incomplete statistical summaries. Consider
the LA Times article “Two years after Sandy Hook, poll finds more support for gun
rights,” which appeared on December 14, 2014 (Kurtis Lee). This short piece described
percentages from a Pew Research Center poll6 that took place December 3–​7, 2014. In
describing the structure of the poll, the article merely stated: “Pew’s overall poll has a
margin of error of plus or minus 3 percentage points. The error margin for subgroups is
higher.” Additional important information was omitted, including a national sample size
of 1,507 adults from all fifty states plus the District of Columbia, 605 landline contacts,
902 cell phone contacts, the weighting scheme, α = 0.05, and more. This section clarifies
the important methodological information that should accompany journalistic and ac-
ademic accounts of polling efforts.

Living with Errors
Whenever we make a decision, such as reporting an opinion poll result based on statis-
tical analysis, we run the risk of making an error, because these decisions are based on
probabilistic not deterministic statements. Define first δ as the observed or desired effect
size. This could be something like a percentage difference between two candidates or a
difference from zero. Conventionally label the sample size as n. With hypothesis testing,
either implicit or explicit, we care principally about two types of errors. A Type I Error is
the probability that the null hypothesis of no effect or relationship is true, and we reject
it anyway. This is labeled α, and is almost always set to 0.05 in polling and public opinion
research. A Type II Error is the probability that the null hypothesis is false, and we fail
to reject it. This is labeled β. Often we care more about 1 –​β, which is called power, than
about β. The key issue is that these quantities are always traded off by determination of
α, δ, β, n, meaning that a smaller α implies a larger β holding δ and n constant, a larger
n leads to smaller α and β plus a smaller detectable δ, and fixing α and β in advance (as
is always done in prospective medical studies) gives a direct trade-​off between the effect
size and the data size.
Furthermore, these trade-​offs are also affected by the variance in the data, σ2.
Increasing the sample size decreases the standard error of statistics of interest in
280    Jeff Gill and Jonathan Homola

proportion to 1/ n . So the variance can be controlled with sampling to a desired level.


The implication for the researcher with sufficient resources is that the standard errors
can be purchased down to a desired level by sampling enough cases. This, however,
assumes that we know or have a good estimate of the true population variance. In ad-
vance, we usually do not know the underlying variance of the future data generating
process for certain. While academic survey researchers often have at least a rough idea
of the expected variance from previous work (be it their own or that of others), their
counterparts in the media often have very good estimates of the variance due to much
more repetition under similar circumstances.
Most polling results are expressed as percentages, summarizing attitudes to-
ward politicians, proposals, and events. Since percentages can also be expressed as
proportions, we can use some simple tools to make these trade-​offs between objectives
and determine the ideal values for α, δ, β, or n respectively (always depending on the
others). Suppose we want to estimate the population proportion that supports a given
candidate, π, and we want a standard error that is no worse than σ = 0.05. To test an effect
size (support level) of 55%, we hypothesize p = 0.55. This is a Bernoulli setup, so we have
a form for the standard error of some estimated p:

σ = p(1 − p) / n , (5)

with a mathematical upper bound of 0.5 / n . Using the hypothesized effect size,
p = 0.55, this means

σ = 0.05 = 0.55(1 − 0.55) / n = 0.49749 / n . (6)

Rewriting this algebraically means that n = (0.49749/​0.05)2 = 98.999. So 99 respondents


are necessary to test an effect size of 55% with a standard error that is 0.05 or smaller.
Again, notice that sample size is important because it is controllable and affects all of the
other quantities in a direct way.
Now suppose that we just want evidence that one candidate in a two-​candidate
race is in the lead. This is equivalent to testing whether π > 0.5, and is tested with
the sample proportion p = x / n, where x is the number of respondents claiming to
support the candidate of interest. This time we do not have a value for p, so we will use
the value that produces the largest theoretical standard error as a way to be as cautious
as possible:

0. 5 (7)
σ = (0.5)(0.5) / n =
n

which maximizes σ due to the symmetry of the numerator. The 95% margin of error is
created by multiplying this value times the α = 0.05 critical value under a normal distri-
bution assumption:

0. 5
MOEα = 0.05 = CVα = 0.05 × σ = (1.96) . (8)
n
Issues in Polling Methodologies    281

Which is used to create a reported 95% confidence interval:

 p (0.5)  (9)
 ± (1.96) .
 n 

To understand whether there is evidence that our candidate is over 50%, we care about
the lower bound of this confidence interval, which can be algebraically isolated,
2
0. 5  0.98  (10)
L = p − (1.96) → n=  ,
n  L − p 

so at p = 0.55 we need n = 384, and at p = 0.65 we need only n = 43. This highlights an
important principle: the higher the observed sample value, the fewer the respondents
needed. If our hypothetical candidate is far in the lead, then we do not need to sample
many people, but if both candidates are in a very close race, then more respondents are
required to make an affirmative claim at the α = 0.05 level. Now what is the power of the
test that the 95% CI will be completely above the comparison point of 0.5? Using a simple
Monte Carlo simulation in R with one million draws, hypothesizing p0 = 0.55, and using
n = 99, we calculate

# SET THE SIMULATION SAMPLE SIZE


m <–​1000000
# GENERATE m NORMALS WITH MEAN 0.55 AND STANDARD
DEVIATION sqrt (0.55*(1–​0.55)/​99)
p.hat <–​ rnorm(m,0.55,sqrt(0.55*(1–​0.55)/​99))
# CREATE A CONFIDENCE INTERVAL MATRIX THAT IS m * 2 BIG
p.ci <–​cbind(p.hat  –​1.96*0.5/​ sqrt(99),p.hat +
1.96*0.5/​sqrt(99))
# GET THE PROPORTION OF LOWER BOUNDS GREATER THAN
ONE-​HALF
sum(p.ci[,1] > 0.5)/​m
[1]‌ 0.16613

showing that the probability that the complete CI is greater than 0.5 is 0.16613,
which is terrible. More specifically, this means that there is only an approximately
17% chance of rejecting a false null hypothesis. Note that we fixed the sample size
(99), fixed the effect size (0.55), fixed the significance level (α = 0.05), and got the
standard error by assumption, but let the power be realized. How do we improve
this number?
Suppose that were dissatisfied with the result above and wanted n such that 0.8 of the
95% CIs do not cover 0.5 (80% power). We want the scaled difference of the lower bound
and the threshold to be equal to the 0.8 cumulative density function (CDF):

0.8 L − 0. 5
∫ −∞
f N (x )dx = .
σ/ n
(11)
282    Jeff Gill and Jonathan Homola

Rewriting this gives

0.8
L = 0. 5 + ∫ f N (x )dx(σ / n ). (12)
−∞

Since L = p − z α /2 (σ / n ) by definition of a confidence interval for the mean, then

0.8
p − z α /2 (σ / n ) = 0.5 +
∫ −∞
f N (x )dx(σ / n ).
0.6 − 1.96(σ / n ) = 0.5 + 0.84162(σ / n )
0.5 + 1.96(0.5 / n ) = 0.6 − 0.84162(0.5 / n )

So we can calculate n by solving the equation:

Threshold + 95% CV × Standard Error = Assumed Mean − Φ(0.8) × Standard Error


0.50 + 1.96(0.5/ n ) = 0.55 − 0.84162(0.5/ n )

meaning that n = 785, using the cautious σ = 0.5.7 Notice that we needed a considerably
greater sample size to get a power of 0.8, which is standard in many academic disciplines
as a criterion. We can also use R to check these calculations:

# SET THE SAMPLE SIZE


n <–​ 785
# SET THE NUMBER OF SIMULATIONS
m <–​1000000
# CALCULATE THE ESTIMATE OF p
p.hat <–​ rnorm(m,0.55,sqrt(0.55*0.45/​n))
# CALCULATE THE CONFIDENCE INTERVAL
p.ci <–​cbind(p.hat  –​1.96*0.5/​ sqrt(n),p.hat +
1.96*0.5/​sqrt(n))
# RETURN THE NUMBER OF LOWER BOUNDS GREATER THAN 0.5
sum(p.ci[,1] > 0.5)/​m
[1]‌ 0.80125

Here we fixed the power level (1–​β = 0.8), fixed the effect size (using 0.55), fixed the
significance level (α = 0.05), and got the standard error by the binomial assumption,
but let the sample size be realized. What are the implications of this power stipulation?
Anyone who considers (or is actively) expending resources to collect samples should at
least understand the power implications of the sample size selected. Perhaps a few more
cases would considerably increase the probability of rejecting a false null. Researchers
who are not themselves collecting data generally cannot stipulate a power level, but
it should still be calculated in order to fully understand the subsequent inferences
being made.
Issues in Polling Methodologies    283

To further illustrate the importance of sample size, suppose we are interested in testing
whether support for a candidate is stronger in one state over another. The standard error
for the difference of proportions is

p1 (1 − p1 ) p2 (1 − p2 ) (13)
σ diff = + ,
n1 n2

or more cautiously, if we lack information we assume that p1 = p2 = 0.5 to get

1 1 (14)
σ diff = 0.5 + .
n1 n2

Restricting the sample sizes to be equal gives σ prop = 0.5 n2 , where n is the sample size in
each group. Then for α = 0.05 and 1 –​β = 0.8, in the approach where we do not know p1
and p2, we get n = [2.8 / ( p1 − p2 )]2. However, if we have the necessary information, this
becomes n = 2[ p1 (1 − p1 ) + p2 (1 − p2 )][2.8 / ( p1 − p2 )]2 . Let us assume that we suspect that
our candidate has 7.5% more support in California than in Arizona in a national elec-
tion, and that we want to run two surveys to test this. If the surveys are equal in size,
n, how big must the total sample size be such that there is 80% power and significance
at 0.05, if the true difference in proportions is hypothesized to be 7.5%? For the 7.5%
to be 2.8 standard errors from zero, we need n > (2.8/​0.075)2 = 1393.8. What if the true
difference in proportions is hypothesized to be 15%? Now, for the 15% to be 2.8 standard
errors from zero, we need n > (2.8/​0.15)2 = 348.44. Going the other way, what about a
hypothesized 2.5% lead? Then n > (2.8/​0.025)2 = 12544. This shows again the principle
that larger sample sizes are required to reliably detect smaller effect sizes with fixed
α and β.
More generally, suppose we state the sample sizes proportionally, q and (1 –​q), such
that qn is the size of the first group and (1 –​q)n is the size of the second group. Now the
standard error for difference of proportions is given by

p1 (1 − p1 ) p2 (1 − p2 ) (15)
σ diff = + ,
qn (1 − q)n

which has a cautious upper bound of


1

σ diff,max = 0.5[q(1 − q)] 2 / n . (16)

With a little rearranging, we get


2
 −
1

 [q(1 − q )] 2
/ 2
n= . (17)
 σ diff,max 
 
284    Jeff Gill and Jonathan Homola

If we accurately have more information, p1 and p2:

2
 −
1 1

 [q(1 − q )] 2
/ [ p1 (1 − p1 )( 1 − q ) + p 2 (1 − p2 )q]2
 (18)
n= .
 σ diff 
 

But this has σdiff in the denominator, which relies on some information about sample
size besides proportional difference, which we do not have. This means that we need to
rely on an approximation, the Fleiss (1981) equation:
2
1   1  
n = 2  z1−α /2 (p1 + p2 )  1 − (p1 + p2 ) + z1−β p1 (1 − p1 ) + p2 (1 − p2 )  . (19)
δ   2  

Since this is an estimate rather than a precise calculation, it has additional uncertainty
included as part of the process. Unfortunately, since we are missing two quantities
(n and σdiff ), we need to resort to such a strategy. Obviously this should be noted in any
subsequent write-​up.
This section discussed the overt and proper ways that errors should be accounted for
and discussed with survey and polling data. When statements are made about statis-
tical analysis of such data, there is always some level of uncertainty, since the results are
based on some unknown quantities. Furthermore, the data size, the sample variance,
the (observed or desired) effect size, α, and power (1 –​β) are all interacting quantities,
and trade-​offs have to be made. Therefore all aspects of the analysis with regard to these
quantities should be reported to readers.

Treating the Margin of Error Correctly


This section describes in more detail issues that come up regarding understanding the
margin of error in reported results. Polling in advance of the 2016 National Democratic
Primary, a YouGov poll for the Economist, asked 325 Democratic registered voters be-
tween May 15 and May 18, 2015, to identify their choice,8 producing the percentages
shown in table 13.3.
Recall that a margin of error is half of a 95% confidence interval, defined by

[θ − 1.96 × Var(θ) : θ + 1.96 × Var(θ)], (20)

where Var(θ) comes from previous polls, is set by assumption, or is based on the actu-
ally observed sample proportions. Note that θ is the random quantity and θ is fixed but
unknown. Note further that given the varying sample proportions in a poll such as the
one reported by YouGov, the individual estimates will have individual margins of error
Issues in Polling Methodologies    285

Table 13.3 Support from Democratic
Registered Voters
Democratic Registered Voters

Clinton 60%
Sanders 12%
Biden 11%
Webb 3%
O’Malley 2%
Other 1%
Undecided 11%

N = 325

associated with them. For example, for Hillary Clinton, the 95% confidence interval
would be calculated as follows:

 (0.60)(0.40) (0.60)(0.40) 
CI0.95 = 0.60 − 1.96 × : 0.60 + 1.96 × 
 325 3 25  (21)
= [0.60 − 0.053 : 0.60 + 0.053]

= [0.547: 0.653]

Since 95% is a strong convention in media polling, we restrict ourselves to this level.9
Accordingly, the margin of error for Hillary Clinton’s estimate would be roughly 5.3
points. However, for her potential competitor, Jim Webb, the margin of error would be
considerably smaller. More specifically, we would get

 (0.03)(0.97) (0.03)(0.97) 
CI0.95 = 0.03 − 1.96 × : 0.03 + 1.96 × 
 325 3 25  (22)
= [0.03 − 0.019 : 0.03 + 0.019]

= [0.011 : 0.049]

In other words, the margin of error would only be 1.9 points in this case. Despite
these differences in margins of error for different statistics in the same poll, media
reports of polling results will often only report one margin of error. Per convention,
that margin reflects the maximum possible margin of error, which would theoreti-
cally only apply to observed sample proportions that are exactly even. While this is
a conservative convention that is unlikely to drastically distort results, there is un-
fortunately also widespread confusion about the interpretation of confidence and
286    Jeff Gill and Jonathan Homola

margins of error in media reporting, which can be more dangerous. As an example,


the following is a generic statement that regularly accompanies polling reports in the
New York Times:

In theory, in 19 cases out of 20, the results from such polls should differ by no more
than plus or minus four to five percentage points from what would have been
obtained by polling the entire population of voters.

This is correct, but misinterpretations are unfortunately extremely common as well.


In a piece from the Milwaukee Journal Sentinel by Craig Gilbert, rather tellingly titled
“Margin of error can be confusing” (October 11, 2000), we find this seemingly similar
statement:

When a poll has a margin of error of 3 percentage points, that means there’s a 95 per-
cent certainty that the results would differ by no more than plus or minus 3 points
from those obtained if the entire voting age population was questioned.

This is not true because of the word certainty. Instead, it means that in 95% of
replications, we would expect the true parameter to fall into that confidence interval on
average. And it gets worse (from the same article):

Let’s say George W. Bush is up by 5 points. It sounds like this lead well exceeds the
3-​point margin of error. But in fact, Bush’s support could be off by three points in ei-
ther direction. So could Al Gore’s. So the real range of the poll is anywhere from an
11-​point Bush lead to a 1-​point Gore lead.

Here the author assumes that the candidates’ fortunes are independent. However,
since losses by one candidate clearly imply gains by others, there is no such independ­
ence. This is called compositional data.
To illustrate the inconsistencies that can arise when ignoring the presence of
compositional data and to clarify the correct way of interpreting the margin of error
in such settings, consider a poll with three candidates: Bush, Gore, and other. The
correct distributional assumption is multinomial with parameters [p1, p2, p3], for the
true proportion of people in each group. Define [s1, s2, s3] as the sample proportions
from a single poll. We are interested in the difference s1  –​ s2 for the two leading
candidates. The expected value of this difference is p1 –​ p2, and the variance is

Var(s1 − s2 ) = Var(s1 ) + Var(s2 ) − 2Cov(s1 , s2 )


 s (1 − s1 )   s2 (1 − s2 )   ss 
=  1  +  − 2 − 1 2 
 n   n   n  (23)

s1 (1 − s1 )+ s2 (1 − s2 )+2s1s2
=
n
Issues in Polling Methodologies    287

where the standard deviation of the difference between the two candidates is the
square root of this. Note the cancellation of minus signs. Multiplying this by 1.96 gives
the margin of error at the 95% confidence level. For specific hypothesis testing of a
difference, the z-​score is

s1 − s2 (24)
z= ,
Var(s1 − s2 )

which is a simple calculation.


For example, assume that a poll with n  =  1,500 respondents reports sBush  =  0.47,
sGore = 0.42, and sOther = 0.11. The newspaper claims that there is a 5 point difference with
a 3% margin of error, so “the real range of the poll is anywhere from an 11-​point Bush
lead to a 1-​point Gore lead.” The actual variance is produced by

(0.47)(0.53) + (0.42)(0.58) + 2(0.47)(0.42)


Var (s Bush − sGore ) = = 0.000591667 (25)
15000
under the assumption that lost votes do not flow to the “other” candidate. The
square root of this variance is 0.0243242. Finally, the margin of error, 1.96  ×
0.0243242  =  0.04767543  ≈ 0.0477, is slightly less than the observed difference of
0.05, and therefore Gore could not actually be leading in terms of the 95% confi-
dence interval. In fact, we should instead assume Bush’s lead to be anywhere be-
tween 5 –​4.77 = 0.23 and 5 + 4.77 = 9.77 percentage points. The formal hypothesis
test (which gives the exact same information in different terms) starts with calcu-
lating z = 5/​0.0243242 = 205.56, meaning that the test statistic is far enough into the
tail to support a difference for any reasonable α value. Why such a large number for
this test statistic? The answer is that n = 1,500 is such a large number of respondents
that for a difference of 5 we can support a very small α. Suppose we wanted to calcu-
late the power of this test with α = 0.01? Use the simulation method from above as
follows:

# SET THE SIMULATION SAMPLE SIZE


m <–​1000000
# GENERATE m NORMALS WITH MEAN 0.47 AND SD sqrt(0.47*(1 –​
0.47)/​1500)
p.hat <–​ rnorm(m,0.47,sqrt(0.47*(1 –​ 0.47)/​1500))
# CREATE A 0.01 CONFIDENCE INTERVAL MATRIX THAT IS m
* 2 BIG
p.ci <–​cbind(p.hat  –​2.5758*0.5/​ sqrt(1500),p.hat +
2.5758*0.5/​sqrt(1500))
# GET THE PROPORTION OF LOWER BOUNDS GREATER THAN GORE
sum(p.ci[,1] > 0.42)/​m
[1]‌ 0.90264
288    Jeff Gill and Jonathan Homola

So we have a 90% chance of rejecting a false null that the two candidates have identical
support.
The purpose of this section has been to carefully describe the margin of error and how
it is calculated. Since the margin of error is one-​half of a confidence interval, its calcu-
lation is straightforward, even though the interpretation of the confidence interval is
often mistaken. More subtly, with compositional data such as proportions of candidate
support, the calculations must be done differently to account for the restriction that they
sum to one. Failing to do so yields incorrect summaries that mislead readers.

Understanding Proportions
as Compositional Data

The data type represented by proportions of groups, by candidates, parties, and so forth
is compositional. This means that the size of each group is described by a numerical ratio
to the whole, and that these relative proportions are required to sum to one. Therefore,
not only is the range of possible values bounded, the summation constraint also imposes
relatively high (negative) correlations among values, since gains by one group neces-
sarily imply aggregate losses by the others.
The statistical analysis of compositional data is much more difficult than it would
initially appear. Since it is impossible to change a proportion without affecting at
least one other proportion, these are clearly not independent random variables, and
the covariance structure necessarily has negative bias. In fact the “crude” covariance
matrix formed directly from a sample compositional data set will have the property
that each row and column sum to zero, meaning that there must be at least one neg-
ative covariance term in every row and column. This means that correlations are not
actually free to take on the full range of values from –​1 to 1. Why is this important?
Suppose we saw a correlation coefficient of 0.25. Most people would interpret this as
indicating a weak relationship (subject to evaluation with its corresponding standard
error, of course). However, it is possible that the structure of the compositional
data at hand limits this correlation to a maximum of 0.30. Then it would be a strong
effect, reaching 5/​6 of its maximum possible positive value. Aitchison (1982) notes
that these reasons lead to a lack of satisfactory parametric classes of distributions for
compositional data.
There are several approaches in the methodological literature that have attempted
but failed to develop useful parametric models of compositional data. One of the
most common is to apply the Dirichlet distribution (Conner and Mosimann 1969;
Darroch and James 1974; Mosimann 1975; James and Mosimann 1980; James 1981),
a higher dimension counterpart to the beta distribution for random variables
bounded by zero and one. This is a very useful parametrization, but it assumes that
Issues in Polling Methodologies    289

each of the proportions is derived from an independent gamma distributed random


variable. In addition, the covariance matrix produced from a Dirichlet assumption
has a negative bias, because it does not account for the summation restriction.
Applying a multinomial distribution is unlikely to prove useful, since it also does
not account for the summation requirement and focuses on counts rather than
proportions (although this latter problem can obviously be solved with additional
assumptions). Finally, linear approaches such as principal components analysis,
principal components regression, and partial least squares will not provide satisfac-
tory results because the is probability contours of compositional data are not linear
(Hinkle and Rayens 1995).
The best manner for handling compositional data is Aitchison’s (1982) log-​ratio con-
trast transformation. This process transforms the bounded and restricted compositions
to Gaussian normal random variates. The primary advantage of this approach is that the
resulting multivariate normality, achieved through the transformation and an appeal to
the Lindeberg-​Feller variant of the central limit theorem, provides a convenient inferen-
tial structure even in high dimensional problems.

The Log-​Ratio Transformation of Compositional Data


Compositional data with d categories on the unit interval are represented by a d –​1 di-
mensional simplex:  d = {(x1 , x2 ,, xd ) : x1 , x2 ,, xd > 0 ; x1 + x2 +  + xd = 1}. This
composition vector actually represents only a single data value and is therefore indexed
by cases as well (xi1, xi2, . . ., xid) for a collected data set. A single composition with d
categories defines a point in an only d –​1 dimensional space, since knowledge of d –​1
components means the last can be obtained by the summation requirement. Often these
compositions are created by normalizing data whose sample space is the d-​dimensional
positive orthant, but in the case of group proportions within an organization, the data
are usually provided as racial, gender, or other proportions.
Aitchison (1982) introduced the following log-​ ratio transformation of the
compositions on d to the d-​dimensional real space,  d:

x 
yi = log  i  i = 1,, d(i ≠ g ) (26)
 xg 

where xg is an arbitrarily chosen divisor from the set of categories. In the case of a data
set of compositions, this transformation would be applied to each case-​vector using the
same reference category in the denominator. One obvious limitation is that no com-
positional value can equal zero. Aitchison (1986) deals with this problem by adding a
small amount to zero values, although this can lead to the problem of “inliers”: taking
the log of a very small number produces a very large negative value. Bacon-Shone (1992)
290    Jeff Gill and Jonathan Homola

provides a solution that involves taking the log-​ratio transformation on scaled ranks to
prevent problems with dividing or logging zero values. In practice, it is often convenient
to collapse categories with zero values into other categories. This works because these
categories are typically not the center of interest.
The log-​ratio transformation shares the well-​known linear transformation theory of
multinomial distributions and has the class-​preservation property that its distributional
form is invariant to the choice of divisor category (Aitchison and Shen 1980). This means
that the researcher can select the divisor reference category without regard for distribu-
tional consequences. The sample covariance matrix for the log-​ratio transformed com-
position is mathematically awkward, so Aitchison (1982) suggests a “variation matrix”
calculated term-​wise by

 x  
τij = Var  i   , (27)
 x j  

which is symmetric with zeros on the diagonal. This is now a measure of variability for
xi and xj, which are vectors of proportions measured over time, space, or a randomized
block design. Note that there is now no truncating on the bounds of the values of the co-
variance matrix, as there had been in the untransformed compositional form.
Aitchison further suggests that inference can be developed by appealing to the cen-
tral limit theorem such that Y ~ MVN (μ, Σ). This is not an unreasonable appeal, since
the Lindeberg-​Feller central limit theorem essentially states that convergence to nor-
mality is assured, provided that no variance term dominates in the limit (Lehmann 1999,
app. A1). This is guaranteed, since we start with bounded compositional data prior to the
transformation.
To illustrate the application of Aitchison’s log-​ratio contrast transformation,
we use survey data from the fourth module of the Comparative Study of Electoral
Systems (CSES).10 More specifically, in order to study the popular question of
whether parties benefit from positioning themselves close to the mean voter position
along the left-​right scale, we employ two different questions that ask respondents to
place themselves and each of their national parties on an eleven-​point scale ranging
from 0 (left) to 10 (right).11 Based on these questions, we first determine the mean
voter position for a given country election by averaging all respondents’ left-​right
self-​placements. We then compute party positions by calculating each party’s av-
erage placement. Our covariate of interest is then simply the absolute policy distance
between each party’s position and the mean voter position in the respective election.
Previous studies have repeatedly shown that as this policy distance increases, parties
in established democracies tend to suffer electorally (Alvarez et al. 2000; Dow 2011;
Ezrow et al. 2014).
To measure our outcome variable (party success), we employ two different techniques.
The first is simply a given party’s observed vote share in the current lower house election.
The second relies on the CSES surveys and is based on a question in which respondents
indicate their vote choice in the current lower house election.12 Based on all nonmissing
Issues in Polling Methodologies    291

responses, we calculate each party’s vote share by dividing the number of respondents
who indicated that they voted for a given party by the number of all respondents who in-
dicated that they voted for any party in the respective country. We then apply Aitchison’s
log-​ratio transformation to both measures of party success, using the first party in each
country’s CSES coding scheme (usually the largest party) as the reference category. Table
13.4 lists all these measures for the U.S. presidential election in 2012.
Table 13.5 presents the results of four OLS models that regress the different measures
of party success on a party’s distance to the mean voter position. As expected, in all four
model specifications, the coefficient estimate for policy distance is negative, indicating
that as a party’s distance from the mean voter position increases, that party tends to lose
public support. However, the more interesting part of this exercise is the effect of the log-​
ratio transformation on the results: using both the observed vote share and the CSES-​
based measure of indicated vote share, accounting for the compositional nature of the
data by applying Aitchison’s transformation leads to a loss in reliability of the estimated
coefficients. In other words, with this specific data set and model specification, not

Table 13.4 (Transformed) Vote Shares and Indicated Vote Shares, CSES USA 2012


Transformed
CSES
Transformed CSES Indicated CSES Indicated Indicated
Party Vote Share Vote Share Vote (N) Vote (%) Vote

Democratic Party 48.40 0 921 69.09 0


Republican Party 47.10 −.027 412 30.91 −.804
Missing 596

Table 13.5 The Effect of Policy Distance on Vote Share (CSES)


Transformed Vote CSES Indicated Transformed CSES
Vote Share Share Vote (%) Indicated Vote

Policy Distance −2.35 (1.30) −.11 (.08) −.03 (.01) −.07 (.14)


[−5.18; .48] [−.29; .07] [−.06; −.00] [−.37; .22]
Constant 18.54 (2.88) −1.09 (.24) .21 (.03) −1.36 (.28)
[12.27; 24.82] [−1.62; −.57] [.15; .27] [−1.97; −.75]
Observations 81 81 87 87

Note: The table reports estimated coefficients from OLS regressions and robust standard errors
(clustered by election) in parentheses. 95% confidence intervals are reported in brackets. The four
different outcome variables are defined in the text.
292    Jeff Gill and Jonathan Homola

considering the compositional characteristics of the data at hand would lead journalists
or scholars to potentially overestimate the reliability of their findings.13
Extending the previous discussion of the multinomial setup, this section has
highlighted the unique challenges that researchers and journalists face when
working with compositional data such as vote shares or proportions of party
support. The summation constraint of compositional data requires different
techniques if we want to convey results and the uncertainty associated with them
correctly. Aitchison’s log-​ratio contrast transformation offers one such approach,
which we recommend here.14

The Null Hypothesis Significance Test

This section discusses problems with the frequently used Null Hypothesis
Significance Test (NHST). The key problem is that this procedure does not in-
form results in the way that many people assume. Such interpretation problems
cause readers to believe that results are more reliable than they likely are. This
was first discussed in political science by Gill (1999), followed by Ward et  al.
(2010) and Rainey (2014). Objections to the use of the NHST go all the way back
to Rozeboom (1960), who described it as a “strangle-​hold,” and Bakan (1960),
who called it “an instance of the kind of essential mindlessness in the conduct
of research.” Most of the early objections came from scholars in psychology, who
have generated literally hundreds of articles and book chapters describing the
problems with the NHST. Yet it endures and dominates in studies with survey
research and polling data. Why? There are two main reasons. First, “it creates
the illusion of objectivity by seemingly juxtaposing alternatives in an equivalent
manner” (Gill 1999). So it looks and feels scientific. Second, faculty unthinkingly
regurgitate it to their graduate students (and others), who graduate, get jobs, and
repeat the cycle. Hardly a Kuhnian (1996) path of scientific progress. So the NHST
thrives for pointless reasons.
To get a better understanding of the problems that commonly arise with respect to the
NHST, we briefly describe some of the major flaws:

1. The basis of the NHST is the logical argument of modus tollens (denying the con-
sequent), which makes an assumption, observes some real-​world event, and
then determines the consistency of the assumption by checking it against the
observation:
If X, then Y.
Y is not observed.
Therefore, not X.
Issues in Polling Methodologies    293

The problem of modus tollens as part of NHST is that its usual certainty statements
are replaced with probabilistic ones:
If X, then Y is highly likely.
Y is not observed.
Therefore, X is highly unlikely.

While this logic might seem plausible at first, it actually turns out to be a fallacy.
Observing data that are atypical under a given assumption does not imply that
the assumption is likely false. In other words, almost a contradiction of the null
hypothesis does not imply that the null hypothesis is almost false. The following
example illustrates the fallacy:
If a person is an American, then it is highly unlikely that she is the President of the
United States.
The person is the President of the United States.
Therefore, it is highly unlikely that she is an American.

2. The inverse probability problem highlights a common problem in interpreting


the NHST. It is a widespread belief that the smaller the p-​value, the greater the
probability that the null hypothesis is false. According to this incorrect inter-
pretation, the NHST produces P(H0|D), the probability of H0 being true given
the observed data D. However, the NHST actually first assumes H0 as true and
then asks for the probability of observing D or more extreme data. This is clearly
P(D|H0). However, P(H0|D) would in fact be the more desirable test, as it could be
used to find the hypothesis with the greatest probability of being true given some
observed data. Bayes’s law allows for a better understanding of the two unequal
probabilities:

P (H 0 )
P (H 0 | D) = P (D | H 0 ) (28)
P (D)

As a consequence, P(H0|D) = P(D|H0) is only true if P(H0) = P(D), for which we


usually do not have any theoretical justification. Unfortunately P(H0|D) is what
people want from an inferential statement. A practical consequence of this mis-
understanding is the belief that three stars behind a coefficient estimate imply that
the null is less likely than if the coefficient had only one star, although the whole
regression table itself is created under the initial assumption that the null is in
fact true.
3. There are two common misconceptions about the role of sample size in NHST.
First is the belief that statistical significance in a large sample study implies sub-
stantive real-​world importance. This is a concern in polling and public opinion,
because it implies a bias against work on small or difficult to reach populations that
294    Jeff Gill and Jonathan Homola

inherently only allow for smaller sample sizes and smaller p-​values. The correct in-
terpretation is that as the sample size increases, we are able to distinguish smaller
population-​effect sizes progressively. Second is the interpretation that for a given
p-​value in a study that rejects the null hypothesis, a larger sample size implies a
more reliable result. This is false, as two studies that reject the null hypothesis with
the same p-​value are equally likely to make a Type I error, which is independent of
their sample size.15
4. A fourth criticism of the NHST is based on its asymmetrical nature. If the test sta-
tistic is sufficiently atypical given the null hypothesis, then the null hypothesis is
rejected. However, if the test statistic is not sufficiently atypical, then the null hy-
pothesis is not accepted. In other words, H1 is held innocent until shown guilty,
whereas H0 is held guilty until shown innocent. As a consequence, failing to re-
ject the null hypothesis does not rule out an infinite number of other competing
research hypotheses. A  nonrejected null hypothesis essentially provides no in-
formation about the world. It means that given the observed data, one cannot
make any assertion about a relationship. There is a serious misinterpretation that
can arise as a consequence of this asymmetry: the incorrect belief that finding a
nonstatistically significant effect is evidence that the effect is zero. However, lack
of evidence of an effect is not evidence of a lack of an effect. If published, such an in-
correct statement (the hypothesized relationship does not exist) is damaging to
our future knowledge, because it will discourage others from investigating this
effect using other data or models. They will be discouraged from exploring other
versions of this relationship and will move on to new hypothesized relationships,
since the initial effect has already been “shown” to not exist, unless they are clearly
aware of this falsehood.

There are more problems with the NHST, including the arbitrariness of α, its bias
in the model selection process, the fallacy of believing that one minus the p-​value is
the probability of replication, the problems it causes with regard to cross-​validation
studies, and its detachedness of actual substantive significance (see Gill 1999, or Ziliak
and McCloskey 2008). However, the four problems highlighted here and the examples
in the next section should be enough to highlight the flawed nature of the NHST
and warrant either a very cautious use of it or—​even better—​a switch to principled
alternatives.

Polling Examples

To illustrate some of the mistakes that are commonly made when scholars and
journalists encounter nonrejected null hypotheses, we analyzed all twenty issues of
Public Opinion Quarterly (POQ) published over the last five years (volume 74 in 2010
to volume 78 in 2014). More specifically, we searched for the expression “no effect” in
Issues in Polling Methodologies    295

all articles, research notes, and research syntheses and found it in 31 of 168 manuscripts
(18.5%).16 Not all of those cases are necessarily problematic. In fact, many of them are
referring to previous research and summarize earlier studies as finding no effects for a
given hypothesized relationship.
Nonetheless, a number of cases are directly related to nonrejected null hypotheses and
draw either implicit or explicit conclusions. While some are more carefully worded than
others, all are methodologically problematic. Examples of somewhat careful wordings
include formulations that do not unequivocally rule out any effect at all, but are a bit
more cautious in describing their results. For example, in an article on voting tech-
nology and privacy concerns, the authors find that “being part of the political minority
had little to no effect on post-​election privacy judgments” (POQ 75; emphasis added).
Similarly, in their study on different survey devices, another set of authors conclude that
“[a]‌mong those who did not answer one or more items, there appears to be no effect from
device on the number of items not answered” (POQ 75; emphasis added).
Other articles contain both cautiously and not so cautiously worded conclusions. For
example, in an analysis of interviewer effects, the authors first describe a model that “also
accounts for area variables, which have virtually no effect on either the interviewer-​level
variance or the DIC diagnostic,” but then later on incorrectly claim that “interviewer
gender has no effect among male sample units” (POQ 74; emphasis added). A similarly
problematic combination of conclusions can be found in another article on modes of
data collection, in which the authors first correctly state “that very few of the study char-
acteristics are significantly correlated with the observed positivity effect,” but then in the
very next sentence wrongly state that “there are no effects on [odds ratios] for the nega-
tive end of the scale” (POQ 75; emphasis added).
These types of absolutist conclusions that claim a null effect based on a nonrejected
null hypothesis are the most problematic, and we find them in POQ articles in each of
the last five years. In 2010 a study claim that “[h]‌ousehold income has no effect on in-
numeracy” (POQ 74). In 2011 a set of authors conclude that “[r]esidential racial context
had no effect on changes in perception” (POQ 75). The next year, an article stated that
“the number of prior survey waves that whites participated in had no effect on levels of
racial prejudice” (POQ 76), and in the subsequent year two authors claim that “fear has
no effect on [symbolic racism]” (POQ 77). Examples from 2014 include the conclusions
that “[f]or low-​sophistication respondents who were unaware of the ACA ruling, con-
servatism has no effect at all on Supreme Court legitimacy”; that “[attitude importance]
had no effect on the balance of pro and con articles read”; and that “[g]ender and marital
status have no effect on perceptions of federal spending benefit” (POQ 78).
However, there are also articles that correctly deal with nonrejected null hypotheses.
For example, in a study on the effect of issue coverage on the public agenda, the au-
thor correctly interpret the analysis with conclusions such as “[t]‌he null hypothesis that
Clinton coverage had no effect cannot be rejected,” or “we cannot confidently reject the
null hypothesis that President Clinton’s coverage had no effect on public opinion” (POQ
76). This is exactly how failing to reject the null hypothesis should be interpreted. Given
the asymmetrical setup of NHST, a nonstatistically significant effect does not imply that
296    Jeff Gill and Jonathan Homola

the effect is (near) zero. Instead, it merely allows us to conclude that we cannot reject the
null hypothesis.
The implication of the errors outlined here is that less savvy readers (or even sophis-
ticated readers under some circumstances) will take away the message that the corre-
sponding data and model have “shown” that there is no relationship. Returning to the
quoted example above, “interviewer gender has no effect among male sample units,” the
incorrect message is that interviewer gender does not matter, whereas it could matter
with different but similar data/​models, under different interviewing circumstances,
when the questions are about gender, in different age or race groups, and so forth. As
stated previously, publishing this mistake will have a chilling effect on future research
unless the future researchers are clearly aware that the statement is in error. Errors of
this kind may result from general sloppiness by authors, but the resulting effect is exactly
the same.

Conclusion

Survey research and polling is done by both academics and practitioners.


Methodological training varies considerably between these groups. Here we have
attempted to explain some underlying statistical principles that improve the interpre-
tation of results from models and summaries. We have also tried to describe prob-
lematic procedures and practices that lead to misleading conclusions. Some of these
are relatively benign, but others change how readers of the subsequent work interpret
findings. A major theme in this process is correctly considering uncertainty that is
inherent in working with these kinds of data. This uncertainty comes from sampling
procedures, instrument design, implementation, data complexity, missingness, and
model choice. Often it cannot be avoided, which makes it all the more important to
analyze and discuss it appropriately. A second theme is the correct manner of un-
derstanding and reporting results. All statistical tests involve Type I  and II errors,
effect sizes, and a set of assumptions. Not considering all of these appropriately leads
to unreliable conclusions about candidate support, the effect of covariates on choice,
trends, and future predictions. We hope that we have provided some clarity on these
issues.

Notes
1. http://​elections.huffingtonpost.com/​pollster/​polls/​abc-​post-​21963.
2. The reported margin of sampling error was ±3.5 percentage points.
3. http://​www.aapor.org/​Standards-​Ethics/​AAPOR-​Code-​of-​Ethics.aspx.
4. http://​www.aapor.org/​transparency.aspx.
Issues in Polling Methodologies    297

5. The mathematical discussion below is based on the assumption that random samples do
indeed reflect a random sample of the population of interest. While this assumption is
commonly made, it clearly does not hold for opt-​in Internet-​based surveys and can be
seriously doubted for conventional surveys with high levels of nonresponse. A  recent
discussion of the problems that can arise in these settings can be found at http://​www.
huffingtonpost.com/​2015/​02/​03/​margin-​of-​error-​debate_​n_​6565788.html and http://​
www.washingtonpost.com/​blogs/​monkey-​cage/​wp/​2015/​02/​04/​straight-​t alk-​about-​
polling-​probability-​sampling-​can-​be-​helpful-​but-​its-​no-​magic-​bullet/​.
6. http://​www.people-​press.org/​2014/​12/​10/​growing-​public-​support-​for-​gun-​rights/​.
7. This can be calculated in R by using qnorm(0.8)  =  0.84162, which in turn is equiva-
0.8

lent to Φ(0.8) =
−∞ N
f (x )dx .
8. http://​elections.huffingtonpost.com/​pollster/​polls/​yougov-​economist-​22155.
9. However, it is important to note that there is nothing theoretical or fundamental about this
number; it is simply a common convention.
10. Our data come from the second advance release of Module 4 from March 20, 2015, which
covers election studies from a total of seventeen different countries. http://​www.cses.org/​
datacenter/​module4/​module4.htm.
11. The exact question wording is: “In politics people sometimes talk of left and right. Where
would you place [YOURSELF/​PARTY X] on a scale from 0 to 10 where 0 means the left
and 10 means the right?”
12. For the 2012 French and U.S. elections, we used the respondents’ vote choice in the first
round of the current presidential elections.
13. Moreover, the eclectic collection of countries covered in this advance release of the CSES
Module 4 and the overly simplistic model specifications might cause the effects described
above to be weaker than one would usually expect.
14. For a far more comprehensive discussion of the field and different techniques, see
Pawlowsky-​Glahn and Buccianti (2011).
15. This misconception results from a misunderstanding of Type II errors. If two studies are
identical in every way apart from their sample size, and both fail to reject the null hypoth­
esis, then the larger sample size study is less likely to make a Type II error.
16. When also including the four special issues of POQ that were published during that time,
we find 34 of 203 articles include the term (16.7%).

References
Aitchison, J. 1982. “The Statistical Analysis of Compositional Data.” Journal of the Royal
Statistical Society, Series B 44: 139–​177.
Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London: Chapman & Hall.
Aitchison, J., and S. M. Shen. 1980. “Logistic-​Normal Distributions:  Some Properties and
Uses.” Biometrika 67: 261–​272.
Alvarez, R. M., J. Nagler, and S. Bowler. 2000. “Issues, Economics, and the Dynamics of
Multiparty Elections: The 1997 British General Election.” American Political Science Review
42: 5596.
Bacon-​Shone, J. 1992. “Ranking Methods for Compositional Data.” Applied Statistics 41
(3): 533–​537.
298    Jeff Gill and Jonathan Homola

Bakan, D. 1960. “The Test of Significance in Psychological Research.” Psychological Bulletin


66: 423–​437.
Conner, R. J., and J. E. Mosimann. 1969. “Concepts of Independence for Proportions with a
Generalization of the Dirichlet Distribution.” Journal of the American Statistical Association
64: 194–​206.
Darroch, J. N., and I. R. James. 1974. “F-​Independence and Null Correlations of Continuous,
Bounded-​ Sum, Positive Variables.” Journal of the Royal Statistical Society, Series B
36: 467–​483.
Dow, Jay K. 2011. “Party-​System Extremism in Majoritarian and Proportional Electoral
Systems.” British Journal of Political Science 41: 341–​361.
Ezrow, L., J. Homola, and M. Tavits. 2014. “When Extremism Pays: Policy Positions, Voter
Certainty, and Party Support in Postcommunist Europe.” Journal of Politics 76: 535–​547.
Fleiss J. L. 1981. Statistical Methods for Rates and Proportions. 2nd Ed. New York: Wiley.
Gill, J. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research
Quarterly 52: 647–​674.
Hinkle, J., and W. Rayens. 1995. “Partial Least Squares and Compositional Data: Problems and
Alternatives.” Chemometrics and Intelligent Laboratory Systems 30: 159–​172.
James, I. R. 1981. “Distributions Associated with Neutrality Properties for Random
Proportions.” In Statistical Distributions in Scientific Work, edited by C. Taille, G. P. Patil, and
B. Baldessari, 4:125–​136. Dordecht, Holland: D. Reidel.
James, I. R., and J. E. Mosimann. 1980. “A New Characterization of the Dirichlet Distribution
Through Neutrality.” Annals of Statistics 8: 183–​189.
Kuhn, T. S. 1996. The Structure of Scientific Revolutions. 3rd ed. Chicago:  University of
Chicago Press.
Lehmann, E. L. 1999. Elements of Large-​Sample Theory. New York: Springer-​Verlag.
Mosimann, J. E. 1975. “Statistical Problems of Size and Shape: I, Biological Applications and
Basic Theorems.” In Statistical Distributions in Scientific Work, edited by G. P. Patil, S. Kotz,
and J. K. Ord, 187–​217. Dordecht, Holland: D. Reidel.
Pawlowsky-​Glahn, V., and A. Buccianti. 2011. Compositional Data analysis:  Theory and
Applications. Chichester, UK: Wiley.
Rainey, C. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science
58: 1083–​1091.
Rozeboom, W. W. 1960. “The Fallacy of the Null Hypothesis Significance Test.” Psychological
Bulletin 57: 416–​428.
Ward, M. D., B. D. Greenhill, and K. M. Bakke. 2010. “The Perils of Policy by P-​Value: Predicting
Civil Conflicts.” Journal of Peace Research 47: 363–​375.
Ziliak, S. T., and D. N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard
Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.
Chapter 14

Cau sal Infere nc e w i t h


C om plex Su rvey De si g ns
Generating Population Estimates Using Survey Weights

Ines Levin and Betsy Sinclair

Introduction

Public opinion surveys are a highly valuable resource for social scientists, as they allow
researchers to learn about the determinants of attitudes toward diverse issues and to
test hypotheses about political behavior. Using survey data to arrive at estimates of
causal effects that are generalizable to the target population of interest (i.e., making
population-level inferences) can be challenging, however, because it requires taking into
account complex sampling designs and data collection issues such as unit nonresponse.
Researchers who disregard important elements of the survey design run the risk of
obtaining measures of causal effects that do not apply to the target population.
Survey weights are routinely included in survey data sets and can be used by
researchers to account for numerous features of the survey design, including sampling
with differential selection probabilities; unit nonresponse during the data collection
process; and post-​stratification performed with the objective of ensuring that the sample
distribution of demographic attributes resembles known distributions in the target pop-
ulation. While survey weights are typically taken into account during the computa-
tion of basic descriptive statistics, they are most often ignored during the application of
standard causal inference techniques such as propensity score matching.
In this chapter we review methods for combining survey weighting and propensity
score matching that enable researchers to estimate population average treatment effects
(PATEs). The propensity score–​based matching methods discussed in this chapter in-
clude nearest-​neighbor matching, subclassification matching, and propensity score
weighting. After reviewing approaches for incorporating survey weights into each
300    Ines Levin and Betsy Sinclair

of these procedures, we conduct a Monte Carlo simulation study to demonstrate how


ignoring survey weights may lead to biased estimators of treatment effects. Finally, we
illustrate the differences between sample-level inferences (computed ignoring survey
weights) and population-level inferences (computed by incorporating survey weights)
using real-​world data from the 2012 panel of The American Panel Survey (TAPS). In the
last applied section, we make sample-​and population-​based inferences about the effects
of social media usage on civic engagement.

Causal Inference with Complex


Survey Data

Survey data are commonly used by political scientists to learn about political attitudes
and behaviors, including, for instance, the dynamics and determinants of public
opinion on policy issues (Alvarez and Brehm 2002; Feldman 1988; Page, Shapiro,
and Dempsey 1987) and the relationship between individual characteristics and
self-​reported and observed behavior (Alvarez and Nagler 1995; Ansolabehere and Hersh
2012). In particular, researchers often use survey data to make causal inferences—​that
is, to study whether exposure to a presumed cause may drive individuals to hold certain
attitudes or to behave in particular ways. Common causal inference techniques include
experiments, regression analysis, and matching methods, among others (Imai 2014;
Keele 2015). When applied to survey data without taking into account characteristics of
the survey design, these methods allow making sample-level inferences (i.e., measuring
effects that apply to the sample at hand), but not necessarily making population-level
inferences (i.e., measuring effects that apply to the target population of interest).
Why is it important to incorporate information about the survey design into data
analyses when using data from complex surveys? Polling organizations routinely take
steps to ensure that sampled individuals are representative of the population of interest.
This goal, however, is rarely achieved, for a number of reasons. In the case of probability
surveys, cost-​benefit considerations or the need to oversample specific segments of the
population may drive polling organizations to use sampling techniques involving une-
qual selection probabilities (Groves et al. 2009). In the case of nonprobability Internet
surveys, self-​selection of respondents into online panels can lead to overrepresenta-
tion of technologically savvy individuals (Iyengar and Vavreck 2012). Issues arising at
the data-​collection stage, such as failure to contact and nonresponse, may further bias
the demographic characteristics of the sample (Brehm 1993). The latter are pervasive
problems in survey research and threaten the validity of survey-​based inferences re-
gardless of the method used to conduct the survey (i.e., in person, by telephone, or on-
line). For instance, Jackman and Spahn (2014) found that nonresponse is responsible for
much of the positive bias in estimates of voter turnout in the face-​to-​face component of
the 2012 American National Election Study (ANES). To help researchers deal with these
Causal Inference with Complex Survey Designs    301

problems, most survey data sets include information about the survey design or about
discrepancies between respondents’ characteristics and average characteristics of the
target population, which can be used to adjust the survey sample to resemble the target
population.
In particular, survey data sets typically include data records termed “weights” that
allow researchers to make population-level inferences when the demographics of the
selected sample do not mirror the characteristics of the target population. Examples of
procedures that have been designed to deal with some of the above-​mentioned issues
(Groves et al. 2009) are weighting for differential selection probabilities, to adjust for de-
liberate oversampling of specific demographic subgroups; weighting to adjust for unit
nonresponse, to adjust for lower (or higher) response rates within specific demographic
subgroups; and post-​stratification weights, to ensure that the distribution of impor-
tant demographic variables in the adjusted sample resembles the distribution of these
variables known to exist in the population. In the case of nonprobability online surveys,
researchers have developed propensity score–​based methods to select representative
samples of survey participants and to generate post-​stratification weights (Rivers and
Bailey 2009).
To illustrate the usefulness of survey weights, consider the 2012 ANES Time Series
Study. The study was conducted using two survey modes:  face-​to-​face and online
interviewing. Respondents were selected into the face-​to-​face sample using a multistage
cluster design, with probability proportional to population within the primary sampling
units and with an oversampling of blacks and Hispanics (Jackman and Spahn 2014);
respondents were selected into the online sample by drawing from GfK Knowledge
Networks online panel (ANES 2015). While according to the unweighted face-​to-​face
sample 23% percent of respondents considered themselves Spanish, Hispanic, or Latino,
only 11% did so in the weighted face-​to-​face sample, a contrast that reflects the over-
sampling of Hispanics in the face-​to-​face component of the 2012 ANES. According to
the online sample, 30% of respondents reported reviewing news on the Internet on a
daily basis, compared to 26% in the weighted online sample; this difference, though
small, is consistent with the expectation that online survey panelists should be more
regular Internet users than typical adult Americans. This simple example illustrates
how ignoring survey weights can lead to a distorted portrait of the population. In this
chapter we demonstrate how ignoring weights can also lead to inaccurate population-
level inferences.
Although survey weights are available in most data sets, they are often left out of re-
gression analyses and applications of causal inference techniques. Researchers tend to
assume that, conditional on the variables included in the analysis, the characteristics
of the survey design are ignorable (Gelman 2007; Winship and Radbill 1994). Under
this ignorability assumption, incorporating survey weights into the analysis should be
inconsequential. This assumption, however, is violated when factors that affect sam-
pling probabilities are omitted from the analysis. As noted by Gelman (2007, 154), “all
variables should be included that have an important effect on sampling or nonresponse,
if they also are potentially predictive of the outcome of interest.” Although researchers
302    Ines Levin and Betsy Sinclair

might go to great lengths to incorporate all variables thought to affect both probabilities
of selection and the outcome of interest, they may inadvertently fail to do so. It might be
impossible, for instance, to account for some of the drivers of nonresponse. In the rest
of this chapter, we examine whether (and how) the common practice of ignoring survey
weights may affect the accuracy of estimates of causal effects, focusing on one particular
collection of causal inference techniques: propensity score matching methods.
Though numerous matching procedures have been developed in recent years
(Morgan and Winship 2015), the intuition underlying these techniques remains the
same: measuring causal effects by observing how the outcome of interest varies between
subsamples of respondents who are similar in all relevant ways, except for having been
exposed to a given causal state. These methods differ in the procedure used to construct
look-​alike samples of respondents who have and have not been exposed to a presumed
cause or treatment. Propensity score matching methods (Rosenbaum and Rubin 1983),
in particular, construct balanced samples of treated and control respondents based on a
measure of the likelihood of being exposed to the treatment, called the propensity score.
The three propensity score matching techniques examined in this chapter differ, in turn,
on how this measure is used to construct balanced samples. Following is a brief descrip-
tion of the three matching algorithms reviewed in this chapter:

• Propensity score weighting: For all respondents, constructs weights equal to the


inverse of the probability of assignment to treatment, then uses these weights to
estimate treatment effects—​for instance, by computing weighted differences in
means or running weighted regressions.
• Nearest-​ neighbor propensity score matching:  Before estimating treatment
effects, preprocesses the data by matching treated respondents to respondents in
the control group with similar probabilities of assignment to treatment. After that,
estimates treatment effects in the matched sample.
• Subclassification matching:  Before estimating treatment effects, preprocesses
the data by classifying respondents into strata based on their probabilities of
assignment to treatment, such that respondents allocated into the same strata have
approximately the same probability of exposure to the presumed cause. After that,
estimates the overall treatment effect by averaging over strata-​specific estimates,
weighting by the number of respondents within each strata.

In previous studies, researchers reviewed and applied procedures for incorpora­


ting survey weights into the above techniques. Zanutto (2006), for instance, used
subclassification matching to estimate the impact of gender on information tech-
nology (IT) salaries, including and excluding survey weights, and found that,
depending on the specific IT occupation, the unweighted analysis may either exag-
gerate or underestimate the gender salary gap—​a result she attributes to “the dif-
ferential underrepresentation of lower paid men and women” (2006, 84)  in the
sample. DuGoff, Schuler, and Stuart (2014) used the three methods listed above to
measure the effect of having a specialist physician (as opposed to a primary care
physician) as a usual source of care on annual average healthcare spending and
found that—​although all procedures suggest that having a specialist physician as a
Causal Inference with Complex Survey Designs    303

usual source of care leads to higher healthcare spending—​the magnitude of estimated


treatment effects varies considerably depending on whether survey weights are taken
into account. DuGoff and colleagues also conducted a simulation study to assess the
performance of the three propensity score matching techniques listed above, ex-
cluding and including survey weights, and concluded that “using propensity score
methods without survey design elements yields large bias and very low coverage”
(2014, 292).

Methodology

Propensity score matching procedures are typically implemented in two


stages: estimating the propensity score model and matching treatment and control units
on the basis of an estimated distance measure. Since the propensity score model is not
used for making population-​level inferences but for constructing a balancing score (i.e., a
distance measure that can later be used to match treatment and control units and in doing
so produce balanced samples), survey weights can be safely omitted from the first stage
of the procedure (DuGoff et al. 2014; Zanutto 2006). We do incorporate survey weights
as a predictor in the propensity score model, however, as doing so may help account for
potentially relevant individual attributes (DuGoff et al. 2014, 289). To be consistent with
previous studies, we also introduce survey weights into the second stage, when propen-
sity score matching algorithms are used to estimate PATEs. Next we describe the steps
that we take to incorporate survey weights into three different propensity score matching
techniques. In the next section we introduce definitions and further explain these
techniques.1

Nearest-​Neighbor Matching
We first match each treatment unit to the nearest control unit(s), using nearest-​
neighbor matching with replacement. After that, we compute weighted differences
in means for our outcomes of interest between the treatment and matched control
groups. While treatment units are weighted using their original survey weights, control
units are weighted using the survey weights corresponding to their counterpart in the
treatment group. This procedure ensures that the weighted distribution of covariates in
the matched control resembles the weighted distribution of covariates in the matched
treatment and thus can be used to learn about the impact of the treatment among the
treated in the target population (i.e., it yields an estimate of the population average
treatment effect on the treated, or PATT).
The specific formula used to estimate PATT is

= ∑ wi y i
∀i ∈T ∑
∀i ∈C
wm(i ) yi
PATT −
∑ ∑
nn
∀i ∈T
wi wm(i )
∀i ∈C
304    Ines Levin and Betsy Sinclair

where yi and wi denote the outcome of interest and survey weight for individual i,
respectively; wm(i ) denotes the survey weight corresponding to i’s counterpart in the
treatment group (only relevant for control units); T indicates the matched treatment
group (equivalent to the original treatment group since all treated units are kept in
the matched sample); and C indicates the matched control group. Because control
units with no counterpart in the treatment group are dropped from the analysis, this
nearest-​neighbor matching procedure cannot be used for obtaining estimates that
apply to the entire population, such as PATEs (DuGoff et al. 2014, p. 288). Since we
keep all treatment units, however, it can be used to estimate the PATTs.

Subclassification Matching
We stratify the sample into S subclasses, on the basis of the estimated propensity score.
Then we compute weighted differences in outcomes of interest between treatment and
control groups within each subclass. Because we are interested in comparing the perfor-
mance of this technique with that of nearest-​neighbor matching, we focus on estimating
the PATT. This quantity of interest is estimated by averaging strata-​specific treatment
effects across subclasses, accounting for the weighted number of treatment units within
each subclass. The procedure used to estimate the PATT can be summarized thus:

∑ w y ∑ ∀i ∈C wi yi 
∀i ∈TS i i
∑ w 
 ∑ ∀i ∈T wi
− S

∑ ∀i ∈CSwi 
∀S s

=
PATT S

∑ ∀SwS
sub

where yi and wi again denote the outcome of interest and survey weight for individual i, re-
spectively; Ts and Cs indicate the subset of treated and control individuals in subclass S, re-
spectively; and ws denotes subclass weights, which are proportional to the total number of
treated individuals in subclass S. If we were instead interested in estimating the PATE, sub-
class weights would account for the weighted number of both control and treatment units
within each subclass. For a more in-​depth discussion of the importance of introducing
survey weights into subclassification matching procedures, see Zanutto (2006).

Propensity Score Weighting


We first compute composite survey weights by multiplying estimated propensity score
weights by the original survey weights. This allows estimating weighted differences
in the outcomes of interest between treatment and control groups that adjust for pre-​
treatment differences between the characteristics of treated and control units and that
apply to the target population. Since we focus on estimating the PATT, we construct
propensity score weights such that the control group resembles the treatment group as
closely as possible. Thus, propensity score weights equal one for treatment units and
Causal Inference with Complex Survey Designs    305

equal the odds of being assigned to treatment for control units. The formula used to
compute the PATT under this procedure is

 wi 
∑ ∀i ∈Twi yi ∑ ∀i ∈C  p  i
 
y
=
PATT − i

∑ ∀i ∈Twi
psw
w 

∑ ∀i ∈C  pi 
i

where yi and wi, and pi denote the outcome of interest, survey weight, and propensity
score for individual i, respectively; and T and C indicate the original treatment and con-
trol groups, respectively. The inverse of the propensity score, 1/​pi, is multiplied by survey
weights (wi) among control individuals in order to produce a weighted distribution of
covariates in the control group resembling the one in the treatment group.
For each technique, in addition to computing weighted differences in average
outcomes between treatment and control units, we conduct postmatching model-​based
adjustment by running weighted regressions in matched samples (in the case of nearest-​
neighbor matching), within subclasses (in the case of subclassification matching),
and in the entire sample (in the case of propensity score weighting). The advantage
of postmatching model-​based adjustment is that it can be used to estimate treatment
effects while controlling for imbalances in covariates that might remain after the imple-
mentation of the matching algorithm (Ho et al. 2007; Rosenbaum and Rubin 1983).

Simulation Study

This section discusses the results of a Monte Carlo simulation study that we conducted
to illustrate the importance of incorporating survey weights into propensity score
matching procedures. We start by describing the characteristics of the simulation pro-
cess, including population assumptions and the method used to generate synthetic
survey data. After that we assess the success of the different propensity score matching
procedures in recovering true treatment effects.
We first generated data for a hypothetical population in which an outcome of interest
(Y) was assumed to be affected by exposure to a binary treatment (T), a uniformly dis-
tributed covariate (X), and membership in a subpopulation (indicated by a binary indi-
cator S1). The mathematical formula used to generate the outcome is

( )
Y = α 0 + α1 X + α 2 S1 + β0 + β s1 S1 T + ε

where α0 = −0.25, α1 = 0.60, α2 = 0.30, β0 = 1.00, and βS1 = 0.75, and where ε designates
a white noise error term following a standard-​normal distribution. Exactly 10% of the
population was coded as belonging to S1 (i.e., as having S1 = 1), and the remaining 90%
were all coded as having S1 = 0. The above expression implies the existence of heteroge-
neous treatment effects, as the impact of T on Y depends on the value of S1. Specifically,
306    Ines Levin and Betsy Sinclair

the treatment effect equals 1 when S1 = 0 and 1.75 when S1 = 1, leading to a PATE of
1.075 = 1 + (0.75 × 0.1).
In addition, we assumed that assignment to treatment T is positively affected by
covariate X and membership in S1, as indicated by the following mathematical expres-
sion used to compute the probability of assignment to treatment (pT):

logit ( pT ) = γ 0 + γ 1 X + γ 2 S1

where γ0 = −1, γ1 = 2, and γ2 = 1. Since S1 is assumed to have a positive influence on T, the
realized proportion of individuals with S1 = 1 was larger in the treatment group than in
the control group. As a consequence—​since S1 is also assumed to have a positive influ-
ence on Y—​the PATT is larger than the PATE: approximately 1.112 compared to 1.075.
We repeatedly sampled 1,000 respondents from a hypothetical population of size
100,000 with the above characteristics, using random sampling within strata defined by
membership in S1. We did so by first randomly selecting 500 individuals for whom S1 = 1,
and 500 additional individuals for whom S1 = 0. While only 10% of individuals in the
population possess attribute S1, 50% of those in the sample do so. Thus, our survey design
led to oversampling of individuals with attribute S1 and undersampling of individuals
without this attribute. This sampling procedure is analogous to the one often used by
researchers to oversample racial minorities. To correct for this oversampling in subse-
quent analyses, we created survey weights given by the inverse of selection probabilities.
More formally, for each sampled individual, survey weights equaled

 pS 
W = (1 − S1 )  1  + S1
 pS1 

where pS1 = nS1 / N S1 is the probability of selection for members of S1, given by the number
of individuals with S1 = 1 in the sample (nS1 ) divided by the number of individuals with
S1 = 1 in the population (N S1 ), and pS1 is the probability of selection for individuals who
do not belong to S1, computed analogously. Accordingly, individuals in S1 were assigned
a survey weight equal to 1, and individuals outside S were assigned a survey weight
equal to 9.
After generating 1,000 synthetic data sets using stratified random sampling—​each
one with a sample size of 1,000—​we estimated sample average treatment effects on
the treated (SATTs) and PATTs within each data set, by doing a naïve comparison
of unweighted and weighted outcomes in the treatment and control groups and
by applying the three propensity score matching techniques described in the pre-
vious section, excluding and including survey weights. This procedure allowed us to
compare the success of each technique, with and without regression adjustment, in
recovering true quantities of interest before and after incorporating survey weights
into the analysis.
In estimating propensity score models and conducting regression-​based adjustments,
we assume that S1 (a variable that affects sampling probabilities, treatment assignment,
and treatment effects) is not observed, and we therefore exclude it from the set of
Causal Inference with Complex Survey Designs    307

predictors. We do so for exposition purposes, as it is only when a variable that affects


selection probabilities is omitted from the analysis that failing to incorporate survey
weights into the analysis may lead to incorrect estimates of PATEs (Gelman 2007).
Table 14.1 summarizes the results of the simulation study. The columns of this table
give the following information for the estimated average treatment effects on the treated
(ATTs): mean, bias (average of the absolute value of the difference between the true and
estimated effect), mean squared error or MSE (average of the squared difference between
the true and estimated effect), and coverage probability (proportion of simulations
in which the true effect falls inside the 95% credible interval for the estimated effect).
These results suggest that incorporating survey weights into any of the three propensity
score matching procedures leads to lower bias, lower MSE, and higher coverage prob-
ability. Figure 14.1 depicts the distribution of estimated ATTs, excluding and including
survey weights, for the three propensity score matching procedures. The vertical line in
each plot indicates true ATTs. Consistent with the results shown in Table 14.1, ignoring
survey weights leads to biased estimators of treatment effects.
If factors that influence sample selection are excluded from the calculation of survey
weights, however, then weighted causal effects estimators may perform poorly. Suppose,
within the context of the previous example, that there is an additional binary variable
S2 that affects sample selection, such that individuals with S2 equal to 1 are not appro-
priately represented in the sample, and that S2 is not used in developing survey weights.

Table 14.1 Simulation Study, Summary Statistics


95% Confidence Interval

ATT Bias MSE 2.50% 97.50% Coverage

Naïve unweighted 1.83 0.72 0.52 1.69 1.95 0.0


weighted 1.46 0.35 0.13 1.29 1.63 0.1
Prop. Score unweighted 1.52 0.41 0.17 1.36 1.68 0.2
Weighting weighted 1.11 0.00 0.01 0.92 1.30 53.5
with reg. unweighted 1.52 0.41 0.17 1.37 1.68 0.2
adjustment weighted 1.10 –​0.01 0.01 0.91 1.30 79.6
Nearest-​ unweighted 1.52 0.41 0.18 1.34 1.71 0.7
Neighbor weighted 1.11 0.00 0.01 0.88 1.35 50.1
with reg. unweighted 1.52 0.41 0.18 1.34 1.71 0.6
adjustment weighted 1.10 –​0.01 0.01 0.88 1.34 82.5
Subclassification unweighted 1.53 0.42 0.18 1.37 1.68 0.2
weighted 1.11 0.00 0.01 0.91 1.31 50.3

with reg. unweighted 1.52 0.41 0.17 1.36 1.68 60.4


adjustment weighted 1.11 0.00 0.01 0.92 1.30 100.0
308    Ines Levin and Betsy Sinclair

Propensity Score Weighting Propensity Score Weighting


(without regression adjustment) (with regression adjustment)
6 6
unweighted unweighted
5 weighted 5 weighted

4 4
Density

Density
3 3

2 2

1 1

0 0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
N = 1000, Bandwidth = 0.01831 N = 1000, Bandwidth = 0.01829

Nearest Neighbor Matching Nearest Neighbor Matching


(without regression adjustment) (with regression adjustment)
6 6
unweighted unweighted
5 weighted 5 weighted

4 4
Density

Density
3 3

2 2

1 1

0 0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
N = 1000, Bandwidth = 0.0216 N = 1000, Bandwidth = 0.0216

Subclassification Matching Subclassification Matching


(without regression adjustment) (with regression adjustment)
6 6
unweighted unweighted
5 weighted 5 weighted

4 4
Density

Density

3 3

2 2

1 1

0 0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
N = 1000, Bandwidth = 0.01829 N = 1000, Bandwidth = 0.01849

Figure 14.1  Distribution of Estimated ATTs in Simulation Study.

Suppose further that treatment effects vary as a function of S2 as determined by the


following mathematical formula for the outcome (Yʹ):

Y ′ = α 0 + α1 X + α 2 S1 + (β0 + β s1 S1 + β s2 S2 )T+ ε

where βS2 captures how the influence of T on Yʹ varies as a function of S2 (i.e., hetero-
geneous effects of S2). Depending on the magnitude of βS2 and the degree of misrep-
resentation, excluding S2 from the calculation of survey weights—​that is, developing
weights based on S1 only, using the same formula for W as before—​can produce inaccu-
rate inferences.
Causal Inference with Complex Survey Designs    309

To demonstrate the relevance of the proper development of survey weights, we


conducted two additional Monte Carlo simulation studies using a similar procedure to
the one described earlier on in this chapter, but in which (1) the treatment effect varies as
a function of S2 as indicated by βS2 in the last equation, and (2) S2 is not used in the devel-
opment of survey weights. The first simulation study—​designed such that individuals
with S2 equal to 1 are three times as likely to be included in the sample as those with S2
equal to 0—​illustrates how the magnitude of the bias varies as a function of βS2 .2 Figure
14.2, panel A, shows how, for the six weighted estimators considered in this paper, the ab-
solute value of the bias increases from close to 0 to almost 0.40 when βS2 decreases from
0 to –​1.6 or increases from 0 to 1.6. The second simulation study helps illustrate how the
magnitude of the bias varies as a function of the degree of misrepresentation when βS2 is
held constant at –​0.5.3 Figure 14.2, panel B, shows how, for the six weighted estimators,
the absolute value of the bias increases from 0 to more than 0.20 when individuals with
S2 equal to 1 are severely under-​or overrepresented in the sample, as measured by the
ratio of proportions of individuals with S2 equal to 1 in the population relative to the
sample. A  remarkable aspect of these last two simulation studies is that the specific
weighted causal effects estimator used for calculating ATTs is much less consequential
than the proper development of survey weights, specially when the excluded variable
strongly influences the treatment effect or when there is severe misrepresentation.

Application: Social Media Usage and


Political Participation

A number of scholars have argued that social media will increase political participa-
tion. To a large extent, the theory behind this argument is founded on two concepts.
First, social media have the potential to decrease the information gap between more-​
informed and less-​informed citizens by allowing less-​informed citizens greater access
to their more-​informed peers, who may provide political expertise (Huckfeldt and
Sprague 1995; Schlozman, Verba, and Brady 2013). Increasing citizens’ information
should decrease their cost of participation. Second, social media decrease the social cost
of disagreement, allowing people to more anonymously access new ideas (Gentzkow
and Shapiro 2011). Again, by allowing people access to more ideas and information,
their participation should increase. Countering these claims, others have argued that
increasing political disagreement in personal relationships will directly decrease par-
ticipation (Mutz 2006), and moreover, decreasing common experiences will decrease
interest in politics (Sunstein 2007). Researchers have generally concluded that despite
lower barriers for participation via these channels, political participation has not dra-
matically increased since the advent of social media (Bimber and Davis 2003; Bimber
2001, 2003; Jennings and Zeitner 2003).
We test the extent to which survey respondents are likely to report differing levels of
political engagement based on their social media usage. Data for this analysis are drawn
310    Ines Levin and Betsy Sinclair

(A)

0.4

0.3
abs (bias)

0.2

0.1

0.0
–1 0 1
βS2
(B)

0.4

0.3
abs (bias)

0.2

0.1

0.0
0.5 1.0 1.5
Prop(S2=1 in sample)/Prop(S2=1 in population)

Figure 14.2  Biased Estimators Due to Misspecified Survey Weights. A: Influence of changes in


heterogeneous effects of S2. B: Influence of changes in representation of S2 in the sample.
Note: The six lines in plot A and B correspond to different propensity score–​based causal effects estimators (weighting,
subclassification, and nearest-​neighbor; with and without regression adjustment).

from TAPS, a monthly online survey of about two thousand people. Panelists were
recruited as a national probability sample with an address-​based sampling frame in the
fall of 2011 by GfK-​Knowledge Networks for the Weidenbaum Center at Washington
University.4 We quantify political engagement based on responses to eleven political
participation questions included in the June wave of the 2012 TAPS panel. Each of the
eleven political participation items has a binary response such that it takes 1 if the re-
spondent has engaged in the activity in the last few months and takes 0 otherwise. The
political activities range from contacting an elected official to having signed a petition
to having discussed politics with other people. The overall civic engagement scale is
computed by summing up the binary responses from the eleven political participation
items. For example, a scale of 11 is assigned to those respondents who have engaged in all
Causal Inference with Complex Survey Designs    311

of the eleven activities, while a scale of 0 is assigned to those respondents who have en-
gaged none of the eleven activities. Our primary explanatory variable is a binary variable
that takes 1 if the respondent uses social networking websites (SNSs) to communicate
with his or her friends and family and 0 otherwise. We match using age, gender, income,
education, and frequency of Internet usage. Since these variables do not account for all
the factors that were used to construct survey weights, it is likely that ignoring survey
weights may lead to inaccurate estimates of treatment effects.
We estimated treatment effects in multiple ways, by comparing overall levels of political
participation (as measured by the civic engagement scale) for respondents who do and do
not use social networking sites. We did so naïvely (i.e., without controlling for preexpo-
sure differences between SNS users and nonusers), as well as by matching on a number
of individual attributes using the propensity score matching techniques described in pre-
vious sections. We focused on estimating ATTs—​that is, on individuals who resemble so-
cial media users. For each procedure, the SATT and the PATT were calculated by ignoring
and incorporating survey weights into the analysis, respectively. The following is a brief
description of the implementation of each propensity score matching technique:

• Nearest-​neighbor matching was implemented by first matching each SNS user


to the nearest two nonusers, with the distance between respondents measured in
terms of the estimated likelihood of SNS usage, and then computing differences
in average levels of political participation between SNS users and nonusers in the
matched sample, weighting using survey weights for estimating the PATT.
• Subclassification matching was implemented by first splitting the sample into eight
subclasses based on the likelihood of SNS usage, then computing differences in average
levels of political participation between SNS users and nonusers within each subclass,
and last averaging across subclasses (weighting by the number of SNS users within each
subclass), incorporating survey weights into the procedure when estimating the PATT.
• Propensity score weighting was implemented by computing weighted differences
in average levels of political participation between SNS users and nonusers in the
entire sample, with weights equal to one for users and given by the odds of SNS
usage for nonusers, using adjusted weights (equal to propensity score weights times
the original survey weights) for estimating the PATT.

Table 14.2 summarizes the results of our analysis. It reports estimates of the effects of so-
cial media usage on users (i.e., ATTs) found by computing naïve differences and using the
three propensity score matching procedures described above, with and without regres-
sion adjustment. Unweighted quantities approximate the SATT, and weighted quantities
approximate the PATT. The naïve comparison suggests that there are no significant
differences between groups. After controlling for individual attributes (particularly age)
using any of the three matching methods, it becomes apparent that usage of social net-
working sites has a positive and generally significant effect on the political involvement
of sampled users, with the magnitude of the estimated SATT ranging between .21 and .30
depending on the exact method. This positive influence, however, only holds in the un-
weighted sample. Following the incorporation of survey weights, we find that impact of
312    Ines Levin and Betsy Sinclair

Table 14.2 Influence of Social Network Usage on Political Participation


Difference Std. Err.

Naïve unweighted –​0.07 0.14


weighted –​0.07 0.14

Prop. Score unweighted 0.29 0.16


Weighting weighted 0.14 0.16
with reg. unweighted 0.28 0.13
adjustment weighted 0.16 0.12
Nearest-​ unweighted 0.21 0.16
Neighbor weighted –​0.03 0.16
with reg. unweighted 0.22 0.16
adjustment weighted 0.06 0.16
Subclassification unweighted 0.27 0.14
weighted 0.07 0.14

with reg. unweighted 0.30 0.41


adjustment weighted 0.14 0.38

social networking sites on the political involvement of population users (i.e., the PATT)
is statistically indistinguishable from zero. This result is consistent with previous findings
about the influence of social media on political participation (Bimber 2001, 2003; Bimber
and Davis 2003; Jennings and Zeitner 2003) and illustrates the importance of accounting
for features of the survey design when using survey data to elicit causal effects.

Conclusion

This chapter illustrates the importance of accounting for important characteristics of


the sampling design, such as probabilities of selection into the survey, when using com-
plex survey data as input for making population-level inferences. Using a Monte Carlo
simulation study, we demonstrated how ignoring survey weights may lead to biased
estimators of PATEs. Then, using data from TAPS, we illustrated how ignoring survey
weights would cause us to conclude that social media usage has positive and signifi-
cant effects on political participation, when these effects are not actually apparent in the
target population.
A number of caveats are in order, though, as survey weights included in most public
opinion surveys are estimated quantities that may carry considerably uncertainty. If
poorly constructed, survey weights may fail to accurately account for differential selec-
tion probabilities and nonresponse. This may happen, for instance, when researchers are
Causal Inference with Complex Survey Designs    313

uncertain about the causes of nonresponse or about the true characteristics of the target
population, in which case post-​stratification weights cannot be relied on for recovering
population-level quantities of interest. Jackman and Spahn (2014), for example, find that
although nonresponse is one of the main drivers of turnout overestimation in the ANES,
survey weights do not solve the problem but actually worsen it (p. 3). Another limitation
of survey weights is that the presence of extreme weights for some units may exacerbate
the variance of estimates of causal effects, an issue that is often dealt with by trimming
survey weights (Elliott and Little 2000).
Finally, it has been observed that the tasks of calculating survey weights, using
them to estimate quantities of interest, and estimating the appropriate standard
errors may carry considerable complexity (Gelman 2007; Winship and Radbill
1994). In general, it is possible to encounter weights that account for sampling frame
errors (incomplete frames), that correct for varying selection probabilities (e.g.,
when a group is oversampled because members are more difficult to reach), and
that account for unit nonresponse, as well as post-​stratification weights that aim to
improve undercoverage, nonresponse, and sampling variance. There is a set of best
practices for statistically adjusting survey weights for particular inferences using
standard socioeconomic and demographic covariates available in known population
distributions (DeBell and Krosnick 2009; Gelman 2007; Henderson et al. 2010), but
they tend to be fairly laborious for a typical user of these data, and thus researchers
often use the design-​based weight for each respondent to provide a snapshot of na-
tional opinion. These design-​based weights may, even in canonical and publicly
available data sets, be too large for a researcher to feel comfortable focusing on the
inferences drawn about a particular subgroup. They may also have been calculated
erroneously, so before using weights a researcher should attend to any documenta-
tion surrounding their estimation. We call here for a theory-​and evidence-​based
debate on this issue, and for professional organizations, such as the American
Association for Public Opinion Research (AAPOR) and the Society for Political
Methodology (SPM), to provide guidance on the correct development and usage of
survey weights. We are particularly interested in debate surrounding the relationship
between design-​based weights and descriptions of sampling error in opt-​in surveys.
Until clearer standards are introduced, researchers should bear in mind the potential
drawbacks of design-​based weights and decide, on a case-​by-​case basis, the suita-
bility of incorporating them into their analyses.

Notes
1. All data and code used in this chapter are available at the project’s Dataverse, at https://​
dataverse.harvard.edu/​dataverse/​cicsd.
2. For this simulation study, we generated 1,000 synthetic data sets for each of 17 values of βS2 ,
with values of βS2 ranging between –​1.6 and 1.6.
3. For this simulation study, we generated 1,000 synthetic data for 19 different levels of repre-
sentation of individuals with S2 equal to 1, with representation ratios ranging between 0.05
314    Ines Levin and Betsy Sinclair

and 0.95. Representation ratios were calculated by dividing the proportion of individuals
with S2 equal to 1 in the sample by the proportion of individuals with S2 equal to 1 in the
population.
4. More specifically, the frame is drawn from the U.S. Postal Service’s computerized delivery
sequence file of mailing addresses. Access to this file allows TAPS to reach approximately
97% of all physical addresses in the country, including P.O. boxes and rural route addresses.
To improve the sampling process, residences that are determined to be seasonal or va-
cant are identified and removed. Through a third-​party vendor, the frame is able to match
identified physical addresses with landline telephone numbers and with a certain level of
accuracy in identifying the race, age, number, and type of individuals in the residence, as
well as home ownership status. Since some demographic groups are more difficult to iden-
tify and recruit by the third-​party vendor, the sample is stratified to target young adults and
Hispanic persons in addition to the balance of the general population. Thus, these groups
are slightly oversampled to anticipate their predicted likelihood of underparticipation in
probability samples. Once panelists have been selected for the survey, they complete a pro-
file survey that captures key demographic variables, followed by monthly waves of the panel.
Those individuals without Internet access were provided a laptop and Internet service at the
expense of the Weidenbaum Center. In a typical month, over 1,600 of the panelists complete
the online survey. The data for this project come from monthly surveys collected between
November 2011 and November 2014. More technical information about the survey is avail-
able at http://​taps.wustl.edu.

References
Alvarez, R. M., and J. Brehm. 2002. Hard Choices, Easy Answers:  Values, Information, and
American Public Opinion. Princeton, NJ: Princeton University Press.
Alvarez, R. M., and J. Nagler. 1995. “Economics, Issues and the Perot Candidacy: Voter Choice
in the 1992 Presidential Election.” American Journal of Political Science 39 (3): 714–​744.
American National Election Studies (ANES). 2015. “User’s Guide and Codebook for the ANES
2012 Time Series Study.” The University of Michigan and Stanford University.
Ansolabehere, S. and E. Hersh. 2012. “Validation:  What Big Data Reveal about Survey
Misreporting and the Real Electorate.” Political Analysis 24 (4): 437–​459.
Bimber, B. 2001. “Information and Political Engagement in America: The Search for Effects of
Information Technology at the Individual Level.” Political Research Quarterly 54: 53–​67.
Bimber, B. 2003. Information and American Democracy. Cambridge, UK:  Cambridge
University Press.
Bimber, B., and R. Davis. 2003. Campaigning Online. New York: Oxford University Press.
Brehm, J. 1993. The Phantom Respondents: Opinion Surveys and Political Representation. Ann
Arbor: University of Michigan Press.
DeBell, M. and J. A. Krosnick. 2009. “Computing Weights for American National Election
Study Survey Data.” ANES Technical Report series, no. nes012427. Ann Arbor, MI, and Palo
Alto, CA: American National Election Studies.
DuGoff, E. H., M. Schuler, and E. A. Stuart. 2014. “Generalizing Observational Study
Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research
49 (1): 284–​303.
Elliott, M. R., and R. J.  A. Little. 2000. “Model-​Based Alternatives to Trimming Survey
Weights.” Journal of Official Statistics 16 (3): 191–​209.
Causal Inference with Complex Survey Designs    315

Feldman, S. 1988. “Structure and Consistency in Public Opinion: the Role of Core Beliefs and
Values.” American Journal of Political Science 32 (2): 416–​440.
Gelman, A. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical
Science 22 (2): 153–​164.
Gentzkow, M., and J. Shapiro. 2011. “Ideological Segregation Online and Offline.” Quarterly
Journal of Economics 126: 1799–​1839.
Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau.
2009. Survey Methodology. Hoboken, NJ: John Wiley & Sons, Inc.
Henderson, M., D. S. Hillygus, and T. Tompson. 2010. “ ‘Sour Grapes’or Rational Voting? Voter
Decision Making among Thwarted Primary Voters in 2008. ” Public Opinion Quarterly 74
(3): 499–​529.
Ho, D. E., K. Imai, G. King, and E. A. Stuart. 2007. “Matching as Nonparametric Preprocessing
for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15
(3): 199–​236.
Huckfeldt, R., and J. Sprague. 1995. Citizens, Politicism and Social Communication. Cambridge,
UK: Cambridge University Press.
Imai, K. 2014. “Introduction to the Virtual Issue: Past and Future Research Agenda on Causal
Inference.” Political Analysis Virtual Issue 2: 1–​4.
Iyengar, S., and L. Vavreck. 2012. “Online Panels and the Future of Political Communication
Research.” In The SAGE Handbook of Political Communication, edited by H. A. Semetko and
M. Scammel, 225–​240. Thousand Oaks, CA: SAGE Publications Inc.
Jackman, S., and B. T. Spahn. 2014. “Why Does the American National Election Study
Overestimate Voter Turnout?” Paper presented at the 31st annual meeting of the Society for
Political Methodology. http://​polmeth.wustl.edu/​mediaDetail.php?docId=1514.
Jennings, M. K., and V. Zeitner. 2003. “Internet Use and Civic Engagement.” Public Opinion
Quarterly 67: 311–​334.
Keele, L. 2015. “The Statistics of Causal Inference:  The View from Political Methodology.”
Political Analysis 23 (3): 313–​335.
Morgan, S. L., and C. Winship. 2015. Counterfactuals and Causal Inference:  Methods and
Principles for Social Research. New York: Cambridge University Press.
Mutz, D. 2006. Hearing the Other Side. Princeton, NJ: Princeton University Press.
Page, B. I., R. Y. Shapiro, and G. R. Dempsey. 1987. “What Moves Public Opinion?” American
Political Science Review 81 (1): 23–​43.
Rivers, D., and D. Bailey. 2009. “Inference from Matched Samples in the 2008 US National
Elections.” In Proceedings of the Joint Statistical Meetings, Survey Research Methods Section.
Alexandria, VA: American Statistical Association, 627–​639.
Rosenbaum, P. R., and D. B. Rubin. 1983. “The Central Role of the Propensity Score in
Observational Studies for Causal Effects.” Biometrika 70 (1): 41–​55.
Schlozman, K. L., S. Verba, and H. E. Brady. 2013. The Unheavenly Chorus. Princeton,
NJ: Princeton University Press.
Sunstein, C. R. 2007. Republic.com 2.0. Princeton, NJ: Princeton University Press.
Winship, C., and L. Radbill. 1994. “Sampling Weights and Regression Analysis.” Sociological
Methods & Research 23 (2): 230–​257.
Zanutto, E. L. 2006. “A Comparison of Propensity Score and Linear Regression Analysis of
Complex Survey Data.” Journal of Data Science 4 (1): 67–​91.
Chapter 15

Aggregating Su rv ey Data
to Estimate Su bnat i ona l
Public Opi ni on

Paul Brace

Introduction

The study of subnational public opinion presents special opportunities. The funda-
mental benefit offered by measuring public opinion at the subnational level is that it
affords uncommon opportunities to gauge the nature of the connections of opinions to
their political and socioeconomic contexts, on the one hand, and the linkage of these
opinions to subnational governmental outcomes, on the other. Systematic comparative
analyses of the causes and consequences of public opinion across governmental units
allow us to focus on the nature of the linkages between mass publics and governmental
outcomes.
For many years the study of public opinion by political scientists rested on the unex-
plored assumption that it influenced government leaders and ultimately public policy
(Shapiro 2011). The rise of modern polling techniques gave researchers a way to reg-
ularly measure people’s privately held opinions. Pioneered famously by Campbell,
Converse, Miller, and Stokes in The American Voter (1960), these surveys were subse-
quently administered during every presidential election, albeit with modifications, in
the American National Election Studies. These new data stimulated intense analytical
effort concerning the correlates of participation and vote choice, levels of information
respondents exhibited, and the consistency of their answers across questions (see, e.g.,
Converse 1964; Popkin et al. 1976; Zaller 1992; Popkin 1994).
Questions about linkages between opinion and policy became more salient as
mounting research on the content and forces operating on public opinion revealed
low levels of information and interest in political issues (Delli Karpini and Keeter
1996). Impressive advances occurred in survey methodologies but lacked an important
Aggregating Survey Data to Estimate Subnational Public Opinion    317

dimension: the basis for systematic comparative analyses across governmental units.


For years we have been awash in an ever-​expanding sea of national and independent
subnational surveys, but no attention has been paid to systematizing these surveys in
a manner that would make them analytically comparable. Then and now, subnational
surveys were conducted by different polling organizations, at different times, using
different question wording (Parry, Kisida, and Langley 2008).
Just as the measurement and analysis of subnational public opinion offers special
opportunities for linking opinions to contexts and outcomes, it also presents special
and formidable challenges. The question of linkage has largely been ignored or has been
approached by using crude surrogates for opinion due to the lack of an analytical in-
frastructure needed to produce survey-​based measures of subnational opinion. While
arguably one of the most significant questions in the study of government, linkage
simply did not lend itself to rigorous empirical analysis, given the available data and
methodologies. Such analyses would require systematic comparative data that could
link well-​measured opinion to similarly well-​measured indicators of government
actions.
A notable pioneering strategy was to study how voters’ preferences within particular
constituencies (e.g., congressional districts) connect to the behavior of policymakers
for that constituency (e.g., roll-​call votes). The “dyadic representation” model pioneered
by Miller and Stokes (1963) reported modest and variable linkages across policy areas.
This study also highlighted a fundamental difficulty with analyzing linkages within
subnational domains: the number of survey responses available from national surveys to
gauge constituency opinion within subnational units was exceedingly small. The small
number of observations within districts, in the face of modest and variable correlations
between opinion and politicians’ behavior, left questions about whether the observed
relationships were truly modest or an artifact of low reliability of opinion estimates
owing to small numbers of observations within districts. Using different measures of
opinion or methodological assumptions, subsequent studies extended the foundations
of Miller and Stokes, reporting stronger evidence of opinion-​policy linkages (e.g.,
Achen 1975, 1977 Erikson 1978; Page, Shapiro, Gronke, and Rosenberg 1984; Bartels 1991;
McDonagh 1992).
The Miller and Stokes study has been enormously influential, but also highlights
core methodological and inferential challenges embodied in investigating constituency
opinion and policy outcomes at the subnational level. These challenges are persistent
obstacles that researchers have sought to address by employing new data or leveraging
existing data in increasingly sophisticated ways.
Another core debate within this evolving literature concerns how the “quality”
of public opinion is tied to the “effects” it has on political outcomes. A  continuing
question concerns the extent to which citizen opinions shape outcomes or are instead
led, manipulated, or informed by political leaders, the mass media, or other forces in
the political environment. Burstein (2010) observes that our measures of opinion about
specific policies derived from national surveys are generally quite poor. Researchers
are commonly forced to use opinions on (arguably) related topics (e.g., self-​proclaimed
318   Paul Brace

political ideology [Erikson et al. 1993], “policy mood” [Stimson et al. 1995]), but this
leaves lingering questions of interpretation: Is the observed relationship or lack of re-
lationship “genuine,” or is it an artifact of using surrogate opinion measures? Burstein
(2010) argues that such measures of public opinion “provide no information at all as to
what specifically the public wants” (2010, 69). Moreover, Page (2002) argues that our
studies overestimate the impact of opinion on policy because of sampling bias: public
opinion polls focus on issues that are important to the public, and it is on such issues that
democratic governments are most likely to do what the public wants (2002, 232–​235).
In general, researchers studying subnational opinion using national survey data are
forced to work with survey items that are less than ideal for gauging linkages on specific
policies. Instead, global or general measures have been employed that, while showing
substantial evidence of linkage, do not provide insight into how specific opinions trans-
late into specific policies. This is supported by voluminous evidence showing that most
of the public does not hold opinions, or maintain consistent opinions, on many specific
issues most of the time (see, e.g., Converse 1964; Zaller 1992).
If citizen opinion is absent on specific issues, what could explain the observed
linkages between general opinions and specific policy outcomes? It is possible our
causal arrows need to be reversed. Gabriel Lenz observes that after decades of research,
“[d]‌etermining whether citizens lead their politicians or follow them turns out to be a
lot harder than it sounds. Basic correlations between citizen policy views and their vote
choice or policy outcomes does not allow researchers to disentangle which came first,
citizen attitudes or electoral or policy outcome. Such correlations derived from cross-​
sectional research designs cannot tell these two very different outcomes apart because
they are observationally equivalent (Lenz 2012, 7; see also Norrander 2000). To unpack
the causal sequence requires that we examine not only differences between units, but
also differences within units between cause and effect. Moreover, it requires variations
in the magnitude of the causal variable to measure the magnitude of effect on political
outcomes, if any. While correlations between opinion and political outcomes are a nec-
essary condition for inferring democratic responsiveness, it is not sufficient, because
this correlation could just as easily result from outcomes driving opinions. Ultimately,
the sufficient condition for democratic responsiveness requires that changes of variable
magnitude in opinions precede and translate into changes in variable magnitude in po-
litical outcomes.
As illustrated in the following review, research on linkages between subnational
opinion and political outcomes highlights central and recurring concerns:

• The first concern is the sources of data to measure subnational opinion. In the ab-
sence of suitable subnational surveys, researchers are forced to make pragmatic
decisions about alternative sources of data to gauge subnational opinion. Studies
have employed surrogates or used observations obtained from national surveys,
producing ever-​improving but still less than ideal measures of specific opinions.
• The second concern is the sufficiency of the number of observations used to es-
timate subnational opinion. The number of observations available in national
Aggregating Survey Data to Estimate Subnational Public Opinion    319

surveys for specific subnational constituencies (e.g., states, counties, congres-


sional districts) varies tremendously across subunits. Given the very small or zero
observations available for some or many subunits, reliable comparisons of opinions
and their effects across many subunits are commonly limited to those subunits
with sufficient observations. As a consequence, studies of subnational linkage
commonly must focus on a subset of the more populous (and thus more sampled)
subunits while ignoring subunits with smaller populations. This becomes partic-
ularly problematic when there are relatively few subunits, such as states, which
together exhibit considerable variety in their politics and policies, that would be
ignored by focusing on a handful of highly populated (and sampled) states.
• The third and related concern involves the data needed for research designs that can
embrace the causal sequences involved in the opinion-​policy linkage. Ultimately
researchers must consider longitudinal features of opinion within subnational
units. If opinion drives policy, changes in opinion must translate into changes in
policy, but this requires not only sufficient observations within subnational units
in general, but also sufficient observations within subunits over time to measure
opinion change.
• Finally, the substance of our measures of opinion commonly derive from prag-
matic choices based on available data, but these often fall short of the specificity
needed to elaborate the processes whereby specific opinions translate into specific
policy changes.

Overall, the evolution of the study of subnational opinion has involved progressive
improvements, using new data and methodologies to produce more reliable and specific
measures of subnational opinion, based on more observations, making comparisons
among more subunits possible, and allowing for longitudinal analyses of subnational
opinion that will ultimately be necessary for articulating the causal connections be-
tween opinion and policy across subnational units. This is a dynamic area of research
that has attracted significant and sustained scholarly interest, one that promises to yield
impressive dividends in the future.

Opinion-​Policy Linkage
Data and analytical demands for studying opinion-​policy linkages have served as major
impediments to progress, which also established the research frontiers surmounted by
innovations and methodological advances. The study of subnational public opinion
has been characterized by increasingly sophisticated methodologies for surmounting
the vexing challenges of not having specific subunit survey data by leveraging various
sources of available demographic and survey data.
In a democracy, policy is supposed to be linked to the preferences of the public. This
linkage has served as a motivation for a wealth of studies. Most typically, studies have
illustrated correlations between public opinion, measured various ways, and public
320   Paul Brace

policies, across units measured within a constant time period. Although commonly re-
porting significant opinion-​policy linkages, these findings are vulnerable to the criti-
cism that these relationships are the result of rival causal interpretations.
Notably, elites may be shaping opinion. Jacobs and Shapiro (2000) argue that
elected officials have an incentive to convert skeptical constituents to their own posi-
tion. Alternatively, opinions may come to reflect policy through migration. Studies
of subnational taxation and expenditure point to the importance of voting with one’s
feet (Tiebout 1956). From this perspective, strong correlations between opinion and
policy simply reflect the result of geographic sorting as citizens move to jurisdictions
with policies in line with their preferences. In the end, cross-​sectional correlations be-
tween opinion and policy do not preclude these rival explanations. Cross-​sectional
correlations represent opinion policy congruence, but nothing more.
Ultimately, convincing studies of linkage between opinion and policy require
investigating the causal dynamics by which the preferences of constituencies cause the
behavior of representatives, independently of elite persuasion, voter mobility, and geo-
graphic sorting. It requires the exploration of the temporal order of opinion and policy
data, in which current public opinion changes significantly and systematically relate to
future public policy changes. When (if) such opinion change leads to policy change, this
dispels skeptics’ concerns that policy might lead opinion. If current opinion predicts
future policy change, independent of current policy that presumably reflects current
elite preferences, it is difficult to argue that opinion was not influential. Moreover, such
findings render the notion that voter mobility is driving the process implausible, be-
cause it would require vast migrations of voters in and out of jurisdictions in advance of
policy changes.
An examination of the historical development of studies of public opinion in
subnational jurisdictions reveals a progressive research frontier that has advanced only
after solving vexing measurement and data issues. Why is this different than other areas
of inquiry? Most commonly, where theories point to important questions, data are col-
lected to answer those questions. We could imagine the collection of state-​level surveys
that were coordinated and archived across states. Ultimately, such data could provide
valid and reliable estimates of public opinion within states that were comparable across
states. Unfortunately, “[p]‌ublic opinion data of the subnational sort have proved partic-
ularly elusive” (Parry, Kisida, and Langley 2008, 197). The resources and rewards for such
systematization and coordination do not exist: the design and execution of common
questions across states detract from polling directors’ other duties, while archiving these
data in a common repository is typically viewed as too cumbersome (Parry, Kisida, and
Langley 2008, 211).
While we might hope that these impasses could be somehow surmounted in the fu-
ture, the reality is that even if they were, it would be many, many years before such co-
ordinated effort could produce enough state level surveys to answer any but the most
preliminary questions. Moreover, such data would not allow us to examine even re-
cent history. Hence, while the spread of electronic data collection and archiving has
advanced the study of state politics (see Brace and Jewett 1995), and despite the fact
Aggregating Survey Data to Estimate Subnational Public Opinion    321

that technologies have created “robust” state polling enterprises, “opportunities for
multi-​state analysis remain daunting (Parry, Kisida, and Langley 2008, 210), and these
advances have not included state or subnational public opinion.
Given this impasse, creative and methodologically innovative utilization of imperfect
or incomplete data to create reliable and valid measures of subnational opinion is more
than a stopgap measure; it is the only way forward unless and until we develop the in-
frastructure to routinely coordinate, collect, organize, and archive genuine state-​level
polls. Given the practical obstacles involved and the historical state-​level polling that
can either never be obtained or does not coordinate with other state-​level polls, our un-
derstanding of the role of comparative public opinion in the subnational domain will
necessarily be based on the thoughtful and critical conversion of what we have into what
we need.

Early Studies of Opinion


and Policy in the States: Surrogates,
Electoral Returns, Simulations, and
Validity Issues

The comparative study of state politics dates to V. O. Key’s magisterial study, Southern
Politics (1949), or earlier. By the 1960s and into the early 1970s, the comparative study
of state politics had hit its stride with a stream of influential studies at the leading edge
of political science inquiry (e.g., Dawson and Robinson 1963; Dye 1965, 1969a, 1969b,
Hofferbert 1966; Sharkansky 1968; Sharkansky and Hofferbert 1969; Cnudde and
McCrone 1969; Fry and Winters 1970; Godwin and Shepard 1976; and others). By the
late 1970s, however, interest and effort in the area began to fade (see Brace and Jewett
1995). As Cohen (2006) observes, a factor that depressed enthusiasm for comparative
state studies was the lack of public opinion data across the states. While scholars had
developed many innovative and useful measures of aspects of state politics and policy—​
including policy outputs, political structures, institutional capacity, electoral competi-
tion, as well as state demographic and economic profiles—​sound, direct measures of
state public opinion remained elusive.

Surrogates
A long tradition exists of using indirect measures to capture state public opinion in lieu
of survey responses. For instance, scholars have used demographics (Boehmke and
Witmer 2004; Mooney and Lee 2000; Norrander and Wilcox 1999), simulations based
on the demographic characteristics of state residents (Weber et al. 1972), and measures
322   Paul Brace

based on policy makers who represent a state (Berry et al. 1998, 2007; Holbrook-​Provow
and Poe 1987). The limitations of these indirect measures have been debated elsewhere
(Brace et al. 2004, 2007; Erikson, Wright, and McIver 1993).

Surrogate Demographic Variables.


One of the most common approaches used in studies of policy responsiveness in the
U.S. House of Representatives is to measure constituency policy preferences using sur-
rogate demographic variables. Usually this involves estimating a model in which legis-
lative roll-​call behavior is depicted as a function of a wide range of district demographic
characteristics obtained from the U.S. Census. The demographic variables employed in
such studies typically include indicators of racial composition, education, income, age,
social class, occupational distribution, urbanization, homeownership, and family com-
position (Pool, Abelson, and Popkin 1965; Sinclair-​Deckard 1976; Weber and Shaffer
1972). In a more general analysis, Peltzman (1984) used six demographic variables meas-
ured at the county level to tap politically relevant, economic characteristics of senators’
constituencies. Kalt and Zupan’s (1984) analyzed specific industries capturing members
of Congress: in their analysis of Senate voting on strip-​mining regulation, they took
state-​level data on membership in pro-​environmental interest groups and the size of
various state coal producer reserves in BTUs expressed as fractions of state personal
income.
Scholars adopting such an approach make some important assumptions about the
political meaning of demographic characteristics. In particular, they assume that

(1) individuals’ demographic characteristics are related systematically to their policy


preferences,
(2) legislators are aware of the demographic composition of their districts and take
those characteristics (or at least how they interpret those characteristics) into
account when making roll-​call decisions, and
(3) such a relationship holds when one moves across levels of analysis (i.e., from the
individual level to the aggregate level).

The first assumption is quite reasonable. Numerous studies document the demographic
underpinnings of public opinion and political behavior; citizens’ general ideology and
their views on public policy matters are often related to their demographic characteris-
tics. Such a relationship may be due to the degree to which self-​interest is reflected in cit-
izens’ demographic characteristics, or else demographic characteristics might represent
how different groups in society acquire different sets of symbolic attitudes through the
socialization process.
Second, it does not seem unreasonable that legislators are aware of the demographic
characteristics of the constituents that they represent and interpret these characteris-
tics in such a way as to permit the demographic flavor of a district to affect their roll-​
call decisions (e.g., Fenno 1978). The final assumption—​that the relationship between
aggregate demographic characteristics and aggregate policy preferences is a reflection
Aggregating Survey Data to Estimate Subnational Public Opinion    323

of the same relationships at the individual level—​is less certain, since making such
an assumption has the potential of violating classic notions of the ecological fallacy.
Simply, processes that operate at the aggregate level do not need to be in effect at the
individual level.
Although relationships found at the individual level often persist at the aggregate
level, one must clearly take great care in making inferences about political processes
across levels of analysis. Ultimately, studies that rely on demographic variables to repre-
sent constituency influences are quite limited. There is at best an imperfect relationship
between demographic characteristics and policy preferences among individual citi-
zens. Although demographic variables might have a significant impact on individuals’
policy preferences, they typically explain only a small amount of the variance in such
preferences, and this means that roll-​call models that simply rely on demographic
variables are missing a substantial portion of the effect of constituency preferences.
Moreover, the uncertainty surrounding the policy implications of demographic
variables means that the policy signals directed at legislators by their constituents’ dem-
ographic characteristics are somewhat ambiguous. Knowing, for instance, that a district
has a high proportion of citizens with a college education does not necessarily give a leg-
islator clear, unambiguous signals about the policy preferences of constituents, since this
demographic characteristic, like others, is not perfectly related to policy preferences.

Presidential Election Results.  Other scholars have used election returns to estimate dis-
trict preferences (e.g., Canes-​Wrone, Cogan, and Brady 2002; Erikson and Wright 1980).
Explicitly based on electoral behavior and updated with each election, election results
have the advantage of being available across all states and districts (Kernell 2009).
Election returns are popular and easily accessed proxies for district partisanship. For
instance, Canes-​Wrone, Cogan, and Brady (2002), Ansolabehere, Snyder, and Stewart
(2001), and Erikson and Wright (1980) all use district-​level presidential election
returns as a proxy for district partisanship in models of legislative politics. Constituent
behavior (vote choices) is the basis for the proxy and links to the partisan or ideolog-
ical continuum that generally underlies electoral competition. Thus, it is reasonable
to assume that a measure of district or state partisanship utilizing vote shares has high
validity. Numerous scholars have also relied on presidential election results as a sur-
rogate measure of district ideological orientation (Fleisher 1993; Glazer and Robbins
1985; Johannes 1984; LeoGrande and Jeydel 1997; Nice and Cohen 1983). The logic
underlying this is grounded in standard spatial models of electoral choice. Arguably,
many citizens cast their votes in presidential elections by comparing their own ideolog-
ical positions with those of the competing candidates. Insofar as aggregate presidential
election results reflect ideological voting in the electorate, scholars should be able to
utilize presidential election results at the district level as a proxy measure of district
ideology.
Unfortunately, there are shortcomings and trade-​offs to this approach. Presidential
vote shares in any given election may be products of short-​term forces; for instance,
different issues are more or less salient in any given election, and particular candidates
324   Paul Brace

are more or less popular. Most observers agree that certain presidential elections are
highly ideological and that the presidential election results from those elections reflect
the ideological characteristics of constituencies; the 1964, 1972, and 1988 elections come
immediately to mind as elections in which support for the Democratic and Republican
presidential candidates was differentiated by ideological considerations. On the other
hand, we know some elections are detached from ideology; the 1968 and 1976 elections
were somewhat less ideological than other elections. Clearly, not all presidential
elections are equally ideological, and this affects the degree to which scholars can use
district-​level presidential election results as a surrogate for district ideology. Finally,
presidential vote shares do not offer insight into preferences of constituencies on par-
ticular policies, nor can they measure the preferences of district subconstituencies (e.g.,
the preferences of Democrats or Latinos) using presidential vote shares.
LeoGrande and Jeydel (1997) explore the possibility of utilizing presidential election
results as a surrogate for district ideology. They find only moderate correlations for pres-
idential election results between adjacent elections, suggesting that the reliability of the
aggregate presidential vote is not extremely high. Ultimately, presidential vote shares in
any given election may be largely the product of short-​term forces (Levendusky, Pope,
and Jackman 2008).

Referenda Results.  In referenda elections, voters confront one or more specific policy
positions on which they can express their preferences. A number of states hold referenda
elections on a regular basis, and scholars have found it possible to utilize district-​level
data on referenda election results to estimate the policy preferences and/​or ideological
orientation of a given constituency.
The use of referenda data as a surrogate measure of constituency policy preferences is
best represented by the work of Kuklinski (1977) and McCrone and Kuklinski (1979). In
both studies, the authors utilize data from California referenda to estimate the positions
of district constituencies on three dimensions that emerge from a factor analysis of the
referenda data. While these scholars find that referenda data can provide quite reliable
measures of district ideology, unfortunately such data are available for only a limited
number of states, and vary from year to year.

Simulations
Another innovation in the measurement of district opinion and constituency policy
preferences is the use of simulated district opinion, a technique developed by Weber
and Shaffer (1972) and subsequently utilized by several legislative scholars (Erikson
1978; Sullivan and Minns 1976; Sullivan and Uslaner 1978; Uslaner and Weber 1979). This
approach takes advantage of demographic data that are available at the district level, as
well as knowledge concerning the relationship between individuals’ demographic char-
acteristics and their policy positions. In traditional simulations of constituency opinion,
scholars utilize what we refer to as a “bottom-​up” simulation—​that is, using data from
Aggregating Survey Data to Estimate Subnational Public Opinion    325

a lower level of aggregation (i.e., from individual-​level surveys) to simulate opinion at a


higher level of aggregation (e.g., the district or state level).
In such a simulation, citizen groups are identified based on their combinations of so-
cial and economic characteristics: race, income, education level, and so forth. Using na-
tional surveys, items are selected that match the grouping characteristics, and opinions
of members of these combinations or groupings are obtained. Using regression, the re-
lationship between socioeconomic and demographic characteristics and opinions is
estimated. Using this model, the mean values of the socioeconomic and demographic
characteristics for the district or state are then plugged in, and the model is used to sim-
ulate estimates of the district’s or state’s opinions based on the sizes of groups within the
state or district.
On the face of it, this approach appears to be quite reasonable. The logic underlying
the approach seems to be sensible, and simulated measures of opinion have a stronger
association with roll-​call behavior than measures based on small-​sample estimates
(Erikson 1978). Most importantly, the general availability of demographic and political
variables with which to simulate public opinion means the approach allows estimating
opinion across a wide range of subunits and across time.
Perhaps the most important concern that one might have about this approach is
that the individual-​level regressions from which the simulations derive often ex-
hibit exceedingly low levels of fit to the data. With adjusted R2 levels that often
fall below .20, measures of simulated district-​level opinion have a significantly
large amount of random error associated with them. This is not necessarily a sur-
prise, since the level of measurement error in individual-​level survey data is often
much higher than that found in aggregate-​level data. Ultimately, while bottom-​
up simulated measures may be an improvement over those obtained from other
analytical approaches, they remain imprecise indicators of constituency opinion
(Seidman 1973).

Disaggregation of National
Surveys: Using Survey Data to Map
Subnational Differences in Opinion

Can we study subnational linkages using data from national surveys? Famously, Miller
and Stokes (1963) were the first to tackle this question. Disaggregating opinion data from
national election studies at the congressional district level, they examined the linkages
of these district-​level opinions with the preferences of members of Congress and with
their legislative votes. These survey observations were very small in number and far
from representative cross-​sectional samples from the early National Election Studies
(NES), with the corresponding congresspersons’ roll-​call votes and responses to a sep-
arate survey of their political attitudes and perceptions of their constituents’ opinions.
326   Paul Brace

Miller and Stokes found moderate linkages for opinion, but these relationships varied
across issues: stronger connections for civil rights and weaker connections for foreign
policy.
Beyond its substantive findings, the Miller and Stokes study also highlights many of
the fundamental methodological challenges to studying linkage. It revealed the severe
threats to reliability in estimates of subnational opinion using sparse numbers of survey
observations in subunits (congressional districts, in this case). Almost all survey-​based
disaggregation methods suffer from a profound design challenge, sometimes referred
to as the “Miller-​Stokes” problem. The survey data they had for any individual congres-
sional district were extremely sparse; their study used a national probability sample that
had an average of only thirteen respondents per congressional district (see Achen 1977;
Erikson 1978).
Miller and Stokes, and subsequent studies using disaggregated survey observations at
the subnational level, reveal that the success of disaggregation hinges on the represent-
ativeness and size of the disaggregated opinion data. James Gibson (1988) made clever
use of the large Stouffer survey study of tolerance, revealing that there was some corre-
lation between public opinion and the repressiveness of the anticommunist legislation
that states adopted (Gibson 1988).
In Statehouse Democracy (1993), Erikson, Wright, and McIver reinvigorated state
politics research on public opinion. They showed that one could combine survey
observations from multiple years on opinions that were stable across time and then
disaggregated to the subnational unit (in their case states). By combining survey
observations from the same polling organization from multiple years, they were able to
obtain more observations per state and more reliable measures of opinion.
Erikson, Wright, and McIver gauge state opinion based on a question about self-​
proclaimed political ideology. Their ideology measure has become widely used in
studies of state politics and policymaking. This general measure of opinion is strongly
and significantly related to general features of governmental outcomes across the
states. These include spending on education, the scope of Medicaid and Aid for
Families with Dependent Children, the legalization of gambling, passage of the Equal
Rights Amendment, capital punishment, and issues related to state spending and
tax effort and progressivity (e.g., Lascher et al. 1996; Camobreco 1998; Mooney and
Lee 2000).
The pooling methodology pioneered by Erikson, Wright, and McIver (1993) has also
been extended to other surveys to measure specific issue opinions (e.g., Brace et  al.
2002). This has allowed scholars to address questions about linkages between specific
policies and issues at the subnational level (e.g., Arceneaux 2002; Brace et al. 2002; Brace
and Jewett 1995; Burstein 2010 Johnson, Brace, and Arceneaux 2005; Brace and Boyea
2008; Norrander and Wilcox 1999).
Disaggregation of national survey data has advanced the study of subnational linkage
by producing more valid and reliable measures of subnational opinion. This approach
is not without limitations, however. Notably, a problem with national surveys is that
Aggregating Survey Data to Estimate Subnational Public Opinion    327

the amount of information per state is directly proportional to state population. Less
populous states tend to have inadequate sample sizes. For example, if using CBS/​NYT
polls from 1977 to 2007 to measure party identification, there are 436 respondents from
Illinois (the fifth most populous state), 180 from Kentucky (the median state), and only
32 from Delaware (the fifth least populous state) in a typical year. In addition, some years
(e.g., 2005) have less information than others, leading to very small samples for the less
populous states in certain years.
The aggregation method also does not address nonrepresentative samples resulting
from the survey design. Many national surveys use primary sampling units (PSUs) that
are not fully representative subnational sampling frames. The crucial point is that while
the design may be unbiased in terms of expected values at the national level, any partic-
ular implementation of the sampling design could produce a nonrepresentative selec-
tion of PSUs for a particular subunit.
These problems are mitigated to a large extent. As Brace et al. (2002) illustrate, more
populous states also have more PSUs and thus are less vulnerable to bias. Alternatively,
less populous states exhibit much less variation in opinion, and in this more homog-
enous environment, bias is less likely. As Brace et al. (2002) note, the risk of bias is
greatest in less populated states (low population coverage) with substantial variation
in public opinion (low population homology). Depending on the issue, this situation is
rare. In sum, while there are fewer PSUs in less populous states, there is also less diver-
sity of opinion in these states, and even an unrepresentative PSU could be represen-
tative. Alternatively, in populous states where there is substantial diversity of opinion
across geographical areas, there are more PSUs to capture this diversity.
The disaggregation of national surveys has produced measures of subnational
opinion of heightened reliability and validity that have contributed to major advances
in our understanding of linkages of opinion and policy in subnational settings. This
method, however, has intrinsic limitations. The success of disaggregation across years
depends on stable underlying attitudes. This necessarily limits research focus to survey
items that exhibit stable opinions over the short or not so short run.
Using disaggregation, scholars have been limited to using attitudes shown to be
stable across time to produce cross-​sectional measures of opinion. This precludes
many issues about which opinion is volatile. It limits the substantive breadth of the
types of policies and opinions that are suitable for study. More important, the sta-
bility required for suitable disaggregation also means that longitudinal analyses
are largely not possible. Disaggregated opinion data are suited to addressing cross-​
sectional correlations between suitably stable opinions and related measures of state
policies.
Cross-​sectional research afforded by disaggregated opinion measures has revealed
strong and convincing correlations between suitably stable measures of subnational
opinion and subnational policies. While these links are quite strong, correlation is not
causality. Cross-​sectional analyses cannot unravel the many complex temporal patterns
embodied in the opinion-​policy nexus that produces these correlations.
328   Paul Brace

Multilevel Regression and Post-​


stratification: Expanding the Scope
of Issues and the Longitudinal
Analysis of Opinion Change

Disaggregation of national survey observations to subnational units has produced


convincing measures of subnational opinion on an array of issues. Measures devel-
oped from this methodology have established strong and statistically significant cross-​
sectional differences in opinions across the states or other subunits that in turn reveal
connections to elite behavior and/​or policy. These endeavors have established clearly the
necessary condition for inferring linkage: opinions vary across states and correlate with
state policies.
Without this strong foundation, it would make little sense to explore complex
questions about opinion-​ policy linkages:  if opinion, convincingly measured, did
not correlate with policy, further analyses would be unwarranted. Given the strong
correlations, it then makes sense to “unpack” the causal sequences that underpin the
observed correlations between opinions and policies. From this perspective, disag-
gregation and resulting research form an important building block in pursuit of a cu-
mulative and systematic understanding of the opinion-​policy nexus. Disaggregation
has its limits, but they do not undermine the utility of the measures derived from this
technique. Unlike measures of subnational opinion developed from surrogates or
simulations, where the measures suffered from intractable flaws, disaggregated opinion
measures suffer limits, but not fundamental flaws.
The fundamental limit of disaggregated measures of subunit opinion is that they are
limited to cross-​sectional analyses of the opinion-​policy linkage. These cross-​sectional
findings, while important, remain vulnerable to rival causal interpretations. As noted
above, elites have an incentive to convert skeptical constituents to their own opinion; if
so, elites may be shaping opinion rather than the opposite (Jacobs and Shapiro 2000).
In addition, subunit opinions may come to reflect policy through population migra-
tion. Strong correlations between opinion and policy could simply reflect the result
of geographic sorting as citizens move to jurisdictions with policies in line with their
preferences.
Ultimately, the next chapters of exploring the linkage between opinion and policy re-
quire investigating the causal dynamics by which the preferences of constituencies cause
the behavior of representatives, independent of elite persuasion, voter mobility, and ge-
ographic sorting. It requires the exploration of the temporal order of opinion and policy
data, in which current public opinion changes significantly relate to future public policy
changes. When (if) such opinion change leads policy change, this dispels skeptics’
concerns that policy might lead opinion. If current opinion predicts future policy
change, independent of current policy that presumably reflects current elite preferences,
Aggregating Survey Data to Estimate Subnational Public Opinion    329

it is difficult to argue that opinion was not influential. Moreover, such findings render
the notion that voter mobility is driving the process implausible, because it would re-
quire vast migrations of voters in and out of jurisdictions in advance of policy changes.
At present, many of the most compelling questions about opinion-​policy linkages
concern temporal processes and highlight the need for convincing measures of
subnational opinion that vary over time. In light of the obstacles described to this
point concerning measurement of subnational opinion, this may seem a very tall order.
Where once we had no survey-​based measures of subnational opinion, extensive effort
produced survey-​based, cross-​sectional measures of subnational opinion. Given that
there have been no dramatic changes in the general qualities and quantities of data avail-
able to researchers, the question is how we can leverage existing data to produce con-
vincing measures of subnational opinion that can vary between states and within states
over time.
The latest advanced technique used to estimate state-​level public opinion, as well
as public opinion at other levels of aggregation (especially legislative districts but also
others), builds on the simulation methods that used national-​level survey data in con-
junction with state-​level census data. This multilevel regression and post-​stratification
method (MRP), developed by Park, Gelman, and Bafumi (2006), incorporates demo-
graphic and geographic information to improve survey-​based estimates of each ge-
ographic unit’s public opinion on individual issues. It improves upon the estimation
of the effects of individual-​and state-​level predictors by employing recent advances
in multilevel modeling, a generalization of linear and generalized linear modeling, in
which relationships between grouped variables are themselves modeled and estimated.
This partially pools information about respondents across states to learn about what
drives individual responses. Whereas the disaggregation method copes with insufficient
samples within states by combining surveys, MRP compensates for small within-​state
samples by using demographic and geographic correlations.
Unlike earlier simulation methods, MRP uses the location of the respondents to es-
timate state-​level effects on responses, using state-​level predictors such as region or
state-​level (aggregate) demographics (e.g., those not available at the individual level) to
model these unit-​level effects. In this way, all individuals in the survey, no matter their
location, yield information about demographic patterns that can be applied to all state
estimates, and those residents from a particular state or region yield further information
about how much predictions within that state or region vary from others, after control-
ling for demographics. In the final step, post-​stratification weights the estimates for each
demographic-​geographic respondent type (post-​stratified) by the percentages of each
type in the actual state populations.
This multilevel model allows us to use many more respondent types than classical
methods would do. This improves accuracy by incorporating more detailed popula-
tion information. An additional benefit of MRP is that modeling individual responses
is itself substantively interesting, in that one can study the relationship between demo-
graphics and opinion and inquire what drives differences between states: demographic
composition or residual cultural differences.
330   Paul Brace

Recent studies have highlighted the virtues of MRP measures compared to other
approaches (Lax and Phillips 2009b; Park, Gelman, and Bafumi 2004, 2006; Pacheco
2011). Lax and Phillips illustrate the trade-​offs between disaggregation and MRP to con-
sider whether the latter is worth the additional analytical and implementation costs.
When subunit sample sizes are small to medium in size but for very large samples, the
additional implementation costs may outweigh any additional benefits of MRP. They
also illustrate how additional demographic information improves estimation, and that
MRP can be employed successfully even on small samples, such as a single national poll.
Most recently, Warshaw and Rodden (2012) show that MRP produces more accurate
estimates of district-​level public opinion on individual issues than either disaggregation
of national surveys or presidential vote shares.
The MRP method has been used on a large scale by Lax and Phillips (2009a), who
showed how state policies toward gay rights were responsive to public opinions about
these rights—​more so than any effect of liberal-​conservative ideology. Extending this to
thirty-​nine policies covering eight issue areas—​abortion, education, electoral reform,
gambling, gay rights, health care, immigration, and law enforcement—​they found that
state policies are highly responsive to state publics’ issue-​specific preferences, statisti-
cally controlling for other variables.
Scholars have only just begun to extend these innovations to other subnational
jurisdictions. These pioneering studies have illustrated levels of responsiveness to cit-
izen preferences. In municipal politics research, scholars confronted the same obstacles
as others studying subnational politics, namely a lack of suitable surveys to weigh public
preferences (Palus 2010; Trounstine 2010). Urban politics scholars used crude demo-
graphic surrogates for citizen preferences with the same weaknesses as others obtained
with such surrogates. Others narrowed their focus to cities with large survey samples
(Palus 2010). While this was useful, there remain questions about the generalizability of
these select cities’, or of such large cities’, results to smaller cities.
Largely because of the lack of satisfactory measures of citizen opinions in cities, until
recently there had been no systematic studies of the responsiveness of city policies to
the preferences of their citizens. Tausanovitch and Warshaw (2014) surmounted this ob-
stacle using seven large-​scale surveys containing over 275,000 respondents with MRP to
produce estimates of citizen opinion for 1,600 cities and towns across the United States.
Notably, they found that city governments are responsive to their citizens’ preferences
across a wide range of policy areas, with many substantive impacts that are quite large.
They also found that liberal cities spend over twice as much per capita as conservative
cities, with higher and less regressive tax systems than their conservative counterparts.
At an even more local level, Michael Berkman and Eric Plutzer have explored the
linkages of citizen preferences to school board politics (2005). To surmount the lack
of suitable survey data at the school district level, the authors devised small polity in-
ference, a statistical technique that combines elements of the simulation approach,
aggregation, and Bayesian hierarchical models with post-​stratification. Among many
interesting findings, these authors discovered that school funding decisions were most
responsive to citizen preferences not where there were independently elected school
Aggregating Survey Data to Estimate Subnational Public Opinion    331

boards, but instead when these decisions were made by the more professional politicians
in city or county government, and where more professional politicians appoint school
board members (2005, 156–​157).

Conclusion

This review of the past half century of the study of subnational public opinion has illus-
trated a progressive research program. In the beginning, students of opinion presumed
that opinion influenced politics, but rarely if ever looked at connections. Students of
comparative policy sought linkages to opinion, but had no convincing measures of
subnational public attitudes. In the absence of subnational opinion data, the compar-
ative study of survey-​based measures of opinion and subnational indicators of govern-
ment action languished. A large reason for this stasis was the daunting research obstacles
that questions of linkage entailed: examination of patterns of opinion and patterns of
policy was required. Either comparative analyses of the connections between opinion
and policy across subunits or longitudinal analyses of opinion change and policy change
within single units were also required.
Most generally, the limitations of this early period are quite clear. Convincing opinion
measures derived from information at the subnational level were simply not available.
Even fifty years later, we do not have a repository of systematic survey observations col-
lected at the state or subnational levels. To break this impasse required the development
of innovative approaches capable of using limited data in a convincing manner. The
last twenty-​five years have witnessed a revolution in important innovations that have
facilitated the development of subnational measures of opinion that are derived from
national survey data.
Disaggregation of national surveys to subnational units produces valid estimates of
state opinion. The reliability of these estimates hinges on the numbers of observations
available within subunits. Pooling more national surveys can increase reliability if the
opinions measured exhibit statistically demonstrable stability across the pooled na-
tional samples. While enhancing reliability, particularly in smaller states with typically
few observations in single national surveys, the requirement that only stable opinion
indicators be pooled also means that this approach is unsuitable for longitudinal
analyses on subnational opinion and policy change. As such, disaggregation has been
instrumental in establishing strong patterns of cross-​sectional correspondence between
opinion and policy in subnational units, but is inadequate for moving on to more com-
plex questions concerning the processes that connect opinion and policy. This is the
new frontier of the study of subnational opinion and policy.
The new frontier of opinion-​policy research focuses on the breadth of linkages across
different polices, but also focuses on forces that promote change in opinion and policy,
and how change in opinion relates to change in policy across subunits. In this latest stage
in the evolution of the study of opinion-​policy linkage, the data demands should be
332   Paul Brace

apparent. We not only need valid and reliable estimates of subnational opinion; we need
them over time as well.
To date, MRP has been the most fruitful method for producing valid and reliable lon-
gitudinal measures of subnational opinion. Combining the advantages of disaggregating
national survey observations to the subnational level, this approach also employs ideas
from simulation studies to integrate demographic information to produce valid and re-
liable estimates of subnational opinion. Where subnational units have large numbers
of observations, MRP differs little from simple disaggregation. More important, in the
many subunits where there are few observations, MRP has been shown to be demon-
strably superior to disaggregation.
These characteristics of MRP offer attractive benefits that will hasten progress. By
producing superior estimates for small sample subunits, analyses can better integrate
patterns between opinion and policy across more subunits. The MRP method can also
produce estimates across a wider array of policies because, unlike disaggregation, MRP
does not limit inquiry to opinions that are stable across the period pooled. Finally, and
relatedly, just as MRP can produce estimates of opinion across more subunits with fewer
data, it can also produce annual estimates of opinion for subunits, also not possible with
disaggregation.
The MRP, or any future methods that can produce reliable and valid measures of
subnational opinion on a wide array of issues over time, will advance the study of public
opinion generally, and linkage specifically, by providing the means to address important
lingering questions. By expanding the breadth of issues available for study, researchers
can expand our knowledge of the substantive dimensions of linkage and assay differen-
tial levels of public interest and elite responsiveness. By allowing for analyses of longi-
tudinal change in opinion within states, researchers can explore the forces promoting
change in subunit opinions and the consequences of those changes on elite behavior
and government outcomes. Scholars may explore the conditions in which elites re-
spond to public opinion and those in which they may seek to manipulate it to their ends
(Jacobs and Shapiro 2000), or in which policy attenuates public concern (Wlezien 2004,
Johnson, Brace and Arceneaux 2005).

References
Achen, C. H. 1975. “Mass Political Attitudes and the Survey Response.” American Political
Science Review 69 (4): 1218–​1231.
Achen, C. H. 1977. “Measuring Representation: Perils of the Correlation Coefficient.” American
Journal of Political Science 21 (4): 805–​815.
Ansolabehere, S., J. M. Snyder Jr., and C. Stewart. 2001. “Candidate Positioning in U.S. House
Elections.” American Journal of Political Science 45 (1): 136–​159.
Arceneaux, K. 2002. “Direct Democracy and the Link Between Public Opinion and State
Abortion Policy.” State Politics & Policy Quarterly 2 (4): 372–​387.
Ardoin, P. J., and J. G. Garand. 2003. “Measuring Constituency Ideology in U.S. House
Districts: A Top-​Down Simulation Approach.” Journal of Politics 65 (4): 1165–​1189.
Aggregating Survey Data to Estimate Subnational Public Opinion    333

Bartels, L. M. 1991. “Constituency Opinion and Congressional Policy Making:  The Reagan
Defense Buildup.” American Political Science Review 85: 457–​474.
Beck, P. A., and T. R. Dye. 1982. “Sources of Public Opinion on Taxes: The Florida Case.” Journal
of Politics 44 (1): 172–​182.
Boehmke, F. J., and R. Witmer. 2004. “Disentangling Diffusion: The Effects of Social Learning
and Economic Competition on State Policy Innovation and Expansion.” Political Research
Quarterly 57 (1): 39–​51.
Berkman, M., and E. Plutzer. 2005. Ten Thousand Democracies: Politics and Public Opinion in
America’s School Districts. Washington, DC: Georgetown University Press.
Berry, W. D., E. J. Ringquist, R. C. Fording, and R. L. Hanson. 1998. “Measuring Citizen and
Government Ideology in the American States, 1960–​93” American Journal of Political
Science 42 (1): 327–​348.
Berry, W. D., E. J. Ringquist, R. C. Fording, and R. L. Hanson. 2007. “A Rejoinder:  The
Measurement and Stability of State Citizen Ideology.” State Politics & Policy Quarterly 7
(2): 160–​166.
Brace, P., K. Arceneaux, M. Johnson, and S. Ulbig. 2004. “Does State Political Ideology Change
over Time?” Political Research Quarterly 57 (4): 529–​540.
Brace, P., K. Arceneaux, M. Johnson, and S. Ulbig. 2007. “Reply to ‘The Measurement and
Stability of State Citizen Ideology’.” State Politics and Policy Quarterly 7 (2): 133–​140.
Brace, P., and B. Boyea. 2008. “State Public Opinion, the Death Penalty and the Practice of
Electing Judges.” American Journal of Political Science 52 (2): 360–​372.
Brace, P., and A. Jewett. 1995. “The State of State Politics Research.” Political Research Quarterly
48 (3): 643–​681.
Brace, P., and M. Johnson. 2006. “Does Familiarity Breed Contempt? Examining the Correlates
of State-​Level Confidence in the Federal Government.” In Public Opinion in State Politics,
edited by J. E. Cohen, 19–​37. Stanford, CA: Stanford University Press.
Brace, P., K. Sims-​Butler, K. Arceneaux, and M. Johnson. 2002. “Public Opinion in the
American States:  New Perspectives Using National Survey Data.” American Journal of
Political Science 46 (1): 173–​189.
Burstein, P. 2010. “Public Opinion, Public Policy, and Democracy.” In Handbook of
Politics and Society in Global Perspective, edited by K. T. Leicht and J. C. Jenkins, 63–​79.
New York: Springer.
Camobreco, J. F. 1998. “Preferences, Fiscal Policies, and the Initiative Process.” Journal of
Politics 60 (3): 819–​829.
Campbell, A., P. Converse, W. Miller, and D. Stokes. 1960. The American Voter. New York: John
Wiley and Sons.
Canes-​Wrone, B., J. F. Cogan, and D. W. Brady. 2002. “Out of Step, Out of Office: Electoral
Accountability and House Members’ Voting.” American Political Science Review 96
(1): 127–​140.
Carsey, T. M., and J. J. Harden. 2010. “New Measures of Partisanship, Ideology, and Policy
Mood in the American States.” State Politics & Policy Quarterly 10 (2): 136–​156.
Citrin, J. 1979. “Do People Want Something for Nothing:  Public Opinion on Taxes and
Government.” National Tax Journal Supplement 32 (June): 113–​130.
Cnudde, C. F., and D. J. McCrone. 1966. “The Linkage between Constituency Attitudes and
Congressional Voting Behavior: A Causal Model.” American Political Science Review 60 (1): 66–​72.
Cnudde, C. F., and D. J. McCrone. 1969. “Party Competition and Welfare Policies in the
American States.” American Political Science Review 63 (3): 858–​866.
334   Paul Brace

Cohen, J. E., ed. 2006. Public Opinion in State Politics. Stanford, CA: Stanford University Press.
Converse, P. E. 1964. “The Nature of Belief Systems in Mass Publics.” In Ideology and Discontent,
edited by D. Apter, 206–​261. New York: Free Press.
Dawson, R. E., and J. A. Robinson. 1963. “Inter-​Party Competition, Economic Variables, and
Welfare Policies in the American States.” Journal of Politics 25 (2): 265–​289.
Della Karpini, Michael X. and Scott Keeter. 1996. What Americans Know About Politics and
Why It Matters. New Haven, CT: Yale University Press.
Dye, T. R. 1965. “Malaportionment and Public Policy in the States.” Journal of Politics 27
(3): 586–​601.
Dye, T. R. 1969a. “Income Inequality and American State Politics.” American Political Science
Review 63 (1): 157–​162.
Dye, T. R. 1969b. “Executive Power and Public Policy in the States.” Western Political Quarterly
22 (4): 926–​939.
Erikson, R. S. 1976. “The Relationship Between Public Opinion and State Policy: A New Look
Based on Some Forgotten Data.” American Journal of Political Science 20 (1): 25–​36.
Erikson, R. S. 1978. “Constituency Opinion and Congressional Behavior: A Reexamination of
the Miller-​Stokes Representation Data.” American Journal of Political Science 22 (3): 511–​535.
Erikson, R. S. 1981. “Measuring Constituency Opinion:  The 1978 Congressional Election
Study.” Legislative Studies Quarterly 6 (2): 235–​545.
Erikson, R. S., and G. C. Wright. 1980. “Policy Representation of Constituency Interests.”
Political Behavior 2 (1): 91–​106.
Erikson, R. S., G. C. Wright, and J. P. McIver. 1993. Statehouse Democracy: Public Opinion and
Policy in the American States. New York: Cambridge University Press.
Erikson, R. S., G. C. Wright, and J. P. McIver. 2006. “Public Opinion in the States: A Quarter
Century of Change and Stability.” In Public Opinion in State Politics, edited by J. E. Cohen,
228–​253. Stanford, CA: Stanford University Press.
Fenno, R. F. 1978. Homestyle: House Members in Their Districts. Boston: Little, Brown.
Fleisher, R. 1993. “Explaining the Change in Roll-​Call Voting Behavior of Southern Democrats.”
Journal of Politics 55 (2): 327–​341.
Fry, B. R., and R. F. Winters. 1970. “The Politics of Redistribution.” American Political Science
Review 64 (2): 508–​522.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel Hierarchical
Models. Cambridge, UK: Cambridge University Press.
Gelman, A., and T. C. Little. 1997. “Poststratification into Many Categories Using Hierarchical
Logistic Regression.” Survey Methodology 23 (2): 127–​135.
Gibson, J. 1988. “Political Intolerance and Political Repression During the McCarthy Red
Scare.” American Political Science Review 82 (2): 511–​529.
Glazer, A., and M. Robbins. 1985. “Congressional Responsiveness to Constituency Change.”
American Journal of Political Science 29 (2): 259–​273.
Godwin, R. K., and W. B. Shepard. 1976. “Political Processes and Public Expenditures: A Re-​
examination Based on Theories of Representative Government.” American Political Science
Review 70 (4): 1127–​1135.
Green, D. P., and A. E. Gerken. 1989. “Self-​Interest and Public Opinion Toward Smoking
Restrictions and Cigarette Taxes.” Public Opinion Quarterly 53 (1): (Spring): 1–​16.
Hofferbert, R. I. 1966. “The Relation between Public Policy and Some Structural and
Environmental Variables in the American States.” American Political Science Review 60
(1): 73–​82.
Aggregating Survey Data to Estimate Subnational Public Opinion    335

Holbrook-​Provow, T. M., and S. C. Poe. 1987. “Measuring State Political Ideology.” American
Politics Quarterly 15 (3): 399–​416.
Jennings, E. T., Jr. 1979. “Competition, Constituencies, and Welfare Policies in the American
States.” American Political Science Review 73 (2): 414–​429.
Jacobs, L. R., and R. Y. Shapiro. 1994. “Studying Substantive Democracy.” PS: Political Science
and Politics 27 (1): 9–​17.
Jacobs, L. R., and R. Y. Shapiro. 2000. Politicians Don’t Pander: Political Manipulation and the
Loss of Democratic Responsiveness. Chicago: University of Chicago Press.
Johannes, J. R. 1984. To Serve the People: Congress and Constituency Service. Lincoln: University
of Nebraska Press.
Johnson, M., P. Brace, and K. Arceneaux. 2005. “Public Opinion and Dynamic Representation
in the American States: The Case of Environmental Attitudes.” Social Science Quarterly 86
(1): 87–​108.
Jones, R. S., and W. E. Miller. 1984. “State Polls: Promising Data Sources for Political Research.”
Journal of Politics 46 (4): 1182–​1192.
Joslyn, R. A. 1980. “Manifestations of Elazar’s Political Subcultures: State Public Opinion and
the Content of Political Campaign Advertising.” Publius 10 (2): 37–​58.
Kalt, J. P., and M. A. Zupan. 1984. “Capture and Ideology in the Economic Theory of Politics.”
American Economic Review 74 (3): 279–​300.
Kastellec, J. P., J. R. Lax, and J. H. Phillips. 2010. “Public Opinion and Senate Confirmation of
Supreme Court Nominees.” Journal of Politics 72 (3): 767–​784.
Kernell, G. 2009. “Giving Order to Districts: Estimating Voter Distributions with National
Election Returns.” Political Analysis 17(3): 215–​235.
Key, V. O. 1949. Southern Politics in State and Nation. New York: Knopf.
Kuklinski, J. H. 1977. “Constituency Opinion: A Test of the Surrogate Model.” Public Opinion
Quarterly 41 (1): 34–​40.
Lascher, E. L., M. G. Hagen, S. A. Rochlin, 1996. “Gun Behind the Door? Ballot Initiatives, State
Policies, and Public Opinion.” Journal of Politics 58 (3): 760–​775.
Lax, J. R., and J. H. Phillips. 2009a. “Gay Rights in the States:  Public Opinion and Policy
Responsiveness.” American Political Science Review 103 (3): 367–​386.
Lax, J. R., and J. H. Phillips. 2009b. “How Should We Estimate Public Opinion in the States?”
American Journal of Political Science 53 (1): 107–​121.
Lenz, G. S. 2012. Follow the Leader? How Voters Respond to Politicians’ Policies and Performance.
Chicago: University of Chicago Press.
LeoGrande, W., and A. S. Jeydel. 1997. “Using Presidential Election Returns to Measure
Constituency Ideology: A Research Note.” American Politics Quarterly 25 (1): 3–​19.
Levendusky, M. S., J. C. Pope, and S. D. Jackman. 2008. “Measuring District-​Level Partisanship
with Implications for the Analysis of U.S. Elections.” Journal of Politics 70 (3): 736–​753.
McCrone, D. J., and J. H. Kuklinski. 1979. “The Delegate Theory of Representation.” American
Journal of Political Science 23 (2): 278–​300.
McDonagh, E. L. 1992. “Representative Democracy and State Building in the Progressive Era.”
American Political Science Review 86: 938–​950.
Miller, W. E., and D. E. Stokes. 1963. “Constituency Influence in Congress.” American Political
Science Review 57 (1): 45–​56.
Mooney, C. Z., and M.-​H. Lee. 2000. “The Influence of Values on Consensus and
Contentious Morality Policy: U.S. Death Penalty Reform, 1956–​1982.” Journal of Politics
62 (1): 223–​239.
336   Paul Brace

Nice, D., and J. Cohen. 1983. “Ideological Consistency among State Party Delegations to the
U.S. House, Senate, and National Conventions.” Social Science Quarterly 64 (4): 871–​879.
Nicholson, S. P. 2003. “The Political Environment and Ballot Proposition Awareness.”
American Journal of Political Science 47 (3): 403–​410.
Norrander, B., and C. Wilcox. 1999. “Public Opinion and Policymaking in the States: The Case
of Post-​Roe Abortion Policy.” Policy Studies Journal 27(4): 707–​722.
Norrander, B., 2000. “The Multi-​Layered Impact of Public Opinion on Capital Punishment
Implementation in the American States.” Political Research Quarterly 53 (4): 771–​793.
Norrander, B., and C. Wilcox. 2001. “Measuring State Public Opinion with the Senate National
Election Study.” State Politics & Policy Quarterly 1 (1): 111–​125.
Pacheco, J. 2011. “Using National Surveys to Measure Dynamic U.S. State Public Opinion: A
Guideline for Scholars and an Application.” State Politics & Policy Quarterly 11 (4): 415–​539.
Page, B. 2002. “The Semi-​Sovereign Public.” In Navigating Public Opinion, edited by J. Manza,
F. L. Cook, and B. I. Page, 325–​344. New York: Oxford University Press.
Page, B. I., and R. Y. Shapiro. 1992. The Rational Public: Fifty Years of Trends in Americans’ Policy
Preferences. Chicago: University of Chicago Press.
Page, B. I., R. Y. Shapiro, P. W. Gronke, and R. M. Rosenberg. 1984. “Constituency, Party and
Representation in Congress.” Public Opinion Quarterly 48 (4): 741–​756.
Palus, C. K. 2010. “Responsiveness in American Local Governments.” State and Local
Government Review 42 (2): 133–​150.
Park, D. K., A. Gelman, and J. Bafumi. 2004. “Bayesian Multilevel Estimation with
Poststratification:  State-​ Level Estimates from National Polls.” Political Analysis 12
(4): 375–​385.
Park, D. K., A. Gelman, and J. Bafumi. 2006. “State-​ Level Opinions from National
Surveys: Poststratification Using Multilevel Logistic Regression.” In Public Opinion in State
Politics, edited by J. Cohen, 209–​228. Palo Alto, CA: Stanford University Press.
Parry, J. A., B. Kisida, and R. E. Langley. 2008. “The State of State Polls: Old Challenges, New
Opportunities.” State Politics & Policy Quarterly 8 (2): 198–​216.
Peltzman, S. 1984. “Constituent Interest and Congressional Voting.” Journal of Law and
Economics 27 (1): 181–​210.
Percival, G. L., M. Johnson, and M. Neiman. 2009. “Representation and Local Policy: Relating
County-​Level Public Opinion to Policy Outputs.” Political Research Quarterly 62 (1): 164–​177.
Pool, I. D. S., and R. Abelson. 1961. “The Simulmatics Project.” Public Opinion Quarterly 25
(2): 167–​183.
Pool, Ithiel de Sola, Robert P. Abelson and Samuel Popkin. 1965. Candidates, Issues and
Strategies. Cambridge, MA: M.I.T. Press.
Popkin, S. 1994. The Reasoning Voter: Communication and Persuasion in Presidential Elections.
Chicago: University of Chicago Press.
Popkin, S., J. Gorman, C. Phillips, and J. Smith. 1976. “Comment:  What Have You Done
for Me Lately? Toward a Theory of Voting.” American Political Science Review 70
(September): 779–​805.
Seidman, D. 1973. “Simulation of Public Opinion:  A Caveat.” Public Opinion Quarterly 39
(3): 331–​342.
Shapiro, R. Y. 2011. “Public Opinion and American Democracy.” Public Opinion Quarterly 75
(5): 982–​1017.
Sharkansky, I. 1968. Spending in the American States. Chicago: Rand McNally.
Aggregating Survey Data to Estimate Subnational Public Opinion    337

Sharkansky, I., and R. I. Hofferbert. 1969. “Dimensions of State Politics, and Public Policy.”
American Political Science Review 63 (3): 867–​880.
Sinclair-​Deckard, B. 1976. “Electoral Marginality and Party Loyalty in the House.” American
Journal of Political Science 20 (3): 469–​481.
Stimson, J. A., M. B. Mackuen and R. S. Erikson. 1995. “Dynamic Representation.” American
Political Science Review 89(3): 543–​565.
Stouffer, S. A. 1955. Communism, Conformity, and Civil Liberties: A Cross-​Section of the Nation
Speaks. Garden City, NY: Doubleday.
Sullivan, J. L., and D. R. Minns. 1976. “Ideological Distance between Candidates: An Empirical
Examination.” American Journal of Political Science 20 (3): 439–​469.
Sullivan, J. L., and E. M. Uslaner. 1978. “Congressional Behavior and Electoral Marginality.”
American Journal of Political Science 22 (3): 536–​553.
Tausanovitch, C., and C. Warshaw. 2013. “Measuring Constituent Policy Preferences in
Congress, State Legislatures and Cities.” Journal of Politics 75 (2): 330–​342.
Tausanovitch, C., and C. Warshaw. 2014. “Representation in Municipal Government.”
American Political Science Review 108 (3): 605–​641.
Tiebout, C. 1956, “A Pure Theory of Local Expenditures.” Journal of Political Economy 64
(5): 416–​424.
Trounstine, J. 2010. “Representation and Accountability in Cities.” Annual Review of Political
Science 13: 407–​423.
Uslaner, E. M., and R. E. Weber. 1979. “U.S. State Legislators’ Opinions and Perceptions of
Constituency Attitudes.” Legislative Studies Quarterly 4 (4): 563–​585.
Warshaw, C., and J. Rodden. 2012. “How Should We Measure District-​Level Public Opinion on
Individual Issues?” Journal of Politics 74 (1): 203–​219.
Weber, R. E., A. H. Hopkins, M. L. Mezey, and F. J. Munger. 1972. “Computer Simulation of
State Electorates.” Public Opinion Quarterly 36 (4): 549–​565.
Weber, R. E., and W. R. Shaffer, 1972. “Public Opinion and American State Policymaking.”
Midwest Journal of Political Science 16 (4): 683–​699.
Whittaker, M., G. M. Segura, and S. Bowler. 2005. “Racial/​Ethnic Group Attitudes toward
Environmental Protection in California: Is ‘Environmentalism’ Still a White Phenomenon?”
Political Research Quarterly 58 (3): 435–​447.
Wlezien, C. 1995. “The Public as Thermostat: Dynamics of Preferences for Spending.” American
Journal of Political Science 39 (4): 981–​1000.
Wlezien, C. 2004. “Patterns of Representation: Dynamics of Public Preferences and Policy.”
Journal of Politics 66 (1): 1–​24.
Wlezien, C. 2011. Public Opinion and Public Policy in Advanced Democracies. Oxford
Bibliographies Online. Oxford, UK: Oxford University Press.
Wright, G., R. S. Erikson, and J. P. McIver. 1985. “Measuring State Partisanship and Ideology
Using Survey Data.” Journal of Politics 47 (2): 469–​489.
Wright, G., and J. P. McIver. 2007. “Measuring the Public’s Ideological Preferences in the
50 States:  Survey Responses versus Roll Call Data.” State Politics & Policy Quarterly 7
(2): 141–​151.
Zaller, John R. 1992. The Nature and Origins of Mass Opinion. New  York:  Cambridge
University Press.
Chapter 16

L atent C onst ru c ts i n
Public Opi ni on

Christopher Warshaw

Introduction

Many of the most important constructs in public opinion research are abstract, la-
tent quantities that cannot be directly observed from individual questions on surveys.
The accurate measurement of these concepts “is a cornerstone of successful scientific
inquiry” (Delli Carpini and Keeter 1993, 1203). Some prominent examples of latent
constructs in public opinion research are policy mood, political knowledge, racial re-
sentment, consumer confidence, political activism, and trust in government. In each
instance the available data on surveys are merely noisy indicators of the theoretical
quantities that scholars are interested in measuring. Thus, multiple indicators are neces-
sary to construct a holistic measure of the latent quantity (Jackman 2008). For example,
imagine that scholars wanted to measure religiosity (e.g., McAndrew and Voas 2011;
Margolis 2018). It is self-​evident that self-​reports of church attendance on a survey are
merely noisy indicators of respondents’ underlying religiosity. Moreover, they capture
only one aspect of religiosity. Scholars could construct a more holistic measure of citi-
zens’ underlying religiosity by averaging across multiple indicators of religiosity, such as
church attendance, membership in religious organizations, belief in God, donations to a
church, and so forth.
There are a number of reasons to believe that survey items are often best viewed
as noisy indicators of underlying latent attitudes (see Jackman 2008 for a review).1
One plausible view is that individual survey questions have measurement error due
to vague or confusing question wording (Achen 1975). Another view is that survey
respondents sample from a set of mentally accessible considerations when they
provide their responses to individual questions (Zaller and Feldman 1992). If a re-
spondent answers the same survey question multiple times, he or she would provide
slightly different responses each time even though the underlying latent trait is stable.
Latent Constructs in Public Opinion    339

Measurement error on surveys could also be driven by the conditions of the inter-
view (mode, location, time of day, etc.), the respondents’ level of attentiveness on the
survey (Berinsky, Margolis, and Sances 2014), or characteristics of the interviewer
(e.g., attentiveness, race, ethnicity, gender, education level) (e.g., Anderson, Silver,
and Abramson 1988).
Overall, this perspective suggests that the usage of multiple indicators almost always
reduces measurement error and improves estimates of the underlying latent construct
(Ansolabehere, Rodden, and Snyder 2008). As more indicators become available, the
measurement of the latent construct of interest will generally become more accurate.
In addition, recent work shows how survey designers can use computerized adaptive
testing (CAT) to further improve measurement accuracy and precision (Montgomery
and Cutler 2013).

Examples of Latent Public


Opinion Constructs

There are a number of prominent examples of latent constructs in public opinion re-
search. One is policy liberalism or mood. Surveys typically include many questions about
respondents’ preferences on individual policies. They might include questions about uni-
versal healthcare, abortion, welfare, tax cuts, and environmental policy. One approach
is to analyze these questions separately (e.g., Lax and Phillips 2009a; Broockman 2016).
However, in practice survey respondents’ views on these individual questions are gen-
erally highly correlated with one another. If respondents have liberal views on universal
healthcare, they probably also have liberal views on other policy issues. This is because
responses on individual policy questions largely stem from respondents’ underlying
ideological attitudes. Thus, their views on many policy questions can be mapped onto
a one-​or two-​dimensional policy liberalism scale (Ansolabehere, Rodden, and Snyder
2008; Treier and Hillygus 2009; Bafumi and Herron 2010; Tausanovitch and Warshaw
2013).2 Moreover, when individuals’ responses are averaged across many issue questions,
their latent policy liberalism tends to be very stable over time (Ansolabehere, Rodden,
and Snyder 2008).
Another prominent latent construct is political knowledge. A  variety of theories
suggest that variation in political knowledge influences political behavior. Like other la-
tent concepts, knowledge is not a concept that can be directly measured based on a single
survey question (Delli Carpini and Keeter 1993). At best, individual survey questions
capture a subset of citizens’ knowledge about politics. Instead, political knowledge is
thought to be an agglomeration of citizens’ knowledge of many aspects of the political
process. Indeed, researchers have found that one (Delli Carpini and Keeter 1993) or two
(Barabas et al. 2014) latent dimensions capture the bulk of the variation in citizens’ polit-
ical knowledge.
340   Christopher Warshaw

Racial prejudice and resentment are core concepts in the field of political behavior.
Indeed, racial resentment has been shown to influence a variety of political attitudes and
actions. But there is no way to capture racial prejudice or resentment through a single
survey question. Instead, researchers typically ask respondents many questions that
serve as indicators of prejudice. Then all of these questions are aggregated to produce a
summary measure of prejudice (e.g., Kinder and Sanders 1996; Tarman and Sears 2005;
Carmines, Sniderman, and Easter 2011).
One of the most important metrics of the health of the U.S. economy is consumer con-
fidence. The University of Michigan has used public opinion surveys to track consumer
confidence since the late 1940s (Mueller 1963; Ludvigson 2004). Consumer confidence
is measured using an index of multiple survey questions that all tap into consumers’
underlying, latent views about the economy. This index has been used in a huge liter-
ature in economics, finance, and political economy (e.g., De Boef and Kellstedt 2004;
Ludvigson 2004; Lemmon and Portniaguina 2006).

Measuring Latent Opinion at the


Individual Level

Scholars have used a variety of models to measure latent variables at the individual level.
The objective of each of these models is to measure a continuous latent variable using
responses to a set of survey questions that are assumed to be a function of that latent var-
iable. In this section I discuss the four most common measurement techniques: additive
scales, factor analysis, item-​response, and mixed models for the analysis of both contin-
uous and ordinal data.

Additive Models
The simplest way to measure latent opinion is to just take the average of the responses to
survey items that are thought to represent a particular latent variable (e.g., Abramowitz
and Saunders 1998). For instance, imagine that a survey has four questions that tap into
respondents’ political knowledge, including a question about the number of justices on
the Supreme Court, one that asks respondents to name the current vice president, one
that asks the percentage required for Congress to override a presidential veto, and one
that asks the length of a president’s term. One way to measure political knowledge is to
simply add up the number of correct answers to these four questions.
In some cases, this simple approach may work well. However, additive scales have
several major weaknesses vis-​à-​vis the more complex approaches discussed below. First,
they treat all survey items identically and assume that every item contributes equally
to the underlying latent dimension (Treier and Hillygus 2009). Second, it is difficult to
Latent Constructs in Public Opinion    341

determine the appropriate dimensionality of the latent scale using additive models. In
the case of political knowledge, for example, Barabas et al. (2014) actually identify sev-
eral theoretically important dimensions. Third, it is necessary to determine the correct
polarity of each question in advance (e.g., which response is the “correct” or “liberal”
answer). This is often infeasible for larger sets of questions or for complicated latent
variables. Fourth, additive models are ill-​suited to multi-​chotomous or continuous re-
sponse data. Finally, additive models do not enable the characterization of measurement
error or uncertainty.

Factor Analysis
Factor analysis is the most common latent variable model used in applied research
(Jackman 2008). It has been used in a large number of studies to estimate the public’s la-
tent policy liberalism (e.g., Ansolabehere, Rodden, and Snyder 2006, 2008; Carsey and
Harden 2010), political knowledge (e.g., Delli Carpini and Keeter, 1993) or racial preju-
dice (e.g., Tarman and Sears 2005). Factor analysis is based on the observed relationship
between individual items on a survey. For instance, imagine a Bayesian model of citi-
zens’ policy liberalism, with the single latent factor, θi. For each individual i, we observe
J continuous survey questions, denoted yi = (y1i, . . ., yji, . . ., yJi). We can model yi as a
function of citizens’ policy liberalism (θi) and item-​specific factor loadings λ = (λ1, . . .,
λj, . . ., λJ),

yi ~ N J (λθi , Y ), (1)

where NJ indicates a J-​dimensional multivariate normal distribution and Ψ is a J × J co-


variance matrix (Quinn 2004).
Factor analysis models have a number of advantages over simple additive scales. They
enable each survey item to differentially contribute to the latent construct. They also
enable the construction of complex multidimensional scales. Finally, they enable the
model to determine the polarity of each item. Factor analysis models can be run in the
statistical program R using the Psych or MCMCPack packages. They can also be easily
estimated in other software packages such as Stata.

Item Response Models for Dichotomous


and Ordinal Data
Factor-​analytic models assume that the observed indicators are continuous. Thus, con-
ventional factor analysis can produce biased estimates of latent variables with binary
indicators (Kaplan 2004). For binary variables, therefore, we need a different measure-
ment model. The most common class of measurement models for binary survey items
comes from item response theory (IRT) (see Johnson and Albert 2006). These models
342   Christopher Warshaw

are also well-​suited for Bayesian inference, which makes it possible to characterize the
uncertainty in the latent scale. In addition, Bayesian IRT models can easily deal with
missing data and survey items where respondents answer “Don’t know.”
The conventional two-​parameter IRT model introduced to political science by
Clinton, Jackman, and Rivers (2004) characterizes each policy response Y as a
function of subject i’s latent ideology (θi), the difficulty (αj) and discrimination (βj) of
item j, where

Pr[ y i j = 1] = Φ(β j θi − α j ) (2)

where Φ is the standard normal cumulative distribution function (CDF) (Jackman


2009, 455; Fox 2010, 10). βj is referred to as the “discrimination” parameter be-
cause it captures the degree to which the latent trait affects the probability of a yes
answer. If βj is 0, then question j tells us nothing about the latent variable being
measured. We would expect βj to be close to 0 if we ask a completely irrelevant
question, such as one about the respondent’s favorite color. The “cut point” is the
value of αj/​βj at which the probabilities of answering yes or no to a question are
fifty-​f ifty.
Scholars can run Bayesian IRT models using off-​the-​shelf software such as
MCMCpack (Martin et al. 2011) or the ideal function in the R package pscl (Jackman
2012). They can also run fast approximations of some types of IRT models using the R
package emIRT (Imai, Lo, and Olmsted 2016).3 For more complicated IRT models, they
can use fully Bayesian software such as Jags or Stan.

Models for Mixtures of Continuous, Ordinal,


and Binary Data
Factor analytic models are best for continuous data, while IRT models are best for bi-
nary and ordinal data (Treier and Hillygus 2009). To measure latent variables that are
characterized by a variety of different types of indicators (continuous, ordinal, binary),
it is necessary to use a model appropriate for mixed measurement responses (Quinn
2004). This model characterizes a latent variable using a mixture of link models that
are tailored to the data. The R package MCMCpack implements a Bayesian mixed data
factor analysis model that can be used with survey data (Martin et al. 2011). It is also pos-
sible to develop more complicated models for mixed data using fully Bayesian software
such as Jags or Stan.

Evaluating the Success of a Latent Variable Model


The quality of the inferences about a latent variable are usually assessed with reference
to two key concepts: validity and reliability (Jackman 2008). The concept of validity taps
Latent Constructs in Public Opinion    343

the idea that a latent variable model should generate unbiased measures of the concept
that “it is supposed to measure” (Bollen 1989, 184). The concept of reliability taps into the
amount of measurement error in a given set of estimates.
Adcock and Collier (2001) suggest a useful framework for evaluating the validity of a
measurement model. First, they suggest that models should be evaluated for their con-
tent validity. Are the indicators of the latent variable operationalizing the full substan-
tive content of the latent construct? To assess this, they suggest examining whether “key
elements are omitted from the indicator,” as well as whether “inappropriate elements are
included in the indicator” (538). For example, indicators for respondents’ latent opinion
about climate change should be substantively related to climate change rather than some
other policy area. Moreover, they should include all relevant substantive areas related to
citizens’ views on climate change.
Next, Adcock and Collier (2001) suggest that models should be evaluated for their
convergent validity. Are the estimates of a latent variable closely related to other meas-
ures known to be valid measures of the latent construct? For example, estimates of
respondents’ policy liberalism should be highly correlated with their symbolic ideology.
Third, they suggest that models should be evaluated for their construct validity. Do
the estimates of a latent variable correspond to theoretically related concepts? This form
of validation is particularly useful when there is a well-​understood causal relationship
between two related concepts. For example, estimates of policy liberalism should be
closely related to respondents’ voting behavior and partisan identification.
The concept of reliability assesses the amount of measurement error in a set of
estimates. A measurement would be unreliable if it contained large amounts of random
error (Adcock and Collier 2001). The reliability of a measure is crucial for determining
its usefulness for applied research. Indeed, measurement error in latent variables used as
regression predictors leads to severely biased estimates in substantive analyses (Jackman
2008; Treier and Jackman 2008).
Depending on the data sources available, there are a number of ways to assess the re-
liability of a measurement. One of the most popular approaches is to use “test-​retest” re-
liability (Jackman 2008). Under the assumption that the latent variable does not change,
the correlation between the measure of the latent variable in two time periods is an es-
timate of the reliability of the measures. Ansolabehere, Rodden, and Snyder (2008) use
this approach to assess the stability of the mass public’s policy liberalism across panel
waves of the American National Election Study (ANES). They find that measures of
individuals’ policy liberalism in one wave are strongly correlated with a measure of their
latent policy liberalism two or four years later. Another approach for assessing reliability
is to examine inter-​item reliability based on the average level of correlation among the
survey items used to generate a latent construct, normalized by the number of items.
Jackman (2008) points out that there is often a “bias-​variance trade-​off ” in latent vari-
able estimation. Increasing the number of indicators used in a latent variable model may
increase the reliability of the resulting estimates at the cost of less content validity. For ex-
ample, imagine that a researcher wanted to measure the public’s latent views about abor-
tion policy. Given the low-​dimensional structure of the mass public’s policy liberalism,
344   Christopher Warshaw

the researcher would probably be able to increase the reliability of her measure by in-
cluding survey items about other issue areas in her measurement model. However, this
approach would violate Adcock and Collier (2001)’s dictum that indicators for a partic-
ular latent construct should be substantively related to the construct being measured
rather than to some other policy area.

Individual-​L evel Applications

Latent public opinion constructs have been used for a wide variety of substantive
applications in political science. In this section I briefly discuss two of these applications.

Polarization
It is widely agreed that the latent ideology of members of Congress and other elites
have grown increasingly polarized in recent decades (Poole and Rosenthal 2007). Are
the changes in elite polarization caused by increasing polarization at the mass level
(Barber and McCarty 2015)? To address this question we need holistic measures of the
individual-​level policy liberalism of the American public at a variety of points of time.
Hill and Tausanovitch (2015) do this using data from the ANES. They find little increase
in the polarization of the mass public’s policy liberalism between 1956 and 2012. Their
results strongly suggest that elite polarization is not caused by changes in mass polariza-
tion (see Barber and McCarty 2015 for more on this debate).
Outside of the United States there has been less work on the structure of the mass
public’s preferences. One recent exception is China, where several papers have examined
the mass public’s policy preferences along one or more dimensions (e.g., Lu, Chu, and
Shen 2016; Pan and Xu 2018). For example, Pan and Xu (2018) identify a single, dominant
ideological dimension to public opinion in China. They find that individuals expressing
preferences associated with political liberalism, favoring constitutional democracy and
individual liberty, are also more likely to express preferences associated with economic
liberalism, such as endorsement of market-​oriented policies, and preferences for social
liberalism, such as the value of sexual freedom. Notably, they also find little evidence of
polarization in the Chinese public’s policy preferences.

Political Knowledge
The causes and consequences of variation in citizens’ political knowledge are core
questions in the literature on policy behavior (e.g., Mondak 2001). A large literature uses
scaled measures of latent political knowledge in the American context. For example,
many studies examine the consequences of variation in political knowledge for political
Latent Constructs in Public Opinion    345

accountability and representation. Jessee (2009) and Shor and Rogowski (2018) find that
higher knowledge individuals are more likely to hold legislators accountable for their
roll-​call positions. Bartels (1996) finds that variation in political knowledge has impor-
tant consequences for the outcomes of elections.
There is a smaller literature that focuses on the causes and consequences of variation
in political knowledge outside of the United States. For example, Pereira (2015) measures
cross-​national variation in political knowledge in Latin America based on a Bayesian
item response model that explicitly accounts for differences in the questions across
countries. Using surveys from Latin America and the Caribbean, he demonstrates that
contextual factors such as level of democracy, investments in telecommunications, eth-
nolinguistic diversity, and type of electoral system have substantial effects on knowledge.

Measuring Latent Opinion at the


Group Level

While many research questions require individual-​level estimates of latent opinion,


a number of other research questions focus on the effect of variation in group-​level
opinion on salient political outcomes. For example, scholars often seek to characterize
changes in the policy mood of the electorate (e.g., Stimson 1991; Erikson, MacKuen, and
Stimson 2002; Bartle, Dellepiane-​Avellaneda, and Stimson 2011). Another important
question in American politics is the dyadic link between constituents’ policy views and
the roll-​call votes of their legislators (Miller and Stokes 1963). To evaluate dyadic repre-
sentation, scholars need measures of the public’s average policy preferences in each state
or legislative district. Moreover, a variety of studies have gone even further and sought
to examine whether some groups are represented better than others. Do legislators skew
their roll-​call votes toward the views of co-​partisans (Kastellec et al. 2015; Hill 2015)? Are
legislators more responsive to voters than nonvoters (Griffin and Newman 2005)? Do
the wealthy get better representation than the poor (Bartels 2009; Gilens 2012; Erikson
2015)? To address these sorts of questions, scholars need accurate measures of the av-
erage latent preferences for each group.

Disaggregation
The simplest way to estimate group-​level opinion is to measure latent opinion at the in-
dividual level and then take the mean in each group. For example, Carsey and Harden
(2010) use a factor analytic model to measure the public’s policy liberalism in the United
States in 2010. Then they measure state-​level opinion by taking the average opinion in
each state. Lax and Phillips (2009b) call this approach “disaggregation.” The primary
advantage of disaggregation is that scholars can estimate latent opinion with a set of
346   Christopher Warshaw

individual-​level survey questions, an appropriate individual-​level measurement model


(e.g., a factor analytic or IRT model), and the respondent’s place of residence (e.g.,
Erikson, Wright, and McIver 1993; Brace et al. 2002). Thus, it is very straightforward
for applied researchers to generate estimates of public opinion in each geographic
unit. However, there are rarely enough respondents to generate precise estimates of
the preferences of people in small geographic areas using simple disaggregation. Most
surveys have only a handful of respondents in each state and even fewer in particular
legislative districts or cities.

Smoothing Opinion Using Multilevel Regression


and Post-​stratification (MRP)
A more nuanced approach is to combine individual-​level estimates of latent opinion
with a measurement model that smooths opinion across geographic space (e.g.,
Tausanovitch and Warshaw 2013). Indeed, even very large sample surveys can con-
tain small or even empty samples for many geographic units. In such cases, opinion
estimates for subnational units can be improved through the use of multilevel regres-
sion and post-​stratification (MRP) (Park, Gelman, and Bafumi 2004). The idea behind
MRP is to model respondents’ opinion hierarchically based on demographic and ge-
ographic predictors, partially pooling respondents in different geographic areas to an
extent determined by the data. The smoothed estimates of opinion in each geographic-​
demographic cell (e.g., Hispanic women with a high school education in Georgia) are
then weighted to match the cells’ proportion in the population, yielding estimates of
average opinion in each area. These weights are generally built using post-​stratification-​
based population targets. But they sometimes include more complicated weighting
designs (Ghitza and Gelman 2013). Subnational opinion estimates derived from MRP
models have been shown to be more accurate than ones based on alternative methods,
even with survey samples of only a few thousand people (Park, Gelman, and Bafumi
2004; Lax and Phillips 2009b; Warshaw and Rodden 2012; but see Buttice and Highton
2013 for a cautionary note).
Scholars can build state-​level MRP models in R using the mrp (Malecki et al. 2014) or
dgo (Dunham, Caughey, and Warshaw 2016) packages. They can program customized
MRP models using the glmer function in the lme4 package.4 More complicated MRP
models can be built using fully Bayesian software such as Jags or Stan.

Hierarchical Group-​Level IRT Model


Most public opinion surveys only contain a handful of questions about any particular
latent construct. For example, most surveys only contain a few questions about policy.
Moreover, they might only contain one question about other latent constructs such as
Latent Constructs in Public Opinion    347

trust in government or political activism. The sparseness of questions in most surveys


largely precludes the use of respondent-​level dimension-​reduction techniques on the
vast majority of available public opinion data. To overcome this problem, scholars have
developed a variety of measurement models that are estimated at the level of groups
rather than individuals (Stimson 1991; Lewis 2001; McGann 2014). This enables scholars
to measure latent constructs using data from surveys that only ask one or two questions
about the construct of interest, which would be impossible with models that are
estimated at the individual level. For example, Caughey and Warshaw (2015) develop a
group-​level IRT model that estimates latent group opinion as a function of demographic
and geographic characteristics, smoothing the hierarchical parameters over time via a
dynamic linear model. They reparameterize equation (2) as

pi j = Φ[(θi − κ j )/σ j ], (3)

where κj = αj/​βj and Φ (Fox 2010, 11). In this formulation, the item threshold κj represents
the ability level at which a respondent has a 50% probability of answering question j
correctly.5 The dispersion σj, which is the inverse of the discrimination βj, represents the
magnitude of the measurement error for item j. Given the normal ogive IRT model and
normally distributed group abilities, the probability that a randomly sampled member
of group g correctly answers item j is

pg j = Φ[(θ g − κ j )/ σ2θ + σ2j ], (4)

where θ g is the mean of the θi in group g, σθ is the within-​group standard deviation of


abilities, and κj and σj are the threshold and dispersion of item j (Mislevy 1983, 278).
Rather than modeling the individual responses yij, as in a typical IRT model,
n
Caughey and Warshaw (2015) instead model s gj = ∑ i gj yi[g ]j , the total number of correct
answers to question j out of the ngj responses of individuals in group g (e.g., Ghitza
and Gelman 2013). Assuming that each respondent answers one question and each
response is independent conditional on θi, κj, and σj, the number of correct answers
to item j in each group, sgj, is distributed binomial (ngj, pgj), where ngj is the number
of nonmissing responses. The model in Caughey and Warshaw (2015) then smooths
the estimates of each group using a hierarchical model that models group means as a
function of each group’s demographic and geographic characteristics (Park, Gelman,
and Bafumi 2004).
This group-​level IRT model enables the usage of data from hundreds of individual
surveys, which may only contain one or two policy questions. Similarly to the MRP
models discussed above, the group-​level estimates from this model can be weighted
to generate estimates for geographic units. This approach enables scholars to measure
policy liberalism and other latent variables across geographic space and over time in a
unified framework. Scholars can run group-​level IRT models using the R package dgo
(Dunham, Caughey, and Warshaw 2016).
348   Christopher Warshaw

Group-​L evel Applications

Latent public opinion constructs that are measured at the group level have been used for
a wide variety of substantive applications in political science. In this section I briefly dis-
cuss three of these applications.

Describing Variation in Ideology Across Time and Space


One of the most basic tasks of public opinion research is to describe variation in the
mass public’s views across time or geographic space. To this end, a large body of work
in the American politics literature has focused on longitudinal variation in latent policy
liberalism at the national level. For example, Stimson (1991) measures variation in the
public’s policy mood at the national level in the United States over the past fifty years.
Likewise, Bartle, Dellepiane-​ Avellaneda, and Stimson (2011) and McGann (2014)
measure policy mood in England from 1950 to 2004; Stimson, Thiébaut, and Tiberj
(2012) measure policy mood in France; and Munzert and Bauer (2013) measure changes
in the public’s policy preferences in Germany.
Another large body of work focuses on measuring variation in latent policy liber-
alism across geography. For example, Carsey and Harden (2010) use an IRT model
to measure variation in the public’s policy liberalism across the American states.
However, their approach generates unstable estimates below the state level. To address
this problem, Tausanovitch and Warshaw (2013) combine an IRT and MRP model
to generate cross-​sectional estimates of the public’s policy liberalism in every state,
legislative district, and city in the country during the period 2000–​2012. More re-
cent work in the American politics literature has sought to measure variation in the
public’s policy liberalism across both geographic space and time on a common scale.
Enns and Koch (2013) measure state-​level variation in policy mood between 1956 and
2010, while Caughey and Warshaw (2017) measure variation in policy liberalism in
the American states between 1936 and 2014. Both studies produce estimates in every
state-​year during these periods.
There is also a growing literature that examines variation in latent opinion cross-​
nationally. Caughey, O’Grady, and Warshaw (2015) use a Bayesian group-​level IRT
model to develop measures of policy liberalism in Europe. They find that countries
within Europe have become more polarized over time, and that patterns of ideology
are starkly different across economic and cultural issues. Sumaktoyo (2015) meas-
ures religious conservatism levels in twenty-​six Islamic countries. He finds that
Afghanistan and Pakistan, along with other Arab countries, are the most conserva-
tive Islamic countries. In contrast, Turkey is relatively moderate. The only Muslim-​
majority countries that are less religiously conservative than Turkey are post-​Soviet
countries.
Latent Constructs in Public Opinion    349

Representation in the United States


One of the foundations of representative democracy is the assumption that citizens’
preferences should correspond with, and inform, elected officials’ behavior. This form
of representation is typically called dyadic representation (Miller and Stokes 1963;
Weissberg 1978; Converse and Pierce 1986). Most of the literature in American politics
on dyadic representation focuses on the association between the latent policy liberalism
of constituents and the roll-​call behavior of legislators. These studies generally find that
legislators’ roll-​call positions are correlated with the general ideological preferences
of their districts (e.g., Clinton 2006). However, there is little evidence that candidates’
positions converge on the median voter (Ansolabehere, Snyder, and Stewart 2001; Lee,
Moretti, and Butler 2004).
If legislators’ positions are not converging on the median voter, perhaps they are
responding to the positions of other subconstituencies in each district, such as primary
voters or other activists. Of course, this question is impossible to examine without good
estimates of each subconstituency’s opinion in every legislative district. As a result, a va-
riety of recent studies have used variants of the measurement models discussed above to
examine the link between the policy liberalism of primary voters (Bafumi and Herron
2010; Hill 2015), donors (Barber 2016), and other subconstituencies and the roll-​call be-
havior of legislators.
A growing body of work in American politics is moving beyond the study of dyadic
representation in Congress to examine the links between public opinion and political
outcomes at the state and local levels. Erikson, Wright, and McIver (1993) and many
subsequent studies have examined representation at the state level. More recently,
Tausanovitch and Warshaw (2014) extend the study of representation to the municipal
level, where they find a strong link between public opinion and city policy outputs.

Racial Prejudice
Section 5 of the Voting Rights Act (VRA; 1965) targeted states that were purported to
have high levels of racial prejudice. To evaluate the validity of the VRA’s coverage for-
mula, it would be useful to have a measure of the level of racial prejudice in every state.
To this end, Elmendorf and Spencer (2014) use an individual-​level IRT model to scale
the racial prejudice levels of approximately fifty thousand respondents to two large
surveys in 2008. Then they use MRP to estimate the average level of racial prejudice in
every state and county in the country. They find the highest levels of racial prejudice in
southern states such as Mississippi and South Carolina. However, they also find high
levels of racial prejudice in several other states, such as Wyoming, Pennsylvania, and
Ohio. Their findings provide policymakers with information about contemporary levels
of racial prejudice in the United States that could be useful for future revisions to the
VRA and other federal laws protecting minorities.
350   Christopher Warshaw

Substantive Frontiers

Public opinion work utilizing latent variables is likely to pursue a variety of exciting,
substantive directions in coming years. In this section I  focus on three types of re-
search that investigate the consequences of citizens’ latent policy liberalism for political
outcomes. First, scholars are likely to focus more attention on spatial voting and elec-
toral accountability. Second, the availability of new techniques for measuring changes in
latent opinion over time will facilitate more attention on the dynamic responsiveness of
elected officials and public policies to changes in the public’s views. Third, there is likely
to be more focus on representation and dyadic responsiveness in comparative politics.

Spatial Voting
The theory of spatial or proximity voting (Black 1948; Downs 1957; Enelow and
Hinich 1984) is one of the central ideas in scholarship on voting and elections. The
spatial voting theory’s most important prediction is that the ideological positions of
candidates and parties should influence voters’ decisions at the ballot box. This electoral
connection helps ensure that legislators are responsive to the views of their constituents
(Mayhew 1974).
In recent years a number of prominent papers in the American politics literature have
examined whether citizens vote for the most spatially proximate congressional candi-
date (e.g., Jessee 2009; Joesten and Stone 2014; Shor and Rogowski 2018; Simas 2013).
These studies all proceed by estimating the policy preferences of citizens and legislators
on a common scale. This enables them to examine whether citizens vote for the most
spatially proximate candidate. However, it is important to note that there are three major
limitations of this literature. First, Lewis and Tausanovitch (2013) and Jessee (2016)
show that joint scaling models rely on strong assumptions that undermine their plau-
sibility. These studies suggest that scholars should exercise caution in using estimates
that jointly scale legislators and the mass public into the same latent space. Second,
Tausanovitch and Warshaw (2017) show that existing measures of candidates’ ideology
only improve marginally on the widely available heuristic of party identification. As a
result, they conclude that these measures fall short when it comes to testing theories of
representation and spatial voting on Congress. Third, there is little attention to causal
identification in the literature on spatial voting. Most studies in this literature use cross-​
sectional regressions that do not clearly differentiate spatial proximity between voters
and candidates from other factors that may influence voters’ decisions in the ballot box.
Future studies on spatial voting in congressional elections are likely to use new
advances in measurement and causal inference to overcome these limitations. There
are likely to continue to be rapid advances in our ability to measure the ideology of po-
litical candidates. Moreover, Jessee (2016) points the way toward several promising
Latent Constructs in Public Opinion    351

approaches to improve the plausibility of models that jointly scale the policy liberalism
of candidates and the mass public into the same space.
There is also a growing amount of work on spatial voting in a comparative perspective.
For example, Saiegh (2015) jointly scales voters, parties, and politicians from different
Latin American countries on a common ideological space. This study’s findings indicate
that ideology is a significant determinant of vote choice in Latin America. However, it is
important to note that many of the challenges discussed above in the American context
also face scholars of spatial voting in comparative politics.

Dynamic Representation in the United States


A limitation of virtually all of the existing studies on representation is that they use
cross-​sectional research designs. This makes it impossible to examine policy change,
which is both theoretically limiting and problematic for strong causal inference since
the temporal order of the variables cannot be established (Lowery, Gray, and Hager
1989; Ringquist and Garand 1999). Indeed, most existing studies cannot rule out reverse
causation. For example, cross-​sectional studies of dyadic representation in Congress
could be confounded if legislators’ actions are causing changes in district-​level public
opinion (Lenz 2013; Grose, Malhotra, and Parks Van Houweling 2015). To address these
concerns, the next generation of studies in this area is likely to focus on whether changes
in public opinion lead to changes in political outcomes (e.g., Page and Shapiro 1983;
Erikson, MacKuen, and Stimson 2002; Caughey and Warshaw 2017).

Representation in Comparative Politics


Compared to the United States, there has been much less attention to the study of mass-​
elite linkages in other advanced democracies (Powell 2004, 283–​284). One of the pri-
mary barriers to research on representation in comparative politics has been the lack
of good measures of constituency preferences. However, the availability of new models
to scale latent opinion and of new methods to smooth the estimates of opinion across
geography and over time has the potential to facilitate a new generation of research on
representation in comparative politics (e.g., Lupu and Warner Forthcoming).
Hanretty, Lauderdale, and Vivyan (2016) examine the dyadic association between
members of the British parliament and their constituencies. They use an IRT model to
estimate the British public’s policy liberalism on economic issues and an MRP model to
estimate the preferences of each constituency. They find a strong association between
constituency opinion and members’ behavior on a variety of left-​right issues.
The next generation of work on representation in comparative politics is likely to
focus on whether public policies are responsive to public opinion and what institutional
conditions facilitate responsiveness. Do changes in levels of government spending re-
flect dynamics in the mass public’s policy liberalism on economic issues (Soroka and
352   Christopher Warshaw

Wlezien 2005)? Are the immigration policies of European countries responsive to the
policy preferences of their citizens on immigration issues? Do countries’ decisions
about war and peace reflect the latent preferences of citizens for retribution (Stein 2015)?
Do changes in religious conservatism affect democratic stability or the onset of civil war
(Sumaktoyo 2015)?

Methodological Frontiers

There are also a variety of important methodological frontiers in research on latent


constructs in public opinion. An important one is the question of how to properly assess
the appropriate number of dimensions required to summarize public opinion. Indeed,
there is little agreement in the literature about how to assess the dimensionality of public
opinion data. Another important frontier is the development of better computational
methods to work with large public opinion data sets. Computational challenges are one
of the main barriers facing scholars who wish to develop complicated latent variable
models for large public opinion data sets. A third frontier is the continued develop-
ment of better statistical methods to summarize latent opinion at the subnational level.
Finally, there has recently been an explosion of work that examines public opinion using
non-​survey-​based data. This work is likely to continue to grow in the years to come.

Assessing Dimensionality
The question of whether a particular latent construct is best modeled with one or mul-
tiple dimensions is not easily resolved. For example, a variety of studies find that the
main dimension of latent policy liberalism or ideology in the United States is dominated
by economic policy items (e.g., Ansolabehere, Rodden, and Snyder 2006). However,
there is a vigorous, ongoing debate about whether social issues map to the main dimen-
sion or constitute a second dimension of latent policy liberalism. Some studies find that
social issues constitute a second dimension of latent policy liberalism (Ansolabehere,
Rodden, and Snyder 2006; Treier and Hillygus 2009), while others find that social
issues map to the main dimension of policy liberalism (Jessee 2009; Tausanovitch and
Warshaw 2013), at least in the modern era. One of the challenges in this literature has
been that there is little agreement about how to assess the dimensionality of public
opinion data. Another challenge is that existing computational approaches are often ill-​
suited to estimating multidimensional models.
Future studies should seek to rigorously examine the appropriate number of
dimensions required to summarize public opinion. At a theoretical level, scholars should
offer clear criteria for assessing the appropriate number of dimensions. At an empirical
level, scholars should examine whether the dimensionality of the mass public’s policy
liberalism, as well as other latent constructs, varies across geography or over time. For
Latent Constructs in Public Opinion    353

example, it is possible that the public’s policy liberalism was multidimensional during
the mid-​twentieth century but has gradually collapsed to a single dimension along sim-
ilar lines to the increasingly one-​dimensional roll-​call voting in Congress.

Computational Challenges
Computational challenges are one of the main barriers facing scholars who wish to de-
velop complicated latent variable models for large public opinion data sets. Standard
Bayesian Markov chain Monte Carlo (MCMC) algorithms can be quite slow when
applied to large data sets. As a result, researchers are often unable to estimate their
models using all the data and are forced to make various shortcuts and compromises
(Imai, Lo, and Olmsted 2015). Since a massive data set implies a large number of
parameters under these models, the convergence of MCMC algorithms also becomes
difficult to assess.
Fortunately there is a large body of ongoing work seeking to address the computational
challenges in large-​scale latent variable models. Andrew Gelman and his collaborators
have recently developed the software package Stan to perform fully Bayesian infer-
ence (Gelman, Lee, and Guo 2015).6 While Stan is an improvement on earlier MCMC
algorithms, it is still relatively slow with large data sets. An alternative approach is to uti-
lize expectation-​maximization (EM) algorithms that approximately maximize the pos-
terior distribution under various ideal point models (Imai, Lo, and Olmsted 2015). The
main advantage of EM algorithms is that they can dramatically reduce computational
time. They can estimate an extremely large number of ideal points on a laptop within a
few hours. However, they generally do not produce accurate estimates of uncertainty,
which can reduce their usefulness for many empirical applications (Jackman 2008).7

Measuring Subnational Latent Opinion


Which groups are better represented? Are the rich better represented than the poor
(Erikson 2015)? Do voters receive better representation than nonvoters (Griffin and
Newman 2005)? Are whites better represented than racial minorities? To answer
questions such as these, we need to develop accurate estimates of the latent opinion of
demographic subgroups within individual states and other geographic units.
Most existing smoothing models are ill-​suited to examine questions such as these, be-
cause they assume that differences in the opinions of various demographic groups, such
as blacks and whites, are constant across geography.8 To address these complications,
new smoothing models should incorporate more complicated interactions between
demographics and geography (e.g., Leemann and Wasserfallen 2016). For example, they
might allow the relationship between income and latent opinion to vary across geog-
raphy, using racial diversity as a hierarchical predictor for this relationship (Hersh and
Nall 2015). In the best example of recent work in this area, Ghitza and Gelman (2013)
354   Christopher Warshaw

build an MRP model with a complicated set of interactions that enables them to model
the voting behavior of different income and racial groups in each state. They find that
swings in turnout between the 2004 and 2008 presidential elections were primarily con-
fined to African Americans and young minorities.
Scholars should be aware, however, that there is a trade-​off between bias and error
when they are developing more complicated smoothing models. More complicated
models will inevitably reduce bias in estimates of subgroups’ opinion. But more compli-
cated models will generally have less shrinkage across geography than simpler models,
which is likely to lead to greater error in the estimates for any particular group. Indeed,
Lax and Phillips (2013) find that more complicated interactions between demographic
categories often lead to substantially less accurate estimates of mean opinion in each
geographic unit.

Beyond Surveys
In recent years, there has been an explosion of work that examines public opinion
using non-​survey-​based data. For example, Bonica (2013, 2014) scales millions of
campaign contributions to measure the latent campaign finance preferences of
millions of Americans. Bond and Messing (2015) demonstrate that social media
data represent a useful resource for testing models of legislative and individual-​
level political behavior and attitudes. They develop a model to estimate the ideology
of politicians and their supporters using social media data on individual citizens’
endorsements of political figures. Their measure places on the same scale politicians
and more than six million citizens who are active in social media. Similarly, Barberá
(2015) develops a model to measure the political ideology of Twitter users based on
the assumption that their ideology can be inferred by examining which political
actors each user is following. He applies this method to estimate ideal points for a
large sample of both elite and mass public Twitter users in the United States and five
European countries.
While these new methods are very promising, scholars still need to carefully
define the target population of interest. For example, Bond and Messing’s (2015)
estimates of the ideology of Facebook users are not necessarily representative of
the United States as a whole since not everyone uses Facebook. Another limita-
tion of these sources of data is that they are generally only available for recent time
periods. Thus, they are unsuitable for extending our knowledge of public opinion
back in time using dynamic measurement models. Finally, it is often unclear what
theoretical construct these new models are capturing. For example, are campaign
finance data capturing donors’ ideology, partisanship, or some other latent con-
struct? To evaluate this question, scholars could compare a given set of individuals’
campaign finance preferences with their Twitter ideal points, with their Facebook
ideal points, or with policy liberalism from survey data (see, e.g., Hill and Huber
Forthcoming).
Latent Constructs in Public Opinion    355

Conclusion

This is an exciting time to be doing research that utilizes latent constructs in public
opinion. The development of new and improved methods for summarizing latent
constructs in public opinion has led to a wide variety of substantive advances, in-
cluding work on polarization, representation, political knowledge, and racial re-
sentment. The next generation of work in American politics is likely to focus on
areas such as assessing changes in mass polarization over time at the subnational
level, dynamic representation at the state and local levels, and spatial voting
in elections. There is also a growing body of work in comparative politics that
utilizes latent constructs in public opinion to examine important questions such
as the causes and consequences of political knowledge, dyadic representation in
Westminster democracies, and the effect of changes in religious conservatism on
democratic stability.

Data and Example Code

Public Opinion Data Sources


• Roper Center for Public Opinion Research, http://​www.ropercenter.uconn.edu
• ICPSR, http://​www.icpsr.umich.edu/​icpsrweb/​landing.jsp
• Odum Institute Archive Dataverse, http://​arc.irss.unc.edu/​dvn/​dv/​odvn
• National Annenberg Election Survey, http://​www.annenbergpublicpolicycenter.
org/​political-​communication/​naes/​
• American National Election Survey, http://​www.electionstudies.org
• Cooperative Congressional Election Study, http://​projects.iq.harvard.edu/​cces/​
home

Models for Measuring Latent Opinion at the


Individual Level
• Bayesian Factor analytic and IRT models can be run using off-​the-​shelf software
such as MCMCpack (Martin et al. 2011) or the ideal function in the R package pscl
(Jackman 2012).
• A variety of EM IRT models can be run using the R package emIRT (Imai, Lo, and
Olmsted 2015).
• For more complicated IRT models, researchers can use fully Bayesian software
such as Bugs, Jags, or Stan.
356   Christopher Warshaw

Model for Measuring Latent Opinion at the Group Level


• Multilevel Regression and Post-​stratification (MRP) models can be run using the R
package mrp (Malecki et al. 2014)
• Group-​level MRP and IRT models can be run using the R package dgo (Dunham,
Caughey, and Warshaw 2016).

Notes
1. For a more general overview of the sources of measurement error on surveys, see Biemer
et al. (2011).
2. Some studies call this latent construct “mood” (Stimson 1991), others calls it “ideology”
(Hill and Tausanovitch 2015), and others call it a measure of citizens’ “ideal points” (Bafumi
and Herron 2010), while still others call it “policy preferences” (Treier and Hillygus 2009;
Tausanovitch and Warshaw 2013) or “policy liberalism” (Caughey and Warshaw 2015). In
the balance of this chapter I generally call this latent construct “policy liberalism” to distin-
guish it from symbolic ideology or other related concepts.
3. See below for more discussion of the advantages and disadvantages of emIRT.
4. See Kastellec, Lax, and Phillips (2010) for a primer about estimating MRP models in R.
5. In terms of a spatial model, κj is the cutpoint, or point of indifference between two choices.
6. Stan uses the no-​U-​turn sampler (Hoffman and Gelman 2014), an adaptive variant of
Hamiltonian Monte Carlo, which itself is a generalization of the familiar Metropolis al-
gorithm. It performs multiple steps per iteration to move efficiently through the posterior
distribution.
7. See Hill and Tausanovitch (2015) for an example of where inaccurate characterization of un-
certainty from a latent variable model would change the conclusions of an important sub-
stantive analysis.
8. Several studies have shown that differences in the opinion of various demographic groups
are far from constant. For instance, Gelman et al. (2009) and Hersh and Nall (2015) show
that income is more correlated with opinion in poorer, racially diverse areas. In richer areas
with less diversity, there is little link between income and opinion.

References
Abramowitz, A. I., and K. L. Saunders. 1998. “Ideological Realignment in the US Electorate.”
Journal of Politics 60 (3): 634–​652.
Achen, C. H. 1975. “Mass Political Attitudes and the Survey Response.” American Political
Science Review 69 (4): 1218–​1231.
Adcock, R., and D. Collier. 2001. “Measurement Validity: A Shared Standard for Qualitative
and Quantitative Research.” American Political Science Review 95 (3): 529–​546.
Anderson, B. A., B. D. Silver, and P. R. Abramson. 1988. “The Effects of the Race of the
Interviewer on Race-​Related Attitudes of Black Respondents in SRC/​CPS National Election
Studies.” Public Opinion Quarterly 52 (3): 289–​324.
Latent Constructs in Public Opinion    357

Ansolabehere, S., J. M. Snyder Jr., and C. Stewart III. 2001. “Candidate Positioning in U.S.
House Elections.” American Journal of Political Science 45 (1): 136–​159.
Ansolabehere, S., J. Rodden, and J. M Snyder. 2006. “Purple America.” Journal of Economic
Perspectives 20 (2): 97–​118.
Ansolabehere, S., J. Rodden, and J. M. Snyder Jr. 2008. “The Strength of Issues: Using Multiple
Measures to Gauge Preference Stability, Ideological Constraint, and Issue Voting.” American
Political Science Review 102 (2): 215–​232.
Bafumi, J., and M. C. Herron. 2010. “Leapfrog Representation and Extremism:  A Study of
American Voters and Their Members in Congress.” American Political Science Review 104
(3): 519–​542.
Barabas, J., J. Jerit, W. Pollock, and C. Rainey. 2014. “The Question (s) of Political Knowledge.”
American Political Science Review 108 (4): 840–​855.
Barber, M. J. 2016. “Representing the Preferences of Donors, Partisans, and Voters in the US
Senate.” Public Opinion Quarterly 80 (S1): 225–​249.
Barber, M., and N. McCarty. 2015. “Causes and Consequences of Polarization.” In Solutions to
Polarization in America, edited by Nathaniel Persily, 15–​58. Cambridge University Press.
Barberá, P. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation
Using Twitter Data.” Political Analysis 23 (1): 76–​91.
Bartle, J., S. Dellepiane-​Avellaneda, and J. Stimson. 2011. “The Moving Centre:  Preferences
for Government Activity in Britain, 1950–​2005.” British Journal of Political Science 41
(2): 259–​285.
Bartels, L. M. 1996. “Uninformed Votes:  Information Effects in Presidential Elections.”
American Journal of Political Science 40 (1): 194–​230.
Bartels, L. M. 2009. “Economic Inequality and Political Representation.” In The
Unsustainable American State, eds. Lawrence Jacobs and Desmond King, 167–​196. Oxford
University Press.
Berinsky, A. J., M. F. Margolis, and M. W. Sances. 2014. “Separating the Shirkers from the
Workers? Making Sure Respondents Pay Attention on Self-​ Administered Surveys.”
American Journal of Political Science 58 (3): 739–​753.
Biemer, P. P., R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, and S. Sudman. 2011. Measurement
Errors in Surveys, vol. 173. New York: John Wiley & Sons.
Black, D. 1948. “On the Rationale of Group Decision-​Making.” Journal of Political Economy 56
(1): 23–​34.
Bollen, K. A. 1989. Structural Equations with Latent Variables. Series in Probability and
Mathematical Statistics. New York: John Wiley and Sons.
Bond, R., and S. Messing. 2015. “Quantifying Social Medias Political Space:  Estimating
Ideology from Publicly Revealed Preferences on Facebook.” American Political Science
Review 109 (1): 62–​78.
Bonica, A. 2013. “Ideology and Interests in the Political Marketplace.” American Journal of
Political Science 57 (2): 294–​311.
Bonica, A. 2014. “Mapping the Ideological Marketplace.” American Journal of Political Science
58 (2): 367–​386.
Brace, P., K. Sims-​Butler, K. Arceneaux, and M. Johnson. 2002. “Public Opinion in the
American States:  New Perspectives using National Survey Data.” American Journal of
Political Science 46 (1): 173–​189.
Broockman, D. E. 2016. “Approaches to Studying Policy Representation.” Legislative Studies
Quarterly 41 (1): 181–​215.
358   Christopher Warshaw

Buttice, M. K., and B. Highton. 2013. “How Does Multilevel Regression and Poststratification
Perform with Conventional National Surveys?” Political Analysis 21 (4): 449–​467.
Carmines, E. G., P. M. Sniderman, and B. C. Easter. 2011. “On the Meaning, Measurement, and
Implications of Racial Resentment.” Annals of the American Academy of Political and Social
Science 634 (1): 98–​116.
Carsey, T. M., and J. J. Harden. 2010. “New Measures of Partisanship, Ideology, and Policy
Mood in the American States.” State Politics & Policy Quarterly 10 (2): 136–​156.
Caughey, D., T. O’Grady, and C. Warshaw. 2015. “Ideology in the European Mass Public: A
Dynamic Perspective.” Paper presented at the 2015 ECPR General Conference in Montreal,
Canada.
Caughey, D., and C. Warshaw. 2015. “Dynamic Estimation of Latent Public Opinion Using a
Hierarchical Group-​Level IRT Model.” Political Analysis 23 (2): 197–​211.
Caughey, D., and C. Warshaw. 2017. “Policy Preferences and Policy Change: Dynamic
Responsiveness in the American States,” 1936–2014. American Political Science Review, 1–18.
Clinton, J. D. 2006. “Representation in Congress:  Constituents and Roll Calls in the 106th
House.” Journal of Politics 68 (2): 397–​409.
Clinton, J., S. Jackman, and D. Rivers. 2004. “The Statistical Analysis of Roll Call Data.”
American Political Science Review 98 (2): 355–​370.
Converse, P. E., and R. Pierce. 1986. Political Representation in France. Cambridge, MA: Harvard
University Press.
De Boef, S., and P. M. Kellstedt. 2004. “The Political (and Economic) Origins of Consumer
Confidence.” American Journal of Political Science 48 (4): 633–​649.
Delli Carpini, M. X., and S. Keeter. 1993. “Measuring Political Knowledge: Putting First Things
First.” American Journal of Political Science 37 (4): 1179–​1206.
Downs, A. 1957. An Economic Theory of Democracy. New York: Harper and Row.
Dunham, J., D. Caughey, and C. Warshaw. 2016. “dgo: Dynamic Estimation of Group-​level
Opinion.” R package version 0.2.3. https://​jamesdunham.github.io/​dgo/​.
Elmendorf, C. S., and D. M. Spencer. 2014. “The Geography of Racial Stereotyping: Evidence
and Implications for VRA Preclearance After Shelby County.” California Law Review
102: 1123–​1180.
Enelow, J. M., and M. J. Hinich. 1984. The Spatial Theory of Voting:  An Introduction.
Cambridge: Cambridge University Press.
Enns, P. K., and J. Koch. 2013. “Public Opinion in the U.S. States: 1956 to 2010.” State Politics and
Policy Quarterly 13 (3): 349–​372.
Erikson, R. S. 2015. “Income Inequality and Policy Responsiveness.” Annual Review of Political
Science 18: 11–​29.
Erikson, R. S., G. C. Wright, and J. P. McIver. 1993. Statehouse Democracy: Public Opinion and
Policy in the American States. New York: Cambridge University Press.
Erikson, R. S., M. B. MacKuen, and J. A. Stimson. 2002. The Macro Polity. New York: Cambridge
University Press.
Fox, J.-​P. 2010. Bayesian Item Response Modeling: Theory and Applications. Springer. e-​book.
Gelman, A., B. Shor, D. Park, and J. Cortina. 2009. Red State, Blue State, Rich State,
Poor State:  Why Americans Vote the Way They Do. Princeton, NJ:  Princeton
University Press.
Gelman, A., D. Lee, and J. Guo. 2015. “Stan:  A probabilistic Programming Language for
Bayesian Inference and Optimization.” Journal of Educational and Behavioral Statistics 40
(5): 530–​543.
Latent Constructs in Public Opinion    359

Ghitza, Y., and A. Gelman. 2013. “Deep Interactions with MRP: Election Turnout and Voting
Patterns among Small Electoral Subgroups.” American Journal of Political Science 57
(3): 762–​776.
Gilens, M. 2012. Affluence and Influence: Economic Inequality and Political Power in America.
Princeton, NJ: Princeton University Press.
Griffin, J. D, and B. Newman. 2005. “Are Voters Better Represented?” Journal of Politics 67
(4): 1206–​1227.
Grose, C. R., N. Malhotra, and R. P. Van Houweling. 2015. “Explaining Explanations: How
Legislators Explain Their Policy Positions and How Citizens React.” American Journal of
Political Science 59 (3): 724–​743.
Hanretty, C., B. E. Lauderdale, and N. Vivyan. 2016. “Dyadic Representation in a Westminster
System.” Legislative Studies Quarterly. In Press.
Hersh, E. D., and C. Nall. 2015. “The Primacy of Race in the Geography of Income-​Based
Voting: New Evidence from Public Voting Records.” American Journal of Political Science 60
(2): 289–​303.
Hill, S J. 2015. “Institution of Nomination and the Policy Ideology of Primary Electorates.”
Quarterly Journal of Political Science 10 (4): 461–​487.
Hill, S., and G. Huber. Forthcoming. “Representativeness and Motivations of Contemporary
Contributors to Political Campaigns:  Results from Merged Survey and Administrative
Records.” Political Behavior.
Hill, S. J, and C. Tausanovitch. 2015. “A Disconnect in Representation? Comparison of Trends
in Congressional and Public Polarization.” Journal of Politics 77 (4): 1058–​1075.
Hoffman, M. D., and A. Gelman. 2014. “The No-​ U-​
Turn Sampler:  Adaptively Setting
Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15
(1): 1593–​1623.
Imai, K., J. Lo, and J. Olmsted. 2016. “Fast Estimation of Ideal Points with Massive Data.”
American Political Science Review 110 (4): 631–656.
Jackman, S. 2008. “Measurement.” In The Oxford Handbook of Political Methodology, edited
by Janet M. Box-​Steffensmeier, Henry E. Brady, and David Collier, 119–​151. Oxford: Oxford
University Press.
Jackman, S. 2009. Bayesian Analysis for the Social Sciences. Hoboken, NJ: John Wiley and Sons.
Jackman, S. 2012. “pscl:  Classes and Methods for R Developed in the Political Science
Computational Laboratory, Stanford University.” Department of Political Science, Stanford
University. R package version 1.04.4.
Jessee, S. A. 2009. “Spatial Voting in the 2004 Presidential Election.” American Political Science
Review 103 (1): 59–​81.
Jessee, S. 2016. “(How) Can We Estimate the Ideology of Citizens and Political Elites on the
Same Scale?” American Journal of Political Science 60 (4): 1108–​1124.
Joesten, D. A., and W. J. Stone. 2014. “Reassessing Proximity Voting:  Expertise, Party, and
Choice in Congressional Elections.” Journal of Politics 76 (3): 740–​753.
Johnson, V. E., and J. H. Albert. 2006. Ordinal Data Modeling. New York, NY: Springer Science
& Business Media.
Kaplan, D. 2004. The Sage Handbook of Quantitative Methodology for the Social Sciences.
Thousand Oaks, CA: Sage Publications Inc.
Kastellec, J. P., J. R. Lax, M. Malecki, and J. H. Phillips. 2015. “Polarizing the Electoral
Connection: Partisan Representation in Supreme Court Confirmation Politics.” Journal of
Politics 77 (3): 787–​804.
360   Christopher Warshaw

Kastellec, J. P., J. R. Lax, and J. Phillips. 2010. “Estimating State Public Opinion with Multi-​level
Regression and Poststratification Using R.” Unpublished manuscript.
Kinder, D. R., and L. M. Sanders. 1996. Divided by Color: Racial Politics and Democratic Ideals.
Chicago: University of Chicago Press.
Lax, J. R., and J. H. Phillips. 2009a. “Gay Rights in the States:  Public Opinion and Policy
Responsiveness.” American Political Science Review 103 (3): 367–​386.
Lax, J. R., and J. H. Phillips. 2009b. “How Should We Estimate Public Opinion in the States?”
American Journal of Political Science 53 (1): 107–​121.
Lax, J. R, and J. H. Phillips. 2013. “How Should We Estimate Sub-​national Opinion using MRP?
Preliminary Findings and Recommendations.” Working paper.
Lee, D. S., E. Moretti, and M. J. Butler. 2004. “Do Voters Affect or Elect Policies? Evidence from
the U. S. House.” Quarterly Journal of Economics 119 (3): 807–​859.
Leemann, Lucas and Fabio Wasserfallen. 2016. Extending the Use and Prediction Precision of
Subnational Public Opinion Estimation. American Journal of Political Science. In Press.
Lemmon, M., and E. Portniaguina. 2006. “Consumer Confidence and Asset Prices:  Some
Empirical Evidence.” Review of Financial Studies 19 (4): 1499–​1529.
Lenz, G. S. 2013. Follow the Leader? How Voters Respond to Politicians’ Policies and Performance.
Chicago: University of Chicago Press.
Lewis, J. B. 2001. “Estimating Voter Preference Distributions from Individual-​Level Voting
Data.” Political Analysis 9 (3): 275–​297.
Lewis, J. B., and C. Tausanovitch. 2013. “Has Joint Scaling Solved the Achen Objection to Miller
and Stokes.” Paper presented at Political Representation: Fifty Years after Miller & Stokes,
Center for the Study of Democratic Institutions, Vanderbilt University, Nashville, TN,
March 1–​2.
Lowery, D., V. Gray, and G. Hager. 1989. “Public Opinion and Policy Change in the American
States.” American Politics Research 17 (1): 3–​31.
Lu, Y., Y. Chu, and F. Shen. 2016. “Mass Media, New Technology, and Ideology An Analysis of
Political Trends in China.” Global Media and China. In Press.
Ludvigson, S. C. 2004. “Consumer Confidence and Consumer Spending.” Journal of Economic
Perspectives 18 (2): 29–​50.
Lupu, N., and Z. Warner. Forthcoming. “Mass–​Elite Congruence and Representation in
Argentina.” In Malaise in Representation in Latin American Countries:  Chile, Argentina,
Uruguay, edited by Alfredo Joignant, Mauricio Morales, and Claudio Fuentes.
New York: Palgrave Macmillan.
Malecki, M., J. Lax, A. Gelman, and W. Wang. 2014. “mrp:  Multilevel Regression and
Poststratification.” R package version 1.0-​1. https://​github.com/​malecki/​mrp.
Margolis, M. F. 2018. From Politics to the Pews: How Partisanship and the Political Environment
Shape Religious Identity. University of Chicago Press.
Martin, A. D., K. M. Quinn, and J. H. Park 2011. “Mcmcpack: Markov Chain Monte Carlo in R.”
Journal of Statistical Software 42 (9): 1–​21.
Mayhew, D. 1974. The Electoral Connection. New Haven, CT: Yale University Press.
McAndrew, S., and D. Voas. 2011. “Measuring Religiosity using Surveys.” Survey Question
Bank: Topic Overview 4 (2): 1–​15.
McGann, A. J. 2014. “Estimating the Political Center from Aggregate Data: An Item Response
Theory Alternative to the Stimson Dyad Ratios Algorithm.” Political Analysis 22 (1): 115–​129.
Miller, W. E., and D. E. Stokes. 1963. “Constituency Influence in Congress.” American Political
Science Review 57 (1): 45–​56.
Latent Constructs in Public Opinion    361

Mislevy, R. J. 1983. “Item Response Models for Grouped Data.” Journal of Educational Statistics
8 (4): 271–​288.
Mondak, J. J. 2001. “Developing Valid Knowledge Scales.” American Journal of Political Science
45 (1): 224–​238.
Montgomery, J. M., and J. Cutler. 2013. “Computerized Adaptive Testing for Public Opinion
Surveys.” Political Analysis 21 (2): 172–​192.
Mueller, E. 1963. “Ten Years of Consumer Attitude Surveys: Their Forecasting Record.” Journal
of the American Statistical Association 58 (304): 899–​917.
Munzert, S., and P. C. Bauer. 2013. “Political depolarization in German public opinion, 1980–​
2010.” Political Science Research and Methods 1 (1): 67–​89.
Page, B. I., and R. Y. Shapiro. 1983. “Effects of Public Opinion on Policy.” American Political
Science Review 77 (1): 175–​190.
Pan, J., and Y. Xu. 2015. “China’s Ideological Spectrum.” 2018. The Journal of Politics 80(1):
254–273.
Park, D. K., A. Gelman, and J. Bafumi. 2004. “Bayesian Multilevel Estimation with
Poststratification:  State-​ Level Estimates from National Polls.” Political Analysis 12
(4): 375–​385.
Pereira, F. B. 2015. “Measuring Political Knowledge Across Countries.” Paper presented at the
2015 Midwest Political Science Association conference.
Poole, K. T., and H. Rosenthal. 2007. Ideology & Congress. New Brunswick, NJ: Transaction
Publishers.
Powell, G. B. 2004. “Political Representation in Comparative Politics.” Annual Review of
Political Science 7: 273–​296.
Quinn, K. M. 2004. “Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses.”
Political Analysis 12 (4): 338–​353.
Ringquist, E. J., and J. C. Garand. 1999. “Policy Change in the American States.” In American
State and Local Politics: Directions for the 21st Century, edited by Ronald E. Weber, and Paul
Brace, 268–​299. New York: Chatham House/​Seven Bridges Press.
Saiegh, S. M. 2015. “Using Joint Scaling Methods to Study Ideology and
Representation: Evidence from Latin America.” Political Analysis 23 (3): 363–​384.
Shor, B., and J. C. Rogowski. 2018. “Ideology and the US congressional vote.” Political Science
Research and Methods 6 (2): 323–341.
Simas, E. N. 2013. “Proximity Voting in the 2010 US House Elections.” Electoral Studies 32
(4): 708–​7 17.
Soroka, S. N., and C. Wlezien. 2005. “Opinion–​ Policy Dynamics:  Public Preferences
and Public Expenditure in the United Kingdom.” British Journal of Political Science 35
(4): 665–​689.
Stein, R. 2015. “War and Revenge: Explaining Conflict Initiation by Democracies.” American
Political Science Review 109 (3): 556–​573.
Stimson, J. A. 1991. Public Opinion in America:  Moods, Cycles, and Swings. Boulder,
CO: Westview.
Stimson, J. A., C. Thiébaut, and V. Tiberj. 2012. “The Evolution of Policy Attitudes in France.”
European Union Politics 13 (2): 293–​316.
Sumaktoyo, N. G. 2015. “Islamic Conservatism and Support for Religious Freedom.” Working
paper presented at the 2016 Southern Political Science Association Conference.
Tarman, C., and D. O. Sears. 2005. “The Conceptualization and Measurement of Symbolic
Racism.” Journal of Politics 67 (3): 731–​761.
362   Christopher Warshaw

Tausanovitch, C., and C. Warshaw. 2013. “Measuring Constituent Policy Preferences in


Congress, State Legislatures and Cities.” Journal of Politics 75 (2): 330–​342.
Tausanovitch, C., and C. Warshaw. 2014. “Representation in Municipal Government.”
American Political Science Review 108 (3): 605–​641.
Tausanovitch, C., and C. Warshaw. 2017. “Estimating Candidates’ Political Orientation in a
Polarized Congress.” Political Analysis 25(2): 167–187.
Treier, S., and D. S. Hillygus. 2009. “The Nature of Political Ideology in the Contemporary
Electorate.” Public Opinion Quarterly 73 (4): 679–​703.
Treier, S., and S. Jackman. 2008. “Democracy as a Latent Variable.” American Journal of Political
Science 52 (1): 201–​217.
Warshaw, C., and J. Rodden. 2012. “How Should We Measure District-​Level Public Opinion on
Individual Issues?” Journal of Politics 74 (1): 203–​219.
Weissberg, R. 1978. “Collective vs. Dyadic Representation in Congress.” American Political
Science Review 72 (2): 535–​547.
Zaller, J., and S. Feldman. 1992. “A Simple Theory of the Survey Response: Answering Questions
Versus Revealing Preferences.” American Journal of Political Science 36 (3): 579–​616.
Chapter 17

Measu ring G rou p


C onsciousne s s
Actions Speak Louder Than Words

Kim Proctor

Introduction

Group consciousness is an important concept in explaining a variety of political factors,


ranging from conceptions of group identity (Smith 2004), to adherence to group norms
(Huddy, 2001), to political participation (Gurin, Miller, and Gurin 1980; Miller, Gurin,
Gurin, and Malanchuk 1981; Shingles 1981, Stokes 2003; Sanchez 2006a, 2006b), to par-
tisanship (Highton and Kam 2011; Wallace et al. 2009; Kidd et al. 2007; Welch and Foster
1992; Abramowitz and Saunders 2006), to public opinion (Gurin 1985; Sanchez 2006a;
Conover 1984, 1988; Conover and Feldman 1984; Conover and Sapiro 1993). Given the
large body of evidence demonstrating the power of group consciousness in explaining
political outcomes, one would expect a multitude of well-​tested and statistically valid
measures of group consciousness to be available to researchers. This is not the case,
however, as we lack both theoretical guidance on how to measure group consciousness
and empirical consensus surrounding its operationalization. In short, political scientists
spend a great deal of time discussing group consciousness and how it should be de-
fined, but almost no time examining how it should be measured. This chapter attempts
to bridge this gap between conceptualization and measurement by using item response
theory (IRT) to demonstrate how group consciousness should be quantified for ana-
lytical purposes. Using IRT to measure group consciousness is a major advancement
for political science, as it has stronger theoretical measurement principles and a greater
capacity to solve measurement problems than conventional measurement methods do
(Lord 1980; Hambleton, Swaminathan, and Rogers 1991; Embretson and Reise 2000,
2013; Baker and Kim 2004; van der Linden and Hambleton 1997).
Through IRT, this analysis also speaks to a larger issue in political science, which involves
the proliferation of measurement strategies that are not empirically based. Although I focus
specifically on group consciousness, this methodology could, and should, extend to most
364   Kim Proctor

concepts relating to political behavior, such as political knowledge (Carpini and Keeter
1993; Mondak 2001; Jerit, Barabas, and Bolsen 2006; Abrajano 2015), political participa-
tion (Gillion 2009; Harris and Gillion 2012), legislative significance and accomplishment
(Clinton and Lapinski 2006), and tolerance of ethnic minorities (Weldon 2006), which all
have the potential to capture dozens of different, yet related, ideas. Similar to group con-
sciousness, although these constructs may appear relatively conceptually straightforward,
empirical evidence suggests that they are potentially quite difficult to accurately measure.
This is especially problematic because our current measurement strategies for quantifying
these concepts are murky at best and nonexistent at worst. This not only leads to diverging
results and conclusions, but also inhibits scholars of political behavior from forming con-
sensus measures that could validate theoretical results. Consequently, without methodo-
logically validated measures of our constructs, it is impossible to determine if our empirical
results are accurate or are simply the result of inappropriate measurement strategies; differ-
ential item functioning (DIF), which occurs when a survey contains items that are biased
for various subpopulations; or a combination of both factors.
To examine the measurement of group consciousness, I rely on the Pew Research
Center’s “Survey of LGBT Americans” (2013). This survey provides data on the increas-
ingly important, yet consistently understudied, lesbian, gay, bisexual, and transgender
(LGBT) community. The diversity of this sample is particularly important, as it contains
a wide variety of sexual orientations, racial and ethnic minorities, age groups, income
groups, and education categories, which allows this analysis to test for the impact of
subgroup membership on measuring group consciousness. Further, it provides the first
examination of group consciousness outside the racial and ethnic context by including
the politically important and undertheorized LGBT community.

What Is Group Consciousness?

The concept of group consciousness combines in-​group politicized identity with a set
of ideas about a group’s relative status and strategies for improving it (Jackman and
Jackman 1973; Gurin, Miller, and Gurin 1980; Miller, Gurin, and Gurin 1981; Chong
and Rogers 2005; McClain et al. 2009). It is thought to structure the value and meaning
of group identity for minority communities (Smith 2004) and is often conceived of as
multidimensional, including components such as self-​identification, a sense of dis-
satisfaction with the status of the group, identity importance, and identity attachment
(Gurin, Miller, and Gurin 1980; Miller, Gurin, and Gurin 1981; Ashmore, Deaux, and
McLaughlin-​Volpe 2004; Chong and Rogers 2005). Scholars argue that political con-
sciousness is a driving force in the political behavior of minorities by providing group
members with both a “need to act” and a “will to act” (Gamson 1968, 48). To summa-
rize, group consciousness is generally defined as a multidimensional and complex con-
cept relating to a person’s political awareness of his or her group label (Stryker 1980;
Tajfel 1981, 1982; Turner et  al. 1987; Ashmore, Deaux, and McLaughlin-​Volpe 2004).
Measuring Group Consciousness    365

Because operationalizations shift across fields and range from interpersonal processes
to aggregate-​level products of political action (Brubaker and Cooper 2000), this anal-
ysis focuses on the four distinct conceptual factors that are most relevant:  (1) self-​
categorization, (2) evaluation, (3) importance, and (4) attachment (Ashmore, Deaux,
and McLaughlin-​Volpe 2004).

Self-​Categorization
Self-​categorization refers to the first step in developing group consciousness, as
it represents identification as a member of a particular social group (Deaux 1996;
Ashmore, Deaux, and McLaughlin-​Volpe 2004). It is the precondition for all other
dimensions of group consciousness, because one cannot express pride or importance
in an identity that one does not self-​identify with (Phinney 1991). Research consist-
ently demonstrates the power of self-​categorization, with even arbitrary group labels
eliciting powerful in-​group favoritism among group members (Brewer 1979; Diehl
1989; Tajfel 1982). In this analysis, self-​categorization captures the degree to which
LGBT persons think of themselves as gay and the extent to which they locate their
identities within the gay community. Outwardly labeling oneself as gay is a funda-
mental part of this process, often referred to as “coming out.” When an LGBT person
comes out, he or she explicitly signals to the outside world that he or she categorizes his
or her identity in terms of his or her gayness and that public recognition of this iden-
tity is important. Consequently, as persons increasingly outwardly label themselves as
LGBT, they indicate a heightened level of self-​categorization, signaling higher levels of
group consciousness.
All participants in Pew’s 2013 “ Survey of LGBT Americans” self-​identify as LGBT,
because this was a prerequisite for participation in the survey.1 However, the survey
also contains a question related to “being out,” or the extent to which a respondent pub-
licly self-​identifies with the LGBT label. Table 17.1 summarizes the self-​categorization

Table 17.1 Self-​Categorization in “A Survey of LGBT Americans”


All in all, thinking about the important people in your life, how many are aware that you
are [lesbian, gay, or bisexual]?

N % Mean SD

None of them 64 5.6 3.3 0.9


Only a few of them 185 16.1
Some of them 246 21.4
All or most of them 654 56.9
Total 1,149
366   Kim Proctor

item, including a description of the question and response rates for each category. It
demonstrates that the LGBT community reports varying levels of self-​categorization,
with a majority (57%) of respondents reporting that they are out to all or most of the im-
portant people in their lives, and about one in five reporting that they remain “out” to
only some of them (21%) or only a few of them (16%). A minority of respondents (6%)
reported that none of the most important people in their lives are aware of their LGBT
identity.

Evaluation
Following self-​categorization as a group member, one of the first processes an LGBT
person undergoes is evaluation of the group. Evaluation refers to the positive or negative
attachments that a person has toward his or her group identity (Eagly and Chaiken 1993;
Ashmore, Deaux, and McLaughlin-​Volpe 2004). It has two distinct subcomponents,
public evaluation and private evaluation. Public evaluation captures how favorably
the broader population regards the individual’s social group, while private evaluation
captures how favorably the individual regards his or her social group (Crocker et al.
1994; Luhtanen and Crocker 1992; Sellers et al. 1997; Heere and James 2007). In many
cases, there may be a difference between public and private evaluation. For example, an
individual may report pride in having an LGBT identity, yet recognize the discrimina-
tion and societal disapproval that accompany that label.
Public evaluation and private evaluation are theorized to operate along two dis-
tinct dimensions in relation to group consciousness (Crocker et  al. 1994). Negative
public evaluation, which signals that respondents perceive a large amount of discrim-
ination and societal disapproval, is consistently found to indicate heightened levels
of group consciousness (Miller, Gurin, and Gurin 1981; Stokes 2003; Masuoka 2006).
This implies that as perceptions of society’s attitudes toward the group grow more neg-
ative, the group is indicating higher levels of political consciousness. Private evaluation
displays the inverse of this relationship, with positive personal evaluations signaling
higher levels of group consciousness (Abrams and Brown 1989; Trapnell and Campbell
1999). Group members should evaluate their group more positively as their levels of
consciousness rise.
Table 17.2 displays the items that measure public and private evaluation.
Regarding public evaluation, table 17.2 indicates that the majority of respondents
(55%) reported that gays and lesbians face a lot of discrimination in American so-
ciety, although many respondents reported that there was only some discrimina-
tion (38%). The data for private evaluations demonstrates an even higher degree
of variance, with respondents largely divided between reporting neutral attitudes
(57%) or positive attitudes (38%). Therefore, similar to the self-​categorization item,
the evaluation items display a great deal of variance regarding self-​reported group
consciousness.
Measuring Group Consciousness    367

Table 17.2 Public and Private Evaluation in “A Survey of LGBT


Americans”
How much discrimination is there against gays and lesbians in our society today?

N % Mean SD

None at all 18 1.6 3.5 0.7


Only a little 66 5.7
Some 434 37.7
A lot 632 55.0
Total 1,150

Thinking about your sexual orientation, do you think of it as mainly something


positive in your life today, mainly something negative in your life today, or it doesn’t
make much of a difference either way?

N % Mean SD

Mainly something negative 67 5.8 2.3 0.6


Doesn’t make much of a 659 57.4
difference either way
Mainly something positive 422 36.8
Total 1,148

Importance
In addition to self-​identifying with a group label and making value judgments regarding
the favorability of that label, the importance of the identity to an individual also captures
his or her level of group consciousness. Importance represents the degree of significance
an individual attaches to his or her group label and overall self-​concept of his or her
group membership as meaningful (Ashmore, Deaux, and McLaughlin-​Volpe 2004).
A fundamental component of identity importance is the concept of psychological cen-
trality (Stryker and Serpe 1994), which captures the extent to which a social category is
essential to an individual’s sense of self (Stryker and Serpe 1994; McCall and Simmons
1978; Rosenberg 1979). When persons report that their group label is important to
their overall sense of identity, they acknowledge the importance and centrality of that
label, indicating that it is a fundamental component of their identity. As the identity
becomes more central to respondents, it indicates higher levels of group consciousness.
Table 17.3 demonstrates the centrality of gay identity in the lives of LGBT Americans,
with the community displaying a large degree of variability. Many respondents report
that the identity is very or extremely important (37%), signaling high levels of group
368   Kim Proctor

Table 17.3 Importance in ‘A Survey of LGBT Americans”


How important, if at all, is being [lesbian, gay, or bisexual] to your overall
identity? Would you say it is . . .

N % Mean SD

Not at all important 142 12.4 3.0 1.2


Not too important 263 22.9
Somewhat important 323 28.1
Very important 284 24.7
Extremely important 138 12.0
Total 1,150

consciousness, while many others report that it is not too or not at all important (35%),
signaling low levels of group consciousness.

Attachment
In addition to the centrality of a group identity, attachment, or the sense of closeness
a person feels toward the larger group based on that identity, is also a distinct and
important component of group consciousness (Ashmore, Deaux, and McLaughlin-​
Volpe 2004). Attachment reflects an individual’s affective involvement while also
capturing the close relationships group members form with other members of the
group (Heere and James 2007). An important component of attachment is inter-
dependence, or the interconnection of the individual to the broader social group,
indicating a merging of the self and the larger community (Mael and Tetrick 1992; Tyler
and Blader 2001). Therefore, when persons report higher levels of interdependence, or
a heightened sense of shared identity with other group members, they are indicating
higher levels of group consciousness. Table 17.4 displays the items related to inter-
dependence, which capture the attitudes of LGBT subgroups toward other commu-
nity members. Participants reported their sense of shared identity for all outgroups,
entailing that a lesbian respondent would only describe her feelings of shared identity
regarding gay men and bisexuals. The average score across all outgroups was rounded
to create a single measure of attachment for each respondent. The results demonstrate
that one-​quarter of respondents (25%) feel that they share a lot of common concerns
with other LGBT persons, and a majority (52%) report that they share some concerns.
A considerably smaller portion of respondents reported sharing only a little (18%) or
nothing at all (4%).
Measuring Group Consciousness    369

Table 17.4 Attachment in “A Survey of LGBT Americans”


As a [lesbian, gay man, bisexual], how much do you feel you share
common concerns and identity with [lesbians, gay men, bisexuals]?

N % Mean SD

Not at all 50 4.4 3.0 0.8


Only a little 206 17.9
Some 601 52.4
A lot 291 25.4
Total 1,148

How Should We Measure Group


Consciousness?

Although it is not empirically established, scholars often assume that group con-
sciousness is multidimensional, with each subcomponent representing a dis-
tinct dimension. Therefore, the number of variables used ranges widely across
studies. Some reports “use multiple measures to capture the full range of the
multidimensional concept of group consciousness” (Sanchez 2006b, 428; 2008)
and treat these concepts as distinct and independent variables. Other studies
use the subcomponents of group consciousness to create indices, which are pre-
dominantly constructed by adding values across group consciousness variables
(Masuoka 2006; Henderson-​King and Stewart 1994; Jamal 2005; Duncan 1999). Both
approaches are particularly problematic, because constructs should not be mapped to a
specific number of dimensions without examining the underlying structure of the data
(Gerbing and Anderson 1988). Essentially, scholars should not assume multidimen-
sionality (i.e., multiple independent measures) or unidimensionality (i.e., one additive
index); dimensionality must be assessed and empirically validated before measuring
group consciousness.
To date, none of the published articles examining group consciousness measure
the concept based on strong measurement models. For example, only classical test
theory has been used to examine the measurement of group consciousness (Sanchez
and Vargas 2016), and this technique has only been used sparingly. This is problematic,
as classical test theory models assume that measurement precision is constant across
the entire trait range (Fraley, Waller, and Brennan 2000), implying that each measure
will equally capture high, moderate, and low levels of group consciousness. This is in-
correct, however, as most scales tend to accurately capture only one end of a scale. To
370   Kim Proctor

demonstrate, many scales of group consciousness may adequately capture persons with
high levels of group consciousness, but may mischaracterize levels of group conscious-
ness across the rest of the distribution. When these scales are utilized, they will only
accurately explain outcomes for the group they capture and will have poor explanatory
value for other groups. Without examining measurement precision, it is impossible to
determine if researchers are forming correct or incorrect conclusions, because there
is a high probability that the results will only apply to certain levels of the latent trait.
Classical test theory is also strongly dependent on the number of scale items and the
sample in use (Embretson 1996; Yen 1986; Fraley, Waller, and Brennan 2000; Hambleton,
Swaminathan, and Rogers 1991).
Classical test theory also fails to account for DIF, which allows us to determine if sub-
group differences are reliable and valid, meaning that they reflect actual differences
between groups, or if they are a function of the survey items (Zumbo 1999). Because
classical test theory assumes that all group differences are the result of “real” varia-
tion, this method fails to account for the fact that many items often “work differently”
or are biased for or against particular subgroups (Embretson and Reise 2000, 249;
Swaminathan and Rogers 1990; Zumbo 1999; Osterlind and Eveson 2009; Holland and
Wainer 2012). Therefore, the differences we observe may not be actual differences at all,
but rather a function of the survey’s measurement bias (Abrajano 2015). This is particu-
larly problematic for group consciousness, because subgroup differences have been an
important component of the literature for decades. For example, important subgroup
differences have been identified relating to socioeconomic status (Masuoka 2006; Jamal
2005; Duncan 1999; Sanchez 2006b), panethnic identity (Jamal 2005; Masuoka 2006;
Sanchez 2006a, 2006b, 2008), sex (Jamal 2005), and age (Jamal 2005; Sanchez 2006b,
2008), among other factors.
Item response theory offers several methodological advantages that allow us to
address these limitations. It refers to models intended to characterize the relationship
between an individual’s responses and the underlying latent trait of interest (van der
Linden and Hambleton 1997; Fraley, Waller, and Brennan 2000; Baker 2001; Embretson
1996; Embretson and Reise 2000). In IRT, theta (θ) represents a latent trait, such as
group consciousness. A significant difference between IRT and classical test theory is
that, unlike classical test theory, IRT uses a search process to determine the latent trait,
rather than a simple computation, such as an additive index (Embretson and Reise
2000). Accordingly, IRT scores group consciousness by finding the level of θ that gives
the maximum likelihood. This trait is quantitative in nature, typically has a mean of zero
and a standard deviation of one, and characterizes θ in terms of the probability of item
endorsement (Fraley, Waller, and Brennan 2000).
The IRT models have two primary assumptions:  (1) the item characteristic curve
(ICC) must be monotonically increasing, and (2) the data are locally independent (Lord
1980; Reise, Widaman, and Pugh 1993; Embretson and Reise 2000). The ICC is a non-
linear regression line that shows the probability of reporting a response category relative
to θ (Fraley, Waller, and Brennan 2000). The ICCs must be monotonically increasing,
meaning that the probability of endorsing an item must increase as levels of θ increase
Measuring Group Consciousness    371

(Fraley, Waller, and Brennan 2000). Although many different monotonically increasing
functions can be utilized, logistic functions and normal ogive functions are the most
prevalent (Embretson and Reise 2000). The shape of the ICC will vary across items
based on difficulty and discrimination. Difficulty refers to the probability of success-
fully endorsing an item; items that many people endorse are less difficult, while items
that fewer people endorse are more difficult. An ideal instrument contains items that
span a wide range of item difficulties. Discrimination relates to the slope of the ICC and
demonstrates how well an item discriminates between categories of θ. Items with high
levels of discrimination will more accurately distinguish between persons with sim-
ilar levels of θ around the difficulty value. Local independence relates to the relationship
between the IRT model and the data (Embretson and Reise 2000). This assumption
requires that, after we condition on θ, a respondent’s probability of endorsing an item
is independent of the probability of endorsing other items. This assumption is also re-
lated to unidimensionality, which requires that all of the concepts map onto a single
underlying trait.
Given the empirical properties and advantages of IRT, I argue that analyses focusing
on latent constructs, such as group consciousness, should rely on IRT models to measure
θ. Using IRT, I establish each respondent’s level of group consciousness along a quantita-
tive, methodologically based scale.

Data

Pew Research Center’s “Survey of LGBT Americans” (2013) is based on a survey of the
LGBT population conducted April 11–​29, 2013. It includes a nationally representative
sample of 1,197 self-​identified lesbian, gay, bisexual, and transgender adults eighteen years
of age or older. Given the limited sample size of the transgender population, with only 43
respondents, this subgroup is not included in this methodological analysis, because this
sample is inadequate for hypothesis testing due to its limited power (Green 1991; Wilson
Van Voorhis, and Morgan 2007). The final sample contained 1,154 LGB persons.
The GfK Group administered the survey using KnowledgePanel, a nationally rep-
resentative online research panel, as considerable research on sensitive issues, such
as sexual orientation and gender identity, demonstrates that online survey admin-
istration is the most likely mode for eliciting honest answers from respondents (Pew
Research Center 2013; Kreuter, Presser, and Tourangeau 2008). KnowledgePanel
recruits participants using probability-​sampling methods and includes persons both
with and without Internet access, those with landlines and cell phones, those with
only cell phones, and persons without a phone. From a sample of 3,645 self-​identified
LGBT panelists, one person per household was recruited into the study, constituting
a sample of 1,924 panelists. From this eligible sample, 62% completed the survey. They
were offered a $10 incentive to complete the process, which increased to $20 toward
the end of the field period to reduce the nonresponse rate. Table 17.5 demonstrates the
372   Kim Proctor

Table 17.5 Sexual Orientation in “A Survey


of LGBT Americans”
N %

Lesbian 277 24.0


Gay 398 34.5
Bisexual Female 349 30.2
Bisexual Males 129 11.2
Total 1,153

distribution of lesbians, gay males, and bisexuals in the sample. Gay males represent the
largest group (35%), followed by bisexual females (30%), lesbians (24%), and bisexual
males (11%).

Methods

There are four steps in executing an IRT model:  (1) testing model assumptions,
(2) estimating the parameters, (3) assessing model fit, and (4) examining differential
item functioning. The principal aspects of testing model assumptions are to establish
both unidimensionality and monotonicity (Galecki, Sherman, and Prenoveau 2016).
Exploratory factor analysis with principal components analysis was used to examine the
dimensionality of the data. Table 17.6 shows the results, which indicate that, rather than
the multidimensional construct group consciousness is hypothesized to be and regu-
larly operationalized as, the construct is unidimensional within this data set.
Unidimensionality is established using eigenvalues and the proportion of variance
explained. The Kaiser criterion (Kaiser 1970) recommends retaining only those factors
with eigenvalues greater than 1. In this analysis, only one factor demonstrated an eigen-
value greater than 1, indicating a unidimensional model. Further, if a group of items is
unidimensional, one factor should explain 20% or more of the total variance for all items
(Reckase 1979; Reeve et al. 2007; Slocum-​Gori and Zumbo 2011). For this model, the
first factor exceeded this criterion by explaining 40.44% of the total variance, with no
other factors exceeding the 20% threshold. Based on these results, the data satisfy the
unidimensionality requirement.
Mokken scale analysis (MSA; Mokken, 1971, 1997) was used to test the monotonicity
assumption. It examines patterns of responses and validates if these patterns are mono-
tonically increasing, which is required for developing an IRT model. For items to meet
the monotonicity assumption, the Loevinger’s H coefficient, which measures scalability,
should exceed 0.30 (Loevinger et al. 1953; van Schuur 2003; Hardouin 2013; Hemker,
Measuring Group Consciousness    373

Table 17.6 Unidimensionality and Group Consciousness


Eigenvalue Difference Variance Explained (%)

Factor 1 2.02 1.07 40.44


Factor 2 0.95 0.18 19.00
Factor 3 0.77 0.08 15.31
Factor 4 0.69 0.12 13.79
Factor 5 0.57 . 11.45
N 1,134

χ2(10) = 615.00, Prob> χ2 = 0.000

Sijtsma, and Molenaar 1995). This MSA indicated that two items, public evaluation and
attachment, violated the monotonicity assumption, demonstrating that neither variable
should be retained in the IRT model.2 Table 17.7 shows that self-​categorization, public
evaluation, and importance all exceeded the required threshold of 0.30, therefore satis-
fying the monotonicity assumption and signifying that these three items are appropriate
for measuring group consciousness using IRT.
Although the variables demonstrated unidimensionality and monotonicity, visual
inspection of the ICCs indicated potential problems with IRT estimation (Koster et al.
2009; Murray et al. 2014; Stochl, Jones, and Croudace 2012). Following an iterative pro-
cess of examining unidimensionality, monotonicity, and model data fit, the variables
were recoded to develop the most optimal model. This model was one with the strongest
support for unidimensionality and monotonicity and the best model fit as measured
by the test information function (TIF), residual analysis, global model fit, and the
Akaike information criterion (AIC) and Bayesian information criterion (BIC) statistics
(Zampetakis et al. 2015). To recode the data, I combined categories within items with the
poorest model fit, while leaving categories with adequate model fit intact until the op-
timal fit was achieved. After numerous iterations and subsequent analysis of model fit,
each item was recoded into a dichotomous measure that captured whether or not a re-
spondent endorsed an item by reporting that he or she had LGBT group consciousness
in that area.3 Table 17.8 summarizes the recoded measures:
With these three items, I used a two-​parameter logistic model (2PL; Thissen and
Steinberg 1986; van der Linden and Hambleton 1997; Embretson and Reise 2000)
to estimate the IRT parameters; 2PL models are IRT models for binary dependent
variables, which is appropriate because each of the three recoded group consciousness
items is binary. The 2PL model allows discrimination to vary across items, indicating
that the model does not assume that each item is equally indicative of a respondent’s
standing on θ. Equation 1 (the 2PL model) shows the probability that a respondent
with a given level of group consciousness (θ) will endorse item i (Embretson and Reise
2000, 70):
374   Kim Proctor

Table 17.7 Monotonicity and Group Consciousness


N Loevinger’s H Coefficient

Self-​Categorization 1,134 0.46


Private Evaluation 1,134 0.41
Importance 1,134 0.46
Scale 1,134 0.44

Table 17.8 Recoded Group Consciousness Variables


Self-​Categorization Private Evaluation Importance

Not Endorsed Endorsed Not Endorsed Endorsed Not Endorsed Endorsed

N 435 684 726 422 728 422


% 43.1 56.9 63.2 36.8 63.3 36.7
Total 1,149 1,148 1,150

exp α i (θs − βi )


p ( Xis = 1 | θs ,βi , α i ) =
1 + exp α i (θs − βi ) (1)

The logit of equation 1, θs − βi, is the difference of trait level and item difficulty. The αi
represents the item discrimination parameter. The discrimination parameter, which is
also referred to as the slope, indicates how well an item differentinates between response
categories. Items with higher discrimination are generally superior measures, because
they discriminate between response categories more accurately. The slope parameter is
calculated at the location of item difficulty. Item difficulty represents the parameters and
demonstrates the trait level at which there is a 50% probability of endorsing an item.
Higher difficulty values represent items that are more difficult, indicating that fewer
people are likely to endorse that item (Embretson and Reise 2000; Koch 1983; Reise,
Widaman, and Pugh 1993). Using this information about the 2PL, table 17.9 displays the
model results.
The IRT model demonstrates that all three items have similar levels of discrimination,
indicating that they fairly evenly differentiate between response categories. The impor-
tance item is the most discriminating, with an α of 1.95, while the self-​categorization
item is the least discriminating, with an α of 1.29. Overall, all three items performed
relatively well at discriminating between response categories. The difficulty of the items
has a somewhat greater range, which is preferred, as well-​developed survey instruments
contain a number of items that range in difficulty. For this set of items, identity impor-
tance and private evaluation were the most difficult items to endorse, with higher βs.
Measuring Group Consciousness    375

Table 17.9 IRT Model of Group Consciousness


β SE

Self-​Categorization
        Discrimination 1.29*** 0.15
        Difficulty −0.28** 0.06
Private Evaluation
        Discrimination 1.84*** 0.27
        Difficulty 0.46*** 0.06
Importance
        Discrimination 1.95*** 0.30
        Difficulty 0.46*** 0.06
N 1,153

**  p<0.05, ***  p<0.001.

Conversely, self-​categorization was an easier item for respondents to endorse, with a


substantially lower β (−0.28). In general, these items tended to skew toward being mod-
erate to easy for respondents to endorse.
Another advantage of IRT over classical test theory is that the method is able to
demonstrate measurement precision across levels of group consciousness. Figure 17.1
displays this information, referred to as the TIF. Precision is highest where the chart
covers the most area (Zampetakis et al. 2015), which is particularly valuable because it
shows where the scale is most accurate. For this group consciousness scale, the results
are most precise at moderate levels of group consciousness and least precise for the
lowest and highest levels of group consciousness. This means that when modeling
group consciousness using these data, one can expect the greatest explanatory power
for those with a moderate amount of group consciousness. This offers a significant
advantage over classical test theory which, as stated above, cannot quantify precision
across scales.
Two methods are used to assess the model fit for an IRT model. The first method
examines the relationship between the observed and expected data by examining the
model residuals (Hambleton and Murray 1983; Ludlow 1986; Stark 2001). To demon-
strate adequate model fit, the expected data should fall within the 95% confidence in-
terval of the observed data. Large residuals, or discrepancies between the observed and
expected, indicate potential problems with the model (Embretson and Reise 2000).
Figure 17.2 displays the relationship between the observed and expected data and
indicates that the model fits the data well. In these figures, the black line with the error
bars represents the observed data, while the gray line represents the expected data. For
all categories of each of the three items, the majority of the observed data’s 95% confi-
dence interval overlapped the expected results.
376   Kim Proctor

7.0 Test Information Function

6.0

5.0
Sum of IIF Values

4.0

3.0

2.0

1.0

0.0
–3.0 –2.0 –1.0 0.0 1.0 2.0 3.0
Group Consciousness

Figure 17.1  Test information function for group consciousness.

Private Evaluation Identity Importance


1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
–3.0 –2.0 –1.0 0.0 1.0 2.0 3.0 –3.0 –2.0 –1.0 0.0 1.0 2.0 3.0

Self-Categorization
Prob. of Positive Response

1.0

0.8

0.6

0.4

0.2

0.0
–3.0 –2.0 –1.0 0.0 1.0 2.0 3.0
Group Consciousness

Figure 17.2  Model fit for group consciousness.

The second method for evaluating model fit involves examining the χ2/​df statistic,
which formalizes the analysis of residuals (Embretson and Reise 2000). This statistic
examines the global fit of model and assumes an asymptotic χ2 distribution (Orlando
and Thissen 2000; Zampetakis et al. 2015). Table 17.10 displays the chi-​square results for
the two-​parameter logistic model. This table shows information on singlets, which are
residuals for single items; doublets, which are residuals for pairs of items; and triplets,
which are residuals for three items in a cross-​validation sample (Liu et al. 2011).
Measuring Group Consciousness    377

Table 17.10 Frequencies of the Adjusted Chi-​Square to df Ratios for GRM Model


Data Fit
<1 1 < 2 2 < 3 3 < 4 4 < 5 5 < 7 >7 Mean SD

Singlets 0 0 0 1 0 2 0 5.19 1.88


Doublets 0 0 0 0 0 0 3 61.04 6.08
Triplets 0 0 0 0 0 0 1

The results in table 17.10 suggest that the model has moderate to poor fit, as the ma-
jority of chi-​square statistics are significant for singlets, doublets, and triplets. These
results should be interpreted with caution, however, as the chi-​square statistic is par-
ticularly sensitive to sample size and tends to imply model misfit even in moderately
sized samples (Zampetakis et  al. 2015). Evidence indicates that nearly any departure
from the model will result in a significant detection of misfit (Bentler and Bonnet 1980),
especially if the data are not normally distributed (McIntosh 2007). Consequently,
this model likely fits the data better than the chi-​square statistic implies. For example,
Sinharay and Haberman (2014) analyzed a series of chi-​square fit statistics in relation
to IRT models and failed to find any models that fit the data, with severe misfit in nearly
all large samples. Therefore, given the visual fit displayed in figure 17.2, I argue that the
model adequately captures the data and that the resulting group consciousness scale is
robust even in the event of violations of the IRT model.
The final step in capturing group consciousness is examining DIF. As detailed
above, DIF occurs when there is an interaction between levels of group conscious-
ness and group membership. When DIF is not present, respondents with the same
level of group consciousness will have the same score on the latent trait; when DIF is
present, a respondent’s level of group consciousness will be conditioned by his or her
group membership and inaccurately distort the results. Therefore, two respondents
may have the same level of group consciousness, but score differently on the scale
based on their subgroup, rather than their level of θ. Two forms of DIF may be pre-
sent in the sample, uniform DIF and nonuniform DIF (Zumbo 1999; Holland and
Wainer 2012; Swaminathan and Rogers 1990). Uniform DIF occurs when group mem-
bership and group consciousness interact, but that interaction is consistent across all
levels of the latent trait. Nonuniform DIF occurs when that interaction varies across
levels of the latent trait, with different effects at low, moderate, or high levels of group
consciousness.
I used DIFdetect to identify and adjust for DIF-​affected items (Crane et al. 2006).
This method utilizes an ordinal logistic regression model for DIF detection and extends
previous DIF detection analyses (Mantel and Haenszel 1959; Swaminathan and Rogers
1990; Zumbo 1999). DIFdetect is an iterative process for estimating group consciousness
that begins with detecting which items demonstrate DIF. When items do not demon-
strate DIF, IRT parameters are estimated for the entire sample. When items demonstrate
378   Kim Proctor

Table 17.11 Differential Item Functioning in “A Survey of LGBT Americans”


Self-​Categorization Private Evaluation Importance

Type of Significant DIF at p < 0.05

Lesbians Uniform Uniform None


Female Bisexuals Uniform Uniform None
Male Bisexuals Uniform Nonuniform Uniform
Racial and Ethnic Minorities Nonuniform Uniform Uniform
Bachelor’s Degree Uniform None None
Over 45 Years of Age None None None

DIF, IRT parameters are estimated separately for the separate groups. This produces a
DIF-​adjusted estimate that can be used in subsequent analyses without bias. For the it-
erative process, the DIF-​adjusted estimate of the latent trait is used to test additional
grouping categories for DIF. This process of adjusting for DIF is repeated until all rel-
evant items have been analyzed and adjusted for, as necessary (Zampetakis et al. 2015).
Table 17.11 shows that for nearly every demographic category, both uniform and
nonuniform DIF was present, as the probability of DIF was consistently significant.
Sexual orientation, race and ethnicity, and education all contributed to differential item
functioning within this sample, while age did not. Each subgroup was compared to a
reference population. For example, lesbians and bisexuals were compared to gay men,
racial and ethnic minorities were compared to whites, those with bachelor’s degrees
were compared to those without degrees, and the over age forty-​five population was
compared to the under age forty-​five population. Across each DIF analysis except age,
group membership was significant for at least one item within the scale, indicating that a
DIF-​adjusted measure of group consciousness must be used.
This is a particularly important finding, because it casts doubt on previous analyses of
subgroup differences in levels of group consciousness. To date, we have attributed group
differences to actual differences that exist between demographic groups. If these items
are the result of survey bias, however, we may be drawing the wrong conclusions about
levels of group consciousness. Using DIF-​adjusted results, it is possible that differences
among demographic groups may disappear in subsequent tests. Therefore, to verify that
we form accurate conclusions about group consciousness, it is essential to use DIF anal-
ysis in constructing our measures of latent traits.

Results

Using DIF estimates that were adjusted for lesbian sexual orientation, bisexual female
sexual orientation, and education, I produced an unbiased and empirically grounded
Measuring Group Consciousness    379

Table 17.12 Summary Statistics of Group Consciousness


Mean SD Min Max N
0.000 0.75 −0.94 1.20 1,153

measure of group consciousness. Adjustments for racial and ethnic minority status and
bisexual male orientation did not contribute to an improvement in the estimation of
θ. Therefore, although DIF was present, I did not adjust group consciousness for these
measures, because they did not improve the model. This likely indicates that, while
significant, the DIF results for these groups were not substantively important and are
unlikely to impact subsequent modeling. For all other groups, however, DIF funda-
mentally structured the results, demonstrating that these differences are likely to im-
pact future tests. In addition, it is possible that the inability to improve the estimation
of θ for bisexual males and racial and ethnic minorities is a function of their relatively
small sample size, and that meaningful DIF could be found in future analyses that rely
on larger and more diverse samples.
Table 17.12 displays the summary statistics of the group consciousness measure,
showing that IRT generated an interval measure of group consciousness with a mean
of 0 and a standard deviation of 0.8. The latent trait was predicted using an empirical
Bayes estimator that combines prior information about θ with the probability to ob-
tain the conditional posterior distribution of θ (Skrondal and Rabe-​Hesketh 2004,
2009). The resulting measure of group consciousness ranges from −0.9 to 1.2, with lower
values representing lower levels of group consciousness and higher values representing
higher levels of group consciousness. Overall, the summary statistics demonstrate that
this measure of group consciousness has favorable statistical properties for subsequent
testing.

Discussion

The results presented in this analysis cast doubt on group consciousness research
that fails to use strong measurement models. To date, dozens of research articles
examine group consciousness, yet contain little to no discussion of the most ap-
propriate measurement strategies for capturing the concept. This is a serious lim-
itation in the current body of group consciousness research, as it leads to three
primary limitations that the methodology proposed in this analysis addresses: (1)
our measures of group consciousness may have face validity, but lack construct va-
lidity; (2) many measures of group consciousness probably contain survey bias that
distorts our interpretation of subgroup differences; and (3) we are measuring group
consciousness incorrectly when we use a series of distinct, independent variables or
additive measures.
380   Kim Proctor

Beginning with an examination of validity, the most commonly used group con-
sciousness measures have not been examined from a measurement standpoint. This
means that although they theoretically align with our understanding of group con-
sciousness, this relationship has not been empirically established. In this analysis, at
least two of the measures that were expected to map to group consciousness, public eval-
uation and attachment, failed to demonstrate a relationship with the latent trait. If de-
tailed examination of these items had not been performed, they could have erroneously
been included in the final group consciousness measure. This would have likely led to
model distortions and the incorrect presentation of results. Essentially, any conclusions
we drew from a measure of group consciousness that included these items would have
been wrong, as they fundamentally mismeasured the construct. Therefore, because
most preceding articles have not used methodologically valid measures of group con-
sciousness, we cannot be certain that our conclusions about the nature of group con-
sciousness are reliable or valid.
Item bias further distorts these results and has a high probability of misdirecting
our conclusions. Currently, many research articles point to significant and mean-
ingful subgroup differences regarding levels of group consciousness (Masuoka 2006;
Jamal 2005; Duncan 1999; Sanchez 2006a, 2006b, 2008). However, none of these arti-
cles examine whether the survey itself is driving these differences through differential
item functioning. Given that five subgroups within this examination demonstrated
DIF—​lesbians, bisexual females, bisexual males, racial and ethnic minorities, and the
college-​educated population—​it is very likely that our current understanding of sub-
group differences may be the result of survey bias. Moving forward, analyses that seek to
explain the formation of group consciousness and control for subgroups must include
an analysis of DIF. Without doing so, the field may be making false deductions about the
relationship between demographic categories and group consciousness.
Finally, this research calls into question the many measures of group consciousness
that are currently employed. Most scholars analyzing group consciousness utilize either
additive measures that simply add together a series of dependent variables, or treat all
the subcomponents of group consciousness as distinct and operationalize each vari-
able as a separate independent variable. Both approaches are incorrect. The first creates
measures that are directly contingent on the number of items on the scale, which may
or may not be related. The second treats variables as multidimensional when they are
probably unidimensional. As this analysis demonstrates, the method that most accu-
rately estimates group consciousness must rely on IRT. This is particularly important
given that IRT produces results with favorable properties for statistical testing. Given
that examining group differences can be misleading if the incorrect level of measure-
ment is used (Maxwell and Delaney 1985), many of our current results regarding group
consciousness may be misspecified.
Together, these results have broad implications for scholars of political behavior, be-
cause they provide strong support for the argument that IRT must be more thoroughly
incorporated into our empirical analyses. Although we dedicate a great deal of time to
discussing theoretical factors and implications, we rarely devote the same amount of
Measuring Group Consciousness    381

attention to measurement strategies. Consequently, we use measures that are theoret-


ically grounded, yet rarely empirically grounded. As this analysis demonstrates, that
limitation is highly likely to lead us to false conclusions based on inappropriate meas-
urement. This is particularly probable because our concepts tend to be relatively ab-
stract, amorphous, and difficult to define.
Moving forward, scholars should incorporate IRT as a solution to these measure-
ment problems. It allows us to develop empirically based measures for capturing latent
constructs with favorable statistical properties for subsequent analysis. It builds on our
theoretical knowledge by relying on theoretical justifications for initial item selection,
while subsequently empirically testing the validity of those assumptions. Through a pro-
cess of examining dimensionality, monotonicity, DIF, and model data fit, IRT allows us
to produce empirically valid and reliable operationalizations. A general guideline would
encourage scholars of political behavior to always begin with IRT, even when analyzing
concepts that seem relatively straightforward, such as political knowledge or political
participation, as evidence demonstrates that these latent variables are rarely as uncom-
plicated as they seem. Consequently, all analyses that utilize latent constructs should
consider incorporating IRT as their measurement strategy.

Conclusion

Using IRT, this analysis makes a series of important contributions that challenge the
conventional measurement strategies of scholars analyzing group consciousness.
It begins by demonstrating that group consciousness is not multidimensional from
a measurement standpoint, as all theoretical subcomponents mapped onto a single
construct in this sample. Although we may discuss the construct as multidimensional,
it is best operationalized using a single construct. In addition, many concepts that
are traditionally grouped into group consciousness measures, such as public evalu-
ation and attachment, failed to meet model assumptions and did not properly align
with group consciousness. Therefore, some of the subcomponents we use to clarify
the definition of group consciousness may not be particularly meaningful and should
potentially be excluded from usage in future analyses. Further, even when the correct
number of dimensions is used and the items are correctly specified, group conscious-
ness measures are highly likely to suffer from differential item functioning. As this
analysis shows, nearly all major subgroups demonstrated a degree of survey bias,
implying that the conclusions formed about the relationship between these subgroups
and group consciousness will be biased unless we use DIF-​adjusted results. In total,
these results call into question our current understanding of group consciousness,
as almost all articles examining group consciousness lack appropriate measurement
methodologies. Using IRT, we can overcome these limitations by establishing statis-
tically valid measures of group consciousness that allow us to reexamine our prior
conclusions.
382   Kim Proctor

Notes
1. Survey weights were not used in this analysis.
2. Public evaluation and attachment were recoded using a variety of methods and retested
to analyze if using a different measurement strategy would satisfy the monotonicity
requirements. No method of recoding the items was able to achieve a sufficient Loevinger’s
H coefficient to establish monotonicity. Further, visual inspection of the item characteristic
curves validated the MSA, with both ICCs demonstrating significant violations of the mon-
otonicity assumption (Koster et al. 2009; Murray et al. 2014; Stochl et al. 2012).
3. Unidimensionality was re-​established for the three-​item scale after analyzing monotonicity.
The remaining items satisfied the unidimensionality requirement, with only one factor
having an eigenvalue greater than 1, and the first factor explaining 56.65% of the variance.
Therefore, this subset of items also met the unidimensionality condition. Monotonicity was
also re-​established for the three-​item scale after recoding the variables following the logic
described below. The remaining items satisfied the monotonicity requirement, indicating
that item recoding did not violate model assumptions.

References
Abrajano, M. 2015. “Reexamining the ‘Racial gap’ in Political Knowledge.” Journal of Politics 77
(1): 44–​54.
Abramowitz, A. I., and K. L. Saunders. 2006. “Exploring the Bases of Partisanship in
the American Electorate:  Social Identity vs. Ideology.” Political Research Quarterly 59
(2): 175–​187.
Abrams, D., and R. Brown. 1989. “Self-​Consciousness and Social Identity: Self-​Regulation as a
Group Member.” Social Psychology Quarterly 52 (4): 311–​318.
Ashmore, R. D., K. Deaux, and T. McLaughlin-​Volpe. 2004. “An Organizing Framework for
Collective Identity:  Articulation and Significance of Multidimensionality.” Psychological
Bulletin 130 (1): 80–​113.
Baker, F. B. 2001. The Basics of Item Response Theory. New  York:  ERIC Clearinghouse on
Assessment and Evaluation.
Baker, F. B., and S. Kim. 2004. Item Response Theory: Parameter Estimation Techniques. 2nd ed.
New York: CRC Press.
Bentler, P. M., and D. G. Bonnet. 1980. “Significance Tests and Goodness of Fit in the Analysis
of Covariance Structures.” Psychological Bulletin 88 (3): 588–​606.
Brewer, M. B. 1979. “In-​Group Bias in the Minimal Intergroup Situation:  A Cognitive-​
Motivational Analysis.” Psychological Bulletin 86 (2): 307–​324.
Brubaker, R., and F. Cooper. 2000. “Beyond ‘Identity’.” Theory and Society 29 (1): 1–​47.
Carpini, M. X. D., and S. Keeter. 1993. “Measuring Political Knowledge: Putting First Things
First.” American Journal of Political Science 37 (4): 1179–​1206.
Chong, D., and R. Rogers. 2005. “Racial Solidarity and Political Participation.” Political
Behavior 27 (4): 347–​374.
Clinton, J. D., and J. S. Lapinski. 2006. “Measuring Legislative Accomplishment, 1877–​1994.”
American Journal of Political Science 50 (1): 232–​249.
Conover, P. J. 1984. “The Influence of Group Identifications on Political Perception and
Evaluation.” Journal of Politics 46 (3): 760–​785.
Measuring Group Consciousness    383

Conover, P. J. 1988. “The Role of Social Groups in Political Thinking.” British Journal of Political
Science 18 (1): 51–​76.
Conover, P. J., and S. Feldman. 1984. “How People Organize the Political World: A Schematic
Model.” American Journal of Political Science 28 (1): 95–​126.
Conover, P. J., and V. Sapiro. 1993. “Gender, Feminist Consciousness, and War.” American
Journal of Political Science 37 (4): 1079–​1099.
Crane, P. K., L. E. Gibbons, L. Jolley, and G. van Belle. 2006. “Differential Item Functioning
Analysis with Ordinal Logistic Regression Techniques: DIFdetect and difwithpar.” Medical
Care 44 (11, supp. 3): S115–​S123.
Crocker, J., R. Luhtanen, B. Blaine, and S. Broadnax. 1994. “Collective Self-​Esteem and
Psychological Well-​Being among White, Black, and Asian College Students.” Personality and
Social Psychology Bulletin 20 (5): 503–​513.
Deaux, K. 1996. “Social Identification.” In Psychology: Handbook of Basic Principles, edited by E.
T. Higgins, and A. W. Kruglanski, 227–​238. New York: Guilford Press.
Diehl, M. 1989. “Justice and Discrimination between Minimal Groups: The Limits of Equity.”
British Journal of Social Psychology 28 (3): 227–​238.
Duncan, L. E. 1999. “Motivation for Collective Action:  Group Consciousness as Mediator
of Personality, Life Experiences, and Women’s Rights Activism.” Political Psychology 20
(3): 611–​635.
Eagly, A. H., and S. Chaiken. 1993. The Psychology of Attitudes. Fort Worth, TX: Harcourt Brace
Jovanovich College Publishers.
Embretson, S. E. 1996. “The New Rules of Measurement.” Psychological Assessment 8 (4): 341–​349.
Embretson, S. E., and S. P. Reise. 2000. Item Response Theory for Psychologists. Mahwah,
NJ: Lawrence Erlbaum Associates.
Embretson, S. E., and S. P. Reise. 2013. Item Response Theory. Psychology Press.
Fraley, R. C., N. G. Waller, and K. A. Brennan. 2000. “An Item Response Theory Analysis of
Self-​Report Measures of Adult Attachment.” Journal of Personality and Social Psychology 78
(2): 350–​365.
Galecki, J. M., M. F. Sherman, and J. M. Prenoveau. 2016. “Item Analysis of the Leeds
Dependence Questionnaire in Community Treatment Centers.” Psychological Assessment 28
(9): 1061–​1073.
Gamson, W. A. 1968. Power and Discontent. Homewood, IL: Dorsey Press.
Gerbing, D. W., and J. C. Anderson. 1988. “An Updated Paradigm for Scale Development
Incorporating Unidimensionality and Its Assessment.” Journal of Marketing Research 25
(2): 186–​192.
Gillion, D. Q. 2009. “Re-​defining Political Participation through Item Response Theory.” Paper
presented at APSA 2009 Meeting, Toronto.
Green, S. B. 1991. “How Many Subjects Does It Take to Do a Regression Analysis?” Multivariate
Behavioral Research 26 (3): 499–​510.
Gurin, P., A. H. Miller, and G. Gurin. 1980. “Stratum Identification and Consciousness.” Social
Psychology Quarterly 43 (1): 30–​47.
Gurin, P. 1985 “Women’s Gender Consciousness.” Public Opinion Quarterly 49 (2): 143–​163.
Hambleton, R., H. Swaminathan, and H. J. Rogers. 1991. Fundamentals of Item Response Theory.
Newbury Park, CA: Sage.
Hambleton, R. K., and L. Murray. 1983. “Some Goodness of Fit Investigations for Item Response
Models.” In R. K. Hambleton (Ed.), Applications of Item Response Theory. Vancouver,
BC: Educational Research Institute of British Columbia.
384   Kim Proctor

Hardouin, J. 2013. MSP: Stata Module to Perform the Mokken Scale Procedure. https://​ideas.
repec.org/​c/​boc/​bocode/​s439402.html
Harris, F., and D. Q. Gillion. 2012. “Expanding the Possibilities: Reconceptualizing Political
Participation as a Toolbox.” In The Oxford Handbook of American Elections and Political
Behavior, edited by J. E. Leighley, 144–​161. New York: Oxford University Press.
Heere, B., and J. D. James 2007. “Stepping Outside the Lines: Developing a Multi-​dimensional
Team Identity Scale Based on Social Identity Theory.” Sport Management Review 10
(1): 65–​91.
Hemker, B. T., K. Sijtsma, and I. W. Molenaar. 1995. “Selection of Unidimensional Scales from a
Multidimensional Item Bank in the Polytomous Mokken I RT Model.” Applied Psychological
Measurement 19 (4): 337–​352.
Henderson-​King, D. H., and A. J. Stewart. 1994. “Women or Feminists? Assessing Women’s
Group Consciousness.” Sex Roles 31 (9): 505–​516.
Highton, B., and C. D. Kam. 2011. “The Long-​Term Dynamics of Partisanship and Issue
Orientations.” Journal of Politics 73 (1): 202–​215.
Holland, P. W., and H. Wainer, eds. 2012. Differential Item Functioning. New York: Routledge.
Huddy, L. 2001. “From Social to Political Identity: A Critical Examination of Social Identity
Theory.” Political Psychology 22 (1): 127–​156.
Jackman, M. R., and R. W. Jackman. 1973. “An Interpretation of the Relation between Objective
and Subjective Social Status.” American Sociological Review 38 (5): 569–​582.
Jamal, A. 2005. “The Political Participation and Engagement of Muslim Americans: Mosque
Involvement and Group Consciousness.” American Politics Research 33 (4): 521–​544.
Jerit, J., J. Barabas, and T. Bolsen. 2006. “Citizens, Knowledge, and the Information
Environment.” American Journal of Political Science 50 (2): 266–​282.
Kaiser, H. F. 1970. “A Second Generation Little Jiffy.” Psychometrika 35 (4): 401–​415.
Kidd, Q., H. Diggs, M. Farooq, and M. Murray. 2007. “Black Voters, Black Candidates, and
Social Issues: Does Party Identification Matter?” Social Science Quarterly 88 (1): 165–​176.
Koch, W. R. 1983. “Likert Scaling Using the Graded Response Latent Trait Model.” Applied
Psychological Measurement 7 (1): 15–​32.
Koster, M., M. E. Timmerman, H. Nakken, S. J. Pijl, and E. J. van Houten. 2009.
“Evaluating Social Participation of Pupils with Special Needs in Regular Primary
Schools: Examination of a Teacher Questionnaire.” European Journal of Psychological
Assessment 25 (4): 213–​222.
Kreuter, F., S. Presser, and R. Tourangeau. 2008. “Social Desirability Bias in CATI, IVR, and
Web Surveys: The Effects of Mode and Question Sensitivity.” Public Opinion Quarterly 72
(5): 847–​865.
Liu, L., F. Drasgow, R. Reshetar, and Y. R. Kim. 2011. “Item Response Theory (IRT) Analysis of
Item Sets.” Paper presented at the Northeastern Educational Research Association (NERA)
Annual Conference, Rocky Hill, CT.
Loevinger, J., G. C. Gleser, and P. H. DuBois. 1953. “Maximizing the Discriminating Power of a
Multiple-​Score Test.” Psychometrika 18 (4): 309–​317.
Lord, F. M. 1980. Applications of Item Response Theory to Practical Testing Problems. Hillside,
NJ: Erlbaum.
Ludlow, L. H. 1986. “Graphical Analysis of Item Response Theory Residuals.” Applied
Psychological Measurement 10 (3): 217–​229.
Luhtanen, R., and J. Crocker. 1992. “A Collective Self-​Esteem Scale: Self-​Evaluation of One’s
Social Identity.” Personality and Social Psychology Bulletin 18 (3): 735–​754.
Measuring Group Consciousness    385

Mael, F. A., and L. E. Tetrick. 1992. “Identifying Organizational Identification.” Educational and
Psychological Measurement 52 (4): 813–​824.
Mantel, N., and W. Haenszel. 1959. “Statistical Aspects of the Analysis of Data from
Retrospective Studies.” Journal of the National Cancer Institute 22 (4): 719–​748.
Masuoka, N. 2006. “Together They Become One:  Examining the Predictors of Panethnic
Group Consciousness among Asian Americans and Latinos.” Social Science Quarterly 87
(5): 993–​1011.
Maxwell, S. E., and H. D. Delaney. 1985. “Measurement and Statistics:  An Examination of
Construct Validity.” Psychological Bulletin 97 (1): 85–​93.
McCall, G. J., and J. L. Simmons. 1978. Identities and Interactions: An Examination of Human
Associations in Everyday Life. New York: Free Press.
McClain, P. D., J. D. Johnson Carew, E. Walton Jr., and C. S. Watts. 2009. “Group Membership,
Group Identity, and Group Consciousness:  Measures of Racial Identity in American
Politics?” Annual Review of Political Science 12: 471–​485.
McIntosh, C. N. 2007. “Rethinking Fit Assessment in Structural Equation Modelling:  A
Commentary and Elaboration on Barrett.” Personality and Individual Differences 42
(5): 859–​867.
Miller, A. H., P. Gurin, G. Gurin, and O. Malanchuk. 1981. “Group Consciousness and Political
Participation.” American Journal of Political Science 25 (3): 494–​511.
Mokken, R. J. 1971. A Theory and Procedure of Scale Analysis. Berlin: De Gruyter.
Mokken, R. J. 1997. “Nonparametric Models for Dichotomous Responses.” In Handbook of
Modern Item Response Theory, edited by W. J. van der Linden and R. K. Hambleton, 351–​367.
New York: Springer.
Mondak, J. J. 2001. “Developing Valid Knowledge Scales.” American Journal of Political Science
45 (1): 224–​238.
Murray, A. L., K. McKenzie, K. R. Murray, and M. Richelieu. 2014. “Mokken Scales for
Testing Both Pre-​and Postintervention: An Analysis of the Clinical Outcomes in Routine
Evaluation—​Outcome Measure (CORE–​OM) Before and After Counseling.” Psychological
Assessment 26 (4): 1196.
Orlando, M., and D. Thissen. 2000. “Likelihood-​Based Item-​Fit Indices for Dichotomous Item
Response Theory Models.” Applied Psychological Measurement 24 (1): 50–​64.
Osterlind, S. J., and H. T. Eveson. 2009. Differential Item Functioning. 2nd ed. New York: Sage.
Pew Research Center. 2013. “A Survey of LGBT Americans: Attitudes, Experiences, and Values
in Changing Times.” Pew Research Center. http://​www.pewsocialtrends.org/​2013/​06/​13/​
a-​survey-​of-​lgbt-​americans/​
Phinney, J. S. 1991. “Ethnic Identity and Self-​Esteem:  A Review and Integration.” Hispanic
Journal of Behavioral Sciences 13: 193–​208.
Reckase, M. D. 1979. “Unifactor Latent Trait Models Applied to Multifactor Tests: Results and
Implications.” Journal of Educational and Behavioral Statistics 4 (3): 207–​230.
Reeve, B. B., R. D. Hays, J. B. Bjorner, K. F. Cook, P. K. Crane, J. A. Teresi, et  al. 2007.
“Psychometric Evaluation and Calibration of Health-​ Related Quality of Life Item
Banks:  Plans for the Patient-​ Reported Outcomes Measurement Information System
(PROMIS).” Medical Care 45 (5): S22–​S31.
Reise, S. P., K. F. Widaman, and R. H. Pugh. 1993. “Confirmatory Factor Analysis and Item
Response Theory: Two Approaches for Exploring Measurement Invariance.” Psychological
Bulletin 114 (3): 552–​566.
Rosenberg, M. 1979. Conceiving the Self. New York: Basic Books.
386   Kim Proctor

Sanchez, G. R. 2006a. “The Role of Group Consciousness in Latino Public Opinion.” Political
Research Quarterly 59 (3): 435–​446.
Sanchez, G. R. 2006b. “The Role of Group Consciousness in Political Participation among
Latinos in the United States.” American Politics Research 34 (4): 427–​450.
Sanchez, G. R. 2008. “Latino Group Consciousness and Perceptions of Commonality with
African Americans.” Social Science Quarterly 89 (2): 428–​444.
Sanchez, G. R., and E. D. Vargas. 2016. “Taking a Closer Look at Group Identity: The Link
between Theory and Measurement of Group Consciousness and Linked Fate.” Political
Research Quarterly 69 (1): 160–​174.
Sellers, R. M., S. A.  J. Rowley, T. M. Chavous, J. N. Shelton, and M. A. Smith. 1997.
“Multidimensional Inventory of Black Identity: A Preliminary Investigation of Reliability
and Construct Validity.” Journal of Personality and Social Psychology 73 (4): 805–​815.
Shingles, R. 1981. “Black Consciousness and Political Participation:  The Missing Link.”
American Political Science Review 75 (1): 76–​91.
Sinharay, S., and S. J. Haberman. 2014. “How Often Is the Misfit of Item Response Theory
Models Practically Significant?” Educational Measurement: Issues and Practice 33 (1): 23–​35.
Skrondal, A., and S. Rabe-​Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel,
Longitudinal, and Structural Equation Models. Boca Raton, FL: CRC Press.
Skrondal, A., and S. Rabe-​Hesketh. 2009. “Prediction in Multilevel Generalized Linear
Models.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 (3): 659–​687.
Slocum-​Gori, S. L., and B. D. Zumbo. 2011. “Assessing the Unidimensionality of Psychological
Scales:  Using Multiple Criteria from Factor Analysis.” Social Indicators Research 102
(3): 443–​461.
Smith, R. M. 2004. “Identities, Interests, and the Future of Political Science.” Perspectives on
Politics 2 (2): 301–​312.
Stark, S. 2001. MODFIT:  A Computer Program for Model-​ Data Fit. Urbana-​
Champaign: University of Illinois.
Stochl, J., P. B. Jones, and T. J. Croudace. 2012. “Mokken Scale Analysis of Mental Health and
Well-​Being Questionnaire Item Responses: A Non-​parametric IRT Method in Empirical
Research for Applied Health Researchers.” BMC Medical Research Methodology 12 (1): 1.
Stokes, A. K. 2003. “Latino Group Consciousness and Political Participation.” American Politics
Research 31 (4): 361–​378.
Stryker, S. 1980. Symbolic Interactionism a Social Structural Version. Menlo Park, CA: Benjamin
Cummings.
Stryker, S., and R. T. Serpe. 1994. “Identity Salience and Psychological Centrality: Equivalent,
Overlapping, or Complementary Concepts?” Social Psychology Quarterly 57 (1): 16–​35.
Swaminathan, H., and H. J. Rogers. 1990. “Detecting Differential Item Functioning Using
Logistic Regression Procedures.” Journal of Educational Measurement 27 (4): 361–​370.
Tajfel, H. 1981. Human Groups and Social Categories: Studies in Social Psychology. Cambridge,
MA: Cambridge University Press.
Tajfel, H. 1982. “Social Psychology of Intergroup Relations.” Annual Review of Psychology
33: 1–​39.
Thissen, D., and L. Steinberg. 1986. “A Taxonomy of Item Response Models.” Psychometrika 51
(4): 567–​577.
Trapnell, P. D., and J. D. Campbell. 1999. “Private Self-​Consciousness and the Five-​Factor
Model of Personality: Distinguishing Rumination from Reflection.” Journal of Personality
and Social Psychology 76 (2): 284–​304.
Measuring Group Consciousness    387

Turner, J. C., M. A. Hogg, P. J. Oakes, S. D. Reicher, and M. S. Wetherell. 1987. Rediscovering the
Social Group: A Theory of Self-​Categorization. New York: Basil Blackwell.
Tyler, T. R., and S. L. Blader. 2001. “Identity and Cooperative Behavior in Groups.” Group
Processes and Intergroup Relations 4 (3): 207–​226.
van der Linden, W. J., and R. K. Hambleton, eds. 1997. Handbook of Modern Item Response
Theory. New York: Springer.
van Schuur, W. H. 2003. “Mokken Scale Analysis: Between the Guttmann Scale and Parametric
Item Response Theory.” Political Analysis 11 (2): 139–​163.
Wallace, D. S., A. Abduk-​Khaliq, M. Czuchry, and T. L. Sia. 2009. “African Americans’ Political
Attitudes, Party Affiliation, and Voting Behavior.” Journal of African American Studies 13
(2): 139–​146.
Welch, S. and L. S. Foster. 1992. “The Impact of Economic Conditions on the Voting Behavior
of Blacks.” The Western Political Quarterly 45 (1): 221–​236.
Weldon, S. A. 2006. “The Institutional Context of Tolerance for Ethnic Minorities:  A
Comparative, Multilevel Analysis of Western Europe.” American Journal of Political Science
50 (2): 331–​349.
Wilson Van Voorhis, C. R., and B. L. Morgan. 2007. “Understanding Power and Rules of Thumb
for Determining Sample Sizes.” Tutorials in Quantitative Methods for Psychology 3 (2): 43–​50.
Yen, W. M. 1986. “The Choice of Scale for Educational Measurement: An IRT Perspective.”
Journal of Educational Measurement 23 (4): 299–​325.
Zampetakis, L. A., M. Lerakis, K. Kafetsios, and V. Moustakis. 2015. “Using Item Response
Theory to Investigate the Structure of Anticipated Affect:  Do Self-​Reports about Future
Affective Reactions Conform to Typical or Maximal Models?” Frontiers in Psychology
September (6): 1–​8.
Zumbo, B. D. 1999. A Handbook on the Theory and Methods of Differential Item Functioning
(DIF):  Logistic Regression Modeling as a Unitary Framework for Binary and Likert-​type
(Ordinal) Item Scores. Ottawa, ON:  Directorate of Human Resources Research and
Evaluation, Department of National Defense.
Chapter 18

Cross-​N ationa l Su rv eys


and the C om pa rat i v e
Study of El e c tora l
Systems
When Country/​Elections Become Cases

Jeffrey A. Karp and Jack Vowles

Introduction

The origins of public opinion polls and election studies have been well covered in a rel-
atively extensive literature (Burdick and Brodbeck 1959: Converse 1987; Herbst 1993).
Less attention has been paid, however, to the development of political polling and survey
research across national boundaries (for brief accounts see Smith 2010a; Kittilson 2007;
Heath, Fisher, and Smith 2005). By this we do not mean the simple expansion of polls
and surveys into more and more countries, but rather the construction of polling and
survey instruments specifically designed to be fielded in more than one country for
purposes of direct comparison. Here we focus on the development of such instruments
for the purposes of comparative analysis in political science, in the context of more gen­
eral developments in survey and polling research. As an example, we take the case of the
Comparative Study of Electoral Systems (CSES), an international collaboration active
since 1996.
Cross-​national comparison can draw increasing attention to the importance of the
institutional and cultural contexts that shape public opinion and political behavior, as
well as the underlying variables that may shape and perhaps account for those contex-
tual differences. Since the 1990s such polls and surveys have expanded both in their
numbers and their extension, and they arguably now form one of the most important
frontiers in the development of survey research in political science.
Cross-National Surveys & Comparative Study of Electoral Systems    389

The CSES stands out because, in cross-​national comparative research, countries—​


and indeed, for political scientists, the elections within them—​become cases of equal
significance to the individual respondents within each national component. In most
cross-​national surveys, timing is relatively random, depending on when finance is
secured and the demands of fieldwork. Cross-​national election surveys, however, are
conducted after elections. The election, rather than simply country x at time t, becomes
a case. Because of its theoretical focus on institutional differences between countries,
the CSES also stands out because it both provides data and explicitly encourages anal-
ysis of macro country-​level differences and cross-​level interactions between micro and
macro variables. Finally, its individual-​level data are immediately released to the public,
at no charge and with no embargo or delay, benefiting CSES collaborators.

The International Proliferation of


Surveys and Polls

Before political polls and surveys could become cross-​national, it was necessary for
them to proliferate. Polling on political issues based on random probability sampling
originated in the United States in the 1930s, pioneered by George Gallup and Elmo
Roper (Cantril and Strunk 1951; Converse 1987). Political polling began in France in
1939, inspired by Gallup, and in Great Britain in the 1940s, when Gallup launched a sub-
sidiary there, with similar questions being asked in both countries. Survey institutes
were set up throughout West Germany during the Allied occupation as part of a strategy
to reduce the persisting influence of the Nazi regime on public opinion. By the 1950s
political polling had spread to many other democracies, and polls sponsored by media
organizations began to be reported regularly.
Academic election studies followed in the wake of the political pollsters. The United
States led the way, and indeed early election studies in the United States provided both
the methodological and theoretical inspiration for the extension of those studies else-
where and the eventual development of cross-​national studies. The first academic elec-
tion studies, known as the “Columbia studies,” can be traced back to the work of Paul
Lazarfeld and his colleagues, who conducted what can still be considered a sophisti-
cated survey to examine campaign effects. The findings were published in the People’s
Choice (Lazarfeld et al. 1948), known for introducing the theory of “the two step flow of
communications,” which assumes that public opinion is influenced by elites. While the
initial motivation was to examine media effects and opinion change, the data revealed
remarkable opinion stability. This led to a second study, which was conducted in Elmira,
New York, during the 1948 election, in which was developed the sociological model that
was the theoretical focus of Voting (Berelson et al. 1954).
The origins of the American National Election Studies (ANES), based at the
University of Michigan and also, in recent cycles, at Stanford University, can be traced
390    Jeffrey A. Karp and Jack Vowles

back to a survey from 1948. The survey, which was not primarily concerned with the
election, was designed to examine foreign policy attitudes. Truman’s surprise victory
in 1948 is considered to be one of the greatest upsets in American history. Virtually
all of the major polling organizations, including Gallup, had predicted that Thomas
Dewey, the Republican governor of New  York, would easily defeat Truman. Given
the unexpected outcome, the decision was taken to interview the same respondents
again after the election to gain more knowledge about some of the perplexities of the
presidential vote.
The success of the Michigan Survey Research Center in producing a survey estimate
that essentially matched the electoral outcome helped to establish the University of
Michigan as a center for electoral research (Miller 1994). As a newly trained political sci-
entist and the assistant director of the Michigan Survey Research Center, Warren Miller
helped to design the 1952 national study, which was largely based on his PhD disserta-
tion and provided the framework for further studies that would become known as the
Michigan Election Studies.1 He recruited two graduate students to work on the proj­
ect, Donald Stokes and Philip Converse, who together would represent the core team.
The early studies were primarily designed to examine the effects of partisanship, issues,
and personalities on voting behavior. The 1952 study surveyed 1,899 respondents and
included 293 variables. These data, along with data from the 1956 election, formed the
basis for The American Voter, a seminal study of voting behavior that provided a theo-
retical framework that has had a major influence on electoral research not only in the
United States, but also abroad (Campbell et al. 1960).
Outside the United States, the first election studies began to appear in the 1950s and
1960s in various European countries, including Britain (1964), France (1958), Germany
(1949), Denmark (1959), Norway (1957), Sweden (1956), and the Netherlands (1967).
(Website links to most of these long-​standing studies are provided in an appendix to
this chapter.) They developed as a result of the exchange of various individuals who were
part of teams based in the United States or Europe. For example, the first British Election
Study (BES) was conducted by David Butler and Donald Stokes in 1964, the latter of
whom was a coauthor of American Voter. The Michigan school heavily influenced the
development of election studies in other countries, which has led to a similarity in both
theoretical and methodological features. Germany is said to have been influenced by
both the Columbia and Michigan schools, and the funnel of causality approach from the
Michigan model can be found in every German election study since the 1960s (Kaase
and Klingemann 1994). Other coauthors of American Voter were also instrumental in
helping to initiate election studies in Europe. For example, Philip Converse collaborated
on the earliest election studies in France and is said to have had a hand in the first
Norwegian Election Study in 1965. Converse was also the principal investigator of the
first Canadian Election Study, also conducted in 1965. Of the coauthors of American
Voter, Warren Miller was viewed as one of the most active on the European front,
having spent lengthy visits in the Scandinavian countries, Britain, the Netherlands, and
West Germany (Thomassen 1994). The Swedish election study of 1954 was also heavily
influenced by the Columbia studies, closely resembling Lazarfeld’s Erie County study
Cross-National Surveys & Comparative Study of Electoral Systems    391

of 1940, although later studies were more heavily inspired by the Michigan model
(Holmberg 1994).

The Development of Cross-​National


Polls and Surveys

Polls and election surveys proliferated, and the scene was set for comparative research
on political matters using these methods. The first large-​scale, cross-​national survey was
a 1948 Time magazine survey on freedom (Roper 1948; Smith 2014), followed by a now
little-​cited nine-​country study, “How Nations See Each Other” (Buchanan and Cantril
1953). But the most influential comparative study based on survey research in political
science was The Civic Culture (Almond and Verba 1963), which introduced and devel-
oped concepts that continue to shape contemporary studies of democracy. Surveys were
conducted in five countries: the United States, Britain, West Germany, Mexico, and Italy,
in 1959 and 1960. The theme was to investigate the consolidation of democracy and, in
particular, the political culture that might sustain it. The case selection was deliberate
and well-​conceived: the United States and Britain represented stable, long-​established
democracies; West Germany and Italy represented postauthoritarian regimes in which
democracy was becoming established; and Mexico represented a less-​ developed
country with what we would now describe as a partial democracy or hybrid regime.
With only five country cases, and given the much less powerful statistical resources
of the time, the cross-​national comparison was qualitative and descriptive, and the data
analysis was almost entirely made up of cross-​tabulations. Out of a rich mixture of nor-
mative theory and psychology, engaging with their data, the researchers developed a
typology of political cultures and identified the mixture that they considered would best
support democracy. While The Civic Culture was subject to much criticism at the time,
some of which the authors later conceded was justified (Almond and Verba 1980), the
book remains a landmark of research in comparative political science. It was followed
up by a study on political participation and equality in seven nations (Verba, Nie, and
Kim 1978), and, not long afterward, by a five-​nation study of unconventional political
participation (Barnes, Kaase, et al. 1979).
However, none of these were election studies, as most of their fieldwork took place
between elections. Nor were they institutionalized, repeated, or longitudinal. With the
advance of economic and political integration in Europe, however, a source of funding
for more continuous comparative research had emerged in the form of the institutions
of the European Union. A five-​country “Attitudes to Europe” (1962) study paved the way.
In the context of the intensification of European economic integration, the European
Commission established the Eurobarometer in 1973. The Eurobarometer conducts
two surveys per year in each European Union member country, with a target of one
thousand interviews per country. The original mission was to observe public attitudes
392    Jeffrey A. Karp and Jack Vowles

toward the most important current events connected directly or indirectly with the de-
velopment of the European Union and the unification of Europe (Aldrin 2011).
By the turn of the twenty-​first century a number of comparative social science survey
projects had been established. Table 18.1 provides a list, their foundation date, and links
to further information.2 The first fully global collaboration in international survey re-
search was the World Values Survey (WVS), established in 1981 in tandem with the
European Values Survey (World Values Survey 2015). While the initial set of countries
tended to come from the developed world, the reach of the WVS has expanded to in-
clude countries with a wide range of cultures and stages of development. The WVS
follows a theme first investigated in The Civic Culture: the extent to which moderniza-
tion and economic development may be transforming values and cultures around the
world, particularly as a result of generational replacement (Inglehart 1997). Research
based on these data has produced major contributions to the literature and some
challenging and controversial findings on political development and political culture
(e.g., Welzel 2013).
The WVS has mounted seven waves, all covering three-​year periods, with roughly
two-​year gaps between these periods. The WVS established a model that has since been
applied in later cross-​national collaborations. The program itself maintains a central
infrastructure that organizes the formulation of questionnaire content for each wave,
collects the data, and makes them available, but the funding of surveys within the re-
spective countries is generally the responsibility of country collaborators, although
the WVS has sometimes provided financial assistance. This means that country cov-
erage is uneven, some countries having continuous representation, while others have
participated on a more episodic basis. This poses some problems that are shared with
some other cross-​national survey projects, discussed below.

Table 18.1 Major Cross-​National Survey Programs, 1973–​2015

Eurobarometer 1973 http://​ec.europa.eu/​public_​opinion/​index_​en.htm


European Election Study 1979 http://​eeshomepage.net/​
World Values 1981 http://​www.worldvaluessurvey.org/​WVSContents.jsp
ISSP 1984 http://​www.gesis.org/​en/​issp/​issp-​home/​
CNEP 1990 http://​www.cnep.ics.ul.pt/​
Latino Barometer 1995 http://​www.latinobarometro.org/​latContents.jsp
CSES 1996 http://​www.cses.org
Afro-​Barometer 1999 http://​www.afrobarometer.org
Asian Barometer 2000 http://​www.asianbarometer.org
AsiaBarometer 2003 https://​www.asiabarometer.org/​
Pew Global Attitudes 2001 http://​www.pewglobal.org/​about/​
European Social Survey 2002 http://​www.europeansocialsurvey.org/​
Arab Barometer 2005 http://​www.arabbarometer.org/​
Gallup World Poll 2005 http://​www.gallup.com/​services/​170945/​world-​poll.aspx
Cross-National Surveys & Comparative Study of Electoral Systems    393

The next international social survey to be established was the International Social
Survey Programme, in 1984. Its mission is to run annual surveys on “topics important
for the social sciences” (ISSP 2015). Each year has a theme, and the themes are repeated
after a period of intervening years. For example, there have been three studies of national
identity, begun in 1995 and repeated in 2001 and 2013. The ISSP began with four member
countries and had expanded to forty-​eight countries by 2013. Its central infrastructure is
quite limited, and it again relies on country-​collaborator funding for its surveys (Skjak
2010; Haller, Jowell, and Smith 2012). Unlike the WVS, which usually shapes the entire
questionnaire to be fielded in each country, the ISSP develops a module of questions
that are included within a broader national social survey.
In 1995 the Eurobarometer was joined by the Latino-​Barometer, covering countries
in Latin America; in 2000 by the Asian Barometer; and in 2005 by the Arab Barometer,
forming a loose network, the Global Barometer program (Global Barometer Surveys
2015). Another AsiaBarometer program, based in Japan, began in 2003. In 2002
there was a further European initiative, the European Social Survey (ESS). While the
Eurobarometer’s key themes tend to have a policy-​relevant focus in accord with the
concerns of its funder, the European Commission, the ESS is driven primarily by ac-
ademic researchers. The ESS has a strong methodological focus, one of its aims being
“to achieve and spread higher standards of rigor in cross-​national research in the so-
cial sciences, including for example, questionnaire design and pre-​testing, sampling,
data collection, reduction of bias and the reliability of questions” (ESS 2015). Relatively
speaking, the ESS has generous funding and therefore has considerable resources to put
into the pursuit of methodological excellence (Fitzgerald and Jowell 2010). Within a
regional framework, in addition, like other similarly focused programs, compared to
global studies it faces fewer problems of cross-​cultural variation.
The extent of comparative polling by commercial polling organizations or outside the
universities has been extensive, given that many are cross-​national themselves, either
directly linked or affiliated.3 But these data tend to remain unreleased at the individual
level, appearing in reports or confidential documents released to clients commissioning
such research. A major exception is the Pew Global Attitudes Survey, which since 2002
has conducted annual surveys around the world “on a broad array of subjects ranging
from people’s assessments of their own lives to their views about the current state of
the world and important issues of the day.” In 2014 Pew reported having collected data
from sixty-​three countries, although in any one year the number has varied from only
fifteen to just under fifty (Pew Research Center 2014). The most recent entry to the field
and currently the most comprehensive has been the Gallup World Poll. It collects data
from over 160 countries, addressing many questions of interest to political science, such
as confidence in institutions and levels of human development. Its data are available on
a subscription basis, although some may be more easily accessible to academics (Gallup
2015; Kittilson 2007, 880; Tortora, Srinivasan, and Esipova 2010).
These various programs of comparative survey research have much in common, both
in their strengths and weaknesses. In terms of methodology, there are various well-​
understood challenges (Harkness 2008; Smith 2010b, 2014, 285–​286; Stegmueller 2011).
394    Jeffrey A. Karp and Jack Vowles

One particularly relevant to political science is that of the timing of fieldwork. Because
interest in politics waxes and wanes over the election cycle, and recall error increases
over time, any variables associated with elections or even political participation in gen­
eral may be affected.
With fieldwork timed post-​election, defining elections as cases allows researchers
to more rigorously address new questions about how context influences behavior. One
early within-​country example is Markus (1988), who merged eight presidential election
studies to examine how national economic conditions influence voting behavior. An
early exercise in systematically comparing findings from national election studies was
Franklin (1992). The similarity of many of these election studies in theory and meth-
odology, not to mention the frequent use of similar or at least comparable instruments,
offered opportunities that were not generally foreseen for comparative research
(Thomassen 1994). This replication of surveys across countries had begun to make it
possible to investigate how institutional and cultural contexts affect electoral behavior.
Among political scientists principally interested in elections, attempts to take ad-
vantage of the common heritage of election studies and to exploit the opportunities for
comparative research began in the late 1980s. The first attempt to conduct cross-​national
election research was that of the Comparative National Elections Project (CNEP). Its
theme has been “the processes of intermediation through which citizens receive in-
formation about policies, parties, candidates, and politics in general during the course
of election campaigns, thus reviving the long neglected research perspective of the
‘Columbia School’ established by Paul Lazarsfeld and his colleagues in the 1940s and
1950s.” As of 2015 it included twenty-​five election studies collected in twenty countries
and had led to a significant list of publications (CNEP 2015). However, its focus tends
to remain largely on individual-​level factors, with less attention paid to differences be-
tween countries and elections themselves.

Comparative Study of Electoral


Systems (CSES)

Background and Development


At the same time, a wider group of electoral researchers was forming the International
Committee for Research into Elections and Representative Democracy (ICORE), which
served as the precursor to the CSES. Like the ISSP, the CSES relies on national teams
of researchers to both fund and administer a common ten-​to fifteen-​minute module
of questions.4 This instrument is put into the field after a general election, along with
additional demographic, administrative, and other behavioral and attitudinal variables
that are usually part of a wider election study. The CSES began in 1996 and has grown
into a project that, early in 2015, included data from 146 elections in over fifty countries
Cross-National Surveys & Comparative Study of Electoral Systems    395

and was accessible to all wishing to use it.5 In combination with the increased number
of democratic countries during this period, the CSES has been instrumental in broad-
ening the number of countries running election studies. The CSES was developed to
address three questions: how social, political, and economic institutional contexts shape
belief and behaviors, affecting the nature and quality of democratic choice; the nature
of political and social cleavages and alignments; and how citizens evaluate democratic
institutions and practices (Grosse and Appleton 2009).
To date, four modules have been in the field, each focusing on a different theme. Table
18.2 provides a brief summary. Much more detail is of course available on the CSES web-
site.6 Modules are current for five years. In most countries, the CSES module is run in
a single election during that period, but some CSES collaborators have repeated the
same module in more than one election. While much of the CSES module does change
from one time to the next, a few core questions are becoming increasingly valuable for
time series analysis. Because many collaborators regard their commitment to the CSES
as including the module once only, in jurisdictions where more than one election is
held over the period of the module, there are sometimes gaps in the time series. Other
collaborators run the same module twice in those circumstances, a practice that should
be encouraged.
As noted, like the WVS and ISSP, the CSES is based on a national collaboration model,
rather than on a centralized one (Curtice 2007). Consequently, it is difficult to impose
rigorous methodological consistency across the various country studies. Many country
studies are established election studies, with their own instruments, time series, and
standards to maintain. Inclusion in the CSES requires a random probability national
sample that can, however, include a properly administered quota sample with substitu-
tion. Some contributed studies have been rejected for failing to meet those standards.
Quality control is a high priority. Collaborators are required to submit a detailed design

Table 18.2 CSES Modules and Themes, 1996–​2016


Module 1: 1996–​2001 Module 2: 2002–​2006 Module 3: 2006–​2011 Module 4: 2011–​2016
System performance: Accountability and Political choices, Distributional politics
constitutional and representation: contestation and and social protection;
institutional effects do elections make inclusiveness: policy campaign mobilization,
on democratic governments questions about electoral new and old forms;
performance; accountable, are system design. In a new approach to
the social citizens’ views established democracies: political knowledge.
underpinnings of represented? how satisfaction varies
party systems; Political participation with choices, how and
attitudes to parties, and turnout; why new parties are
political institutions, institutions and formed.
and the democratic contexts in new In new democracies:
process. democracies. Electoral system design
and political stability.
396    Jeffrey A. Karp and Jack Vowles

report that is available to users (data from which are deployed in the analysis below).
Central coordination is split between the University of Michigan’s Survey Research
Center and the Leibniz Institute for the Social Sciences (GESIS), where the data sets are
cleaned and tested (Howell and Jusko 2009; Howell 2010). Users are provided with ex-
tensive documentation, which includes any information that might be relevant for the
inclusion or possible exclusion of a country/​election study on methodological grounds.

Case Selection
Of course the inclusion of country/​election cases is far from a random process, de-
pendent as it is on the willingness of country-​based researchers to participate and to
secure funding for an election study in the first place. While most countries included
maintain a continuous presence, some drop in and out as funding or collaborator avail-
ability permits. The nonrandom nature of country case selection in the CSES is the first
challenge we address here, one that is common to most, if not quite all, other similar re-
search programs.
Bormann and Golder (2013) collected data on all legislative and presidential elections
up to 2011 that had been held in democratic regimes. This forms a baseline from which
to first construct a population of elections from which the CSES data are drawn during
the same period (thus excluding more recent country/​elections).7 From its inception in
1996 through 2011, the CSES module was fielded in 116 democratic elections in forty-​six
countries.8 In thirty-​one countries the CSES module had been run at least twice, and in
nine countries the CSES module had been run in at least four elections. This, however,
is only a small fraction of the overall number of elections that were held in democratic
regimes during that period. While the CSES includes one of the largest cross-​national
surveys to date, the CSES sample consists of just 16% of all general parliamentary/​legis-
lative and presidential elections held between 1996 and 2011.
As Table 18.3 shows, the coverage rate of the CSES is best in the West, which includes
Western Europe, the United States, Canada, Australia, and New Zealand, and in the
small number of democratic elections held in the Middle East and Northern Africa.
There was no election under a democratic regime as defined by Borman and Golder that
was covered in elections in sub-​Saharan Africa (n = 96) and none in the Pacific Islands
(n = 57), and Latin America and Asia are also underrepresented.
Perhaps more important, the CSES appears to be not very representative of the se-
lection of electoral systems, which was at least initially a primary focus for the project
(see below). Elections held in majoritarian systems account for only 7% of the sample,
although majoritarian elections formed 23% of all possible cases (sourced from IDEA
2015). Less than 10% of the CSES cases include presidential elections, compared to 31%
of the potential cases. To further examine this, we constructed a simple model of case
selection in which the dependent variable represents whether a survey was conducted
after the election that included the CSES module. The results are reported in Table
18.4. Some 30% of the variance in case selection can be explained by the electoral
Cross-National Surveys & Comparative Study of Electoral Systems    397

Table 18.3 Representation of Elections by Region in the CSES (1996–​2011)


Elections Percent

1. Sub-​Saharan Africa 96 0
2. Asia 81 15
3. West (incl. US, Canada, Australia, New Zealand) 165 35
4. Eastern Europe/​post-​Soviet states 130 18
5. Pacific Islands/​Oceania 57 0
6. Middle East/​North Africa 9 44
7. Latin America/​Caribbean 180 11
Total 718 16

Sources: CSES, Modules 1–​3; IDEA 2015.

Table 18.4 CSES Case Selection (Logit Coefficients)


Coef S.E.

Majoritarian system −1.82** 0.41


Mixed electoral system −0.22 0.33
Established democracy 1.61** 0.24
Presidential election −1.10** 0.30
Log of population in millions 0.45** 0.08
Constant −2.50** 0.25
Nagelkerke R2 0.30
N 675

**p < .01; *p <. 05.


Sources: CSES, Modules 1–​3; IDEA 2015.

system, democratic development, and the size of the country’s population. Established
democracies are much more likely to be included than newer democracies, and larger
countries rather than smaller countries, while presidential elections and majoritarian
systems are underrepresented.9 There appear to be no significant differences in the se-
lection of mixed electoral systems compared to proportional representation systems
(the omitted category).
Another possible selection of cases might be confined to countries that are members
of the Organisation for Economic Co-​operation and Development (OECD), signif-
icant both for the size of their populations and their economies, and often the refer-
ence point for much comparative research because of the higher quality and range of
398    Jeffrey A. Karp and Jack Vowles

data available from them. From this standpoint, up to mid-​2015 every single country
currently in the OECD has featured in the CSES, except for Luxembourg. However,
some countries have contributed data for every single election since 1996 (e.g., Poland,
Switzerland, and France for all presidential elections), while others have contributed but
one (Italy, Estonia, Slovakia). Overall, the OECD country response rate between 1996
and 2015 was just under 60%, estimated after the second release of Module 4 in March
2015. However, that should climb significantly when Module 4 data submission and re-
lease are complete.
It is important not to make too much of apparent “bias” in the CSES. So long as there
is sufficient variation in the macro-​level variables of interest across the country cases,
inferences can be drawn from properly specified models. However, researchers ought
to pay more attention to case selection issues. As noted previously, small countries—​
particularly very small countries—​are less likely to appear in the CSES, and indeed in
cross-​national survey samples in general. Inferences about OECD countries are unlikely
to be greatly affected by the absence of Luxembourg, for example (although it is one
of the world’s richest countries). However, about half of the world’s countries and ter-
ritories have populations of fewer than five million people, and a quarter have fewer
than half a million. Such countries tend to collect and report less information about
themselves. Much cross-​national comparative research is likely to have a large country
bias.10 But of course the majority of the world’s population is found in the larger coun-
tries. Yet cross-​national comparative researchers do not weight their data by population
size, because virtually all inferences would then be driven by the largest countries. The
whole point of cross-​national comparative research is to use countries as cases, on the
assumption that their particular characteristics are variables in question and therefore
they should be weighted equally. On this assumption, cross-​national researchers should
probably weight their country-​cases equally. Most do not, although the CSES does pro-
vide an appropriate weight to do so. In multivariate analysis, of course, weights matter
less: most of the relevant parameters will be captured by the control variables and by
other features of model specification.

The Multilevel Data Structure


If case selection continues to be a challenge, at least advances in statistical mod-
eling techniques give analysts more scope to address some of the problems and some
assurance of greater rigor in comparative analysis. As in similar international studies,
new strategies of analysis have come to the fore in recent years. Since The Civic Culture,
methodological standards have risen, and the capacities of statistical techniques and
computer hardware and software have increased to match them. No longer is it suffi-
cient to simply compare frequencies and cross-​tabulations between countries.
The CSES has led the way in combining individual-​level data and country-​level data,
opening up new possibilities, but at the cost of increasing complexity. When pooling
cross-​national comparative survey data, one must also take account of their multiple
Cross-National Surveys & Comparative Study of Electoral Systems    399

levels, and in particular the nesting of individuals within countries. As noted previously,
analysis is also possible over time, adding a further dimension. Thus models are needed
to provide for random intercepts for each country (or country-​year/​election) and, quite
frequently, random slopes, on the assumption that the effects of the variables in question
will not be the same across time and space (Gellman and Hill 2007, 235–​342).
While multilevel models can address these questions, with a data set the size of the
CSES, in more complex forms with more than two levels and random slopes that may
not always converge, all this can take time to run and can require more advanced meth-
odological skills to interpret. There may be systematic, culturally derived differences
between countries in terms of response patterns, some leaning to the extremes, others
closer to the middle, which sophisticated methods can be used to address (Stegmueller
2011). When analyzing smaller subsets of units, standard errors may become biased using
standard frequentist methods, requiring a Bayesian approach. Indeed, given the non-
random selection of country cases, an argument can be made that Bayesian approaches
should be used more generally (Western and Jackman 1994; Stegmueller 2013). Other
techniques of multilevel analysis have also been proposed and implemented, such as the
“two-​step” method (Jusko and Shively 2005), but most published work using the CSES,
at least, tends to employ multilevel, random-​intercept models.

Question Design: Translation, Institutions, and Context


Like other international surveys, the CSES must address other significant problems: the
translation of its instruments into numerous languages, and indeed, the broader concern
that even with the most accurate translation, some questions and concepts will simply
not mean the same thing in a different context. The questionnaire is first produced in
English, but within the framework of the CSES Planning Committee, representation
within which has always included members with a broad representation of native lan-
guages. Difficulties of translation therefore enter the question design process very early.
Collaborators who administer the questionnaire in languages other than English pro-
duce their own translations, recording details of the translation process, including notes
about questions and concepts that are difficult to translate. Following current standards
of cross-​national survey design, these are recorded in the design report for each country
submitted by collaborators and made available to users in the documentation associated
with the CSES data sets (Survey Research Center 2010).
One of the more contentious debates within the CSES has been on how best to es-
timate respondents’ political knowledge. In an ideal world, one would design a bat-
tery of questions to be asked in all countries that would allow us to compare levels
of political knowledge cross-​nationally. Yet institutional and cultural differences are
such that the search for such a common battery is akin to that for the Holy Grail of
Christian mythology. Nonetheless, some do argue for a more consistent design of po-
litical knowledge questions across countries (e.g., Milner 2002). In Modules 1–​3, the
objective was simply to estimate the distribution of political knowledge within each
400    Jeffrey A. Karp and Jack Vowles

country, on a similar scale. Collaborators were asked to choose three questions, to


one of which two-​thirds of respondents were expected to provide the correct answer,
to the second of which half were expected to do so, and to the third, only one-​third.
This was intended to produce a scale with a similar mean and standard deviation in
each country that would provide an estimate of relative levels of knowledge within
each country. However, the substantive content of the questions was left entirely to
collaborators, increasing uncertainty about their value and robustness. As it turned
out, further standardization of the scale within countries was usually necessary, as not
all collaborators could accurately calibrate their questions to the requested distribu-
tion. Analysis of the questions over the first two modules found significant measure-
ment problems (Elff 2009).
For Module 4, four standard questions were developed: which party had come in
second in the election in question; the name of the minister of finance or equivalent;
the most recent unemployment figure; and the name of the Secretary-​General of the
United Nations (CSES 2011, 18–​20). The first three questions, in particular, were in-
tended to capture the extent to which respondents could grasp who was, or who was
not, in government, and the extent to which they might be aware of that government’s
economic performance. The latter question in particular was calibrated to the broader
substantive content of Module 4. Because of different institutional frameworks and
other contextual differences, different levels of knowledge of these questions are ex-
pected across countries. Assuming sufficient variation, standardized scales will
be produced for each country. Country variation in these responses to the same
instruments could be of interest in certain areas of research, addressing the question
of institutional and other contextual differences that might account for such varia-
tions, as well as their implications.
Another vexed matter of question design debated within the CSES has been the use
of the standard left-​right scale as a basis for estimating the dimensionality of the party
system and where individuals situate themselves within it. The question, on an eleven-​
point scale from 0 (“Most Left”) to 10 (“Most Right”), asks respondents to place both
parties and themselves on that scale. Some country collaborators argue that Left-​Right
means little or nothing in their countries, and they do have the option of including an
alternative dimension that they think is more meaningful.
A more fundamental and related problem is the limited space available for the
module. In this case, the CSES faces a greater problem than other cross-​national surveys
that can command most, if not all, of the questionnaire space for their comparative
questions. Because the CSES is usually incorporated within a broader election study
questionnaire, there is much greater competition for space. This sometimes means that
collaborators will drop a question or questions from the module or demographics. It
also means that multiple instruments to better estimate an underlying variable or di-
mension are usually excluded; one question alone must suffice. Innovative advances
in survey research, such as vignettes or experiments, have yet to be implemented. The
strategy has been to keep instruments and the batteries within them as simple, short,
and straightforward as possible.
Cross-National Surveys & Comparative Study of Electoral Systems    401

Fieldwork, Mode, and Response Rates


As noted, a feature common in cross-​ national surveys is the need for country
collaborators to obtain their own funding. The limited availability of funding can
often limit the options available for fieldwork. The optimal method recommended
by the CSES Planning Committee is face-​to-​face (FTF) interviews with a sample of
respondents selected from a national probability design. These surveys have long been
considered to be the “gold standard” because of their ability to achieve longer interviews
with high response rates. Respondents are much more likely to cooperate if they are
approached in person, as opposed to receiving a self-​completion questionnaire in the
mail or a call on the telephone or email message. This is confirmed in Table 18.5, which
shows that within the CSES, FTF surveys have an average response rate of 57%, which is
higher than the average response rate achieved through other methods.
The FTF surveys are very costly and are reaching a point that may soon be unsus-
tainable in some countries. For example, the 2012 American National Election Study
(ANES) that contains CSES Module 4 was estimated to cost $4.2 million to complete
two thousand FTF interviews of seventy minutes in length (both pre and post), or $2,100
per respondent. The Economic and Social Research Council’s (ESRC) call for the 2015
British Election Study (BES) was for a maximum of £1.25 million, most of which will be
devoted to the core FTF probability sample, which traditionally consists of about three
thousand completed FTF interviews (Karp and Luhiste 2016).
As Table 18.5 shows, FTF interviews are the dominant mode in the majority of studies
within the CSES, if only because the costs of such interviews remain lower in many
countries than in Britain or the United States. However, 20% of the election studies were

Table 18.5 Election Study Designs and Response Rates in the CSES


Mode Response Rate n

Face to face 57.2 75


Telephone 45.1 21
Mail 45.4 8

Module 1 (1996–​2001) 60.0 23


Module 2 (2001–​2006) 52.1 31
Module 3 (2006–​2011) 53.5 42
Module 4 (2011–​2016) 45.4 8

No incentive 54.6 76
Token 48.1 7
Payment (i.e., lottery) 52.7 16

Source: Compiled from Design Reports, Comparative Study of Electoral Systems


(2011).
402    Jeffrey A. Karp and Jack Vowles

conducted by telephone. Telephone surveys tend to suffer from declining response rates
as well as diminished coverage of households by landlines and the increased use of mo-
bile phones. Estimates from the National Health Interview Survey (NHIS) conducted
in the second half of 2013 indicate that two in every five households (39.1%) lived in
households with only wireless telephones (Blumberg and Luke 2014). The high level
of mobile-​only households in the United States is not unique. Estimates from Europe
indicate that the number of households with only mobile phones increased dramati-
cally in the 2000s. As of 2009, three-​quarters of the Finish population had mobile-​only
households. The rate of mobile-​only coverage varies substantially across Europe. By
2009 a majority in Slovakia, Latvia, Lithuania, and the Czech Republic had only mobile
phones, although Europeans in other democracies were not so quick to abandon their
landlines (Mohorko, de Leeuw, and Hox 2013). These differences pose new challenges
for survey researchers that are not just restricted to reaching respondents but include
interviewing them in different contexts (Lynn and Kaminska 2012).
Variation in survey practices and standards across countries raises the question of
whether observed differences are real (Heath, Fisher, and Smith 2005). Countries
with low responses rates are likely to underrepresent potential participants, for ex-
ample those with lower levels of education, leading to a biased sample that may not be
corrected by weights or applying controls.11 There is also considerable inconsistency in
the collaborators’ calculations and reporting of response rates themselves, of which the
CSES is well aware.
As Table 18.5 shows, response rates vary not only across mode but also across time.
However, on the surface at least the response rates for telephone interviews from other
countries in the CSES do not appear to be substantially lower than for FTF surveys.
Australia and New Zealand rely almost entirely on the “mail-​back” method, mailing
questionnaires to respondents randomly sampled from the electoral register, thus ex-
cluding those who are not registered from their samples (although these numbers are
usually less than 10%).12 Both countries have robust mail delivery systems. While mail
surveys have the advantage of low costs, they may not be viable where postal systems are
less reliable.
As a result of these and other differences, response rates differ between country
studies and for the most part are declining over time within countries, a feature common
to most survey research and polling. This is also evident in Table 18.5. Yet the differences
in response rates across modes are not as high as one might have expected.13
Research shows that providing respondents with different mode options can in
some circumstances reduce response rates (Griffin, Fisher, and Morgan 2001), but in
others enhance them (Dillman, Smyth, and Christian 2009). Many researchers seek
to encourage respondents to use the Internet to reduce survey costs. They may present
their sample with a first option of web only, but later offer mail-​back as an option for
nonrespondents. This tends to reduce response rates, as web surveys tend to have low
response rates (Manfreda et al. 2008; Shinn, Baker, and Briers 2007). When given un-
constrained choice between mail-​back and web interface from the beginning, by far the
majority of respondents choose hard copy (Bensky, Link, and Shuttles 2010). Offering
Cross-National Surveys & Comparative Study of Electoral Systems    403

an additional web option can encourage procrastination and thus nonresponse in some
cases (Medway and Fulton 2012; Millar and Dillman 2011). However, simultaneous
mode offering can enhance response rates if one—​the mail-​back—​is seen as the pri-
mary mode and the other—​the web—​is offered less prominently (Newsome et al. 2013).
As this is the case with the Australian and New Zealand election studies, our expectation
is that their web option as a supplementary add-​on to mail-​back should marginally en-
hance their response rates.
Debate continues within the CSES about whether or not to accept data that are not
based on a random probability sample. In 2005 and 2010 the BES included the CSES
module on a nonprobability Internet-​based sample; both times it was rejected by the
CSES Planning Committee. While it may be the case that online nonprobability samples
drawn from repeatedly contacted panels can match patterns of party choice and much
of what lies behind such choices (Sanders et al. 2007), the objectives of the CSES range
far beyond simple party choice. In an online panel, perceptions of the accountability
and representativeness of government and political leaders, satisfaction with democ-
racy, and age-​related patterns of turnout may be subject to more bias than random prob-
ability samples using traditional methods, even when their response rates are low (Karp
and Luhiste 2016).

Conclusions

The development of comparative cross-​national survey research programs in social


and political science has transformed the field of comparative politics. One can now
talk of “comparative political behavior” as a significant subfield of political science,
in a way that was not so credible twenty years ago. Over this period, a paucity of data
has turned into, if anything, an oversupply, albeit with significant deficits in coverage.
Yet significant challenges remain. Inattention to nonrandom country case selection
issues, problems of comparability of question design, variations in country sampling,
and questionnaire modes and response rates expose researchers to risks of making
incorrect inferences. But these challenges can be addressed. The CSES provides de-
tailed reports that can be used to identify potential problems. Researchers should sub-
ject their cases to scrutiny and as a last resort even discard those about which doubts
may be raised that might affect findings about the particular research question being
addressed.
We must also acknowledge that declining response rates, increasing survey costs,
and declining social science research budgets all combine to make the future of cross-​
national survey research programs uncertain, despite recent progress.14 Nonetheless,
election study participation in the CSES increased through Modules 1 to 3 and is likely
do so again in Module 4. The number of publications using the CSES has also been on an
upward track. Increasingly sophisticated methods are being developed to compensate
for some of the methodological challenges posed by the national collaboration model.
404    Jeffrey A. Karp and Jack Vowles

Notes
1. In 1977 the Michigan Election Studies was changed to the National Election Studies, where
control over content and design was vested in a board of overseers appointed by the prin-
cipal investigator in consultation with the National Science Foundation (Miller 1994). In
2005 the National Election Studies became known as the American National Election
Studies.
2. A more comprehensive list including several regional studies can be found in Kittilson
(2007, 867–​887).
3. Aside from the two commercial firms noted here, Gfk NOP, Harris Interactive, IPSOS,
Synovote/​Agis, and TNS have also been active in cross-​national polling (Smith 2010a).
4. We thank Dave Howell for his very helpful comments on an earlier draft of this chapter,
but of course take full responsibility ourselves for what follows.
5. Some of the cases include elections that are not sovereign nations, such as Hong Kong.
6. Major studies have emerged and are emerging from the CSES: Norris (2004); Klingemann
(2009); Dalton and Anderson (2010); Dalton, Farrell, and McAllister (2011); Thomassen
(2014); Vowles and Xezonakis (2016). A short analysis of studies published up until 2009
can be found in Vowles (2009).
7. Borman and Golder (2013) define democratic regimes as requiring the election of a
chief executive and legislature, more than one party competing in elections, and an al-
ternation in power under identical rules. For this reason, South Africa does not qualify,
because it has not experienced an alternation in power since the end of apartheid.
South Africa ran the CSES third module in 2009, the only African country so far to
participate.
8. Studies in which the CSES has been run under regimes that were not full democracies are
not included in this figure.
9. However, India, the world’s largest democracy, is yet to be included in the CSES, despite
efforts by successive planning committees to encourage its participation.
10. One reason for the underrepresentation of majoritarian countries in the CSES is that
many of these are small Caribbean or Pacific Island democracies that were former British
colonies.
11. The CSES asks its collaborators to provide a comparison of the educational profile of their
sample with that of the population and provides the opportunity for collaborators to in-
clude demographic and political weights to correct for biases related to sampling error and
nonresponse bias.
12. Australia and New Zealand also offered respondents the choice of completing the survey
online, but surprisingly few took up this option.
13. Paying respondents per interview or providing token incentives do not apparently
contribute to higher response rates in the CSES, but given the broad thrust of the
survey methodology literature indicating that these methods are effective, this is al-
most certainly a result of endogeneity. Payments and incentives are likely applied in
cases where nonresponse problems are strongest, not where response rates are still
relatively high.
14. For example, at the CSES plenary meeting in Berlin in 2014 that elected the planning
committee for Module 5, reports from many election study teams repeated a similar
theme that funding remained uncertain and continuation in the field could not be
guaranteed.
Cross-National Surveys & Comparative Study of Electoral Systems    405

References
Aldrin, P. 2011. “The Eurobarometer and the Making of European Opinion.” In Perceptions of
Europe: A Comparative Sociology of European Attitudes, edited by D. Gaxie, N. Hube, and J.
Rowell, 17–​34, Colchester, UK: ECPR Press.
Almond, G., and S. Verba. 1963. The Civic Culture: Political Attitudes and Democracy in Five
Nations. Princeton, NJ: Princeton University Press.
Almond, G., and S. Verba, eds. 1980. The Civic Culture Revisited. Boston, Little Brown.
“Attitudes to Europe.” 1962. http://​www.worldsocialscience.org/​documents/​attitudes-​towards-​
europe-​1962.pdf.
Barnes, S., and M. Kaase, et al. 1979. Political Action. London: Sage.
Bensky, E. N., M. Link, and C. Shuttles. 2010. “Does the Timing of Offering Multiple Modes
of Return Hurt the Response Rate?” Survey Practice 3 (5). http://​www.surveypractice.org/​
index.php/​SurveyPractice/​article/​view/​146/​html.
Berelson, B. R, P. F. Lazarfeld, and W. N. McFee. 1954. Voting: A Study of Opinion Formation in a
Presidential Campaign. Chicago: University of Chicago Press.
Blumberg, S. J., and J. V. Luke. 2014. ‘Wireless Substitution: Early Release of Estimates
from the National Health Interview Survey, July–​ D ecember 2013.’ Centers for
Disease Control and Prevention. http://​w ww.cdc.gov/​nchs/​data/​nhis/​e arlyrelease/​
wireless201407.pdf.
Bormann, N., and M. Golder. 2013. “Democratic Electoral Systems around the World, 1946–​
2011.” Electoral Studies 32: 360–​369.
Buchanan, W., and H. Cantril.1953. How Nations See Each Other: A Study in Public Opinion.
Urbana: University of Illinois Press.
Burdick, E., and A. J. Brodbeck. 1959. American Voting Behavior. Glencoe, IL: The Free Press.
Campbell, A., P. Converse, W. Miller, and D. Stokes. 1960. The American Voter.
Chicago: University of Chicago Press.
Cantril, H., and M. Strunk. 1951. Public Opinion 1935–​ 1946. Princeton, NJ:  Princeton
University Press.
Comparative National Elections Project (CNEP). 2015. http://​www.cnep.ics.ul.pt/​index1.asp.
Comparative Study of Electoral Systems (CSES). 2011. ‘The Comparative Study of Electoral
Systems (CSES) Module 4 Theoretical Statement.’ http://​www.cses.org/​plancom/​module4/​
CSES_​Module4_​TheoreticalStatement.pdf.
Converse, J. M. 1987. Survey Research in the United States: Roots and Emergence 1890–​1960.
Oakland: University of California Press.
Curtice, J. 2007. “Comparative Opinion Surveys.” In The Oxford Handbook of Political Behavior,
edited by R. Dalton and H. Klingemann, 897–​909. New York: Oxford University Press.
Dalton, R., and C. Anderson, eds. 2010. Citizens, Context, and Choice: How Context Shapes
Citizens’ Electoral Choices. Oxford: Oxford University Press.
Dalton, R., D. Farrell, and I. McAllister. 2011. Political Parties and Democratic Linkage: How
Parties Organise Democracy. Oxford: Oxford University Press.
Dillman, D. A., J. D. Smyth, and L. M. Christian. 2009. Internet, Mail, and Mixed-​Mode
Surveys: The Tailored Design Method. 3rd ed. Hoboken, NJ: John Wiley & Sons.
Elff, M. 2009. “Political Knowledge in Comparative Perspective:  The Problem of Cross-​
National Equivalence of Measurement.” Paper presented at the MPSA 2009 Annual National
Conference, April 2–​5, 2009, Palmer House Hilton, Chicago. http://​www.martin-​elff.net/​
uploads/​Elff-​PolKnowledgeEquivMeasMPSA2009.pdf.
406    Jeffrey A. Karp and Jack Vowles

European Social Survey (ESS). 2015. “About the European Social Survey European Research
Infrastructure.” http://​www.europeansocialsurvey.org/​about/​.
Fitzgerald, R., and R. Jowell. 2010. “Measurement Equivalence in Cross-​National Surveys: The
European Social Survey (ESS) from Design to Implementation and Beyond.” In Survey
Methods in Multinational, Multiregional, and Multicultural Contexts, edited by J. A.
Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. P. Mohler, B. Pennell, and T.
W.Smith, 485–​496. Hoboken, NJ: John Wiley & Sons.
Franklin, M. N. 1992. “The Decline of Cleavage Politics.” In Electoral Change:  Responses to
Evolving Social and Attitudinal Structures in Western Countries, edited by M. N. Franklin, T.
T. Mackie, and H. Valen, 383–​405. Cambridge, UK: Cambridge University Press.
Gallup. 2015. “What the While World Is Thinking.” http://​www.gallup.com/​services/​170945/​
world-​poll.aspx.
Gellman, A., and J. Hill. 2007. Data Analysis Using Regression and Multi-​Level Heirarchical
Models. Cambridge, UK: Cambridge University Press.
Global Barometer. 2015. “Background.” http://​www.globalbarometer.net/​page/​background.
Griffin, D., D. Fisher, and M. Morgan. 2001. “Testing an Internet Response Option for the
American Community Survey.” Paper presented at the American Association for Public
Opinion Research, New Orleans, May.
Grosse, A., and A. Appleton. 2009. “ ‘Big Social Science’ in Comparative Politics: The History of
the Comparative Study of Electoral Systems.” In The Comparative Study of Electoral Systems,
edited by H.-​D. Klingemann. Oxford: Oxford University Press.
Haller, M., R. Jowell, and T. K. Smith. 2012. The International Social Survey Programme 1984–​
2009: Charting the Globe. London, Routledge.
Harkness, J. A. 2008. “Comparative Social Research: Goals and Challenges.” In International
Handbook of Survey Methodology, edited by E. D. de Leeuw, J. J. Hox, and D. Dillman.
New York: Taylor and Francis.
Heath, A., S. Fisher, and S. Smith. 2005. “The Globalisation of Public Opinion Research.”
Annual Review of Political Science 8: 295–​333.
Holmberg, S. 1994. “Election Studies the Swedish Way.” European Journal of Political Research
25 (3): 309–​322.
Howell, David A., and K. L. Jusko. 2009. “Methodological Challenges: Research Opportunities
and Questions for the Future.” In The Comparative Study of Electoral Systems, edited by H.-​
D. Klingemann. Oxford: Oxford University Press.
Howell, D., 2010. “Enhancing Quality and Comparability in the Comparative Study of Electoral
Systems.” In Survey Methods in Multinational, Multiregional, and Multicultural Contexts,
edited by J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. P. Mohler, B.
Pennell, and T. W. Smith, 525–​534. Hoboken, NJ: John Wiley & Sons.
Herbst, S. 1993. Numbered Voices:  How Opinion Polling Has Shaped American Politics.
Chicago: University of Chicago Press.
Institute for Democracy and Electoral Assistance (IDEA). 2015. Unified Database. http://​www.
idea.int/​uid/​.
Inglehart, R. 1997. Modernization and Post-​Modernization: Cultural, Economic, and Political
Change in 43 Societies. Princeton, NJ: Princeton University Press.
International Social Survey Programme (ISSP). 2015. “International Social Survey
Programme: General Information.” http://​www.issp.org/​.
Jusko, K. L., and W. P. Shively. 2005. “Applying a Two-​Step Strategy to the Analysis of Cross-​
National Public Opinion Data.” Political Analysis 13 (4): 327–​344.
Cross-National Surveys & Comparative Study of Electoral Systems    407

Kaase, M., and H. Klingemann. 1994. “Electoral Research in the Federal Republic of Germany.”
European Journal of Political Research 25 (3): 343–​366.
Karp, Jeffrey A. and Maarja Luhiste. 2016. “Explaining Political Engagement with Online
Panels: Comparing the British and American Election Studies.” Public Opinion Quarterly 80
(3): 666–693.
Kittlilson, M. C. 2007. “Research Resources in Comparative Political Behavior.” In The
Oxford Handbook of Political Behavior, edited by R. Dalton and H. Klingemann, 865–​895.
New York: Oxford University Press.
Klingemann, H.-​D., ed. 2009. The Comparative Study of Electoral Systems. Oxford:  Oxford
University Press.
Lazarfeld, P. F., B. Berelson, and H. Gaudet. 1948. The People’s Choice: How the Voter Makes Up
His Mind in a Presidential Campaign. New York: Columbia University Press.
Manfreda, K. L., M. Bosnjak, J. Berzelak, I. Haas, and V. Vasja. 2008. “Web Surveys versus
Other Survey Modes.” International Journal of Market Research 50 (1): 79–​104.
Markus, G. 1988. “The Impact of Personal and National Economic Conditions on the
Presidential Vote: A Pooled Cross-​Sectional Analysis.” American Journal of Political Science
32 (1): 137–​154.
Medway, R., and J. Fulton. 2012. “When More Gets You Less: A Meta-​analysis of the Effect
of Concurrent Web Options on Mail Survey Response Rates.” Public Opinion Quarterly
76: 733–​746.
Millar, M. M., and D. A. Dillman. 2011. “Improving Response to Web and Mixed-​Mode
Surveys.” Public Opinion Quarterly 75 (2): 249–​269.
Miller, W. E. 1994. “An Organizational History of the Intellectual Origins of the American
National Election Studies.” European Journal of Political Research 25 (3): 247–​265.
Milner, Henry. 2002. Civic Literacy: How Informed Citizens Make Democracy Work. Lebanon,
NH: University of New England Press.
Mohorko, A., E. de Leeuw, and J. Hox. 2013. “Coverage Bias in European Telephone
Surveys: Developments of Landline and Mobile Phone Coverage across Countries and over
Time.” Survey Methods: Insights from the Field. http://​surveyinsights.org/​?p=828.
Newsome, J., K. Levin, P. Langetieg, M. Vigil, and M. Sebastiani. 2013. “Multi-​Mode Survey
Administration:  Does Offering Multiple Modes at Once Depress Response Rates?”
Paper presented at American Association for Public Opinion Research (AAPOR)
68th Annual Conference. ww.websm.org/​db/​12/​16579/​Web SurveyBibliography/​
MultiMode_​S urvey_​Administration_​D oes_​O ffering_​Multiple_​Modes_​at_​O nce_​
Depress_​Response_​R ates/​.
Norris, P. 2004. Electoral Engineering: Voting Rules and Political Behavior. New York: Cambridge
University Press.
Pew Research Center. 2014. “Global Trends and Attitudes: Survey Reports 2014.” http://​www.
pewglobal.org/​category/​publications/​survey-​reports/​2014/​.
Roper, E. 1948. Where Stands Freedom: A Report on the Findings of an International Survey of
Public Opinion. New York: Time Magazine.
Sanders, D., H. Clarke, M. Stewart, and P. Whitely. 2007. “Does Mode Matter For Modelling
Political Choice? Evidence from the 2005 British Election Study.” Political Analysis 15
(3): 257–​285.
Shinn, G., M. Baker, and G. Briers. 2007. “Response Patterns: Effect of Day of Receipt of an
E-​mailed Survey Instrument on Response Rate, Response Time, and Response Quality.”
Journal of Extension 45 (2). http://​www.joe.org/​joe/​2007april/​rb4.php.
408    Jeffrey A. Karp and Jack Vowles

Skjak, K. K. 2010. “The International Social Survey Programme:  Annual Cross-​National


Surveys Since 1985.” In Survey Methods in Multinational, Multiregional, and Multicultural
Contexts, edited by J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. P.
Mohler, B. Pennell, and T. W. Smith, 497–​506. Hoboken, NJ: John Wiley & Sons.
Smith, T. W. 2010a. “The Globalisation of Survey Research.” In Survey Methods in Multinational,
Multiregional, and Multicultural Contexts, edited by J. A. Harkness, M. Braun, B. Edwards, T.
P. Johnson, L. Lyberg, P. P. Mohler, B. Pennell, and T. W. Smith, 477–​484. Hoboken, NJ: John
Wiley & Sons.
Smith, T.W. 2010b. “Surveying Across Nations and Cultures.” In Handbook of Survey Research,
2nd ed., edited by P. V. Marsden and J. D. Wright. Bingley: Emerald Group Publishing.
Smith, T. W. 2014. “Cross-​National Public Opinion Research.” In The Concise Encyclopedia
of Comparative Sociology, edited by M. Sasaki, J. Goldstone, E. Zimmermann, and S.
Sanderson, 281–​289. Leiden: Brill.
Stegmueller, D. 2011. “Apples and Oranges? The Problem of Equivalence in Comparative
Research.” Political Analysis 19: 471–​487.
Stegmueller, D. 2013. “How Many Countries for Multilevel Modeling? A  Comparison of
Frequentist and Bayesian Approaches.” American Journal of Political Science 57 (3): 748–​761.
Survey Research Center. 2010. Guidelines for Best Practice in Cross-​Cultural Surveys. Ann
Arbor: Survey Research Center, Institute for Social Research, University of Michigan. http://​
www.ccsg.isr.umich.edu/​.
Thomassen, J. 1994. “An Intellectual History of Election Studies.” European Journal of Political
Research 25 (3): 239–​245.
Thomassen, J. 2014. Elections and Democracy:  Representation and Accountability.
Oxford: Oxford University Press.
Tortora, R. D., R. Srinivasan, and N. Esipova. 2010. “The Gallup World Poll.” In [Harkness,
J.A., 2008. ‘Comparative Social Research: Goals and Challenges’ in] International Handbook
of Survey Methodology, edited by E. D. de Leeuw, J. J. Hox, and D. Dillman, 535–​544.
New York: Taylor and Francis.
Verba, S., N. Nie, and J. O. Kim. 1978. Participation and Political Equality:  A Seven-​Nation
Study. Chicago: Chicago University Press.
Vowles, J. 2009. “The CSES:  Achievements and Future Options.” Paper presented at the
CSES Plenary Conference, Toronto, September 2009. http://​www.cses.org/​plancom/​
2009Toronto/​CSES_​2009Toronto_​TaskForce.pdf.
Vowles, J., and G. Xezonakis, eds. 2016. Globalization and Domestic Politics:  Parties, Public
Opinion, and Elections. Oxford: Oxford University Press.
Welzel, C. 2013. Freedom Rising:  Human Empowerment and the Quest for Emancipation.
New York: Cambridge University Press.
Western, B., and S. Jackman. 1994. “Bayesian Inference for Comparative Research.” American
Political Science Review 88 (2): 412–​423.
World Values Survey. 2015. “World Values Survey:  Who We Are.” http://​www.
worldvaluessurvey.org/​WVSContents.jsp.
Cross-National Surveys & Comparative Study of Electoral Systems    409

Appendix: Selected List of National Election


Study Websites
The National Election Study (United States): http://​www.electionstudies.org/​
The British Election Study: http://​www.britishelectionstudy.com/​
The Swedish National Election Studies: http://​valforskning.pol.gu.se/​english
The French National Election Study: http://​www.cevipof.fr/​fr/​eef2017/​fnes/​
The Danish National Election Study: http://​www.valgprojektet.dk/​default.asp
The Dutch Parliamentary Election Studies: http://​www.dpes.nl/​en/​
German Federal Election Studies:  http://​www.gesis.org/​en/​elections-​home/​german-​
federal-​elections/​
Chapter 19

Gr aphical Vi sua l i z at i on
of P olling Re su lts

Susanna Makela, Yajuan Si, and


Andrew Gelman

Introduction

Graphics are an integral part of modern statistics and political science. Gelman and
Unwin (2013) propose several goals for statistical graphics, divided into “discovery”
goals and “communication” goals. Discovery goals for graphics include giving an
overview of the content of a data set, a sense of its scale and complexity, and explora-
tion for any unexpected aspects. Communication goals are useful for both a general
audience and specialists. Compared to tables, graphs allow many more comparisons
to be visible at once and thus can make even complex statistical reasoning more ac-
cessible to a general audience. In addition, graphs can help statisticians better eval-
uate their assumptions and interpret their inferences, and they help social scientists
to better extract and evaluate the substantive claims and conclusions of models.
Polling is expensive, and falling response rates necessitate the most effective use of
available data. Modeling allows us to obtain better estimates, especially for small cells
defined by demographic groups of interest, by borrowing strength across available data.
New polling methods using nonprobability samples also require statistical modeling for
generalizability; see, for example, Wang et al. (2015).
Graphs can and should be used in each step of the modeling process, from exploring
raw data to presenting and explaining final model results; in this chapter, we describe
their use in each of these steps and illustrate with examples that arise from several pre-
viously published works, which we now briefly summarize. We encourage the reader to
Graphical Visualization of Polling Results    411

refer to these publications for greater detail on the data and models behind the graphics
shown here.
Gelman et  al. (2007) use multilevel modeling to explain the apparent paradox of
poor voters favoring Democrats and rich voters favoring Republicans, while poor states
overall tend to support Republican candidates and rich states support Democratic ones.
Gelman et al. (2016) seek to understand large swings in election polls, arguing that re-
ported swings are often likely due to sampling bias rather than true changes in vote in-
tention. Ghitza and Gelman (2013) use multilevel regression and post-​stratification to
estimate election turnout and voting patterns among subsets of the population defined
by multiple demographic and geographic characteristics. Ghitza and Gelman (2014)
develop a generational model of presidential voting, finding that political events in
voters’ teenage and young adult lives are important in shaping their long-​term partisan
preferences. With response rates to traditional polls rapidly declining, Wang et al. (2015)
demonstrate the potential of a highly nonrepresentative data set of presidential vote in-
tention, collected via the Xbox gaming platform, in obtaining accurate election forecasts
via multilevel modeling and post-​stratification. Finally, Makela et  al. (2014) demon-
strate how statistical graphics can be used to better understand the survey weights that
come with many surveys that have complex sampling designs.

Exploring Raw Data

Large polls and complex public opinion surveys have a great deal of structure
and patterns that can be difficult to summarize concisely. Tables of numbers and
percentages quickly become unwieldy and unreadable, and comparisons between
groups and quantities of interest are much more difficult to make with tables than
with graphs. When we are exploring a raw data set, graphics help give a clearer un-
derstanding of its characteristics by illuminating the qualitative content, allowing us
to check assumptions (e.g., whether outcomes between particular subgroups conform
to subject matter knowledge), confirm expected results, and find distinct patterns
(Gelman and Unwin 2013).
For example, the left panel of figure 19.1, from Ghitza and Gelman (2014), plots the
relationship between age and Republican vote share in 2008 among non-​Hispanic
whites, which is complex and nonmonotonic. This plot uses only the raw data (with
lowess curves for clarity), not model estimates. While subject matter knowledge may
lead us to assume that Republican vote share is lower among younger people than older
people, this graph complicates that assumption and forces us to consider alternative
explanations.
412    Susanna Makela, Yajuan Si, and Andrew Gelman

Confronted with this new pattern, the authors construct corresponding curves for
the 2000–​2012 elections (figure 19.1, center panel). Nonmonotonic patterns are apparent
in each election year, but there is no clear trend across elections, and the peaks and
valleys in different election years do not line up by age. Graphing period trends in the
left and center panels of figure 19.1 revealed an unexpected pattern, but did not help us
understand it.
Perhaps graphing generational or cohort trends—​that is, changing the x-​axis from
age to birth year—​may further illustrate the situation. These trends are graphed in the
right panel of figure 19.1, and indeed, the peaks and valleys are nearly perfectly aligned,
providing strong evidence for generational trends in presidential voting. As Ghitza
and Gelman (2014) note, “this relationship remains clear and strong over the course of
12 years, measured across multiple surveys conducted by different organizations, and
unaltered by any complicated statistical model. This appears to be no statistical artifact.”
These three simple plots clearly illustrate a striking pattern that is the foundation of their
entire paper.
Graphics can also help us understand the design and construction of polls and
surveys, particularly with the rise of nontraditional polling methods. Wang et al. (2015)
generate election forecasts using data collected through the Xbox gaming system in the
forty-​five days before the 2012 U.S. presidential election. Their panel data set consists
of over 750,000 interviews with more than 345,000 unique respondents. However, the
sample is clearly nonrepresentative and is biased most severely with respect to age and
sex; this bias is shown in figure 19.2, which plots the demographic composition of the
Xbox sample to the 2012 electorate as estimated from national exit polls. Similarly, figure
19.3 plots daily estimates of two-​party support for Barack Obama across the forty-​five
days before the 2012 election for the Xbox data compared to averages from traditional
polls, clearly displaying how time trends in the Xbox data compare to time trends in a
representative sample.
Many polls and public opinion surveys have complex sampling schemes and come
with weights that correct for known differences between the sample and population.
Here again, graphics are useful in understanding survey weights and their relationship
to the data, as demonstrated by Makela et al. (2014). Figure 19.4 plots binned survey
weights against the design variables used to calculate the weights. Such figures can be
helpful when deciding how to incorporate sampling weights in a model—​whether they
should be included directly or indirectly through the design variables. Furthermore,
it is useful to know how survey weights are related to outcomes of interest, as shown
in figure 19.5. Here we see that the proportion of children who are overweight or have
asthma varies weakly with the survey weights, while household income varies much
more strongly, indicating that not accounting for survey weights in a model of house-
hold income could result in biased estimates. Finally, since large weights can lead to
highly variable estimators, understanding the relationship between weights and sample
Non-Monotonic Age Curve in 2008 Non-Monotonicity in Other Elections Lining up by Birth Year
0.7 0.7 0.7

0.6 0.6 0.6


Republican Vote

0.5 0.5 0.5

0.4 0.4 0.4


2000 2000
2004 2004
2008 2008
0.3 0.3 2012 0.3 2012

20 40 60 80 20 40 60 80 1990 1970 1950 1930


Age Age Birth Year

Figure 19.1  Raw data and LOESS curves, indicating the relationship between age and presidential voting preferences among non-​Hispanic
white voters for the 2000–​2012 elections. (L) The relationship is clearly nonmonotonic and quite peculiar in 2008; instead of a linear or even
quadratic relationship, the curve changes directions multiple times. (C) Nonmonotonicity is a feature of the other elections as well, though no
clear pattern is apparent from this graph alone. (R) The true relationship emerges when the curves are lined up by birth year instead of age. The
peaks and valleys occur in almost identical locations, strongly suggesting a generational trend. (For the interpretation of the references to color in
this figure legend, the reader is referred to the web version of this chapter.)
[Credit line: Ghitza and Gelman (2014)]
Sex Race Age Education State Party ID Ideology 2008 Vote
100%

75%

50%

25%

0%
ale

ale

te

ic

er

ra e
e

ttl und

lid nd

Ro a

pu at

er

M ral

ns ate

M a
ain

er
k

y
at

leg

at

tiv
m

hn am
H

ne
−2

−4

−6
ac

ica
65
an
hi

r
th

th

th
be
So rou

oc
M

du

du

Co er

cC
So Oba
Bl

va
m
W

18

30

45

ol

o
O

O
isp

bl

b
Fe

od
Li
em
si− legr

O
ra
ro

eC

er
eg
H

G
eF

ck
D
Re
lid
t
m

Q Bat
ol

ge

ra
ba
at

So

Jo
ho

lle

Ba
du
Sc

Co
ra

ua
h
tG

ig
H
n'
id
D

XBox 2012 Exit Poll

Figure  19.2  Comparison of the demographic, partisan, and 2008 vote distributions in the Xbox data set and the 2012 electorate (as measured by
adjusted exit polls). As one might expect, the sex and age distributions exhibit considerable differences.
[Credit line: Wang et al. (2015)]
Two-party Obama Support 50%

45%

40%

35%

Sep.24 Oct. 01 Oct. 08 Oct. 15 Oct. 22 Oct. 29 Nov. 05

XBox Raw Pollster.com

Figure 19.3  Daily (unadjusted) Xbox estimates of the two-​party Obama support during the forty-​five days leading up to the 2012 presiden-
tial election, which suggest a landslide victory for Romney. The blue line indicates a consensus average of traditional polls (the daily aggregated
polling results from Pollster.com), the horizontal dashed line at 52% indicates the actual two-​party vote share obtained by Barack Obama, and
the vertical dotted lines give the dates of the three presidential debates. (For the interpretation of the references to color in this figure legend, the
reader is referred to the web version of this chapter.)
[Credit line: Wang et al. (2015)]
416    Susanna Makela, Yajuan Si, and Andrew Gelman

(a) (b)

1.00

1.00 ×
0.75
×
0.75
Proportion

Proportion
+
0.50
0.50 × +
×+ +
×× +
+ × × ++
++ + × + +
0.25 +
0.25 ×× ++ × + × ×+ × ++×
× × +
+ + + × ×+ ××
× × ×++×
× × ×
0.00 0.00
(0.96, 1.2] (2.7, 3] (4.4, 4.7] (6.1, 6.4] (7.8, 8.1] (0.96, 1.2] (2.7, 3] (4.4, 4.7] (6.1, 6.4] (7.8, 8.1]
Binned Weights (log) Binned Weights (log)

<8th grade × Some college


Data
Some HS College+
Loess fit
+ HS or equiv

(c) (d)

1.00 + 1.00
+

++
0.75 0.75
Proportion

Proportion

+
+ +
+ ++ +
0.50 0.50 + + +
+ + × + + + + ×
++ + + + ×+ + +++ + ++ ×
++ + + + + + ×
0.25 +++ 0.25
+ × +
× ×+××
++×
×× ×× × × ×
+ + + × × × × ×× + × ×
×
+ +× ×× ×
× × + +
+
××
××××× ×
××
××× × × ×× × ×× × + +
0.00 0.00
(0.96, 1.2] (2.7, 3] (4.4, 4.7] (6.1, 6.4] (7.8, 8.1] (0.96, 1.2] (2.7, 3] (4.4, 4.7] (6.1, 6.4] (7.8, 8.1]
Binned Weights (log) Binned Weights (log)

Hispanic × Other Under 18 × 25–29 40+


18–19 30–34
White Non-Hisp + Black Non-Hisp + 20–24 35–39

Figure 19.4  (a) Currently Married. (b) Education. (c) Race/​Ethnicity. (d) Age. The proportion


of respondents at each level of the given variable vs. binned baseline survey weights (log scale),
plotted for four discrete ranking variables in the Fragile Families study. The binned averages are
smoothed by lowess curves. Sample size is high, so a large number of bins (as indicated by the tick
marks on the x-​axes) are used. A few of the tick marks are labeled to indicate the log weights in
some of the bins; the total range of the weights is large, varying by a factor of approximately exp
(8.5) or 5000. HS = high school.
[Credit line: Makela et al. (2014)]

size is important. Figure 19.6 shows binned weights plotted against sample size to illus-
trate that although the vast majority of observations have weights with small magni-
tude, there are a small number of observations with large weights that can lead to noisy
estimates.
Graphical Visualization of Polling Results    417

(a) (b)

1.00 1.00

0.75 0.75
Proportion

Proportion
0.50 0.50

0.25 0.25

0.00 0.00
(1.8, 2.1] (3.4, 3.7] (5, 5.2] (6.5, 6.8] (8.1, 8.3] (0.85, 1.1] (2.6, 2.9] (4.3, 4.6] (6.1, 6.4] (7.8, 8.1]
Binned Weights (log) Binned Weights (log)

(c) (d)

1.00
60000

0.75
50000
Household Income
Proportion

0.50 40000

30000
0.25

20000
0.00
(0.85, 1.1] (2.6, 2.9] (4.3, 4.6] (6.1, 6.4] (7.8, 8.1] (0.85, 1.1] (2.6, 2.9] (4.3, 4.6] (6.1, 6.4] (7.8, 8.1]
Binned Weights (log) Binned Weights (log)

Figure 19.5  Sample proportions of (a) children who are overweight, (b) children with asthma,
(c) families receiving welfare benefits, and (d) annual household income, all plotted vs. binned
survey weights.
[Credit line: Makela et al. (2014)]

Model Building

When working with large data sets, graphs are instrumental in iteratively building
models of increasing complexity. Figure 19.7, from Ghitza and Gelman (2013), illustrates
one way of comparing raw data to estimates from a simple model and an incrementally
more complex model.
The left panel plots raw 2008 vote share for John McCain by state and income for non-​
Hispanic whites. We can immediately see that there is much variation in McCain vote
(a)
2500

2000

1500
Sample Size

1000

500

0
(1.40,245.8] (1505,1757] (3016,3268] (4528,4780] (6039,6291]
Binned Weights

(b)
100

75
Sample Size

50

25

0
(749.6,1001] (2513,2765] (4276,4528] (6039,6291]
Binned Weights

Figure 19.6  Sample sizes by weight bin for baseline weights in the Fragile Families study for
(a) all weight bins, (b) weight bins with sample size less than 100.
[Credit line: Makela et al. (2014)]
Income coefficient Income coefficient
Raw Values consistent across states varying by state

MS
McCain Vote Share

$150k+

$150k+

$150k+
$0–20k

$20–40k

$40–75k

$75–150k

$0–20k

$20–40k

$40–75k

$75–150k

$0–20k

$20–40k

$40–75k

$75–150k
Figure 19.7  The evolution of a simple model of vote choice in the 2008 election for state/​income subgroups, non-​hispanic whites only. The first
panel shows the raw data; the middle panel is a hierarchical model in which state coefficients vary, but the (linear) income coefficient is held con-
stant across states; the right panel allows the income coefficient to vary by state. Adding complexity to the model reveals weaknesses in inferences
drawn from simpler versions of the model. Three states—​Mississippi (the poorest state), Ohio (a middle-​income state), and Connecticut (the
richest state)—​are highlighted to show important trends.
[Credit line: Ghitza and Gelman (2013)]
420    Susanna Makela, Yajuan Si, and Andrew Gelman

share across states, as we would expect. However, these raw estimates are quite noisy,
and a clear structure is difficult to discern, even with a sample size exceeding fifteen
thousand (Ghitza and Gelman 2013).
The middle panel depicts estimated McCain vote share plotted against income from
a model in which the effect of income is restricted to be the same across states. As in the
raw data, there is wide variation in the estimates of McCain vote share across states. The
left panel plots estimates from a model in which the effect of income is allowed to vary
by state. The inferences drawn from the model in the middle panel now seem simplistic
when compared to estimates from the right panel.
Increasing the complexity of the model by allowing the effect of income to vary by
state gives a more complete picture of voter behavior and adds an important new dimen-
sion to the story told by the middle panel, namely that the effect of individual income
on McCain vote share depends on state-​level income. Importantly, simply comparing
predicted probabilities or tables of model coefficients would have made this conclusion
difficult to come by, while the appropriate graphs make it nearly impossible to miss.
A similar story is told by the set of graphs in figures 19.8–​19.10, originally published in
Gelman et al. (2007). Figures 19.8 and 19.9 are analogous to the middle and right panels
of figure 19.7, respectively; estimates of support for George W. Bush in figure 19.8 are
from a model in which the effect of individual income is the same across states, while
those in figure 19.9 are from a model allowing the effect to vary by state. The size of the
hollow circles represents the proportion of households in each income category relative
to the national average, while the solid circles represent the average state income.

2000 2004

0.75 0.75 Mississippi


Probability Voting Rep

Probability Voting Rep

Mississippi
Ohio
Ohio
Connecticut
0.50 0.50 Connecticut

0.25 0.25

−2 −1 0 1 2 −2 −1 0 1 2
Individual Income Individual Income

Figure 19.8  Probability of supporting Bush as a function of income category, for a rich state
(Connecticut), a middle-​income state (Ohio), and a poor state (Mississippi), from a multilevel
logistic regression model fit to Annenberg poll data from 2000 to 2004. The open circles show
the relative proportion (as compared to national averages) of households in each income cate-
gory in each of the three states, and the solid circles show the average income level and estimated
average support for Bush for each state. Within each state, richer people are more likely to vote
Republican, but the states with higher income give more support to the Democrats.
[Credit line: Gelman et al. (2007)]
Graphical Visualization of Polling Results    421

2000 2004

Mississippi Mississippi
0.75 0.75
Probability Voting Rep

Probability Voting Rep


Ohio Ohio

0.50 0.50 Connecticut


Connecticut

0.25 0.25

−2 −1 0 1 2 −2 −1 0 1 2
Individual Income Individual Income

Figure 19.9  Probability of supporting Bush as a function of income category, for a rich state
(Connecticut), a middle-​income state (Ohio), and a poor state (Mississippi), from a multilevel
logistic regression model with varying intercepts and slopes fit to Annenberg poll data from
2000 to 2004. The open circles show the relative proportion (as compared to national averages)
of households in each income category in each of the three states, and the solid circles show the
average income level and estimated average support for Bush for each state. Income is a very
strong predictor of vote preference in Mississippi, is a weaker predictor in Ohio, and only weakly
predicts vote choice at all in Connecticut. See figure 5 in Gelman et al. (2007) for estimated slopes
in all fifty states, and compare to figure 8 (figure 3 in Gelman et al. 2007), in which the state slopes
are constrained to be equal.
[Credit line: Gelman et al. (2007)]

The full story is shown in figure 19.10, which plots the probability of voting Republican
against individual income for the six presidential elections between 1984 and 2004. This
graph allows us to examine how the effect of individual income changes not only across
states, but across elections as well.
Graphs dividing model estimates into small multiples are also instructive in under-
standing the structure captured by a model. One good example of this is figure 19.11,
from Ghitza and Gelman (2013), which plots the 2008 two-​party McCain vote share
against income for all voters and non-​Hispanic whites by state as estimated from pooled
Pew surveys and a multilevel model. For most states, the relationship between income
and McCain vote share is similar for all voters and non-​Hispanic whites, but there are
several states—​Louisiana, South Carolina, Mississippi, and Maryland among them—​
in which the pattern for non-​Hispanic whites deviates notably from all voters, partic-
ularly for lower income quintiles. These plots emphasize the importance of accounting
for interactions among income, state, and ethnicity, not just between income and state,
when modeling McCain vote share.
Often a more complex model leads to a new story that is more consistent with the
data. Figure 19.12, from Gelman et  al. (2016), shows estimates of two-​party Obama
support over time for one model that adjusts only for demographics and another that
422    Susanna Makela, Yajuan Si, and Andrew Gelman

Mississippi Mississippi
1984 1988
75% Ohio Ohio
Connecticut
Connecticut
50%

25%
Probability Voting Republican

1992 Mississippi 1996 Mississippi


75%

Ohio
Ohio
50% Connecticut Connecticut

25%

2000 Mississippi 2004 Mississippi


75%
Ohio Ohio

50% Connecticut
Connecticut

25%

−2 −1 0 1 2 −2 −1 0 1 2
Individual Income Individual Income

Figure  19.10 Results for a varying-​intercept, varying-​slope, multilevel logistic regression,


using exit poll data from 1984 to 2004. The curves show the probability of supporting Bush as a
function of income category, within states that are poor, middle-​income, and rich.
[Credit line: Gelman et al. (2007)]

adjusts for both demographics and partisanship. Under the first model, Obama support
fluctuates sharply in the forty-​five days preceding Election Day, but adjusting for par-
tisanship in addition to demographics greatly reduces this variation. Gelman et  al.
(2016) interpret results from the latter model as “suggesting that most of the apparent
changes in support during this period were artifacts of partisan nonresponse.” In this
case, graphing estimates from the two models in the same figure reveals a qualitatively
different picture of Obama support prior to the 2012 election using the more complex
model that adjusts for partisanship in addition to demographics than using the simpler,
demographics-​only model.
Another example of graphs illustrating the different stories two models can tell
is figure 19.13, also from Gelman et al. (2016). Here, the authors plot changes in two-​
party Obama support before and after the first presidential debate across various
Graphical Visualization of Polling Results    423

100%
Wyoming Oklahoma Utah Idaho Alabama Arkansas Louisiana
50%
100% 0%

Kentucky Tennessee Nebraska Kansas West Virginia Mississippi Texas


50%
100% 0%

South Carolina North Dakota South Dakota Arizona Georgia Montana Missouri
50%
100% 0%

North Carolina Indiana Florida Ohio Virginia Colorado New Hampshire


50%
100% 0%

Iowa Pennsylvania Minnesota Nevada Wisconsin New Mexico New Jersey


50%
100% 0%

Oregon Michigan Washington Maine Connecticut California Delaware


50%
100% 0%

poor mid rich


Illinois Maryland Massachusetts New York Rhode Island Vermont
50%
0%

poor mid rich poor mid rich poor mid rich poor mid rich poor mid rich poor mid rich

Figure 19.11  All voters, shown in black, and non-​Hispanic whites, in gray. Dots are weighted
averages from pooled June–​November Pew surveys; error bars show + /​−1 s.e. bounds. Curves are
estimated using multilevel models and have a s.e. of about 3% at each point. States are ordered in
decreasing order of McCain vote (Alaska, Hawaii, and Washington, DC, excluded).
[Credit line: Ghitza and Gelman (2013)]

subpopulations for the demographics-​only and demographics plus partisanship models


described above. The conclusions about the effects of the debate on support for Mitt
Romney on these subpopulations differ between the two models.

Understanding the Results

Interpreting coefficients from even relatively simple models can be difficult. Adding
interactions, nonlinear terms, and hierarchical structure to the model makes such
424    Susanna Makela, Yajuan Si, and Andrew Gelman

60%

55%
Two−party Obama support

50%

45%

40%

Sep. 24 Oct. 01 Oct. 08 Oct. 15 Oct. 22 Oct. 29 Nov. 05

Figure  19.12  Obama share of the two-​party vote preference (with 95% confidence bands),
estimated from the Xbox panel under two different post-​stratification models:  the dark line
shows results after adjusting for both demographics and partisanship, and the light line adjusts
only for demographics. The surveys adjusted for partisanship show less than half the variation
of the surveys adjusted for demographics alone, suggesting that most of the apparent changes in
support during this period were artifacts of partisan nonresponse.
[Credit line: Gelman et al. (2016)]

interpretations even more challenging. Furthermore, in multilevel models, coefficients


are modeled in batches, and we may be interested in the extent of partial pooling in
the coefficient estimates, which is difficult to communicate via tables. Graphs can make
regression results from even highly complex models easier to understand, summarize,
and interpret.
One example of using graphs to understand model results comes from Ghitza and
Gelman (2013). In describing models of election turnout and voting patterns, the
authors note that “we knew a priori that our estimates for Obama’s vote share among
African American groups needed to be high, over 90%, but we could not know what re-
gression coefficient was plausible, as the coefficient could change drastically depending
on functional form.” In contrast, graphing the actual estimated Obama support for var-
ious demographic subgroups would immediately reveal whether the model captures
this known aspect of the data and how the estimates behave as these subgroups are made
finer and finer.
Figure 19.14, from Ghitza and Gelman (2013), confirms that African Americans’
predicted two-​party McCain vote share (darkest gray circles) is low. In addition, we see
that adding more demographics reveals the heterogeneity within subgroups, but the
overall estimates remain relatively stable. Figure 19.14 also exemplifies how graphs can
Male

Sex
Female

White

Black

Race
Hispanic

Other

18−29

30−44

Age
45−64

65+

Didn't Graduate From HS

Education
High School Graduate

Some College

College Graduate

Battleground

Quasi−battleground

State
Solid Obama

Solid Romney

Democrat
Party ID
Republican

Other

Liberal
Ideology

Moderate

Conservative

Barack Obama
2008 Vote

John McCain

Did Not Vote In 2008

Other

−5% 0% 5% 10% 15% 20%


Change in two−party Obama support
(positive values indicate a Romney gain)

Adjusted by demographics
Adjusted by demographics and partisanship

Figure 19.13  Estimated swings in two-​party Obama support between the day before and four
days after the first presidential debate under two different post-​stratification models, separated by
subpopulation. The vertical lines represent the overall average movement under each model. The
horizontal lines correspond to 95% confidence intervals.
[Credit line: Gelman et al. (2016)]
State × Ethnicity State × Ethnicity × Income State × Ethnicity × Income × Age
1.0
Turnout 2008

0.5

0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
McCain Vote 2008

Figure  19.14  Turnout and vote choice for population subgroups, presidential election 2008. Size = Subgroup population size 2007; Color by eth-
nicity: White = White, Black = Black, Red = Hispanic, Green = Other. Each bubble represents one demographic subgroup per state, with size and color
indicating population size and ethnicity. As additional demographics are added, heterogeneity within subgroups is revealed by the dispersion of the
bubbles, while estimates remain reasonable. (For the interpretation of the references to color in this figure legend, the reader is referred to the web
version of this chapter.)
[Credit line: Ghitza and Gelman (2013)]
Graphical Visualization of Polling Results    427

encode additional information in the color and size of plotting symbols (see the web
version of this chapter for color figures).
Similarly, figure 19.15 uses color and a grid of maps by age and income to display the
heterogeneity in vote swing from 2004 to 2008 among non-​Hispanic whites (see the web
version of this chapter for color figures). While whites overall shifted toward Obama
by 3.3%, poorer and older white voters in the South and Appalachia actually supported
McCain in 2008 more than they did Bush in 2004 (Ghitza and Gelman 2013). This het-
erogeneity would be nearly impossible to determine from regression coefficients alone,
and the use of color and repeated multiple graphs makes the variation by age, income,
and geography immediately clear to the reader.
Regression coefficients from a complex model are summarized particularly clearly
in figures 19.16 and 19.17, from Ghitza and Gelman (2014). Full details of the model
are given on pages 6–​7 of that paper, but briefly, the model predicts the proportion of

18–29 30–44 45–64 65+


+25%
$0–20k
$20–40k

0%
$40–75k
$75–150k
$150k+

–25%

Figure 19.15  State-​by-​state shift toward McCain (red) or Obama (blue) among white voters,
broken down by income and age. Red = McCain better than Bush; Blue = McCain worse than
Bush. Only groups with > 1% of state voters shown. Although almost every state moved toward
Obama in aggregate, there are substantial demographic groups that moved toward McCain all
over the map, specifically among older whites. (For the interpretation of the references to color in
this figure legend, the reader is referred to the web version of this chapter.)
[Credit line: Ghitza and Gelman (2013)]
428    Susanna Makela, Yajuan Si, and Andrew Gelman

Formative Years Not As


The Formative Years Important for Minorities
18
0.5
Posterior Mean Non-Southern White
0.04 50% C.I. Southern White
95% C.I. Minority
0.4
Age-Specific Weights (w)

0.03

Posterior Density
0.02 0.3

0.01 0.2

0.00
0.1
–0.01

0.0
10 20 30 40 50 60 70 0 5 10 15 20
Age Estimate of βg

Figure 19.16  Estimates for the generational aspects of the model. (L) The rough age range of
fourteen to twenty-​four is found to be of paramount importance in the formation of long-​term
presidential voting preferences. Political events at a very young age have very little impact, and
after the age of twenty-​four, the age weights decrease, staying at a small steady magnitude from
about the age of forty-​five onward. (R) These age weights, and the political socialization process
implied by them, are substantially more important for non-​Hispanic whites than for minorities
as whole.
[Credit line: Ghitza and Gelman (2014)]

Republican presidential support by the birth year cohort, election year, and race/​region
group (non-​southern white, southern white, and minority) to which a given survey re-
spondent belongs. Specifically, Republican vote share is modeled as the sum of a gener-
ational effect—​the importance of age in forming long-​term presidential voting patterns
and how this importance varies by race/​region—​and a period effect that captures
election-​to-​election changes by race/​region and the importance of these changes for
different age groups.
Figure 19.16 summarizes the generational effects, which consist of an age-​specific
weight for ages one to seventy and an interaction term that allows the importance of
these weights to vary by race/​region group for each birth year and election year. While
the actual numerical values of the age weights are difficult to understand, we can imme-
diately see in the left panel that events occurring roughly between the ages of fourteen
and twenty-​four have the largest impact on future vote preference. The interaction terms
are summarized in the right-​hand panel of figure 19.16, which displays their posterior
distribution for each race/​region group. Interaction terms are often difficult to interpret
directly, and this graph allows us to ignore their exact numerical values and focus on
understanding their substantive meaning, while also clearly displaying the uncertainty
in their posterior estimates. The age weights are more important for whites, with the
Graphical Visualization of Polling Results    429

Period Effects Are Period Effects Stronger


by Race/Region During Formative Years?
4
Non-Southern White Non-Southern White
Southern White 15 Southern White
Minority Minority
2
Model Coefficient (αt,g)

Posterior Density
10
0

5
–2

–4
0
1960 1970 1980 1990 2000 2010 1.00 1.25 1.50 1.75 2.00
Election Effect Size Ratio, (Age 18/Age 70)

Figure 19.17  Estimates for the election-​to-​election period effects in the model. (L) Minorities
are consistently more likely to vote for Democratic presidents, and southern whites have steadily
trended pro-​Republican over the past fifty years. (R) Period effects are roughly similar between
young and old voters among minorities and in the South; evidence is inconclusive for non-​
southern whites.
[Credit line: Ghitza and Gelman (2014)]

means of the estimates (denoted by the vertical lines) more than twice as high for whites
as for nonwhites. Ghitza and Gelman (2014) point out that these interaction terms were
not restricted a priori to be positive by the model, but as we can see from the graph, each
distribution is centered well away from zero; this is another feature of the estimates that
would be difficult to discern without a graphical summary.
The period effects are displayed in figure 19.17. These effects consist of an election-​
specific term that captures the effect of that election year for the three race/​region
groups, as well as an interaction term that allows this effect to be potentially stronger
in some age groups than in others. Recall that Republican vote share is modeled as the
sum of a generational and period effect, so a negative value for the election year effect
indicates lower Republican vote share. Thus, the election year effects plotted in the left-​
hand panel of figure 19.17 show that nonwhites have been consistently more likely to vote
for Democratic candidates over the past fifty years, while southern whites have tended
to vote more Republican, particularly in the four most recent elections. These results
are consistent with subject matter knowledge, so we can be confident that the model is
capturing expected patterns in the data.
The parameters governing the relative importance of the election year effects for
different age groups are more difficult to summarize. One way to understand them is to
calculate the ratio of the election year effect at ages eighteen and seventy, respectively,
the (approximate) peak and trough of the age-​weight curve in figure 19.16. The right-​
hand panel of figure 19.17 plots the distribution of this ratio for the race/​region groups.
430    Susanna Makela, Yajuan Si, and Andrew Gelman

For southern whites and minorities, there does not seem to be much of a differential age
effect, and while there is possibly a larger effect for young ages among non-​southern
whites, the spread of the distribution is too wide to be conclusive. Again, these regres-
sion coefficients would have been difficult to interpret from a table, but we can easily
understand them by graphing a clever transformation of the estimates that summarizes
a relevant feature of the model. Furthermore, as in the right-​hand panel of figure 19.16,
plotting the entire posterior distribution instead of a point estimate makes it easier to
understand the extent of uncertainty in the estimated coefficients.

Model Checking

Model checking is the process of understanding how well and to what extent the model
fits the data and where it could be improved. We first consider simple comparisons of
model predictions to known outcomes or gold standard data, as in figures 19.18, from
Wang et  al. (2015), which used a nonrepresentative data set collected via the Xbox
gaming platform to generate election forecasts for the 2012 presidential election and
applied multilevel regression and post-​stratification to adjust the Xbox estimates. The
2012 exit polls are used as the benchmark or gold standard for evaluating the accuracy of
the model-​based forecasts.
Figures 19.18 and 19.19 show the discrepancies between two-​party Obama vote share
for various demographic subgroups obtained from the Xbox estimates and from exit
polls. For simple one-​dimensional demographic groups such as sex and age, model
estimates and benchmark values can be directly plotted on the same graph, as in figure
19.18. However, as we further subdivide the population by considering two-​dimensional
demographic groups such as female moderates, white liberals, and so forth, directly
plotting the two sets of estimates would render the plot difficult to read. Instead, plotting
the differences and ordering them by magnitude allows us to easily see which subgroups’
voting behavior is best captured by the model, as shown in the left panel of figure 19.19.
Here the authors have selected the 30 largest two-​dimensional demographic subgroups
for visual clarity. We can see the same comparison for all 149 two-​dimensional dem-
ographic subgroups in the right panel of figure 19.19. Encoding the relative size of the
subgroup in the size of the dot allows an additional layer of information to be easily
incorporated into the graph, making it clear to the reader that, as would be expected,
the Xbox estimates are poorest for the smallest demographic subgroups and best for the
largest ones.
Another way to check the fit of the model is to consider the posterior predictive distri-
bution for a quantity of interest. In cases where benchmark data are unavailable, we can
draw samples from this distribution and calculate a test statistic to compare to the ac-
tual data. In the case of Wang et al. (2015), in which benchmark data are available, figure
19.20 plots the predicted distribution of electoral votes for Obama. The dashed and light
gray vertical lines represent, respectively, the actual number of electoral votes Obama
Sex Race Age Education Party ID Ideology 2008 Vote
100%

75%
Two−Pary Obama Vote Share

50%

25%

0%
ale

ale

te

ic

er

ol

er

ain

er
k

ra

ra
at

leg

at

at

tiv

m
−2

−4

−6
ac

ica
65
an
hi

ho
th

th

th
be

oc
M

du

du

er

cC
ba
Bl

va
W

18

30

45

ol
O

t/o

O
isp

bl
Sc
Fe

od
Li

em

O
ra

ra
eC

M
er

pu
en
H

M
G

ns

ck
D

hn
ig

Re
m

nd
ol

ge

Co

ra
H

So

Jo
ho

pe
lle

Ba
an

de
Sc

Co
Th

In
h
ig
ss

H
Le

Xbox Estimates 2012 Exit Poll

Figure 19.18  Comparison of the two-​party Obama vote share for various demographic subgroups, as estimated from the 2012 national exit poll and
from the Xbox data on the day before the election.
[Credit line: Wang et al. (2015)]
(a)
−2 0 2 4 6
White moderates
White political independents
Female moderates
White liberals
Moderate college graduates
White females
White college graduates
White males
Male moderates
Male political independents
Female college graduates
Liberal Democrats
Females with some college
Whites with some college
Female 45−64 year−olds
White 30−44 year−olds
Male college graduates
Male 45−64 year−olds
Democrat college graduates
Female Democrats
Female Republicans
White Democrats
White Republicans
White conservatives
Female conservatives
Male Republicans
Conservative Republicans
Conservative males
−2 0 2 4 6
Difference (%)
(b)

100%
Actual Two−party Obama Vote Share

75%

50%

25%

0%
0% 25% 50% 75% 100%
Xbox Estimates

Figure 19.19  Left panel: Differences between the Xbox MRP-​adjusted estimates and the exit
poll estimates for the thirty largest two-​dimensional demographic subgroups, ordered by the
differences. Positive values indicate that the Xbox estimate is larger than the corresponding exit
poll estimate. Among these thirty subgroups, the median and mean absolute differences are 1.9
and 2.2  percentage points, respectively. Right panel:  Two-​party Obama support, as estimated
from the 2012 national exit poll and from the Xbox data on the day before the election, for various
two-​way interaction demographic subgroups (e.g., sixty-​five-​plus-​year-​old women). The sizes of
the dots are proportional to the population sizes of the corresponding subgroups.
[Credit line: Wang et al. (2015)]
Graphical Visualization of Polling Results    433

5%

4%

3%

2%

1%

0%
200 220 240 260 280 300 320 340 360 380 400
Electoral Vote

Figure 19.20  Projected distribution of electoral votes for Obama one day before the election.
The light vertical line represents 269, the minimum number of electoral votes that Obama needed
for a tie. The vertical dashed line indicates 332, the actual number of electoral votes captured by
Obama. The estimated likelihood of Obama winning the electoral vote is 88%.
[Credit line: Wang et al. (2015)]

captured (332) and the minimum number needed to tie (269). As most of the mass of
this distribution is to the right of the minimum number needed to tie, we can see that
the model estimates a high probability of an Obama victory (the estimated likelihood is
in fact 88%). However, we also see that the distribution is quite variable, and the authors
note that “extreme outcomes seem to have unrealistically high likelihoods of occurring.”
Graphs like figure 19.20 are useful in revealing such possibly unexpected aspects of the
model and prompting further investigation into which features of the data are not fully
captured or are misrepresented by the model, leading to another iteration in the cycle of
data exploration and model building.
In addition to understanding the implications and meanings of regression
coefficients, we also want to know how well the model fits the data overall. Figure 19.21,
from Ghitza and Gelman (2014), plots R2, an elementary measure of the percent of var-
iance in the outcome explained by a model, for their full model of vote choice, as well as
a simpler model that includes only period/​group effects. The importance of this graph
is that it displays R2 not only for the data as a whole, but also for the three race/​region
groups separately. Comparing the two models on the basis of the data as a whole may
lead us to conclude that the simpler model is preferable, but the breakdown by race/​
region reveals that the advantage of the more complicated model is its superior perfor-
mance in explaining variance in vote choice among non-​southern whites.

Presenting Results

Finally, graphs are essential in presenting and explaining the results of a poll or statis-
tical model. (Note: we recommend readers refer to the online version of this section, as
434    Susanna Makela, Yajuan Si, and Andrew Gelman

How Well Does the Model Explain Macro-Level Vote Choice?

Overall Non-Southern Southern Minority


White White
100% 92%
Percent of Variance Explained

89%

75% 71%

56% 57% 58%


53%
50%
50%

25%

0%
Period/Group
Effects Model

Full Model

Period/Group
Effects Model

Full Model

Period/Group
Effects Model

Full Model

Period/Group
Effects Model

Full Model
Figure 19.21  Percent of variation explained by the model for all voters and various race/​region
groups. The model accounts for 92% of the macro-​level variance in voting trends over the past
half century. That said, much simpler models, incorporating only period/​group effects, would
also account for much of the variance. The real substantive power of the model is how it improves
model fit within race/​region groups, particularly among non-​southern whites.
[Credit line: Ghitza and Gelman (2014)]

interpretation of the figures described here relies heavily on color.) A prime example is
given in figure 19.22, from Ghitza and Gelman (2014). The top panel shows the Gallup
Presidential Approval series, the main covariate used to model presidential vote choice.
The series is color-​coded to highlight pro-​Republican (red; approval above 50% for
Republican presidents) and pro-​Democratic (blue; approval above 50% for Democratic
presidents) years, with line thickness proportional to the age weights corresponding to
white members of the cohort born in 1941. The darkness of the color reflects the mag-
nitude of the approval, with approval levels closer to 50% shown in shades of gray. The
bottom panel plots the cumulative generational effects—​that is, the overall voting
tendencies of the cohort at each age—​excluding period effects so as to display general
trends independent of the effects of any particular election.
The top and bottom panels work in concert to tell the story of presidential voting
for this cohort. Despite high approval ratings for President Franklin Roosevelt and
during the first half of the Harry Truman presidency, the members of this cohort were
too young to be significantly affected by the popularity of these Democratic leaders.
This lack of effect can be seen in the low age weights in those years (the thickness of
the approval series) and the nearly zero values of the cumulative generation effect in
the bottom panel. The most important years in terms of political socialization for this
Graphical Visualization of Polling Results    435

Birth Year = 1941


100
Approve/(Approve+Disapprove)
Gallup Presidential Approval

75

50

25
Eisenhower
Roosevelt

Kennedy

Johnson
Truman

Clinton

Obama
Reagan

Bush II
Bush I
Carter
Nixon

Ford
0
Cumulative Generation Effect

+10
(Pro-Republican)

–10

0 10 20 30 40 50 60 70
Age of Cohort

Figure  19.22 Presidential Approval time series, and the cumulative generational effects
of that series, for Eisenhower Republicans, born in 1941. The series is drawn to emphasize this
generation’s peak years of socialization, according to the age weights found by the model. Dark
blue indicates strongly pro-​Democratic years, dark red for pro-​Republican, and shades of gray in
between. This generation missed most of the FDR years and were socialized through ten straight
pro-​Republican years, spanning the end of the Truman presidency and eight years of the popular
Republican president Eisenhower. Their partisan voting tendencies were somewhat stabilized
back toward the neutral gray line by the pro-​Democratic 1960s, and they reached a rough equilib-
rium by the end of the Nixon presidency. (For the interpretation of the references to color in this
figure legend, the reader is referred to the web version of this chapter.)
[Credit line: Ghitza and Gelman (2014)]

cohort occurred during the presidency of Dwight Eisenhower, where the age weights
for this cohort are at their largest. Eisenhower was a popular Republican president, re-
flected in the dark red of the approval series, and the 1941 birth cohort became steadily
more pro-​Republican over the course of his presidency.
The effects of subsequent presidents are described in more detail in Ghitza and
Gelman (2014), and we pause here to summarize the many pieces of information incor-
porated in this graph. First, it displays the presidential approval series, with color to dis-
tinguish between pro-​Republican and pro-​Democratic years within a presidency; the
436    Susanna Makela, Yajuan Si, and Andrew Gelman

measure of pro-​Republican approval (equivalently, Democratic disapproval) is the main


covariate used to model vote choice. Second, the graph incorporates the age weights,
a substantively important aspect of the model, in the presidential approval series by
making the width of the series proportional to these weights. Third, the bottom panel
displays an easily interpretable summary of the model results in terms of generational
effects. Finally, the juxtaposition of the two panels so that presidential administrations
align with the age of the cohort neatly ties together the relationship between presiden-
tial approval and generational voting trends captured by the model. In short, this graph
is useful because it shows the correspondence between the key covariates (presidential
approval and age) and the outcome in a single figure and enhances the narrative that
qualitatively ties the model together.
Figure 19.23, from Ghitza and Gelman (2014), plots cumulative generational trends
for all white voters born between 1855 and 1994. The trends for each generation are

The Changing White Electorate


As A Function of Presidential Approval
65%
New Deal Democrats And Older (Born 1855–1934)
Eisenhower Republicans (Born 1935–1949)
Baby Boomers (Born 1950–1959)
Reagan Conservatives (Born 1960–1979)
60% Generation Y & Millenials (Born 1980–1994)
Population-Weighted Average of All White Voters
Republican Vote in an Average Election

55%

50%

45%
Eisenhower
Roosevelt

Kennedy

Johnson
Truman

Clinton

Obama
Reagan

Bush II
Bush I
Carter
Nixon

Ford

40%
1940 1950 1960 1970 1980 1990 2000 2010

Figure  19.23  Cumulative preferences of each generation, shown along with the weighted
summation of the full white electorate. The generations are now more loosely defined, to allow
the entire electorate to be plotted at once, with the width of each curve indicating the proportion
of the white electorate that each generation reflects at any given time. The model—​in this graph
reflecting only the Approval time series and the age weights—​can explain quite a bit about the
voting tendencies of the white electorate over time.
[Credit line: Ghitza and Gelman (2014)]
Graphical Visualization of Polling Results    437

shown in a solid line, with surrounding colored bands whose width is proportional to
each generation’s contribution to the total electorate in a given year. This plot allows us
to easily visualize and understand the behavior of each generation over time and is an
invaluable complement to the narrative given in the text of Ghitza and Gelman (2014).

Discussion

We have described the use of graphics in each step of the modeling process, from
exploring raw data to presenting final results. Graphical displays of data and inferences
help us take advantage of all the information available in a poll or data set, often
conducted at considerable expense. We seek to communicate more information more
directly, to general audiences, to specialists, and to ourselves.
We conclude with some best practices for creating graphs.
Before starting, consider two questions that will guide the rest of the graph-​making
process: Who is your audience, and what are your goals? The same data may be graphed
in different ways depending on whether the audience is, for example, policymakers to
whom you want to communicate a single clear point relevant to a policy decision, or
researchers with whom you want to stimulate a discussion about an academic question.
The graphs may or may not look different depending on who the audience is, but the
place to start is understanding with whom you are communicating and about what.
Next, remember that all graphs are comparisons. What is the comparison that
you want your audience to make when they look at your graph? The answer to
this question will help determine the most high-​level aspects of the graph, such
as whether you make a scatterplot or line graph or dot plot, but also finer details
like axis limits, color scales, and plotting symbols. The graph should display the
comparisons that are important and relevant to the story you are telling, not the ones
that are easiest to make. As an example, consider figure 19.2. Here the most impor-
tant comparisons are within each panel between Xbox and exit poll distributions,
which determines the overall structure of the graph: separate panels for each var-
iable, levels of variables on the x-​axis, and percentages on the y-​axis. However,
comparisons across panels are also interesting, so the y-​axes of the panels are on
the same scale. The overall structure of the graph facilitates the main within-​panel
comparisons, but also allows for cross-​panel comparisons to be made with minimal
cognitive effort.
Finally, we have some small suggestions for making your graphs cleaner and thus,
we hope, more readable, allowing your audience to focus on the data rather than being
distracted by their presentation. First, use axis labels judiciously and sparingly: enough
to give a clear idea of scale, but not so many that they distract from the overall graph.
Second, make use of every available dimension. For example, if you are plotting a cat-
egorical variable on one axis, order the categories by a relevant quantity rather than
alphabetically. Third, don’t expect to fit everything on one graph. Sometimes several
438    Susanna Makela, Yajuan Si, and Andrew Gelman

graphs, each clearly showing a specific comparison of interest, can convey a message
better than one graph that tries to do too much.

Acknowledgments
We thank the National Science Foundation for partial support of this work.

References
Gelman, A., S. Goel, D. Rivers, and D. Rothschild. 2016. “The Mythical Swing Voter.” Quarterly
Journal of Political Science 11 (1): 103–​130.
Gelman, A., and A. Unwin. 2013. “Infovis and Statistical Graphics: Different Goals, Different
Looks.” Journal of Computational and Statistical Graphics 22 (1): 2–​28.
Gelman, A. E., B. Shor, J. Bafumi, and D. K. Park. 2007. “Rich State, Poor State, Red State,
Blue State: What’s the Matter with Connecticut?” Quarterly Journal of Political Science 2
(4): 345–​367.
Ghitza, Y., and A. Gelman. 2013. “Deep Interactions with MRP: Election Turnout and Voting
Patterns among Small Electoral Subgroups.” Americal Journal of Political Science 57
(3): 762–​776.
Ghitza, Y., and A. Gelman. 2014. “The Great Society, Reagan’s Revolution, and Generations of
Presidential Voting.” Unpublished manuscript.
Makela, S., Y. Si, and A. Gelman. 2014. “Statistical Graphics for Survey Weights.” Revista
Colombiana de Estadstica 37 (2): 285–​295.
Wang, W., D. Rothschild, S. Goel, and A. Gelman. 2015. “Forecasting Elections with Non-​
representative Polls.” International Journal of Forecasting 31 (3): 980–​991.
Chapter 20

Gr aphical Di spl ays for


Public Opinion Re se a rc h

Saundra K. Schneider and William G. Jacoby

Introduction

A graphical display can be an excellent tool for presenting quantitative information in a


succinct and easily comprehensible form. However, as Kastellec and Leoni (2007) point
out, graphs are used very infrequently in the political science research literature. On the
one hand, this is a very typical situation; Cleveland (1984b) pointed out some years ago
that graphical displays are far less common than tabular presentations of data and ana-
lytic results in most scientific fields. On the other hand, the fact that the same situation
exists in political science is somewhat ironic, since the individual who has done more
than anyone else to popularize the use of graphical displays as a strategy for representing
quantitative information—​Edward R. Tufte—​began his career as a political scientist.
There is some reason to expect that political scientists’ use of graphs will increase in
the near future. All modern statistical software packages and computing environments
contain routines for producing sophisticated graphical displays with relatively little
effort. At the same time, there is a growing literature in this general field covering such
topics as general theories of statistical graphs (Wilkinson 2005; Young, Friendly, and
Valero-​Mora 2006), the use of particular software systems (e.g., Murrell 2006; Mitchell
2008; Sarkar 2008; Wickham 2009), strategies for employing graphs with large and
complex data sets (Unwin, Theus, and Hofmann 2006; Cook and Swayne 2007), and
narratives detailing the ways that particular graphs either contributed to or hindered
scientific progress in a number of substantive fields (Wainer 2000, 2005, 2009).
Furthermore, Tufte’s works (1997, 2001, 2006) and the previously mentioned article by
Kastellec and Leoni (2007) provide strong advocacy for the incorporation of graphs as
an integral component of empirical investigations.
If graphs are to be a useful tool for research, then it is critical that the displays be put
together in ways that convey their information in an effective manner. But many of the
440    Saundra K. Schneider and William G. Jacoby

graphs that have appeared in the political science literature do not optimize the pres­
entation of their material. This is problematic because a poorly constructed graph can
hinder information retrieval within its audience; when that occurs, graphs are certainly
no better, and maybe even worse, than a tabular display of the same information.
The purpose of this chapter is to take some modest steps toward promoting the
effective use of graphical displays in political science journal articles and research
monographs. We provide specific advice and guidelines about

• determining when a graph would be useful for communicating quantitative


information,
• features to consider in selecting a graph for displaying data or analytic results, and
• features and details associated with specific types of graphs that help to maximize
the information they convey to their audience.

Our overall objective is to encourage political scientists to use graphs in an effective


manner, making them useful tools for conveying information about the data and
analyses that comprise the central components of our empirical research efforts.

Why Use Graphs?

Quantitative information can be presented numerically, in a tabular display.


Alternatively, it can be presented in pictorial form, as a graphical display. Both of these
display strategies can be used to convey the same information. However, graphs often
have several advantages over tabular displays of numeric information, especially when
the immediate objective is to understand any systematic structure that exists across the
units of analysis. First, well-​constructed graphical displays downplay the details of a
data set (i.e., the specific values associated with particular observations) and focus our
attention instead on its interesting features, such as distributional shape, central ten-
dency, dispersion, and unusual observations (if any are present). Second, graphs ef-
fectively bypass some of the implicit (but important) assumptions that underlie the
interpretation of sample statistics by showing all of the data, rather than just providing
numerical summaries. And third, graphs encourage interaction between the researcher
and the data, because they highlight interesting and unusual features that lead to closer
inspection and (often) new insights.
Given such advantages, it is reasonable to ask whether graphs are always better than
tables for presenting quantitative information. The answer to that question is “no.” There
definitely are some situations in which tabular displays are more effective than graph-
ical displays. For example, consider Table 20.1, which shows the percentage of the pop-
ular vote received by the candidates in the 2000 U.S. presidential election, along with
the raw numbers of voters for each one. Figure 20.1 shows the same information in
graphical form, as a bar chart. In this case, the table provides all of the information in
Table 20.1 Candidate Vote Percentages in the 2000 Presidential Election
Presidential Candidate Percentage of Popular Vote Popular Vote (in Millions)

Bush 47.87 50.46


Gore 48.39 51.00
Nader 2.74 2.88
Other 1.01 1.06
Total 100.00 105.40

Source: U.S. Federal Election Commission. 2001. www.fec.gov/​pubrec/​2000presgeresults.htm.

Other

Nader
Presidential candidate

Gore

Bush

0 10 20 30 40 50
Percentage of 2000 popular vote

Figure 20.1  Candidate Vote Percentages in the 2000 Presidential Election.


Source: U.S. Federal Election Commission. 2001. www.fec.gov/​pubrec/​2000presgeresults.htm.
442    Saundra K. Schneider and William G. Jacoby

a form that is easily comprehensible and amenable to drawing meaningful, substantive


conclusions. The table shows that Gore received 540,000 (or about 0.52%) more votes
than Bush, and Nader received just under three million votes (or about 2.74% of the
total), which is about one-​sixteenth of the votes that were cast for either of the two major
party candidates.
The immediate access to the numbers in Table 20.1 enables very precise statements
and comparisons. That is not the case with the graphical evidence in Figure 20.1. Here
we can see easily that Bush and Gore received far more votes than Nader or anyone else,
but it takes quite a bit of effort to be more precise than that. Perhaps even more trou-
bling, a casual glance at the bar chart might miss what is probably the most important
element of this information: that Gore won the popular vote. The lengths of the bars
representing votes for Bush and Gore are almost identical; a close look is required to
confirm that the bar for Gore is, in fact, longer than that for Bush. Of course very careful
inspection of the figure would avoid problematic conclusions. But the fact remains that
it probably takes more effort than simply reading the numbers in the table.
A different situation exists with respect to the data in Table 20.2. This table shows the
percentage of the electorate within each of the American states who identified them-
selves as Democrats in 2007. The data values for the states are estimated from national-​
level data by aggregating across a number of public opinion surveys and applying
multilevel regression with post-​stratification (Enns and Koch 2013a). The same infor-
mation is presented graphically by the histogram in Figure 20.2. This data set is not par-
ticularly large, in absolute terms—​it contains only fifty observations. Nevertheless, it
is impossible to gain much insight regarding the structure of these data by looking at
the numeric values alone. The states are listed in alphabetical order, which arrays the
observations in a manner that is probably irrelevant for the quantitative information.
And even if the observations were ordered according to some substantively relevant cri-
terion (e.g., from smallest to largest), the sheer number of data values makes it difficult
(and probably impossible) for an observer to process the information contained in the
table in any meaningful way.
In contrast, Figure 20.2 immediately reveals several interesting features of this data
set. First, it shows that the distribution of Democratic identifiers within the states is uni-
modal and nearly symmetrical. The highest bar in the histogram covers the interval on
the horizontal axis from about 29% to about 32%. So on average, just under one-​third of
a state’s electorate identifies as Democrats. The bars of the histogram range from 23% to
41% (approximately), but more than half of the states fall into the interval from about
28% to 35%. There are no “gaps” between bars in the histogram, suggesting that there are
no outliers in the data. Thus, a quick look at the histogram provides information about
the data’s shape, center, spread, and absence of unusual observations—​in other words, a
relatively complete description of the variable’s distribution. Here, the graphical display
is probably much more informative than the table of data values.
Table 20.3 shows still another situation, using some data from the 1992 CPS National
Election Study. Here we have a cross-​tabulation showing the percentages of survey
respondents within each of eight age groups that identified themselves as Democrats,
Table 20.2 Percentage of Democratic Identifiers in State
Electorates, 2007
States Percent Democratic

Alabama 33.76
Alaska 27.99
Arizona 29.52
Arkansas 31.27
California 34.95
Colorado 29.67
Connecticut 33.72
Delaware 36.70
Florida 33.52
Georgia 36.77
Hawaii 39.43
Idaho 25.21
Illinois 35.56
Indiana 29.94
Iowa 29.63
Kansas 27.91
Kentucky 29.06
Louisiana 35.62
Maine 29.88
Maryland 40.47
Massachusetts 33.18
Michigan 33.20
Minnesota 29.79
Mississippi 38.34
Missouri 31.54
Montana 27.65
Nebraska 26.64
Nevada 32.46
New Hampshire 28.84
New Jersey 35.09
New Mexico 32.30
New York 37.18
North Carolina 34.56
North Dakota 27.03
Ohio 31.69

(continued)
Table 20.2 Continued

States Percent Democratic


Oklahoma 28.50
Oregon 30.22
Pennsylvania 32.20
Rhode Island 33.48
South Carolina 35.39
South Dakota 27.48
Tennessee 31.82
Texas 30.59
Utah 24.41
Vermont 31.92
Virginia 34.84
Washington 31.16
West Virginia 28.44
Wisconsin 30.22
Wyoming 24.41

Source: Enns and Koch (2013b).

30

20
Percent of Total

10

25 30 35 40
Percent Democratic identifiers in 2007 state electorate

Figure 20.2  Percentage of Democratic Identifiers in State Electorates, 2007.


Source: Enns and Koch (2013b).
Graphical Displays for Public Opinion Research    445

Table 20.3 Party Identification by Age Groups within


the American Electorate, 1992
Party Identification

Age Group Democrats Independents Republicans

18–​24 27.18 49.74 23.08


25–​34 32.62 40.29 27.09
35–​44 38.65 34.15 27.20
45–​54 36.74 36.46 26.80
55–​64 45.00 26.25 28.75
65–​74 45.91 29.57 24.51
75–​84 46.29 21.14 32.57
85–​94 52.38 9.52 38.10

Note: Table entries are row percentages.


Source: CPS 1992 National Election Study.

independents, or Republicans, respectively. In this case, we are probably not interested


in the actual percentages. Rather, we would like to know whether there are any inter-
esting patterns in the distribution of partisanship across age groups.
It might be possible to answer the preceding question through careful study of the
percentages in Table 20.3. But the answer is immediately obvious if we draw a picture of
the information, as in Figure 20.3. Clearly the percentage of self-​professed independents
decreases sharply as we move from younger to older age groups. Conversely, the per-
centage of partisans (especially Democrats) increases as we move in the same direction.
Furthermore, Democrats outnumber Republicans within every age group, but the size
of the gap becomes larger among older citizens.
The preceding examples suggest three general guidelines for determining when to use
tables and when to use graphs for presenting quantitative information:

1. If there is a relatively small amount of data and the specific numeric values are im-
portant, then tables are probably better than graphs.
2. If there is a large number of data values, then graphs are likely to provide more
useful information than tabular displays.
3. If the researcher is more interested in systematic patterns within the data than in
particular numeric values, then graphs are probably more useful than tables.

These three ideas really should be regarded as suggestions rather than hard and fast
rules. For one thing, it is not clear what constitutes “a small amount of data” or “a large
number of data values.” Also, it is important to keep in mind that any display strategy
446    Saundra K. Schneider and William G. Jacoby

Democrat
Independent
Republican
50
Party Identification (Percentage)

40

30

20

10

18−24 25−34 35−44 45−54 55−64 65−74 75−84 85−94


Age Group

Figure 20.3  Party Identification by Age Groups within the American Electorate, 1992.


Source: CPS 1992 National Election Study.

involves trade-​offs in the information that can be drawn easily from the display. So, for
example, a histogram shows the distribution well, but also makes it impossible to asso-
ciate data values with particular observations. Because of such considerations, the use
of graphs versus tables must be evaluated on a case-​by-​case basis, with the immediate
objective of the display always kept in mind (i.e., what information is the display in-
tended to convey to readers?). Nevertheless, we still believe that graphical displays have
a number of advantageous features compared to tables, and that they should probably be
used more widely in the political science literature.

The Importance of Visual Perception

In trying to determine what makes a good graph, the relevant criterion is not the aes-
thetics of the display. Instead, it is the degree to which the graph encourages accurate
interpretation of the information that it contains. But how can this be achieved in any
Graphical Displays for Public Opinion Research    447

particular graphical display? An answer to this question requires at least a brief consid-
eration of how human beings process graphical information.
When statistical graphics are used for research purposes there are two interacting
components. On the one hand, graphical displays encode quantitative information as
geometric constructions rendered on a display medium. On the other hand, human
perception and cognition must be employed to decode this information and understand
its substantive implications relative to the research context within which it appears. This
process often works very well precisely because the human visual processing system
provides a very effective means for understanding complicated information. But the
interactive nature of the process is critically important:  The elements of the graph-
ical display must encode the quantitative information in a way that facilitates accurate
decoding on the part of the consumer. Therefore, it is useful to consider how people pro-
cess and interpret graphical information.
There are a variety of different scholarly perspectives on human graphical percep-
tion (e.g., Bertin 1983; Spence and Lewandowsky 1990). William S. Cleveland (1993a)
provides a theory that is particularly relevant to the construction and use of statistical
graphs. He argues that there are three components involved in interpreting graphical
displays of quantitative information: First, detection is the basic ability to see the data,
relative to the background elements of the display. This involves careful consideration of
the geometric objects that are used to depict the quantitative information. Tufte (2001)
would also say that it is important to maximize the data-​to-​ink ratio in the graph to max-
imize the prominence of the relevant information, rather than the external trappings of
the display (axis labels, grid lines, etc.).
Second, assembly is the recognition of patterned regularities across the discrete
elements in the graphical display. This involves directing the observer’s eye toward the
structure underlying the data (e.g., the shape of a univariate distribution or the relation-
ship between two variables) and away from the individual units that comprise the data
set. The tricky parts of this process are to avoid overlooking important features in the
data and to keep from imposing patterns that are not really there.
Third, estimation is the ability to make accurate judgments about quantities or
magnitudes using the visual elements of the graphical display. It has long been known
that there are systematic distortions in the ways that people process visual information.
So it is important that a graph employ geometric devices that tend to produce accurate
estimates of the quantitative information they represent. Cleveland’s research shows that
objects plotted against linear scales tend to be interpreted very accurately. Judgments
about slopes and angles are somewhat less accurate, and judgments about areas or sizes
of objects are even less so. Finally, differences in shading or color gradations produce the
least accurate estimates of quantitative differences.
Of course we should construct graphical displays that optimize all three aspects of
graphical perception. While it is easy to give this advice, it is often difficult to carry it
out in practice. The problem is that compromises are often necessary, leading to graphs
that emphasize some facets of the data more directly than others. It is impossible to pro-
vide any general rules to guide the researcher through the process of selecting the “best”
448    Saundra K. Schneider and William G. Jacoby

graph for any given research context. But there is a rule of thumb that seems to be appro-
priate: always try several different kinds of displays (or variants of a single display type)
for any given data set. Doing so often reveals features of the data that would be missed if
the analyst constructed a single graph and left it at that.

The Purpose of a Graphical Display

To determine whether a particular graphical display is a “good” graph or not, it is nec-


essary to consider the purpose of the display. For example, Jacoby (1997) distinguishes
between analytic graphics and presentational graphics (also see Unwin 2008). The
former are created as part of the data analysis process; analytic graphics are intended
to reveal interesting and salient aspects of the data to the researcher. As Tukey said,
visual depiction of the data “forces us to notice what we never expected to see” (1977,
vi). Presentational graphs assume that the important features of the data are already
known to the researcher. Instead, they create visual depictions of these features for other
audiences. Kosslyn states that “a good graph forces readers to see the information the
designer wanted to convey” (1994, 271).
Graphical displays that are created for articles in professional journals probably fall
somewhere in between “pure” presentational and analytical graphs. On the one hand,
the author certainly wants the relevant readers (i.e., the journal editor and reviewers)
to interpret the information in the way that he or she intends—​and the graph should
be constructed in a manner that encourages that. On the other hand, the norms and
ethics of the scientific community require strict adherence to principles of accuracy in
reporting data and study results; therefore, the elements of the graph should not do an-
ything that would encourage misleading interpretations. It is hoped the article’s readers
will be able to look at the graph and see what the author saw during the analysis, thereby
understanding how the conclusions were reached. For that reason, graphical displays in
published work should generally be fairly close to analytical graphs, perhaps with a few
more “bells and whistles” provided to help readers understand what they are seeing.

Some General Guidelines

There is, of course, enormous variety in the kinds of graphical displays that are available.
Accordingly, it is almost impossible to provide hard and fast rules for their construction
and use. We will state this disclaimer at the outset: there are caveats and exceptions to
every guideline that we provide below. Nevertheless, some principles can be applied to
most applications of particular graphical displays. Beyond those, we begin by consid-
ering two broad guidelines that pertain to all displays, regardless of the particular type
of graph they contain.
Graphical Displays for Public Opinion Research    449

Avoid Overly Complicated Displays


The first recommendation is to avoid putting too much information into a single graph.
Doing so usually produces overly complicated displays that inhibit effective and effi-
cient visual perception and information processing. For example, Figure 20.4 presents
information about the ways that state-​level public opinion covaries with other charac-
teristics of state political systems. This graph encodes values from five variables. The
horizontal and vertical axes represent the partisanship and ideology of state electorates,
respectively (larger values indicate more Democratic or liberal populations). So each
plotting symbol is located at a position that summarizes public opinion within that state.

0.1

0.0
State electorate ideology

−0.1

−0.2

−0.3

−0.4 −0.2 0.0 0.2


State electorate partisanship

Figure 20.4  State Political Characteristics in 1992.*


*Size of plotting symbol is proportional to policy priorities (larger circles indicate more spending on collective goods,
rather than particularized benefits). Length of line segment is proportional to interest group strength in state. Orientation
of line segment corresponds to size of state government (angles in clockwise direction from 12:00 to 6:00 correspond
to larger numbers of state employees per capita). State glyphs are located according to state electorate partisanship and
ideology (larger values indicate more Democratic/​liberal electorates).
Sources: State public opinion data are obtained from Gerald Wright’s website, http://​mypage.iu.edu/​wright1/​. Interest group
data are from Gray and Lowery (1996). Policy priorities and state employee data are from Jacoby and Schneider (2001).
450    Saundra K. Schneider and William G. Jacoby

The plotting symbols themselves are “glyphs” in which each component corresponds
to a different variable. The diameter of each circle is proportional to the state’s policy
priorities (Jacoby and Schneider 2009), with larger sizes indicating that a state spends
more money on collective goods than on particularized benefits. The length of the
line segment in each glyph is related to interest group strength within the state; longer
segments correspond to stronger interest group communities (Gray and Lowery 1996).
Finally, the orientation of each line segment—​in a clockwise direction, starting at the
“12:00 position” and ending at the “6:00 position”—​corresponds to the size of the state’s
government (in thousands of employees per capita).
Figure 20.4 certainly contains a great deal of information. But it is not very easy to
decode it and reach substantive conclusions. The rather extreme juxtaposition of many
data values within the plotting region (caused by both the complexity of the plotting
symbol and the overplotting due to states having similar values on the partisanship
and ideology variables) means that the reader must exert a great deal of perceptual and
cognitive effort to isolating the geometric elements that correspond to the variable of
interest (say, the length of the line segments for interest group strength) and then rec-
ognize patterns that exist across the elements (the segments tend to be longer near the
bottom of the plotting region). In terms of Cleveland’s graphical perception theory,
the display in Figure 20.4 is problematic for detection and assembly. There may also be
problems of visual estimation, since the plotting symbols rely on geometric devices that
are not processed very accurately (i.e., the areas of the circles, the lengths of nonaligned
segments, and the angles of the segments).
To address the problematic elements of Figure 20.4, we need to understand the
author’s objective in presenting the display. Here the goal presumably is to show how the
characteristics of state governments are affected by the attitudinal orientations of state
electorates. That information is probably presented more effectively in three separate
graphs, as in Figure 20.5. Once again, the axes correspond to state electorate partisan-
ship and ideology, while the diameters of the plotted circles are proportional to policy
priorities, interest group strength, and sizes of state governments, respectively.
Figure 20.5 takes up more physical space than Figure 20.4, and it uses three panels
rather than just one to encode the data values. But it facilitates visual processing—​not
the efficiency of information storage—​which is the relevant criterion for designing the
display. The “bubble plots” in the three panels of Figure 20.5 make it much easier to un-
derstand the variations in governmental characteristics than the complicated plotting
symbols employed in Figure 20.4.1 As a heuristic guideline, we believe it is useful to
think of a graph as a visual analogue to a paragraph of text: it should be used to present
one major idea.

Show the Full Scale Rectangle


It is both long-​standing practice within the research community and the default in
most graphing software to provide labels only for the left and bottom axes for any given
Graphical Displays for Public Opinion Research    451

(A) Plot symbol size is proportional (B) Plot symbol size is proportional
to state policy priorities to interest group strength in state

0.1 0.1
State electorate ideology

State electorate ideology


0.0 0.0

–0.1 –0.1

–0.2 –0.2

–0.3 –0.3

–0.4 –0.2 0.0 0.2 –0.4 –0.2 0.0 0.2


State electorate partisanship State electorate partisanship

(C) Plot symbol size is proportional


to size of state government

0.1
State electorate ideology

0.0

–0.1

–0.2

–0.3

–0.4 –0.2 0.0 0.2


State electorate partisanship

Figure 20.5  Bubble Plots Showing State Political Characteristics, Relative to State Electorate
Partisanship and Ideology in 1992.
Sources: State public opinion data are obtained from Gerald Wright’s website, http://​mypage.iu.edu/​wright1/​. Interest group
data are from Gray and Lowery (1996). Policy priorities and state employee data are from Jacoby and Schneider (2001).

display. Extending this idea, some graphs omit the right and top axes entirely. Doing so
is consistent with Tufte’s advice to “maximize the data-​to-​ink ratio.” In other words, the
scales for the quantitative elements already appear in the left and bottom axes; it would
be redundant to repeat this information on the other two axes. While we generally agree
with the principle of maximizing the data-​to-​ink ratio, we disagree with its application
to eliminate two axes in a bivariate graph. Instead, our second recommendation is to
452    Saundra K. Schneider and William G. Jacoby

show all four axes—​that is, the “scale rectangle”—​in a graph. Any redundancy costs are
far outweighed by the advantages of doing so.
For example, consider Figure 20.6, which shows two versions of a scatterplot. The first
graph (Figure 20.6A) only shows two coordinate axes. Note how difficult it is to discern
visually the boundaries of the plotting region. The data points depicting observations
with relatively large values on the two variables seem to “float in space.”
But more important than the preceding aesthetic problem is that the omission of the
right and top axes in Figure 20.6A inhibits visual perception. Specifically, estimation
of quantitative variability is optimized when plotting elements are arrayed against a
common scale. Since the data points in the upper-​right portion of the plotting region
are quite far away from the scale axes, it is more difficult to judge the differences be-
tween these observations than it is with points in the lower-​left portion of the graph (i.e.,
observations with relatively small values on the two variables).
Figure 20.6B alleviates the visual perception problem by showing all four sides of the
scale rectangle. Now the data points in the upper-​right portion of the plotting region
are still relatively close to axes that facilitate more accurate judgments about differences
across observations. Note that it is not necessary to provide labels for the top and right-​
hand axes. The tick marks correspond to those that are shown with labels on the bottom
and left-​hand axes, respectively; therefore, the specific quantitative information can still
be retrieved very easily. The tick marks alone should provide sufficient visual cues to fa-
cilitate accurate estimation of differences in point locations.

Univariate Graphical Displays

Univariate graphs are typically used to illustrate a single variable’s distribution. For ex-
ample, a bar chart might show the number or percentage of observations that fall within
each category of a discrete variable. Similarly, a histogram shows the density of data at
each location within the range of a continuous variable. There are many different kinds
of univariate graphical displays. Here, however, we focus only on the few types that tend
to appear with any frequency in political science journal articles: pie charts, bar charts,
dot plots, and histograms.2

Pie Charts
Pie charts are a well-​known graphical strategy for showing a small number of numeric
values that sum to some meaningful whole—​for example, the number or percentage of
observations from some sample that fall within each of a set of categories. Our advice
about pie charts is simple: avoid using them in manuscripts that are intended for publi-
cation as journal articles.
(A) Scatterplot with only two coordinate axes

40
Percent Democratic in 2007 state electorate

35

30

25

15 20 25 30
Percent liberal in 2007 state electorate

(B) Scatterplot with full scale rectangle

40
Percent Democratic in 2007 state electorate

35

30

25

15 20 25 30
Percent liberal in 2007 state electorate

Figure 20.6  Two Versions of a Scatterplot Showing the Percentage of Democratic Identifiers


versus the Percentage of Liberals in State Electorates (Data from 2007).
Source: Enns and Koch (2013b).
454    Saundra K. Schneider and William G. Jacoby

The problem is that pie charts rely on geometric representations of quantitative values
that are not amenable to accurate visual judgments. The numbers associated with each
category of the variable represented in a pie chart are shown by the differing sizes of the
wedges that are cut in the pie. However, people generally are not very good at judging
differences either in areas or in angular separations. Therefore, it is unlikely that readers
will be able to work back from the relative sizes of the pie wedges to the numeric values
they are intended to represent.
There also are at least two practical issues that limit the utility of pie charts in scien-
tific publications. The first is relatively minor. A pie chart can only represent a small
number of categories; otherwise, the wedges become too small and visual detection of
the numeric information is compromised. And as suggested previously, a small number
of values often can be conveyed without resorting to a graphical display at all.
The second practical problem stems from the fact that pie charts often use different-​
colored wedges to help readers distinguish between the discrete categories of the vari-
able being plotted. But social science journals generally do not use color in their figures.
Therefore, the wedges are displayed as varying shades of gray, which are perceived even
less accurately than the sizes of the wedges.
Pie charts may well be useful as presentational graphics, where the objective often is to
highlight the basic existence of different-​sized categories. But there are other graphical
displays that work more effectively to facilitate accurate judgments about differences in
quantitative values. Therefore, we believe that pie charts are best left out of publications
that are intended for a scientific audience.

Bar Charts
Bar charts encode labeled numeric values as the end points of bars that are located rel-
ative to a scale axis. Stated very loosely, the longer the bar, the larger the numeric value
associated with that label. Bar charts are more broadly useful than pie charts because
the numbers plotted in the display do not need to sum to a meaningful value (e.g.,
they do not need to be percentages that sum to 100). They also may be better than pie
charts because they encourage more accurate visual perception. Cleveland’s research
(1993a) shows that judgments about different objects arrayed along a common scale
are usually carried out very accurately. This is precisely how the bar chart presents its
information.
For example, Figure 20.7 shows a bar chart with some information about state public
opinion in 2007. Specifically, the figure shows regional differences in the policy mood of
state electorates. It uses the regional mean values of a variable that was originally devised
by Stimson (1999) and adapted for the American states by Enns and Koch (2013a). The
specific values of this variable are arbitrary, but differences between the scores assigned
to different states correspond to variability in the general policy orientations of state
electorates. Larger values indicate more liberal state public opinion, and smaller values
indicate more conservative electorates.
Graphical Displays for Public Opinion Research    455

West

South

Northeast

Midwest

38 40 42 44 46
Mean state policy mood

Figure 20.7  Bar Chart Showing Mean Policy Mood within Each State, by Region.
Source: Enns and Koch (2013b).

It is extremely important to be attentive to the details in a bar chart (of course, that is
good advice for all graphs). Here, for example, the horizontal orientation of the display
(i.e., the bars run from the left side of the plotting region to the right, rather than verti-
cally) makes it easier to read the textual labels associated with each bar. Also, notice that
the bars are separated from each other by small intervals along the vertical axis; that
emphasizes the discrete nature of the variable being displayed. There are explicit axes
drawn at both the top and the bottom of the plotting region, and they both contain iden-
tical tick marks. This enhances visual estimation of differences in the ends of the bars; it
also facilitates table look-​up, or estimating approximate numeric values from the geo-
metric elements of the display.
Although they are superior to pie charts, bar charts also have a potentially serious
weakness. If the bars represent anything other than frequencies or percentages (in
which case we usually would consider the graphic to be a histogram), then the origin
of the bars (i.e., the numerical value represented at the base of each bar) is arbitrary. The
placement of the origin affects the relative sizes of the bars in the chart. Readers may
focus on the differences in the lengths or areas of the bars, rather than on the differences
in the numeric values located at the bars’ end points. This is problematic, because only
456    Saundra K. Schneider and William G. Jacoby

the latter encode meaningful information; with an arbitrary origin, the sizes of the bars
will also be arbitrary.
For example, a cursory inspection of Figure 20.7 could easily lead a reader to con-
clude that northeastern and western states have much more liberal electorates than do
southern states. After all, the bars for the former two regions are almost six times longer
than that for the latter region. However, such an interpretation would be incorrect, since
the bars originate from the completely arbitrary value of 36. In fact, the mean values only
run from 37.03 to 46.17, while the original variable ranges from 31.75 to 60.05. And since
state policy mood is an interval-​level summary index, any such magnitude comparisons
of the regional means are probably inappropriate.
To show how the position of the bar origin affects visual perception of a bar chart,
consider the first panel of Figure 20.8. This bar chart shows exactly the same infor-
mation as Figure 20.7. Only one detail of the graphical display has been changed; the
origin on the horizontal axis has been set to 0 rather than 36. Notice that the contrast
between the terminal locations of the bars does not seem nearly as pronounced here
as it did in the previous display. While we still see that the mean policy mood score is
lower in southern states than in other regions, it is now clear that the differences are not
that great.
This problem occurs whenever a bar chart is used to show values of an interval-​level
variable, since the zero point at this level of measurement is always arbitrary, by defini-
tion. If a bar chart is used in a research manuscript, then it is essential that the author
provide sufficient explanation and guidance for readers so the chances of any misinter-
pretation are minimized. Or one could use an alternative graphical display that avoids
this problem entirely (like the dot plot, described below).
Other problems arise when a bar chart is presented in a pseudo three-​dimensional
format. The second panel of Figure 20.8 uses this display strategy for the regional
differences in mean policy mood. Apparently the purpose of such a display is to suggest
that the graph depicts a physical structure. Intuitively, that alone seems to distort the in-
herently abstract nature of a variable’s distribution. In addition, the drawn-​in elements
that create the “third dimension” definitely conform to Tufte’s definition of “chart-​junk”
(2001): visual elements added to a display that serve no purpose in conveying the quan-
titative information that the graph is intended to represent.
But there is another, more serious problem with a three-​dimensional bar chart.
The oblique viewing angle used to create the illusion of depth and perspective
makes it more difficult to assess visually the relative heights of the bars associ-
ated with different categories or labels. Doing so involves comparisons of the bar
heights along a nonaligned scale, and this is a task that is usually carried out less
accurately than comparisons along a common scale (such as is used in a standard
bar chart). For all of these reasons, three-​dimensional bar charts should be avoided
in scientific publications. The visual enhancements that they provide do not com-
pensate for the problems they introduce into the visual representation of quantita-
tive information.
(A) Bar chart with bar origin set to 0

West

South

Northeast

Midwest

0 10 20 30 40
Mean state policy mood

(B) Three−dimensional bar chart

West

South

Northeast

Midwest

38 42 46
Mean state policy mood

Figure 20.8  Variations on the Bar Chart Showing Mean State Policy Mood, by Region.
Source: Enns and Koch (2013b).
458    Saundra K. Schneider and William G. Jacoby

Dot Plots
In its most basic form, a dot plot is a two-​dimensional array in which one axis (usu-
ally the vertical) contains textual labels, and the other axis (usually the horizontal)
represents the scale for the variable under consideration. The data values are plotted
as points located at the appropriate horizontal position within each row. For ex-
ample, Figure 20.9 shows a dot plot of the regional means for the policy mood of state
electorates—​in other words, the same information that was shown in Figures 20.7 and
20.8. In the present context, the farther the point is located toward the right, the more
liberal the state electorates within that region, and vice versa.

West

South
Region

Northeast

Midwest

38 40 42 44 46
Mean state policy mood

Figure 20.9  Dot Plot of Mean 2007 State Policy Mood, by Region.


Source: Enns and Koch (2013b).
Graphical Displays for Public Opinion Research    459

Once again, some of the seemingly minor details of the dot plot contribute di-
rectly to its effectiveness as a graphical data display. As with the earlier bar chart, the
horizontal orientation makes it easy to read the labels. The horizontal dashed lines
facilitate table look-​up (i.e., visually connecting the data values to the proper cate-
gory labels), but the lighter color of the lines helps ensure that the plotting symbols
representing the quantitative values are the most prominent elements within the
plotting region.
The dot plot is a particularly useful graphical display because it avoids most of the
problems encountered with pie charts and bar charts. Visual perception of the infor-
mation in a dot plot involves comparing the relative positions of the plotting symbols
along a common scale, a processing task that is carried out more accurately than
the angular and area judgments required for pie charts. Notice, too, that the relative
differences in the horizontal positions of the plotted points in Figure 20.9 are identical
to the differences in the endpoints of the bars in Figure 20.7. Here, however, the hori-
zontal dashed lines for each region extend all the way from the left vertical axis to the
right vertical axis. In so doing, they provide no visual cues that encourage misleading
comparisons analogous to those based on the sizes of the bars in a bar chart. Instead,
the varying point locations along the respective lines facilitate judgments about the
differences between the plotted values, which are completely appropriate for interval-​
level data values like these.
On the other hand, there are situations in which magnitude and ratio comparisons
are appropriate, and the dot plot can be adapted to take this into account. Figure 20.10
shows percentage of the electorate that identified themselves as Republicans within
each of the American states in 2007. Here the labels on the vertical axis in the dot plot
are sorted according to the data values; this makes it easier to perceive differences
among the states. Once again, the dashed lines facilitate table look-​up. But now they
only extend from the zero point on the horizontal axis out to each observation’s plotting
symbol, thereby making the length of each line segment proportional to the data value.
For example, the line for Wyoming is about two times longer than that for Rhode Island,
and this does correspond to the magnitude difference in the percentage of Republican
identifiers in the two states. The dot plot in Figure 20.10 illustrates a general principle
that holds for all data graphics: the visual elements of the display should be set up in a
manner that encourages visual comparisons that are appropriate for the nature of the
data shown in the graph.
Figure 20.10 also shows another, more practical, advantage that dot plots have over
pie charts and bar charts: They can be used to display a much larger number of dis-
tinct values than either of the latter two kinds of displays. Dot plots have a number of
strong features (e.g., if they show raw data, as in Figure 20.10, it is easy to extract infor-
mation about the distribution from the geometric structure in the display), and they can
be adapted to a variety of situations (e.g., including visual representations of sampling
error, making comparisons across subgroups). Overall, we believe any information that
could be depicted in a pie chart or bar chart can actually be displayed more effectively in
a dot plot (Cleveland 1984a; Jacoby 2006).
460    Saundra K. Schneider and William G. Jacoby

Utah
Wyoming
Idaho
Nebraska
Oklahoma
Kentucky
West Virginia
Tennessee
Alabama
South Dakota
Alaska
Pennsylvania
Kansas
Texas
North Dakota
Montana
Arkansas
Arizona
Indiana
Louisiana
Georgia
Florida
South Carolina
Colorado
North Carolina
Oregon
Ohio
Washington
Mississippi
Virginia
Iowa
Minnesota
California
New Mexico
Missouri
Nevada
Wisconsin
Maine
New Hampshire
Delaware
Michigan
Maryland
Connecticut
New Jersey
Vermont
Illinois
Rhode Island
New York
Massachusetts
Hawaii
0 10 20 30 40
Percent Republican identifiers in state electorate

Figure  20.10 Dot Plot Showing Percent of State Electorates Identifying Themselves as


Republicans in 2007.
Source: Enns and Koch (2013b).

Histograms
A histogram is a graphical display that is conceptually different from, but superficially
similar to, a bar chart. Strictly speaking, a histogram shows the probability distribution
Graphical Displays for Public Opinion Research    461

for a random variable. It is a two-​dimensional graph in which the scale along one axis
(usually the horizontal) corresponds to the range of a variable (say, X). The data den-
sity at any point within this range, say xi, is represented by the vertical height of a point
plotted at horizontal position xi. In principle, if X is a continuous variable, then the
plotted points would extend across the entire range of the data, producing a smooth
curve. The total area under the curve would be 1.0, making the area under the curve
between any two horizontal positions (say, x1 and x2) equal to the probability that a ran-
domly selected xi falls within the interval from x1 to x2.
In reality, some adjustments are usually made to the theoretical conception of a his-
togram to take into account the features of “real” empirical data. If, as is usually the
case, there are relatively few observations available at each distinct xi, then X’s range
is divided into a set of adjacent, mutually exclusive, and exhaustive intervals, usually
called “bins.” A rectangle is drawn for each bin, with the width spanning the entire
bin and the height proportional to the relative frequency of observations within that
bin. At the same time, the scale plotted on the vertical axis is usually changed from
densities to percentages. The latter adjustment has no effect on the shape of the histo-
gram, but it does enable the reader to extract more useful information from the graph
(i.e., the percentages of observations within intervals of X values) than would be the
case with the densities.
Of course it is the presence of the rectangles and the percentages on the vertical axis
that makes the histogram look like a bar chart. But once again, attention to the details
makes it easy to distinguish these types of displays. The horizontal axis of the histogram
corresponds to X’s range rather than a set of category labels. Note that the boundaries
of the bins (i.e., the vertical edges of the rectangles in the plotting region) do not neces-
sarily correspond to the locations of the tick marks on the horizontal axis. Notice also
that adjacent rectangles in the histogram touch each other; there is no gap between
them as was the case in the bar chart.
A histogram does not have the arbitrary origin problem that arose with the bar
chart, since the rectangle for each bin is necessarily anchored at zero (i.e., the position
that corresponds to a bin with no observations contained inside its boundaries). But
there is a different problem, because the bin origin (i.e., the X value corresponding to
the lower limit of the first bin) and the bin width (i.e., the size of the interval of values
contained within each bin) are both defined by the researcher; they are not implied by
the data themselves. Moving the origin or modifying the bin width affects the way that
observations are sorted into the respective bins. In so doing, these actions can have a
profound impact on the appearance of the overall histogram, possibly leading to
different substantive interpretations about a variable’s distribution.
As a tangible illustration of the problems that can occur, consider Figure 20.11. The
first panel shows a histogram of state policy mood scores in 2007, with bin widths of 4
units and a bin origin of 30. Here, the distribution of scores appears to be unimodal and
nearly symmetrical, although the upper tail is definitely a bit heavier than the lower tail.
Despite the latter asymmetry, Figure 20.11A seems to depict a reasonably “well-​behaved”
distribution of data values.
(A) Bin width of 4 and bin origin at 30

30

Percent of Total 20

10

30 35 40 45 50 55
State policy mood in 2007

(B) Bin width of 4 and bin origin at 28


25

20
Percent of Total

15

10

30 35 40 45 50
State policy mood in 2007

(C) Bin width of 2 and bin origin at 30

20

15
Percent of Total

10

30 35 40 45 50 55
State policy mood in 2007

Figure 20.11  Three Versions of a Histogram Showing the Distribution of State Policy Mood
in 2007.
Source: Enns and Koch (2013b).
Graphical Displays for Public Opinion Research    463

The second panel of Figure 20.11 shows exactly the same data. Here, however, the
bin origin has been shifted 2 units to the left, to 28. The bin width remains fixed at 4
units. This is completely legitimate, since the minimum score that occurs in the data
set is 31.75. But notice how the histogram now looks very different from the earlier
version. Here the asymmetry of the distribution is much more pronounced; far more
observations fall in the upper half of the variable’s range (i.e., greater than about
40) than in the lower half. The third panel of Figure 20.11 returns the bin origin to 30,
but reduces the bin width to 2 units. Here the asymmetry does not seem to be as pro-
nounced, but the distribution appears to be multimodal. The troubling point is that
the differences between Figures 20.11A, 20.11B, and 20.11C have nothing whatsoever to
do with the data. Instead, they occur entirely because of a seemingly minor (indeed,
trivial) change in a small detail of the graphical display.
The effects of bin definitions are well known to statisticians and researchers in the
field of data graphics (e.g., Scott 1992; Cook and Swayne 2007). However, most data
analysts probably do not think about this when they are preparing a manuscript for
submission to a journal. Modern software packages use various algorithms to set the
default bin origins and widths; often users simply accept what appears in the output.
This is where the interactive nature of statistical graphics becomes particularly impor-
tant. It is always useful to spend some time “tinkering” with the details of a histogram,
just to make sure that no important features of the data are overlooked before the
variable’s distribution is “exposed to the world” as a graphical display in a manuscript
or journal article.

Bivariate Graphical Displays

Bivariate graphs plot data relative to coordinate systems defined by two substantively in-
terpretable axes. Unlike the unidimensional case, in which there are several qualitatively
different types of displays, bivariate graphs are almost always based on the general idea
of using one axis to show a scale corresponding to the range of values on a single vari-
able and the other axis to do the same for a second variable. According to long-​standing
tradition, the variable whose range is depicted on the horizontal axis is generically la-
beled X, while that on the vertical axis is called Y. Each observation in the data set is
represented by a plotting symbol, which is located at a position determined by its values
on the X and Y variables.
Again, virtually all bivariate graphs are based on these simple ideas. But there is
still an enormous amount of latitude left in the details (e.g., the choice of plotting
symbol; labels for points, axes, and tick marks; use of color and shading). The choices
that an author makes regarding these seemingly minor elements of the graph can
have a profound impact on the effectiveness of the display for conveying information
to readers.
464    Saundra K. Schneider and William G. Jacoby

Two Broad Categories of Bivariate Graphs


Bivariate graphs can be divided into two broad classes of displays (Greenacre 2007).
First, maps are used to display similarities and differences across specific objects.
Of course this category includes the physical maps with which we are all familiar. In
that case, different positions within the display correspond to different geographic
locations of objects such as cities, roads, and landmarks. But maps can also display how
objects differ from each other across each of two variables. Maps also are sometimes
used to display results from data analyses, such as multidimensional scaling, principal
components, factor analysis, or correspondence analysis. Regardless of the exact ap-
plication, the general idea of a map is that the reader can differentiate the substantive
identities of the objects that are plotted. Because of this, maps often contain relatively
few data points.
Second, scatterplots (and their relatives) are used to display structure across
observations within a data set. Stated a bit differently, scatterplots are commonly em-
ployed to examine the relationship between two variables. One variable (say, Y) is re-
lated to another (say, X) if the conditional distribution of Y varies systematically across
the range of X values. Accordingly, the general objective of a scatterplot is to allow the
reader to discern the predominant shape of the data “cloud” rather than the separate
identities of individual points within the display.
The distinction between maps and scatterplots is not entirely clearcut. For example,
a researcher might be interested in determining whether the differences among the
objects displayed in a map conform to some recognizable pattern (e.g., do objects that
are believed to be different from each other really appear at widely separated positions
within the plotting region?). Or an analyst may want to identify some of the specific
points in a scatterplot (e.g., outliers that could affect the calculated values of sum-
mary statistics). But even though maps and scatterplots share most of their features,
there are some potentially important differences in the details of these two displays,
discussed below.

Pay Attention to the Details!


A well-​constructed graph should contain just enough information to facilitate accurate
retrieval of the information it contains. Anything less than this provides readers with
an incomplete representation of the author’s argument. Anything more constitutes ex-
traneous and unnecessary content that is potentially distracting to readers. Seemingly
small details and pictorial elements can have a profound impact on the degree to which
any particular graphical display achieves this overall objective. Drawing heavily on
Cleveland’s (1994) work, we can suggest several general principles for constructing
effective bivariate graphs:
Make the background of the plotting region transparent. That is, it should be the am-
bient color of the display medium (e.g., white for paper), rather than shaded or shown
Graphical Displays for Public Opinion Research    465

in a contrasting color. Shading serves no useful purpose in the graph, since the scale rec-
tangle already delineates the boundaries of the plotting region. Furthermore, shading
may be detrimental to visual perception, since it makes it more difficult to see the
plotting symbols.
Use relatively few tick marks on the axis scales, and make sure they point out-
ward. The tick marks should be used to give viewers of the graph a general sense
of the range of data values associated with each of the variables depicted in the dis-
play. A small number of labeled points on each scale is sufficient for this purpose.
The ticks should not point into the plotting region because they could collide with
data points and therefore impair visual perception of the information contained in
the graph.
Do not use grid lines within the plotting region. In the past, grid lines within a graph
were used (along with detailed tick marks on the axis scales) to facilitate accurate
visual retrieval of specific data values from the plotted points. This is simply unneces-
sary in modern data analysis, since the information is stored more accurately and easily
retrieved from a numeric database. Again, bivariate graphical displays (regardless of
whether they are maps or scatterplots) are used for examining differences and structure
across objects; they are not particularly useful for discerning specific quantitative data
values. And just like inward-​pointing tick marks, grid lines may impair visual percep-
tion of the data points.3
In a scatterplot, it is often useful to superimpose a smooth curve over the point
cloud to provide a visual summary of the relationship between the variables. The gen­
eral idea behind such a “scatterplot smoother” is to summarize the central tendency
of the conditional distribution of Y across the range of X values. The main concern
when fitting a smooth curve is to make sure that it really does represent accurately the
predominant structure within the bivariate data. For example, many analysts simply
fit an Ordinary Least Squares (OLS) line to the data to show the linear trend. But it
is often worthwhile to look for nonlinear relationships, using data transformations
(e.g., Atkinson 1985), polynomial functions (e.g., Narula 1979), or nonparametric
smoothers (e.g., Cleveland 1993b). When nonlinearity actually exists within the data,
the latter not only will provide a more accurate depiction of the underlying structure;
they may also reveal details of the bivariate data that are interesting and important
from a substantive perspective.
Make sure that the plotting symbols representing the data are visually prominent
within the display. This general point actually involves several distinct considerations:

• Make sure that the scale rectangle is large enough that it leaves some extra white
space on all sides of the most extreme data points. Stated differently, the “data rec-
tangle” should be smaller than the scale rectangle in order to avoid collisions be-
tween the plotted points and the axes of the display.
• The plotting symbols should be large enough to guarantee that they are easily visible
within the graph. Even though the numeric values associated with an observation
in the data set do define a single location within the plotting region, the pictorial
466    Saundra K. Schneider and William G. Jacoby

representation of a single point (i.e., a period, or “.”) is too small for effective use;
from the perspective of graphical perception theory, this would impair basic detec-
tion of the data.
• The plotting symbols should be resistant to overplotting effects. Observations with
similar data values will be located close together within the plotting region. If the
pictorial symbols used to represent them are large enough to be visually promi-
nent, then they will overlap. When this occurs, it is important that the viewer of the
graph still be able to discern the existence of the separate observations. This is dif-
ficult to do with filled plotting symbols (e.g., a solid black square or circle), which
tend to form a blob or indistinct mass when overplotting occurs. Instead, an open
circle is a good general-​purpose plotting symbol that works well in most bivariate
graphs.

Another detail in the construction of a scatterplot is not really graphical in nature: the


axis labels should be readily interpretable in substantive terms. It is never a good idea to
display the acronyms or abbreviated variable names that typically are used in software
command files. The latter will not be clear to anyone other than the person who wrote
the code—​and even for that person, short variable labels can be very confusing.
All of the preceding ideas may seem to be perfectly obvious and little more than
common sense. However, these principles are violated routinely in many of the graph-
ical displays that actually appear within the political science research literature. The two
panels of Figure 20.12 demonstrate the impact that these seemingly small details have
on the quality of a graphical display. Both panels show scatterplots of the same bivariate
data. But the first plot (Figure 20.12A) violates all of the preceding principles (i.e., shaded
plotting region, many inward-​pointing tick marks, grid lines, tiny plotting symbols that
extend out to the scale rectangle, an OLS line fitted to the data, and short variable names
used as axis labels), while the second plot (Figure 20.12B) conforms to them. Remember
that the relevant judgmental criterion is not aesthetic quality, but rather the ability to
discern systematic structure within the data. By that standard, Figure 20.12B clearly is
better than Figure 20.12A.

Jittering for Discrete Data
In the social sciences, we often encounter discrete variables, wherein the number of dis-
tinct data values is relatively small compared to the range of the data and the number of
observations. When we try to include such variables in a scatterplot, severe overplotting
can occur (i.e., many separate observations fall at a single common location), impairing
visual detection and assembly of the data.
One strategy for dealing with discrete data in a graphical display is to introduce a
small amount of random variability into the data values as they are plotted in a graphical
display (Chambers et al. 1983). This breaks up the locations of the individual plotting
symbols so that it is possible to discern the separate observations. The overall size of this
(A)
60

55

50
Mood

45

40

35

30
16 20 24 28 32 36 40
Republican

Figure 20.12A  Cumulative Impact of Small Details: Bad Scatterplot.


Source: Enns and Koch (2013b).

(B)

60

50
State policy mood, 2007

40

30

20 30 40
Percent Republican identifiers in 2007 state electorate

Figure 20.12B  A Better Version of the Preceding Scatterplot.


Source: Enns and Koch (2013b).
468    Saundra K. Schneider and William G. Jacoby

random fluctuation is very small, so there is no danger of misinterpreting the “noise”


from the real, substantively important, variability across the actual data values.
This process is called “jittering.” Figure 20.13 shows how jittering can facilitate the
graphical display of discrete data. Both panels show scatterplots of the relationship be-
tween two discrete variables, each of which has seven distinct values. Figure 20.13A
shows a plot of the original, unenhanced data. All we can see is a rectangular grid that
apparently contains 44 points. The actual data set contains 434 observations, but this
would not be apparent to a viewer. Figure 20.13B shows a jittered version of the same
scatterplot. Now it is clear that there are many more than 44 observations in the data,
and the variations in the ink density across the separate “clusters” of jittered data points
show that there is a positive relationship between the two variables.

Labeling Points
Providing descriptive labels for the individual data points in a two-​dimensional graph
may seem like a simple way to enhance a display. However, it is important to consider
carefully whether labels really do add useful information to the graph. Even if they do,
there are some potentially tricky considerations involved in using them (Kuhfeld 1986;
Noma 1987).
Generally speaking, point labels should be used in data maps but avoided in
scatterplots. Remember that a data map emphasizes similarities and differences among
specific objects. Therefore, it is usually necessary to identify which objects are depicted
by specific points within the graph. With a static display (i.e., one rendered on a perma-
nent display medium like a journal page), the only way to accomplish this is to include
point labels in the data region. These labels should be (1) large enough to be legible to
readers, (2) positioned so they do not collide with other data points, and (3) as short as
possible to avoid taking up space within the plotting region. Achieving these objectives
is often impossible with the default label settings in graphing software.
For example, Figure 20.14A shows a data map depicting a multidimensional scaling
solution for the American electorate’s perceptions of presidential candidates and other
political figures from the 2004 election. Here the labels are unnecessarily long (i.e.,
they include each candidate’s full first and last names), and each label is placed to the
left of its point. Notice that many of these labels overlap other data points and labels,
making it difficult to understand easily the relative positions of the various candidates.
Figure 20.14B shows the same data map with the point labels shortened and moved to
better locations. Here it is clear which labels are associated with which points, and the
positions of all the data points are now clearly visible.
Point labels generally should be left out of scatterplots, because space is usually
tight around the plotted points within the data rectangle of the graph. Therefore, the
labels will inevitably overlap each other and render their content illegible to readers.
Furthermore, they will also overwrite the data points, inhibiting visual detection of the
graphical information. Fortunately point labels are usually unnecessary in scatterplots,
(A) Data points plotted at actual variable values

Liberal-conservative ideology 6

0 1 2 3 4 5 6
Party identification

(B) Data points jittered to break up plotting locations

6
Liberal-conservative ideology (jittered)

0 2 4 6
Party identification (jittered)

Figure  20.13  Two Versions of a Scatterplot between Two Highly Discrete Variables:  Party
Identification and Liberal-​Conservative Ideology.
Source: 2004 CPS American National Election Study.
(A) Long labels and each label located to the left of its point

Ralph Nader
2
MDS axis 2

John McCain
John Ashcroft
0 George W. Bush Bill Clinton John Kerry
Colin Powell John Edwards
Dick Cheney
Republican Party Laura Bush Democratic Party

Hillary Clinton
–1

–2 –1 0 1
MDS axis 1

(B) Shorter labels and varying label positions

2 Nader

1
MDS axis 2

McCain

Ashcroft G. Bush
0 B. Clinton Kerry
Powell
Rep. Pty.
Cheney Edwards
L. Bush Dem. Pty.

H. Clinton
–1

–2 –1 0 1 2
MDS axis 1

Figure 20.14  Two Versions of a Data Map Obtained from a Multidimensional Scaling Analysis
of Candidate Perceptions in the 2004 American Electorate.
Source: 2004 CPS American National Election Study.
Graphical Displays for Public Opinion Research    471

since the objective is to convey the underlying structure of the bivariate data (i.e., the
shape and orientation of the point cloud) rather than the identities of the individual data
points.4

Plotting Multiple Subsets of Data in a Single Display


In some situations the analyst may want to show several subsets of data separately within
a single display to illustrate variability across the different groups of observations. This
can be accomplished by using multiple plotting symbols to encode the values of the cate-
gorical variable that differentiates the subsets. However, it is important to select symbols
that can be distinguished easily in a relatively casual visual inspection of the display.
And of course a key must be included with the graph to explain which symbols are asso-
ciated with which categories.
Cleveland’s (1993a, 1994) work on visual detection of differing textures shows that the
following set of symbols is very effective for plotting several categories within a single
data set:

o + < s w

The preceding symbols can be discerned very easily, even if there is a great deal of
overplotting across the different categories.5 Note, too, that these symbols are most
effective if used in the order that they are listed here. If there are only two categories, the
open circle and plus sign should be used; with three categories, the “less than” symbol
should be the next one added, and so forth. Figure 20.15 is an example in which the first
four symbols are used to show regional variation in the political characteristics of the
American states.
Multiple-​line plots, used to show variability in trends across subgroups, involve a
slightly different consideration. Here the common practice of superimposing different
plotting symbols over the separate lines should be avoided. It is too easy for an ob-
server to make mistakes in associating symbols with the proper lines, especially when
the trends are relatively similar across the subgroups. For example, Figure 20.16A uses
five curves with circles, two types of triangles, diamonds, and x’s superimposed to show
trends over time in public opinion toward government spending in different policy
areas. Notice that it takes some effort to differentiate the symbols. The symbols asso-
ciated with a given line sometimes touch the other lines, thereby facilitating errors in
visual perception.
Instead, different styles of lines should be used for the respective categories of
the grouping variable. Great care must be taken to ensure that the lines have highly
contrasting styles in order to facilitate accurate visual decoding of the different trends.
Figure 20.16B displays a better version of the multiple-​line plot for temporal trends in
public opinion about federal spending. The specific line styles used for particular curves
have been chosen deliberately. For example, the two dashed line styles are adjacent to
472    Saundra K. Schneider and William G. Jacoby

60 s Midwest o
Northeast +
South <
West s

s s
50 + s
State policy mood, 2007

+ +
s s
+ s
+ +
s
+ o s+s
o
+ o s s
+oo <<
40 o o
o o
<< s s < s
< <
<
<
< <
o <
< <
30

20 30 40
Percent Republican identifiers in 2007 state electorate

Figure 20.15  Using Different Plotting Symbols to Represent Subgroups within the Data: 2007
State Policy Mood versus Percent Republican Identifiers within State in 2007, by Region.
Source: Enns and Koch (2013b).

each other, so it is relatively easy to see the different lengths of the dashes in each one.
The solid line style is used for the curve that intersects two other curves, precisely be-
cause there is less chance to mistake it for a different style. The line style that combines
dots and dashes is used for the lowermost curve, because it could be mistaken for
a dashed line; locating it as far as possible from the other dashed lines decreases the
chance that this will occur.

Aspect Ratio
The aspect ratio of a graph is defined as the physical height of the scale rectangle divided
by the width. Many researchers regard aspect ratio as a relatively unimportant detail.
They leave the selection of the aspect ratio for a graph up to the defaults in the software
used to create it or to the typesetters who prepare the final version of the graph for pub-
lication in a journal. However, aspect ratio can have a strong impact on the information
that is drawn from a graphical display; therefore, it is often useful to give it more detailed
consideration (Cleveland 1993b).
(A)

Public support for policy spending (percentage)

80

60

40 Education
Healthcare
Civil rights
Urban problems
Welfare

1972 1976 1980 1984 1988 1992 1996


Year

Figure 20.16A  Using Symbols to Differentiate Separate Data Sequences in a Line Plot: Public


Support for Policy Spending, 1972 through 1996.
Source: American National Election Studies Cumulative Data File (1948–​2012).

(B)
Public support for policy spending (percentage)

80

60

40 Education
Healthcare
Civil rights
Urban problems
Welfare

1972 1976 1980 1984 1988 1992 1996


Year

Figure 20.16B  Using Line Styles to Differentiate Separate Data Sequences in a Line Plot: Public
Support for Policy Spending, 1972 through 1996.
Source: American National Election Studies Cumulative Data File (1948–​2012).
474    Saundra K. Schneider and William G. Jacoby

Aspect ratio is particularly critical in data maps in which the distances (or sometimes
the angles) between the plotted points convey the relevant information. In this case, the
scale units must be directly comparable (and usually identical) in physical units across
the axes of the display. That is, if (say) one inch corresponds to ten units in the horizontal
direction, then one inch should also correspond to ten units in the vertical direction.
Otherwise, the distances between the points in the plotting region will be incorrect.
Figure 20.17 is an illustration of this problem, using an easily recognized geographic
map showing the relative positions of ten cities in the United States. In fact, this map was
produced by performing a multidimensional scaling analysis on the driving distances
between the cities. The first panel (Figure 20.17A) shows a graph with an aspect ratio
of 0.50; the height of the scale rectangle is one-​half the width. But the scale units in the

(A) Axis scales are incorrect for aspect ratio.

1.0

0.5 Seattle New York


Chicago Washington, DC
0.0
Denver
San Francisco Atlanta
−0.5 Los Angeles
Houston Miami
−1.0

−1.5

−1.5 −1.0 −0.5 0.0 0.5 1.0

(B) Axis scales are adjusted properly for aspect ratio.

0.5 Seattle
New York
Chicago Washington, DC

0.0 Denver
San Francisco Atlanta

Los Angeles
−0.5
Houston Miami

−1.5 −1.0 −0.5 0.0 0.5 1.0

Figure 20.17  The Effect of Aspect Ratio on Relative Point Locations in a Map Showing Driving
Distances between Ten U.S. Cities.
Graphical Displays for Public Opinion Research    475

vertical direction are also one-​half the physical size of the same scale units in the hori-
zontal direction, so the positions of the cities are stretched out too far along the hori-
zontal, east-​west orientation. The second panel (Figure 20.17B) also shows a graph with
an aspect ratio of 0.50, but the physical distances associated with the scale units are iden-
tical on the two axes. So the cities are located in their proper relative positions.
In this simple example, we recognize the problem very easily because most of us are
familiar with the map of the United States. That typically will not be the case with a data
map, where the configuration of points is probably not known prior to the analysis. So it
is incumbent upon the analyst to make sure the scale units and the aspect ratio conform
properly to each other.
The specific aspect ratio is probably less critical in a scatterplot, in which the meas-
urement units often differ across the two axes of the scale rectangle. As a general rule
of thumb, we suggest using an aspect ratio of 1.0, rather than the smaller values (often
0.6 or 0.75) that seem to be the default in some software systems. This “compresses”
the plotted points together along the horizontal direction. That makes it a bit easier to
make visual comparisons of the conditional Y distributions across the X values. In other
words, this facilitates assessment of the relationship between the two variables included
in the scatterplot.
Figure 20.18 presents an example showing the effect of aspect ratio in a scatterplot.
The first panel (Figure 20.18A) uses a relatively small aspect ratio (0.5), producing a
graph that is wider than it is tall. Here we can see that there is a positive relationship be-
tween state public opinion and state policy, since the conditional Y distribution shifts

(A)

0.54
State policy priorities, 1986

0.52

0.50

0.48

0 10 20 30 40
State electorate ideology, 1986

Figure 20.18a  Scatterplot with a Very Small Aspect Ratio (0.5): State Policy Priorities versus
State Electorate Ideology in 1986.
Sources: State public opinion data are obtained from Gerald Wright’s website, http://​mypage.iu.edu/​wright1/​. Policy
priorities and state employee data are from Jacoby and Schneider (2009).
476    Saundra K. Schneider and William G. Jacoby

(B)

0.54
State policy priorities, 1986

0.52

0.50

0.48

0 10 20 30 40
State electorate ideology, 1986

Figure  20.18B  Scatterplot with an Aspect Ratio of 1.0:  State Policy Priorities versus State
Electorate Ideology in 1986.
Sources: State public opinion data are obtained from Gerald Wright’s website, http://​mypage.iu.edu/​wright1/​. Policy
priorities and state employee data are from Jacoby and Schneider (2009).

upward (i.e., the plotted points tend to fall at higher locations) as we move from left to
right within the plotting region. The second panel (Figure 20.18B) shows the same scat-
terplot, but the aspect ratio has been increased to 1.0. Here it is easier to see a feature that
was not readily apparent in Figure 20.18A: the relationship between the two variables
is nonlinear. In the left-​hand side of the scatterplot (say, below X values of about 20),
larger values of one variable tend to be associated with larger values of the other. But on
the right-​hand side of the plotting region, we can see that differences in X values do not
correspond to systematic differences in the central tendencies of the Y values; from an x
value of 20 through the maximum X, the average Y value hovers between 0.52 and 0.53.
In terms of Cleveland’s visual perception theory, the larger aspect ratio facilitates visual
assembly of the systematic structure underlying these bivariate data.
Graphical Displays for Public Opinion Research    477

Conclusions

Let us conclude with some general considerations to keep in mind while developing
graphical displays for inclusion in research manuscripts that will be submitted for publica-
tion. First, it is important to think carefully about the information that a graph is intended
to convey and to choose the type of display that is most effective for that purpose. This
involves not only determining the general class of graph (e.g., dot plot versus bar chart for
labeled data values), but also the tiniest details within the display that is eventually selected
for use. Elements like the orientation of textual labels, line styles, and plotting symbols can
make a huge difference in the ability to communicate information in an accurate manner.
To some authors, these considerations may seem like trivial minutia. But as a leading sta-
tistical graphics scholar emphasizes, “The devil is in the details” (Wilkinson 2005, xi).
Second, make sure that the types of displays used in a manuscript are likely to be fa-
miliar to its intended audience. The statistical graphics literature is replete with graph-
ical displays that overcome some of the limitations in well-​known types of graphs. For
example, the bin definitions that can be so problematic with histograms simply do not
occur in univariate quantile plots. But the latter are virtually unknown within the polit-
ical science community, and even worse, could be mistaken by a casual reader for bivar-
iate scatterplots. Similarly, many specialized displays show particular kinds of data. For
example, an “R-​F Spreadplot” (Cleveland 1993b) shows the quantiles of the fitted values
and residuals from a statistical model. While this is extremely important information,
the nature of the display itself would almost certainly have to be explained in great detail
within a manuscript intended for a political science constituency. Doing so would prob-
ably distract readers and dilute the substantive arguments that the author is making.
Thus, it is probably best to stick with well-​known types of graphs, but to make sure that
they are constructed and presented in ways that avoid potential pitfalls.
Third, recognize that creating graphs is an inherently iterative process. Modern software
makes it very easy to modify just about any element in a graphical display. Seemingly minor
changes to the details of a graph can often produce major improvements in the degree to
which observers can extract accurate and useful information from the display. So the first
graph of a data set should never be the only graph of that data set! An analyst certainly should
never settle for the default choices made in the software used to produce the graph.
Finally, it is incumbent upon the author to make sure that a graphical display really
does contribute something to the argument that he or she is making. Journal editors
are typically under great pressure to encourage short manuscripts, due to publisher-​
imposed page budgets. Therefore, a graph that simply provides a pictorial representation
of numerical information that is already presented in tabular form really adds nothing
to a paper. Instead, the author should use a graph only when it reveals something that
cannot readily be discerned otherwise. When this is the case, graphical displays truly are
unparalleled in their ability to communicate quantitative and potentially complex infor-
mation in ways that can be interpreted easily by readers.
478    Saundra K. Schneider and William G. Jacoby

Notes
1. In fact, the bubble plots in Figure 20.5 exemplify the types of compromises that often have
to be made in graphical displays of data. Within each panel the relevant political character-
istic (i.e., policy priority, interest group strength, or government size) is encoded in the size
of the plotting symbol. But visual judgments about areas of geometric shapes are relatively
inaccurate (Cleveland 1993a). They are also biased in that people tend to underestimate
the sizes of large shapes relative to small shapes (e.g., Lodge 1981). In order to correct for
this bias, the values of the respective political characteristics are made proportional to the
diameters of the circles. Since the area of a circle is related to one-​half the diameter squared,
the resultant power relationship between the size of the plotting symbol and the value of the
variable will help to compensate for the bias in visual perception.
2. So-​called univariate graphs can, in fact, contain information about more than one variable.
For example, a bar chart might be used to display summary statistics for a dependent vari-
able across the values of a discrete independent variable. In this case the bar chart actually
depicts bivariate data. As another example, a dot plot could be used to plot the sizes of the
coefficients associated with particular independent variables in a regression model. In that
case, information pertaining to several variables would be shown together in a single dis-
play. While examples like these might raise questions about the utility or accuracy of the
“univariate” label, they pose no particular challenges to the principles underlying the effec-
tiveness of these kinds of displays. Hence we need not worry about them any further in this
chapter.
3. Multipanel graphical displays sometimes include reference lines within their panels to
facilitate visual comparisons across panels. Some graphical displays include baselines
for judging variations in magnitude and direction within the data (e.g., a horizontal
dashed line within a residual-​versus-​predicted plot after a regression analysis). In both of
these situations the reference lines and baselines serve a well-​defined purpose: They en-
hance visual perception and decoding of the information contained within the display.
Conceptually, they are different from grid lines included in a single-​panel display to merely
mark off regular intervals in a variable’s range of values.
4. One important exception to this rule involves outliers, or observations with unusual var-
iable values, relative to the rest of the data. When outliers exist, it usually is important to
determine where they occur in the data set. This is facilitated by labeling the relevant points
with some identifier. But by their very nature, outlying observations occur at positions
within the plotting region that are separated from the main data point cloud. Hence labels
for these observations generally do not cause serious problems for extracting information
from the graphical display.
5. Different colored plotting symbols are also a very effective way to show several subgroups in
a graphical display. Again, however, most professional journals discourage the use of color
in articles.

References
Atkinson, A. C. 1985. Plots, Transformations, and Regression:  An Introduction to Graphical
Methods of Diagnostic Regression Analysis. Oxford: Oxford University Press.
Graphical Displays for Public Opinion Research    479

Bertin, J. 1983. Semiology of Graphics. English translation by William Berg and Howard Wainer.
Madison: University of Wisconsin Press.
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. W. Tukey. 1983. Graphical Methods for Data
Analysis. Pacific Grove, CA: Wadsworth and Brooks/​Cole.
Cleveland, W. S. 1984a. “Graphical Methods for Data Presentation:  Full Scale Breaks, Dot
Charts, and Multibased Logging.” American Statistician 38: 270–​280.
Cleveland, W. S. 1984b. “Graphs in Scientific Publications.” American Statistician 38: 270–​280.
Cleveland, W. S. 1993a. “A Model for Studying Display Methods of Statistical Graphics (with
Discussion).” Journal of Computational and Graphical Statistics 3: 323–​364.
Cleveland, W. S. 1993b. Visualizing Data. Summit, NJ: Hobart Press.
Cleveland, W. S. 1994. The Elements of Graphing Data. Rev. ed. Summit, NJ: Hobart Press.
Cook, D., and D. F. Swayne. 2007. Interactive and Dynamic Graphics for Data Analysis with R
and Ggobi. New York: Springer.
Enns, P. K., and J. Koch. 2013a. “Public Opinion in the U.S. States: 1956 to 2010.” State Politics
and Policy Quarterly 13 (3): 349–​372.
Enns, P. K., and J. Koch. 2013b. “Replication Data for: Public Opinion in the U.S. States: 1956 to
2010.” Harvard Dataverse, V1. http://​hdl.handle.net/​1902.1/​21655.
Gray, V., and D. Lowery. 1996. The Population Ecology of Interest Representation. Ann
Arbor: University of Michigan Press.
Greenacre, M. 2007. Correspondence Analysis in Practice. 2nd ed. Boca Raton, FL: Chapman
and Hall/​CRC.
Jacoby, W. G. 1997. Statistical Graphics for Univariate and Bivariate Data. Thousand Oaks,
CA: Sage.
Jacoby, W. G. 2006. “The Dot Plot:  A Graphical Display for Labeled Quantitative Values.”
Political Methodologist 14 (1): 6–​14.
Jacoby, W. G., and S. K. Schneider. 2001. “Variability in State Policy Priorities: An Empirical
Analysis.” Journal of Politics 63: 544–​568.
Jacoby, W. G., and S. K. Schneider. 2009. “A New Measure of Policy Spending Priorities in the
American States.” Political Analysis 17: 1–​24.
Kastellec, J. P., and E. L. Leoni. 2007. “Using Graphs Instead of Tables in Political Science.”
Perspectives on Politics 5: 755–​771.
Kosslyn, S. M. 1994. Elements of Graph Design. New York: Freeman.
Kuhfeld, W. F. 1986. “Metric and Nonmetric Plotting Models.” Psychometrika 51: 155–​161.
Lodge, M. 1981. Magnitude Scaling:  Quantitative Measurement of Opinions. Beverly Hills,
CA: Sage.
Mitchell, M. N. 2008. A Visual Guide to Stata Graphics. 2nd ed. College Station, TX: Stata Press.
Murrell, P. 2006. R Graphics. Boca Raton, FL: Chapman and Hall/​CRC.
Narula, S. C. 1979. “Orthogonal Polynomial Regression.” International Statistical Review
47: 31–​36.
Noma, E. 1987. “A Heuristic Method for Label Placement in Scatterplots.” Psychometrika
52: 463–​468.
Sarkar, D. 2008. Lattice: Multivariate Data Visualization with R. New York: Springer.
Scott, D. W. 1992. Multivariate Density Estimation:  Theory, Practice, and Visualization.
New York: Wiley.
Spence, I., and J. Lewandowsky. 1990. “Graphical Perception.” In Modern Methods of Data
Analysis, edited by John Fox and J. Scott Long, 13–​57. Newbury Park, CA: Sage.
480    Saundra K. Schneider and William G. Jacoby

Stimson, J. 1999. Public Opinion in America:  Moods, Cycles, and Swings. New  York.
Westview Press.
Tufte, E. R. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative. Cheshire,
CT: Graphics Press.
Tufte, E. R. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire,
CT: Graphics Press.
Tufte, E. R. 2006. Beautiful Evidence. Cheshire, CT: Graphics Press.
Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison-​Wesley.
Unwin, A. 2008. “Good Graphics?” In Handbook of Data Visualization, edited by C. Chun-​
houh, W. K. Härdle, and A. Unwin, 57–​78. Berlin: Springer-​Verlag.
Unwin, A., M. Theus, and H. Hofmann. 2006. Graphics of Large Datasets: Visualizing a Million.
New York: Springer.
Wainer, H. 2000. Visual Revelations:  Graphical Tales of Fate and Deception from Napoleon
Bonaparte to Ross Perot. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum.
Wainer, H. 2005. Graphic Discovery:  A Trout in the Milk and Other Visual Adventures.
Princeton, NJ: Princeton University Press.
Wainer, H. 2009. Picturing the Uncertain World:  How to Understand, Communicate, and
Control Uncertainty through Graphical Display. Princeton, NJ: Princeton University Press.
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
Wilkinson, L. 2005. The Grammar of Graphics. 2nd ed. New York: Springer.
Young, F. W., M. Friendly and P. M. Valero-​Mora. 2006. Visual Statistics: Seeing Data with
Dynamic Interactive Graphics. Hoboken, NJ: Wiley-​Interscience.
Pa rt  I V

N E W F RON T I E R S
Chapter 21

Survey Expe ri me nts


Managing the Methodological Costs and Benefits

Yanna Krupnikov and Blake Findley

Introduction

Over the last two decades there has been an increase in the use of experimental re-
search in political science (Druckman et al. 2011). Under the broad umbrella term
“experimental research” are a variety of methodological approaches, and one that
has emerged as increasingly important is the survey experiment (Barabas and Jerit
2010; Druckman et al. 2006). Often defined as experimental intervention within an
opinion survey (Druckman et al. 2011, 17), survey experiments offer scholars the op-
portunity to have the “best of both worlds.”1 On the one hand the experimental com-
ponent allows scholars to randomly assign participants to treatments, which helps
the investigation of causal relationships. On the other hand, the survey component
allows scholars to incorporate these experimental interventions into national, repre-
sentative surveys (Lavine 2002; Mutz 2011). As a result, survey experiments carry the
possibility of retaining the control of an experiment without giving up the generaliz-
ability of a survey.
In their earliest form, survey experiments were implemented as split-​ballot studies
in which participants were assigned to multiple versions of printed questionnaires,
identical in all but one way. In an early example of a split-​ballot survey, outlined in
Gilens (2002), Elmo Roper assigned participants to answer one of the two following
questions: (1) “Should the U.S. do more than it is now to help England and France?”
or (2) “Should the U.S. do more than it is now to help England and France in their fight
against Hitler?” Roper’s results, reported in Cantril and Wilks (1940), showed that
the change in question wording had an effect on participants’ opinions, with 13% of
participants replying “yes” to the former and 22% replying “yes” to the latter question
484    Yanna Krupnikov and Blake FINDLEY

(Cantril and Wilks 1940; Gilens 2002). This early study hinted at the power of the survey
experiment: seemingly minor changes in question wording produced substantial shifts
in public opinion. Building on this foundation, the modern survey experiment turned to
the analyzing the very nature of public opinion and preference formation (Lavine 2002).
Survey experiments became more central to public opinion research with the devel-
opment of computer assisted telephone interviewing (CATI) (Sniderman 2011). Rather
than relying on preprinted split ballots, CATI offered survey researchers a tremendous
amount of flexibility. For example, CATI incorporated question sequencing, the ability
to adjust which questions survey participants receive based on their answers to prior
questions. In a pivotal moment for survey experiments, Paul Sniderman’s research used
CATI to add a randomizer to surveys, which allowed survey practitioners to randomly
assign participants to different conditions (Sniderman 2011; Piazza, Sniderman, and
Tetlock 1989).2 While the randomizer followed the basic logic of the split-​ballot form
described above, the procedure was now far more effortless. Moreover, randomization
by computer was more likely to avoid human error.
Over time survey experiments have become even more accessible. In recent years,
projects like Sniderman’s Multi-​Investigator Study, which served as a foundation for
Time-​sharing Experiments for the Social Sciences (TESS), have been developed spe-
cifically to fund the use of survey experiments. Created in 2001 by Arthur Lupia and
Diana Mutz, TESS is a cross-​disciplinary program that allows scholars to submit survey
experiment proposals, and proposals that are accepted are fielded on probability-​based
samples. In the six years after its inception (between 2001 and 2007) TESS allocated
millions of dollars to over two hundred projects and more than one hundred researchers
(Nock and Guterbock 2010; Mutz 2011).
This increased reliance on survey experiments has resulted in tremendous advances
in the study of public opinion and political behavior (Barabas and Jerit 2010). Scholars
have used survey experiments, for example, to analyze the effects of priming and
framing (see Chong and Druckman 2007 for an overview). Focusing on the under-
lying determinants of public opinion formation, scholars have considered how a variety
of attitudes—​for example, attitudes toward political parties or attitudes toward out-​
groups—​affect the way people arrive at their eventual opinions (Bullock 2011; Brader,
Valentino, and Suhay 2008). Beyond public opinion research, survey experiments
have become an increasingly common approach in studies of individual political be-
havior (Keeter et al. 2002; Brooks and Geer 2007) and responses to political commu-
nication (Searles 2010). Moreover, survey experiments have also become pivotal in the
study of measurement and general experimental methodology (Berinsky, Huber, and
Lenz 2012).
The increasing popularity of survey experiments is in part due to their myriad
benefits. As Gaines, Kuklinski, and Quirk (2007) note, “the survey experiment is easy to
implement and avoids many problems associated with cross-​sectional and panel survey
data. It clearly distinguishes cause and effect. When used with representative samples,
therefore, survey experiments can provide firmly grounded inferences about real-​world
political attitudes and behavior” (2). Yet like any other method, a survey experiment
Survey Experiments   485

is not without its costs. Indeed, as Gaines, Kuklinski, and Quirk (2007) note, survey
experiments are not a “panacea”; an experimental design is not automatically improved
by virtue of placement in a survey context.3 Just the opposite; the survey experiment
may actually introduce confounds that other experimental contexts control (Gaines,
Kuklinski, and Quirk 2007).
In this chapter we consider the push and pull of the benefits and costs of survey
experiments by focusing on two key components: the participants and the measures.
First, we consider how survey experiments fit into broader arguments about exper-
imental design and validity. Next, we use our discussion of samples and measures to
examine the intersection between scholars’ goals and methodological constraints. We
conclude by considering the extent to which survey experiments can deliver on the
promise of a controlled study within a generalizable setting.

Differentiating Survey Experiments

A key component of experiments as a methodological approach as often applied in


a variety of social science disciplines is the random assignment of participants to
interventions (Mutz 2011).4 At its most basic level, this random assignment is either
to a treatment group, the group that receives some type of experimental stimulus or
manipulation, or to a control group, the group that does not receive any type of ex-
perimental treatment (Gilens 2002; Nock and Guterbock 2010). The treatment can
take on a variety of forms, ranging from small changes in question wording and struc-
ture (Schwarz et al. 1991) to more substantial changes that may even alter the mode
of experimental administration (Clifford and Jerit 2014). The experimental goal is to
compare groups that are identical in all ways except the random assignment to a par-
ticular treatment. In doing so, scholars aim to isolate the causal relationship between
the intervention and some particular outcome of interest (Barabas and Jerit 2010;
Druckman et al. 2011).
The random assignment, and subsequent exposure to the experimental intervention,
can happen in a variety of contexts (McDermott 2002). Typically, scholars distinguish
among laboratory experiments, field experiments, and survey experiments (Druckman
et  al. 2011).5 Laboratory experiments are conducted in controlled environments, in
which nearly every part of a participant’s experience is (to the extent possible) created
by the researcher. In these types of experiments, a researcher can control factors such as
the particular types of participants who are together in a room during a given experi-
mental round (e.g., Klar 2014), what each individual participant knows about the other
participants (e.g., Ahn, Huckfeldt, and Ryan 2014), the very furniture that surrounds the
participants as they take the study (e.g., Iyengar and Kinder 1987), and even the route a
participant takes to the exit upon completing a study (e.g. Levine 2015).
Laboratory experiments offer scholars the most control, but they also often create
environments that are in many ways artificial (Jerit, Barabas, and Clifford 2013; Morton
486    Yanna Krupnikov and Blake FINDLEY

and Williams 2010). Because participating in a study in a laboratory takes a person out of
his or her day-​to-​day life, a participant may pay more attention to information provided
by the researcher (Jerit, Barabas, and Clifford 2013) and/​or think more thoroughly when
reporting responses to questions and answer in ways specifically designed to please a
researcher (Iyengar 2011). The possibility that the laboratory leads to behavioral changes
may limit the generalizability of laboratory studies; an experimental finding may repre-
sent how a person behaves in a carefully controlled setting, but may not be indicative of
behavior outside the laboratory in the “real world.”
Field experiments are studies conducted in what Druckman et al. (2011) term a “nat-
urally occurring setting” (17). In this type of experimental approach participants are
still randomly assigned to experimental groups, but they are often “unaware that they
are participating in a study” (Gerber 2011, 116). The goal of field experiments is to re-
tain the benefits of random assignment and overcome the artificiality of the lab setting
by presenting participants with treatments in their “real-​world” contexts and without
taking them out of their daily routines (Teele 2014). People, for example, may be ran-
domly assigned to receive different types of “Get Out the Vote” messages (Gerber and
Green 2000), different direct-​mail solicitation donations (Levine 2015), or different
text messages reminding them to vote (Dale and Strauss 2009). Because people are un-
aware that they are receiving these types of messages as part of an experimental study,
they have little incentive to read the messages more carefully or behave in ways that
please the researcher. Moreover, the outcome measures in these studies are often behav-
ioral: rather than measuring responses to treatments with questions and assigned tasks,
in field experiments scholars often track patterns of outcomes (e.g., turnout, donations)
that would correspond to exposure to certain treatments in the field.6
While field experiments help scholars overcome the limitations of the laboratory
setting, field studies are not without their own limits. In certain cases, scholars may be in-
terested not only in the causal connection between a particular experimental treatment
and a behavioral outcome, but also in the mechanisms underlying that connection.
Specifically, the research question may not only ask whether a treatment causes an out-
come, but why that causal connection exists. In these cases it is not enough to observe
that a treatment caused a particular outcome; the goal is to investigate whether the
treatment affected the outcome by serving as the first push in a hypothesized chain of
events.
Levine (2015), for example, argues that people are less responsive to donation
solicitations that mention economic hardships because these solicitations make them
feel poor, and feeling poor leaves a person hesitant to make financial donations. In one
of his studies Levine (2015) uses a field experiment to demonstrate that direct-​mail
solicitations that bring up economic issues lead to fewer donations. Since in his field
experiment solicitations are randomly assigned, Levine (2015) shows that solicitations
that mention economic hardships caused donation rates to decline. There are a number
of reasons, however, why this causal connection may exist. Solicitations that mention
hardships may lead to a lower likelihood of donation because, as Levine hypothesizes,
they may make people feel poor, but they may also lower likelihood of donation because
Survey Experiments   487

they put people in an unhappy mood, or because they lead people to question their trust
in the government, or because mentioning economic hardships can make people feel
anxious. If Levine (2015) was only interested in the causal effect of solicitations on do-
nation behavior, distinguishing between these possible causal mechanisms would be ir-
relevant to his research goals, and the field experiment would be a sufficient test. Since
Levine (2015) is interested in a particular theoretical chain of events, he turns to a survey
experiment to identify why solicitations that mention hardships make people less likely
to donate money.7
Survey experiments offer scholars the promise of integrating the control and focus
on mechanisms that is often present in laboratory experiments while retaining some
generalizability (Barabas and Jerit 2010). Broadly defined, survey experiments involve a
random assignment to groups within an opinion survey (Druckman et al. 2011; Morton
and Williams 2010), or the “deliberate manipulation” of various components and parts
of a survey (Gaines, Kuklinski, and Quirk 2007, 3). Given this definition, the mode of
the survey is irrelevant; the survey may take place over the telephone, in person, or over
the Internet (Druckman et al. 2011; Morton and Williams 2010). A typical survey exper-
iment may proceed in ways that are similar to a laboratory experiment: a participant
is randomly assigned to an experimental group (e.g., treatment or control) and subse-
quently answers a series of questions designed to measure his or her response to a par-
ticular treatment (Morton and Williams 2010). Indeed, as Morton and Williams (2010)
note, there are some survey experiments that could take place within a laboratory and
some laboratory experiments that could reasonably be fielded as survey experiments.
Differentiating a survey experiment from a pure laboratory experiment, then, is
the context in which the random assignment and measurement occurs. As Gaines,
Kuklinski, and Quirk (2007) explain, a survey experiment is the “deliberate manip-
ulation of the form or placement of items in a survey instrument, for the purposes of
inferring how public opinion works in the real world” (4, emphasis added). Relative to
a laboratory study, two factors bring the survey experiment closer to the “real world.”
First, since surveys ask people to answer questions within their natural environments
(i.e., people are unlikely to be asked to go to a laboratory to participate in a survey),
the artificiality of the context is somewhat diminished relative to a pure lab setting.8
Second, while laboratory experiments are limited to participants who live or work near
the laboratory location, survey experiments offer scholars the ability to conduct studies
on broader samples, including samples that are representative of the population being
studied.
In sum, the benefit of the survey experiment approach is that it can retain large
components of the internal validity of a laboratory experiment. Scholars have enough
control to ensure that all participants are exposed to the experimental treatment. This
is something that is often difficult to do in a field experiment, where participants en-
counter the treatment as part of their day-​to-​day lives and can at times avoid or ig-
nore the treatment (Jerit, Barabas, and Clifford 2013).9 Moreover, scholars can also
measure outcomes and mechanisms immediately post-​treatment with items deliber-
ately designed to evaluate the effects of a particular experimental stimulus. On the other
488    Yanna Krupnikov and Blake FINDLEY

hand, the survey experiment offers more external validity: by taking the experiment out-
side the laboratory, scholars can argue that the obtained results have a higher likelihood
of generalizing beyond the particular participants in a given study.
The possibility of retaining high levels of internal validity while increasing the
external validity of experimental studies has made survey experiments an increas-
ingly important methodological approach (Barabas and Jerit 2010; Lavine 2002;
Mutz 2011). Nonetheless, it would be shortsighted to assume that simply plucking
an experiment from a laboratory environment and embedding it wholesale within
a national survey will immediately allow one to capture the full benefits of the
survey experiment approach. Rather, the benefits of survey experiments depend
on the way experimental components fit within a survey setting. Broadening the
base of participants, while potentially useful, may not always increase the gener-
alizability of a study. Similarly, while certain experimental measures are valid in a
laboratory, the same measures can produce confounds in a survey setting. In short,
like most methodological approaches, survey experiments have both costs and
benefits.
In the next several sections we examine both the costs and benefits of survey
experiments by considering the limitations of the survey experiment approach. We
begin by discussing participant recruitment for survey experiments. Here we discuss
how the representativeness of the sample affects generalizability and examine how the
rise of “national panels” can affect survey experiments. Next, we examine measurement
strategies in survey experiments. In this section we focus on the potential limitation of
survey experiments for examining participatory outcomes. We focus on participants
and measures because we see these two components as pivotal to arguments about
the general usefulness of survey experiments. The extent to which scholars can make
broader inferences when relying on survey experiments—​as compared to laboratory
experiments—​depends on who participates in these studies and the types of tasks these
participants are asked to do.

Survey Experiment
and Participant Limitations

A key benefit of survey experiments is that they provide the ability to reach people
from a more diverse geographic area. Since there is no laboratory that a participant
must visit, a scholar can recruit participants who represent broader populations and
subpopulations. Moreover, the ability to participate from one’s own home and on one’s
own time makes participating in survey experiments less costly than participating in
laboratory experiments. In turn, recruitment for survey experiments may yield higher
rates of participation: people will be more likely to participate in survey experiments be-
cause it is an easier process (Mutz 2011).10
Survey Experiments   489

The ability to recruit more participants and more diverse participants is, of course,
beneficial. If nothing else, higher numbers of participants can increase the experimental
power of the study to observe differences by experimental treatment (Maxwell and
Delaney 2004). The diversity of the sample, however, is a different proposition. Key to
realizing the promise of generalizability of survey experiments is considering how we
understand the idea of “sample diversity.” In particular, the question lies in whether we
consider the diversity of a sample in relative or absolute terms. Has a survey experi-
ment provided us with greater generalizability when we recruit a sample that is more di-
verse relative to one that we could have recruited for a laboratory study? Or can a survey
experiment only deliver on the promise of greater generalizability when we recruit a
sample that is diverse in ways that are representative of the population to which we are
trying to generalize?
The tension between the relative and absolute of sample diversity leads to a
larger question. Recruiting a representative sample is too costly for many scholars.
Consequently, if we believe that benefits of survey experiments depend on the absolute
representativeness of the sample, how willing are we to modify the sampling and re-
cruitment process to make recruiting a representative sample more accessible? Below,
we take on each of these possible limitations to the inferences we can draw from survey
experiments.

Absolute Versus Relative Sample Diversity


Certainly fielding a survey experiment on a representative sample of a desired popula-
tion is of tremendous benefit. As Mutz (2011) writes, “critics over the years have often
questioned the extent to which the usual subjects in social science experiments re-
semble broader, more diverse populations . . . population-​based survey experiments
offer a powerful means for researchers to respond to such critiques” (Mutz 2011, 24).
The key to Mutz’s argument, however, is the idea that scholars are able to recruit a repre-
sentative sample of some population. As Mutz notes, not all experiments have the goal
of generalizing toward some population. Those that do aim to generalize their results—​
what Mutz calls “population-​based survey experiments”—​can benefit from a sample
that is representative of the “target population of interest” (2011, 3). In Mutz’s approach
the representativeness of a sample is defined as the “use of sampling methods to produce
a collection of experimental subjects that is representative of the target population of in-
terest of a particular theory” (2011, 2).
While Mutz notes that larger sample sizes are always more beneficial (particularly if
a scholar is interested in moderating effects), key to her approach is the extent to which
the sample is representative of “groups to which we expect the theories to generalize”
(2011, 145). This approach to survey experiments relies on an absolute view of diversity.
In this view, sample diversity is not important because a scholar has managed to recruit
a large convenience sample that is more diverse in comparison to a smaller sample of un-
dergraduate students, but sample diversity is important so long as it reflects the pivotal
490    Yanna Krupnikov and Blake FINDLEY

population in a scholar’s research question. Only once a sample is representative of a


target population can the results of a survey experiment be generalizable.11
Other scholars, however, have explained the diversity of the sample as a more
relative idea. These types of explanations begin with the baseline that a laboratory
study conducted on undergraduate students offers the lowest sample diversity.
From this standpoint, any sample that offers more diversity relative to this base-
line underscores the benefit of going outside the laboratory and relying on a survey
experiment approach. This relative approach has been particularly apparent in re-
search that evaluates the use of Amazon’s Mechanical Turk (MTurk) as a means of
recruiting survey experiment participants.12 In a foundational paper on the costs and
benefits of MTurk samples in survey experiments, for example, Berinsky, Huber, and
Lenz (2012) compare MTurk to a variety of other samples. Key to their argument for
MTurk usefulness is the idea that “demographic characteristics of domestic MTurk
users are more representative and diverse than the corresponding student and con-
venience samples used in experimental political science” (352). Similarly, Paolacci
and Chandler (2014) demonstrate that MTurk is comparable or more diverse than
other sample types in the social sciences. In sum, while scholars note that MTurk
is “by no means representative of the broader population,” it still offers a sample
that is more diverse than what could be obtained otherwise (Arceneaux 2012, 274).
Moreover, recent research suggests that MTurk samples can be used to replicate a va-
riety of findings obtained with survey experiments fielded on representative samples
(Mullinix et al. 2015).
Are the external validity and generalizability benefits of survey experiments realized
when we rely on samples that are not representative, but relatively better than samples
that could have been used in a laboratory setting? The answer to this question depends
on why one believes that laboratory samples limit generalizability. If laboratory studies
lack generalizability because they create a setting that is inherently artificial by placing
individuals in a laboratory, or if we believe that laboratory studies lack generaliza-
bility because they rely on undergraduate students (Sears 1986; Kam, Wilking, and
Zechmeister 2007), then the relative standard is a useful one to apply when considering
the benefits of relying on survey experiments. Applying this standard means that even
a convenience sample (such as MTurk) can offer more generalizability than a labora-
tory study with students. If, however, we argue that laboratory studies lack generaliza-
bility because our results can only generalize when appropriate sampling methods are
used, then only Mutz’s (2011) absolute standard can realize the full benefit of survey
experiments. If we retain this standard, then attempts to demonstrate that survey
experiments with convenience samples like MTurk produce results that are relatively
similar to results obtained with representative samples (Mullinix et al. 2015) are unlikely
to be persuasive.
Ultimately, unifying the relative and absolute definitions is the assumption that
increasing the representativeness of the experimental sample is generally beneficial.13
If survey experiments give scholars the opportunity to conduct their studies on broader
and more diverse populations, it is beneficial for scholars to take these opportunities.
Survey Experiments   491

Moreover, if publication patterns are suggestive, then scholar preferences seem to


lean toward more representative samples in survey experiments (Kam, Wilking, and
Zechmeister 2007). In the next section we turn toward the constraints on recruiting
these more representative samples.

Sample Recruitment: National Participant Panels


Recruiting a group of people to participate in a laboratory study can be a difficult and
time-​consuming process. Although survey experiments—​especially those conducted
over the Internet—​can initially seem much simpler, the costs of the recruitment may
actually be significantly higher in survey experiments. While a laboratory experiment
attempts to “coax” people who live or attend class nearby to go to a laboratory and take
a study (Mutz 2011), recruiting a national sample to participate in a survey experiment
requires a clear identification of a population, a sampling procedure, contact infor-
mation, and time to carry out the actual study. If one is interested in a representative
random sample, this process becomes even more complex and costly (see, for example,
the American National Election Study’s [ANES] sampling process). Moreover, the idea
that a scholar would begin sampling a population “from scratch” every single time the
scholar ran a fifteen-​to twenty-​minute survey experiment suggests an almost insur-
mountable difficulty to the process.
Given these costs, scholars fielding survey experiments have increasingly turned to
survey companies that maintain national panels of participants. These companies sim-
plify the recruitment process. Panels are typically comprised of hundreds of thousands
of people who at some point reported having some interest in participating in surveys.
When a scholar wants to field a survey experiment, that scholar can contract with a
company that maintains this type of panel. The company will then randomly sample the
panel to produce a sample for the scholar and invite the selected panel members to take
a study. Once the study is complete, a panel member receives some sort of payment for
his or her efforts.
Of course companies vary in how they create panels, how they collect data, and
the incentives they offer. Some companies rely on Web advertising to recruit panel
members; in these cases, participation in the panel is, at least at first, opt-​in and the re-
sult is a nonprobability sample.14 Other companies rely on random sampling to recruit
panel members, offering difficult to reach populations incentives to remain in the panel
(Sargis, Skitka, and McKeever 2013).15 Differences in panel construction aside, national
panels offer scholars more accessible means of recruiting national samples for survey
experiments.
The presence of survey companies that offer individual scholars national samples
of participants has been highly beneficial for the growth of survey experiments in the
social sciences (Sargis et al. 2013).16 It is difficult to imagine that survey experiments
would exist with any frequency without the presence of companies maintaining na-
tional panels. Do these panels come at a cost? One possibility is that these online panels,
492    Yanna Krupnikov and Blake FINDLEY

in some sense, replicate the traditional undergraduate subject pool (Morton and
Williams 2010).
Subject pools are typically used in laboratory settings (Kam, Wilking, and
Zechmeister 2007). They are useful because they are comprised of individuals who are
ready and willing to participate in studies. When subject pool members are students,
participation in studies is often required for course credit. While subject pools can
improve the rate of laboratory participation, laboratory studies are often criticized
precisely for their use of subject pools. Since members of pools have expressed a will-
ingness to participate in studies (or are required to participate by virtue of their course
schedules), they are likely to participate in multiple studies during their tenure in the
subject pool (or be required to participate in multiple studies).
Repeated participation in experimental studies can be problematic. When
participants are in multiple studies, they may “become less naïve than one might hope”
(Mutz 2011). Each round of participation may teach subject pool members about the
experimental process, and this type of learning can—​under certain conditions—​affect
their responses to subsequent experimental treatments (Weber and Cook 1972). If this is
a critique levied at laboratory subject pools, it is one that scholars should consider care-
fully as we come to rely more and more on national panels of subjects to obtain samples
for survey experiments. In particular, as Morton and Williams (2010) note, a student
subject pool “automatically refreshes over time,” which is less likely to happen with a na-
tional panel of participants who earn incentives for participation (323).
Concerns about “professional subjects” in survey experiments are often brought up
in regard to MTurk participants. Indeed, the MTurk platform, where participants earn
money for completing tasks and can complete numerous studies on a daily basis, lends
itself to the creation of such a “professional subject” (Chandler, Mueller, and Paolacci
2014). In this particular context, however, the relative “professionalization” of a partici-
pant can affect the way he or she responds to survey experiment treatments and shift the
size of treatment effects (Chandler et al. 2015).
Initially, survey companies with national panels may seem immune from the “profes-
sional subject” criticism. What differentiates MTurk is the pure opt-​in nature of the plat-
form: not only do people opt-​in to participating in MTurk, but they also opt-​in to the
studies. Survey companies, on the other hand, randomly invite people from the panel
to participate in any given study, diminishing opt-​in patterns and limiting the number
of studies a given panel member can take. On the other hand, when people remain part
of an online panel for several years (even if their presence on the panel is due to random
recruitment), they are bound to participate in multiple studies (Hillygus, Jackson, and
Young 2014). Moreover, there is little to limit simultaneous participation in multiple
survey panels. In short, eliminating the chance to opt-​in to individual studies does not
fully immunize online panels against the types of learning effects that may occur in
MTurk participants.17
Hillygus, Jackson, and Young (2014) report that within the ten largest companies
that maintain national online panels, 1% of panel members accounted for 34% of all
completed studies. Analyzing whether repeated participation matters, they compare
Survey Experiments   493

participants recruited via YouGov, a company with a large national panel. In their
sample Hillygus, Jackson, and Young (2014) show that the self-​reported mean of survey
participation over the previous four weeks is 4.54, and 36.5% of their sample reported
being members of three or more online panels at the same time.
Does this repeated participation matter? The existing literature offers conflicting ev-
idence. Some scholars suggest that repeated participation teaches individuals how to
avoid engaging in more “work” by answering questions in ways that avoid additional
follow-​ups (Nancarrow and Cartwright 2007). Others suggest that these “professional
subjects” are less likely to satisfice when taking part in studies (Chang and Krosnick
2009). Overall, Hillygus, Jackson, and Young (2014) note that existing research suggests
that it is largely unclear if repeated participation in studies changes individual behav­
ior in any particular way. More recent studies, however, suggest that there are some
differences between people who participate repeatedly in panel studies and those whose
participation is less frequent.
Hillygus, Jackson, and Young (2014) show that repeated participants are likely to have
lower levels of political knowledge, interest, and engagement with politics. In a different
study, Adams, Atkeson, and Karp (2015) demonstrate that factors such as age, gender,
income, and education all affected the number of studies members of national panels
completed. Taken together, these results suggest some systematic differences between
panel members who participate frequently and those who do so rarely.
Adams, Atkeson, and Karp (2015) argue that some repeat participants are largely
motivated by extrinsic (e.g., money or points for completing studies) rather than in-
trinsic (e.g., interest in politics) factors. Participants motivated by extrinsic factors, they
argue, are more likely to satisfice and less likely to thoughtfully engage with the survey
and—​extrapolating this point to survey experiments—​may be less likely to engage
with the treatment. Indeed, Adams, Atkeson, and Karp (2015) demonstrate that repeat
participants take surveys more quickly.18 Even more important, they show that repeat
participants become more politically knowledgeable.
In sum, the possibility exists that repeated participation in national panels can
affect individual behavior in studies. Presumably, in a survey experiment repeated
participants should be randomized across experimental groups, which may diminish
concerns. On the other hand, there may be scholars who are interested in conditional
effects—​for example, in the way their treatment may affect participants who are less in-
terested in or knowledgeable about politics. The possibility that these political factors
are correlated with participation patterns could affect the ultimate conclusions drawn
from a given survey experiment.
A greater concern is the possibility that the responsiveness of a participant to a given
treatment may be a function of his or her prior experiences with survey experiments.
This issue can be particularly important if a high percentage of participants in a given
study is repeat participants who have a certain sense of familiarity with the experimental
process. In this case, there is a possibility that the results of the study are driven by repeat
participation. This outcome, in turn, limits the generalizability of a survey experiment—​
which may have been the precise reason for relying on a national panel in the first place.
494    Yanna Krupnikov and Blake FINDLEY

Sample Characteristics and Benefits of Survey


Experiments
Any given sample comes with costs and benefits. Indeed, as the research on repeated
participation demonstrates, even reliance on costly national panels does not insu-
late a researcher against potential sample limitations. The goal of discussing these
limitations is to suggest that the survey environment is not by itself enough to be an
unconditional improvement over the laboratory environment. First, the nature of
the sample matters. If we define a survey experiment based on the sampling process
associated with recruiting participants, then an experiment that evaluates a sample
as relatively better than one obtained in a laboratory may not necessarily be an im-
provement. Second, it is important to be cognizant of the constraints involved in
recruiting subjects and the costs of relying on people who may be professional study
participants.
Underlying these points is the idea that sample diversity and representativeness are
not always the ideal heuristics for evaluating the generalizability of a survey experiment.
Opening the study to national panels, for example, is often a way to gain greater sample
diversity. Doing so, however, may introduce the presence of repeated participants and
may actually make the results less generalizable than a laboratory experiment conducted
in a controlled setting on naïve undergraduates. Moreover, a researcher’s goals may not
always be best served by a sample that is diverse across a variety of characteristics (such
as the sample used in the ANES). A survey experiment on a more representative sample
is not by definition superior to one that is conducted on a more narrow sample. As Mutz
(2011) argues, the diversity of the sample should be defined by the population at the
heart of the research question.
Studies that attempt to identify the effects of partisanship cues may offer stronger and
more generalizable inferences when people who identify as independents are excluded
from the sample (e.g., Druckman, Peterson, and Slothuus 2013). Although excluding
independents limits the political diversity of the study, including independents in a
survey experiment about the power of partisan messages could make the survey less gen-
eralizable because independents are less likely to pay attention to (or even be exposed
to) partisan cues in the “real world.” Similarly, when the goal is to identify gradations in
identity strength among individuals who identify as partisans, a sample that is recruited
from partisan blogs and websites may enhance the generalizability of the inferences
made (Huddy, Mason, and Aaroe 2015). Though less diverse and not representative of
the population, a sample in which every participant has some existing connection to a
party allows for clearer inferences about partisan identity strength than a sample drawn
though a national panel.
Recent advances in sample recruitment have given scholars the ability to recruit rep-
resentative (or at the very least more diverse) samples quickly and with relative ease. The
idea that a better sample is available and obtainable, then, may lead to the use of sample
diversity as a heuristic with which to judge the quality of a survey experiment. In the
Survey Experiments   495

abstract, this may be an effective heuristic, and the diversity of the sample may appear
as a signal for the generalizability of the results. In practice, however, the relationship
between sample diversity and generalizability is closely linked to study goals. A survey
experiment is a delicate balance among the research question, the desired scope of
inferences, the sample participants necessary to make the desired inferences, and the
characteristics of the participants recruited. A survey experiment is not immediately
generalizable because it is fielded on a representative sample; similarly, a survey experi-
ment is not immediately limited because it is fielded on a sample that has little variance
across certain characteristics (Kam, Wilking, and Zechmeister 2007; Druckman and
Kam 2011).

Survey Experiments and Limitations


in Measurement

Equating survey experiments with the ability to make inferences about the “real
world” suggests that under most conditions survey experiments are likely to be su-
perior to laboratory studies. Following this logic, then, it may be tempting to simply
redesign experiments previously conducted in the laboratory into studies that can be
fielded as survey experiments. This may initially seem like an easy transition. If survey
experiments are viewed simply as laboratory experiments transported into a broader
survey setting and fielded using a (potentially) representative (or at least more diverse)
sample of participants, then one can easily apply the logic used in laboratory studies
when designing measures and treatments for survey experiments. Yet while there
are certain ideas that can guide measurement in both types of experiments, survey
experiments bring conditions that may be less hospitable to measurement and design
techniques that are useful in laboratory settings.
As numerous scholars have suggested, survey experiments have been a particularly
pivotal tool in research on public opinion (Barabas and Jerit 2010; Lavine 2002). Indeed,
in their definition of survey experiments Gaines, Kuklinski, and Quirk (2007) note that
the goal of survey experiments is to investigate some component of public opinion re-
search. Increasingly, however, scholars have turned to survey experiments to analyze
outcomes that move beyond public opinion. Scholars have used survey experiments,
for example, to study willingness to turn out to vote (Brooks and Geer 2007), obtain
different types of information (Brader, Valentino, and Suhay 2008), or take a variety of
political actions (Levine 2015). The application of survey experiments to political par-
ticipation is important and useful. The benefit of the survey experiment is the increased
ability to make generalizable inferences; it stands to reason that scholars are inter-
ested in making generalizable inferences about a variety of topics, and the participa-
tory components of individual orientations toward politics are pivotal to democratic
outcomes.
496    Yanna Krupnikov and Blake FINDLEY

Moving beyond measures designed to capture components of public opinion,


however, may be more challenging in the survey experiment context. By definition,
survey experiments are conducted within a survey. In turn, surveys with embedded
experiments are conducted in much the same manner as those without embedded
experiments: outside the laboratory, either face-​to-​face with an interviewer, over the tel-
ephone, or over the Internet. In all of these case, the interviewer comes to the partici-
pant, and the participant engages with the study in the context of his or her day-​to-​day
life. Indeed, this context is what makes survey experiments closer to the “real world”
than the carefully controlled setting of the laboratory. The survey context, however,
means that measures are limited to tasks that a participant can reasonably complete
within a survey environment.
The tasks and measures that are best suited for and most easily implementable within
a survey context are those that ask participants to express their preferences. Expressed
preference measures ask participants how likely they would be to undertake some sort
of action. An expressed preference measure may ask a participant how likely he or she
may be to vote in an upcoming election, whether he or she intends to watch the news
in the next week, or whether he or she would be interested in contacting his or her con-
gressperson (Krupnikov and Levine 2011). These are, of course, reasonable measures. As
Krupnikov and Levine (2011) note, questions that ask people how likely they would be
to take some action are often the best way to capture the potential behavior of a diverse
sample in a survey experiment.
Yet expressed preferences carry with them a limitation:  “the significant disadvan-
tage is that people may not necessarily do what they say” (Kroes and Sheldon 1988; 13).
Expressing a high likelihood of taking some action during a survey experiment is in
many ways virtually costless. Indeed, it is likely for this reason that people have high
tendencies to overestimate their willingness to participate in future political events
(Pinkleton, Austin, and Fortman 1998). This tendency to overestimate and overreport
willingness to act can be troublesome for scholars. An increase in an expressed prefer-
ence for action due to some treatment in a survey experiment may mean that this type
of treatment generally increases political participation. Alternatively, such an increase
may mean that this treatment increases people’s willingness to tell an interviewer that
they will take an action, but has null effects on their actual behavior (Krupnikov and
Levine 2011).
An alternative to expressed preference measures is revealed preference measures.
Revealed preference measures give people an opportunity to complete a task within a
research setting. These types of measures create situations in which “respondents ac-
tually experience a cost” (Fowler 2006, 676). While an expressed preference measure
may ask participants how willing they would be to donate funds to a group, a revealed
preference measure may ask participants to place money in an envelope and donate
the money (Levine 2015). While an expressed measure may ask participants how likely
they would be to wear a button displaying support for their political party, a revealed
preference measure may actually track what happens when people are given real po-
litical buttons (Klar and Krupnikov 2016). These types of measures make reporting a
Survey Experiments   497

preference for action more costly, and because participation generally carries a cost
(Verba, Schlozman, and Brady 1995), these types of measures can help make inferences
more generalizable.
Revealed preference measures are often difficult to implement in a survey experi-
ment setting (Krupnikov and Levine 2011). While the laboratory setting can easily lend
itself to the types of tasks that ask participants to reveal their preferences (e.g., Johnson
and Ryan 2015), embedding costly tasks into a survey may often prove difficult. Levine
(2015), for example, conducted experimental studies that track how different types of
donation requests influence individual willingness to donate funds. One study was
conducted in the laboratory; another was conducted as a survey experiment. Since
analyzing donations requires that people actually donate money (rather than report a
willingness to donate money), Levine (2015) measures what proportion of an endow-
ment received at the start of the experiment participants are willing to donate following
exposure to various donation requests. This creates a constraint. In his laboratory ex-
periment, Levine (2015) explains, participants were actually given real money that they
opted to either keep or donate. Since this could not occur in a survey experiment, “it
is possible that some subjects were not convinced that they would actually receive the
money they chose not to donate” (Levine 2015, 230). His discussion of accounting for
this constraint highlights that for the purposes of analyzing an outcome that is best
identified with a revealed preference measure, the survey experiment contexts can have
more costs than benefits.
Revealing preferences, however, does not always mean taking a costly action.
A more generalized form of a revealed preference measure may be an individual’s so-
cial interaction. A person may report a willingness to share information when asked
to express a preference but be more or less willing to share that information when in
the presence of an actual preference. To this extent, then, studies that depend on social
interactions may also be limited by the survey experiment approach. In Klar’s (2014)
laboratory experiment, for example, participants discuss politics either with members
of their own party or with members of the opposing party. Pivotal to Klar’s argument
is the actual, direct, social interaction that occurs within the group; it is unclear if the
same effect can be achieved outside the laboratory even with the implementation of a
chat room.
Klar’s (2014) study aside, utilizing revealed preference measures in survey experiments
is not an impossible task. The recent growth in Internet surveys means that scholars
can embed measures with more behavioral components when measuring participatory
outcomes (e.g., Brader, Valentino, and Suhay 2008). As Levine (2015) demonstrates, it
is even possible to implement a donation experiment with an endowment in the survey
experiment context. Yet it is important to be cognizant of the costs and benefits of doing
so. Survey experiments carry with them constraints for scholars who want to move be-
yond opinion measures toward measures of participation. In these types of cases, the
survey experiment may not be consistently and unconditionally superior to a laboratory
study—​even if the survey experiment is performed on a representative sample of a given
population. While moving to a more diverse sample can increase the generalizability of
498    Yanna Krupnikov and Blake FINDLEY

the study, giving up a revealed preference measure to do so may undermine the validity
of the inferences scholars want to make about individual behavior.

Conclusions

Over the last several decades, survey experiments have proven pivotal to the study of
political behavior. As Lavine (2002) notes, “survey experiments that integrate repre-
sentative samples with the experimental control of questions represent the most val-
uable tool for gaining access to the processes that underlie opinion formation” (242).
More recently, survey experiments have become even more accessible. Especially, if
scholars are willing to rely on nonprobability or national convenience samples in survey
experiments, these studies can be run at lower costs and produce results that may be
more generalizable than those obtained with laboratory studies.
Key to many arguments about the usefulness of survey experiments is the role of the
“sample.” The idea that an experiment was conducted on a sample that is representative—​
or at least more representative than some other possible sample—​often seems to make
experimental results more trustworthy or publishable (Kam, Wilking, and Zechmeister
2007). Yet this focus on sample may blind scholars to the push and pull among research
goals, experimental design, and experimental participants. The control of a laboratory
study, for example, may outweigh the benefits of a national sample for a scholar studying
the effects of interpersonal communication. The inclusion of non-​naïve participants in
a study, for example, may undermine the inference drawn from an experiment on a na-
tional population (Chandler et al. 2015).
More broadly, we suggest that when considering whether a survey experiment is
the best approach, scholars may want to weigh the following considerations. First,
does a full test of the hypothesis require total control over every aspect of a subject’s par-
ticipation? Does an appropriate test of the hypothesis, for example, depend on the re-
searcher being aware of the level of attention a participant pays to the treatment? Would
an experiment—​as a test of a particular hypothesis—​lose conceptual clarity if there is
even a slight amount of variance in the way participants are exposed to the treatments
and subsequently complete post-​treatment tasks? If an adequate test of the hypothesis
requires control over virtually every aspect of subject participation, then the costs of
moving from a laboratory environment to a survey experiment may be too great. If the
experimental design can absorb some decline in control (e.g., participants taking the
study over the Internet are in a variety of different environments when exposed to the
treatment), yet still remain a reasonable test of the hypothesis, then a survey experiment
may be reasonable.
Second, given the particular experimental design, which sample is most likely to pro-
duce generalizable inferences? While representative samples may make the answer to this
question simple, if a scholar is not able to recruit a representative sample, the question of
participants becomes more difficult.19 If the scholar is planning to recruit a convenience
Survey Experiments   499

sample for a survey experiment, can it be assumed that the convenience sample is more
diverse precisely on the types of factors that make the undergraduate sample narrow?
Given the design of the experiment, is the increase in the potential diversity of the
sample a greater benefit than the cost of including subjects who are professional study
participants? Although a convenience sample may be more diverse, this diversity may
not necessarily translate to the generalizability of results in each and every experiment.
Moreover, a researcher may actually have a better understanding of the characteristics
and motivations of the laboratory sample, meaning that he or she will be better able
to design studies that account for the particular narrowness of the sample. The possi-
bility that a convenience sample is less narrow on certain characteristics does not nec-
essarily mean that it will produce results that are more generalizable; scholars should
again weigh the costs and benefits.
We are of course far from the first to suggest that survey experiments carry costs.
Barabas and Jerit (2010), for example, raise issues of external validity in survey
experiments. Gaines, Kuklinski, and Quirk (2007) consider a variety of design issues
that can undermine survey experiments. These articles are important because the
possibility of an experiment that retains control but increases the generalizability of
the findings by measuring outcomes in the “real world” holds a tremendous amount
of promise. Indeed, this type of logic would suggest that most experiments could be
improved by a change in context. Our goal in this chapter is not to suggest others,
but rather to offer a more ambivalent perspective. Although survey experiments are
useful and important, it would be shortsighted to argue that moving a study out of a
laboratory provides only benefits and no costs. Certainly in many cases relying on a
survey experiment enhances the study, but there are certain conditions under which
survey experiments may undermine rather than enhance the inferences scholars
can draw.

Notes
1. Scholars have offered other definitions of what makes a particular experimental design
a “survey experiment.” Nock and Guterbock (2010), for example, define a survey exper-
iment as a study that randomly assigns survey components. Under this definition, an ex-
periment that randomly assigns an intervention that is not at all survey based, but that
uses survey-​style questions to measure outcomes either before or after that intervention, is
not necessarily a survey experiment.
2. Sniderman (2011) also credits Merrill Shanks with CATI development.
3. Gaines et al. (2007) use the term “panacea” in the abstract of the article.
4. It is important to note here—​as Mutz (2011) does—​that the term “experiment” is not always
synonymous with random assignment. For example, Mutz notes that Milgram’s (1963)
original experiment on authority does not necessarily rely on random assignment. Mutz
notes that subsequent studies and replications of Milgram’s original result did use random
assignment. Nonetheless, experimental research in the social sciences has often explic-
itly meant the use of random assignment to interventions. Time Sharing Experiments for
the Social Sciences (TESS), a program that funds only survey experiments, for example,
500    Yanna Krupnikov and Blake FINDLEY

notes that only proposals that have some form of random assignment (either within or
between subject) can be funded (see http://​www.tessexperiments.org/​introduction.
html#proposals, “What Kind of Proposals Are Appropriate?”).
5. One other type is a natural experiment, though in this particular case the intervention is
not created by or at the instruction of a researcher.
6. This is not to argue that all field experiments necessarily rely on behavioral outcomes. In
certain field studies, the intervention is assigned in the field, but treatment outcomes are
measured with follow-​up surveys. See Gerber, Karlan, and Bergan (2009) for an example
of such an approach.
7. Levine (2015) also uses a laboratory experiment to demonstrate mechanisms.
8. Nonetheless, because in a survey experiment people are still aware that they are part of
a research process, survey experiments cannot generalize to real-​world behaviors to the
level of field experiments.
9. Note that the idea that in a field experiment people can avoid or ignore the treatment can
be considered an external validity benefit of a field experiment—​because it means that
people are dealing with the treatment in a way that exemplifies their true behavior (Jerit
et al. 2013).
10. Certainly, scholars can offset the costs of participation in laboratory studies by offering
participants high financial incentives for participation (Morton and Williams 2010).
The use of financial incentives, however, is in itself not without limitations; a re-
searcher may simply lack the financial resources to recruit a large sample of laboratory
participants.
11. These discussions of sample begin from the assumption that a scholar has obtained
and even surpassed the sample size necessary to observe even small group differences.
Assuming that the necessary sample size can been obtained, the question becomes
whether the recruited participants should be sampled in a way that is representative of the
population of interest. As Mutz notes, population-​based survey experiments “need note
(and often have not) relied on nationally representative population samples . . . the key is
that convenience samples are abandoned in favor of samples representing the target popu-
lation of interest” (2011, 3).
12. MTurk is a platform on which researchers can post surveys as tasks. People who are reg-
istered as MTurk “workers” can then choose to opt in to the task and complete the survey
for a predetermined payment. MTurk as a recruitment platform highlights the tension
between the absolute and relative definitions of sample diversity. MTurk recruitment is
unlikely to produce a nationally representative sample, but this approach can produce a
sample that is more diverse than a laboratory sample of undergraduates.
13. Although see Kam, Wilking, and Zechmeister (2007) for the argument that decreasing
the representativeness of the sample can be beneficial for certain types of treatments and
studies.
14. One example of such a company is SSI. See Berinsky, Margolis, and Sances (2014) for use of
SSI in political science.
15. One example of such a company is GfK (formerly known as Knowledge Networks). See
Prior (2005) for use of GfK as Knowledge Networks.
16. Sargis et al. (2013) present data about the rise of Internet-​based studies in psychology;
Barabas and Jerit (2010) discuss this point in regard to political science.
17. Notably, the effects may be more extreme in MTurk, where participants can take part
in multiple studies and can discuss these studies in forums (Chandler, Mueller, and
Survey Experiments   501

Paolacci 2014). Moreover, MTurk participants know that they earn money based on data
quality, which may mean that they may become more attentive as they professionalize
(Chandler, Mueller, and Paolacci 2014). Participants in national panels, on the other
hand, may actually become less attentive as they professionalize (Hillygus, Jackson, and
Young 2014).
18. Note that Adams et al. (2015) address the possibility that repeat members of panels are
simply becoming more adept at handling the technological aspects of survey participa-
tion, which leads them to complete the study more quickly.
19. The assumption here is that the scholar is unable to recruit a representative sample due
to financial constraints, rather than because there is no defined population. Also, the
assumption is that the scholar made the determination that he or she is unable to recruit a
representative sample prior to the design of the experiment.

References
Adams, A. N., L. R. Atkeson, and J. Karp. 2015. “Data Quality, Professional Respondents
and Discontinuous Survey:  Issues of Engagement, Knowledge and Satisficing.” Paper
presented at the International Methods Colloquium, November 6, 2015. http://​www.
methods-​colloquium.com/​#!Lonna-​Atkeson-​Data-​Quality-​Professional-​Respondents-​
and-​Discontinuous-​Survey-​Issues-​of-​Engagement-​Knowledge-​and-​Satisficing/​clv6/​
563cffe30cf2c322b497870e.
Ahn, T. K., R. Huckfeldt, and J. B. Ryan. 2014. Experts, Activists, and Democratic Politics: Are
Electorates Self-​Educating? New York: Cambridge University Press.
Arceneaux, K. 2012. “Cognitive Biases and the Strength of Political Arguments.” American
Journal of Political Science 56 (2): 271–​285.
Barabas, J., and J. Jerit. 2010. “Are Survey Experiments Externally Valid?” American Political
Science Review 104 (2): 226–​242.
Berinsky, A. J., G. A. Huber, and G. S. Lenz. 2012. “Evaluating Online Labor Markets for
Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20 (3): 351–​368.
Berinsky, A. J., M. F. Margolis, and M. W. Sances. 2014. “Separating the Shirkers from the
Workers? Making Sure Respondents Pay Attention on Self-​ Administered Surveys.”
American Journal of Political Science 58 (3): 739–​753.
Brader, T., N. A. Valentino, and E. Suhay. 2008. “What Triggers Public Opposition to
Immigration? Anxiety, Group Cues, and Immigration Threat.” American Journal of Political
Science 52 (4): 959–​978.
Brooks, D. and J. G. Geer. 2007. “Beyond Negativity: The Effects of Incivility on the Electorate.”
American Journal of Political Science. 51 (1): 1–​16.
Bullock, J. G. 2011. “Elite Influence on Public Opinion in an Informed Electorate.” American
Political Science Review 105 (3): 496–​515.
Cantril, H and S. S. Wilks. 1940. “Problems and Techniques.” Public Opinion Quarterly 4
(2): 330–​338.
Chandler, J., P. Mueller, and G. Paolacci. 2014. “Nonnaïveté Among Amazon Mechanical Turk
Workers:  Consequences and Solutions for Behavioral Researchers.” Behavioral Research
Methods 46 (1) 112–​130.
Chandler, J., G. Paolacci, E. Peer, P. Mueller, and K. Ratliff. 2015. “Using Nonnaive Participants
Can Reduce Effect Sizes.” Psychological Science 26 (7): 1131–​1139.
502    Yanna Krupnikov and Blake FINDLEY

Chang, L., and J. A. Krosnick. 2009. “National Surveys via RDD Telephone Interviewing
Versus the Internet Comparing Sample Representativeness and Response Quality.” Public
Opinion Quarterly 73 (4): 641–​678.
Chong, D., and J. N. Druckman. 2007. “Framing Theory.” Annual Review of Political Science
10: 103–​126.
Clifford, S., and J. Jerit. 2014. “Is There a Cost to Convenience? An Experimental Comparison
of Data Quality in Laboratory and Online Studies” Journal of Experimental Political Science
1 (2): 120–​131.
Dale, A., and A. Strauss. 2009. “Don’t Forget to Vote: Text Message Reminders as a Mobilization
Tool” American Journal of Political Science 53 (4): 787–​804.
Druckman, J. N., D. P. Green, J. H. Kuklinski, and A. Lupia. 2006. “The Growth and
Development of Experimental Research in Political Science.” American Political Science
Review 100 (4): 627–​635.
Druckman, J. N., D. P. Green, J. H. Kuklinski, and A. Lupia. 2011. “Experiments:  An
Introduction in Core Concepts.” In Cambridge Handbook of Experimental Political Science,
edited by J. Druckman, P. Green, J. H. Kuklinksi, and A. Lupia, 15–​26. New York: Oxford
University Press.
Druckman, J. N and C. D. Kam. 2011. “Students as Experimental Participants: A Defense of the
‘Narrow Database.’ ” In Cambridge Handbook of Experimental Political Science, edited by J.
Druckman, P. Green, J. H. Kuklinksi, and A. Lupia, 41–​57.
Druckman, J. N., E. Peterson and R. Slothuus. 2013  “How Elite Polarization Affects Public
Opinion Formation.” American Political Science Review 107 (1): 57–​79.
Fowler, J. 2006. “Altruism and Turnout.” Journal of Politics 68 (3): 674–​683.
Gaines, B. J., J. H. Kuklinski, and P. J. Quirk. 2007. “The Logic of the Survey Experiment
Reexamined.” Political Analysis 15 (1): 1–​20.
Gerber, A. 2011. “Field Experiments in Political Science.” In Cambridge Handbook of
Experimental Political Science, edited by J. Druckman, D. P. Green, J. H. Kuklinksi, and A.
Lupia, 115–​140. New York: Oxford University Press.
Gerber, A., and D. P. Green. 2000. “The Effects of Canvassing, Telephone Calls, and Direct Mail
on Voter Turnout: A Field Experiment.” American Political Science Review 94 (3): 653–​663.
Gerber, A., D. Karlan, and D. Bergan. 2009. “Does the Media Matter? A Field Experiment
Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American
Economic Journal: Journal of Applied Economics 1 (2): 35–​52.
Gilens, M. 2002. “An Anatomy of Survey-​ Based Experiments.” In Navigating Public
Opinion: Polls, Policy and the Future of American Democracy, edited by J. Manza, F. Lomax
Cook, B. I. Page, 232–​250. New York: Oxford University Press.
Hillygus, D. S., N. Jackson, and M. Young. 2014. “Professional Respondents in Non-​Probability
Online Panels.” In Online Panel Research: A Data Quality Perspective, edited by M. Callegaro,
R. Baker, J. Bethlehem, A. S. Goritz, J. A. Krosnick, and P. Lavrakas, 219–​237. New York: John
Wiley & Sons.
Huddy, L, L. Mason and L. Aaroe. 2015. “Expressive Partisanship:  Campaign Involvement,
Political Emotion and Partisan Identity.” American Political Science Review 109 (1): 1–​17.
Iyengar, S. 2011. “Laboratory Experiments in Political Science.” In Cambridge Handbook
of Experimental Political Science, edited by J. Druckman, P. Green, J. H. Kuklinksi, and A.
Lupia, 73–​88. New York: Oxford University Press.
Iyengar, S., and D. Kinder. 1987. News That Matters. Chicago: University of Chicago Press.
Survey Experiments   503

Jerit, J., J. Barabas, and S. Clifford. 2013. “Comparing Contemporaneous Laboratory and Field
Experiments on Media Effects.” Public Opinion Quarterly 77 (1): 256–​282.
Johnson, D. B., and J. B. Ryan. 2015. “The Interrogation Game: Using Coercion and Rewards to
Elicit Information from Groups.” Journal of Peace Research 52 (November): 822–​837.
Kam, C. D., J. R. Wilking, and E. J. Zechmeister. 2007. “Beyond the ‘Narrow Data Base’: Another
Convenience Sample for Experimental Research.” Political Behavior 29 (4): 415–​440.
Keeter, S. C.  Zukin, M. Andolina, and K. Jenkins. 2002. “Improving the Measurement of
Political Participation.” Paper presented at the annual meeting of the Midwest Political
Science Association, Chicago IL.
Klar, S. 2014. “Partisanship in Social Setting.” American Journal of Political Science 58
(3): 687–​704.
Klar, S. and Y. Krupnikov. 2016. Independent Politics: How American Disdain for Parties Leads
to Political Inaction. New York: Cambridge University Press.
Kroes, E. and R. Sheldon. 1988. “Stated Preferences Methods.” Journal of Transport Economics
and Policy 22 (1): 11–​25.
Krupnikov, Y. and Levine, A. S. 2011. “Expressing Versus Revealing Preferences in Experimental
Research.” In The Sourcebook for Political Communication Research: Methods, Measures and
Analytic Techniques, edited by E. Bucy and R. L. Holbert, 149–​164. New York: Routledge.
Lavine, H. 2002. “On-​line Versus Memory Based Process Models of Political Evaluation.”
In Political Psychology, edited by K. Monroe, 225–​247. Mahwah, NJ:  Lawrence Erlbaum
Associates.
Levine, A. S. 2015. American Insecurity: Why Our Economic Fears Lead to Political Inaction.
Princeton, NJ: Princeton University Press.
Maxwell, S. E., and H. Delaney. 2004. Designing Experiments and Analyzing Data: A Model
Comparison Perspective. New York: Taylor and Francis.
McDermott, R. 2002. “Experimental Methods in Political Science.” Annual Review of Political
Science 5: 31–​61.
Milgram, S. 1963. “Behavioral Study of Obedience.” Journal of Abnormal and Social Psychology
67 (4): 371–​378.
Morton, R. B., and K. C. Williams. 2010. Experimental Political Science and the Study of
Causality: From Nature to the Lab. New York: Cambridge University Press.
Mullinix, K. J., T. J. Leeper, J. N. Druckman, and J. Freese. 2015. “The Generalizability of Survey
Experiments.” Journal of Experimental Political Science 2 (2): 109–​138.
Mutz, D. 2011. Population-​Based Survey Experiments. Princeton, NJ: Princeton University Press.
Nancarrow, C., and T. Cartwright. 2007. “Online Access Panels and Tracking Research: The
Conditioning Issue.” International Journal of Market Research 49 (5): 573–​594.
Nock, S. L., and T. M. Guterbock. 2010. “Survey Experiments.” In The Handbook of Survey
Research, 2nd ed., edited by P. W. Marsden and J. D. Wright, 837–​864. Wiley Interscience.
Paolacci, G., and J. Chandler. 2014. “Inside the Turk:  Understanding Mechanical Turk as a
Participant Pool.” Current Directions in Psychological Science 23: 184–​188.
Piazza, T. P. M. Sniderman and P. Tetlock (1989) “Analysis of the Dynamics of Political
Reasoning:  A General-​ Purpose Computer-​ Assisted Methodology.” Political Analysis 1
(1): 99–​119.
Pinkleton, B. E., E. W. Austin and K. K. J. Fortman. 1998. “Relationships of Media Use and
Political Disaffection to Political Efficacy and Voting Behavior.” Journal of Broadcasting and
Electronic Media 42 (1): 34–​49.
504    Yanna Krupnikov and Blake FINDLEY

Prior, M. 2005. “News vs. Entertainment:  How Increasing Media Choice Widens Gaps in
Political Knowledge and Turnout.” American Journal of Political Science 49 (3): 577–​592.
Sargis, E. G., L. J. Skitka, and W. McKeever. 2013. “The Internet as Psychological Laboratory
Revisited: Practices, Challenges and Solutions.” In The Social Net: Understanding our Online
Behavior, edited by Y. Amichai-​Hamburger, 253–​269. New York: Oxford University Press.
Schwarz, N., B. Knauper, H.-​J. Hippler, E. Noelle-​Neumann, and L. Clark. 1991. “Rating Scales
Numeric Values May Change the Meaning of Scale Labels.” Public Opinion Quarterly 55
(4): 570–​582.
Searles, K. 2010. “Feeling Good and Doing Good for the Environment: The Use of Emotional
Appeals in Pro-​Environmental Public Service Announcements.” Applied Environmental
Education and Communication 9 (3): 173–​184.
Sears, D. O. 1986. “College Sophomores in the Laboratory: Influences of a Narrow Data Base on
Social Psychology’s View of Human Nature.” Journal of Personality and Social Psychology 51
(3): 515–​530.
Sniderman, P. M. 2011. “The Logic and Design of the Survey Experiment.” In Cambridge
Handbook of Experimental Political Science, edited by J. Druckman, P. Green, J. H. Kuklinksi,
and A. Lupia, 102. New York: Oxford University Press.
Teele, D. L. 2014. Introduction to Field Experiments and Their Critics: Essays on the Uses and
Abuses of Experimentation in the Social Sciences, edited by D. Teele, 1–​8. New Haven, CT: Yale
University Press.
Verba, S., K. L. Schlozman and H. E. Brady. 1995. Voice and Equality: Civic Volunteerism in
American Politics. New York: Cambridge University Press.
Weber, S. J., and T. D. Cook. 1972. “Subject Effects in Laboratory Research: An Examination
of Subject Roles, Demand Characteristics, and Valid Inference.” Psychological Bulletin 77
(4): 273–​295.
Chapter 22

Using Qual i tat i v e


M ethods i n a
Quantitativ e Su rv ey
Research Ag e nda

Kinsey Gimbel and Jocelyn Newsome

Introduction

Pollsters and survey researchers are often skeptical of focus groups, interviews, and
other qualitative methods, either shying away from them entirely or using them only
in the very initial discovery stages of a project, then abandoning them once a quantita-
tive survey method is implemented. However, qualitative techniques, from focus groups
to cognitive and in-​depth interviews (IDIs), can improve survey efforts and provide
unique data unobtainable through quantitative methods. While qualitative methods
have limitations, they can serve as a valuable complement to more traditional survey
research methods.
This chapter reviews the ways that qualitative methods can be useful in the frame-
work of a quantitative survey research effort, as well as the boundaries that qualitative
methods need to stay within to meet scientific standards. It then describes how quali-
tative efforts can be used throughout a project, including during initial survey creation,
as a survey is being conducted to collect data that are difficult to address in a survey
format, and as a way to learn more about specific survey findings. Qualitative methods
discussed include cognitive interviews, IDIs, and focus groups. The chapter concludes
with specific guidance on best practices for conducting qualitative research.
506    Kinsey Gimbel and Jocelyn Newsome

What Is Qualitative Research?

Qualitative research is frequently defined as a series of contrasts to quantitative re-


search. Generally, qualitative research employs less structured and more open-​ended
ways of gathering data than do quantitative methods. As a result, qualitative data
tend to be messy and complex. With quantitative data, respondents are usually easily
categorized—​for example, 24% of city residents voted in the last local election, and 76%
did not. A respondent either voted or didn’t vote; there are only two choices.
A qualitative research project exploring why people didn’t vote is less easy to sim-
plify. A series of interviews with nonvoters might find that one respondent did not vote
because she was overwhelmed by the number of candidates for the school board and
simply wasn’t sure how to make an informed choice. One may have intended to vote,
but her car broke down unexpectedly on her way home from work, leaving her with
only enough time to pick up her child from day care before the polls closed. Another
respondent may have failed to vote because he had outstanding parking tickets and was
worried about showing up at his polling place without having paid his fines.
Qualitative research gives us stories. Stories are important. Stories provide insight
into what the numbers actually mean. If quantitative survey data can tell us that 76% of
residents did not vote, qualitative data can explain why they did not vote. A focus group
of nonvoters can reveal whether the failure to vote was a result of lack of interest in city
politics; dissatisfaction with the candidates on the ballot; or perhaps even a strong liking
for all of the candidates, making it impossible to choose. Quantitative survey data—​
there are, of course, other types of quantitative research, but the focus here is on quanti-
tative data gathered through surveys—​is excellent at providing the big picture, capturing
a snapshot of a complex situation. Qualitative data can help us interpret the snapshot, so
that we know exactly what it is that we’re looking at. But because these are stories, they
are not easily reduced to numbers. If quantitative data are all about numbers, qualita-
tive data are all about words. (This is, of course, an oversimplification. Quantitative data
collection depends heavily on the words we use—​think, for example, of push polls—​and
qualitative data can be described in terms of numbers—​half of the respondents in the
focus group reported that they simply didn’t have time to get to the polls while they were
open. But it can be a useful oversimplification.)
Another key difference between quantitative and qualitative research is sampling.
Quantitative survey research employs probability sampling and seeks data that are sta-
tistically generalizable to a larger population. Qualitative data use “purposive” sam-
pling, which involves researchers systematically selecting certain groups or individuals
based on their relevance to the central research question. In the example of a study of
nonvoters, it would be inefficient to sample a general adult population. Furthermore,
we would want to include both individuals who have never voted and those who have
voted in the past. We would want to consider all the different factors that might lead to
different reasons for not voting, so we’d look for diversity among age, gender, population,
Qualitative Methods in Quantitative Survey Research    507

education levels, geographic location, and political affiliation. The idea is to get as many
different stories as possible—​not to have a pool that necessarily looks exactly like the
larger population. Because the sampling is purposive, qualitative data are not generaliz-
able to the larger population. This does not mean qualitative sampling is sloppy; it is just
focused on identifying participants who will be able to provide data on the key research
questions.
Perhaps because of this limitation, there is a tendency for researchers trained in
quantitative survey research methodology to ignore or downplay qualitative options.
Occasionally, quantitative researchers may use focus groups or interviews during
the exploratory stage of a project, but these are left behind once a survey begins.
Sometimes qualitative methods are only used when (whether due to money, time,
institutional review board [IRB] or other clearance concerns) quantitative data
collection isn’t possible. But a research plan does not force a choice between quan-
titative or qualitative methods; the two can be integrated throughout the research
process. Qualitative methods can complement and enhance more traditional quan-
titative research, allowing researchers to ask questions and collect data that surveys
alone cannot. Qualitative techniques, including focus groups and IDIs, can both sup-
plement survey findings and provide unique data unobtainable through quantitative
methods.

Integrating Qualitative Methods


into Survey Research

This section highlights four phases of the survey research process in which qualitative
methods can be used in concert with traditional survey research methods to both im-
prove a survey’s design or methodology and better understand and illustrate survey
findings: during the initial project discovery phase, during survey creation and refine-
ment, concurrent with a survey effort, and after a survey has been completed.

During the Initial Project Discovery Phase


The most common phase in which qualitative methods are used during survey research
is project discovery. Focus groups and interviews are ideal for learning about high-​level
concepts and for gaining a sense of how people think about or approach a topic. This
makes them exceptionally useful at the beginning of a research project, when under-
standing of a topic or issue is still in its initial stages. Many researchers, prior to de-
signing a survey, will hold a series of focus groups or interviews with the intention of
discussing a general topic. Researchers may want to consider using qualitative methods
at the outset of a research project when they are trying to identify three specific things:
508    Kinsey Gimbel and Jocelyn Newsome

1) Topics/​issues related to a research question. Some research projects are born with
a clearly defined research question and well-​defined outcomes. However, most
projects start out with a very general topic or question, and one of the first jobs of
the researchers is to identify the specific questions to be considered. Qualitative
data can provide researchers with guidance on which issues or topics are of high
interest to respondents and which remain unclear.
2) The limits and scope of a particular topic. At the beginning of a project, a researcher’s
instinct is often to load up the data collection instrument with as many questions
about as many aspects of the research topic as possible. But this can backfire, dis-
tracting respondents and losing their attention. Focus groups and interviews can
be used to ask respondents about the whole constellation of issues around a cer-
tain topic, identifying where the respondents lose interest, where the connection
between the topics breaks down, what areas are unlikely to be profitable topics of
research, and what does not belong in an area.
3) Concepts and terminology used by the target population. Doing qualitative research
with participants can also allow researchers to learn what kinds of issues and
language are being used among the actual target audience. This is especially im-
portant when the researchers are from a different demographic or interest group
than respondents. It is critical for researchers to identify any possible gaps in their
understanding of a concept or how respondents think about a particular issue as
early in the process as possible; this can help prevent measurement error or bias by
ensuring that important concepts are included in a research plan in the right way.

During Survey Creation and Refinement


Even if a research plan using a survey methodology has been determined, qualitative
methods can be used in a targeted manner during survey creation to ensure that the
highest quality data will be collected through the survey. Pilot studies are often used in
survey development to test instruments, and those are valuable for methodology and
item or survey nonresponse. However, IDIs and focus groups can allow a researcher to
dig more deeply into the content of the instrument being developed. Using qualitative
methodology at this stage is especially important if a new survey is being developed or
significant new questions or scales are being attempted. Being able to fine-​tune specific
questions or flow patterns of a survey prior to full fielding can result in better data and
more persuasive findings, ensuring that the survey aligns with research questions and
measurement goals, improving data quality and reducing measurement error.
More specifically, a certain kind of qualitative method called cognitive interviewing
can be extremely valuable during survey creation. Cognitive interviewing focuses on
identifying how respondents understand survey questions and process their responses.
This allows researchers to evaluate specific prospective survey questions, refine question
wording, identify possible response options, and determine ways that respondent
Qualitative Methods in Quantitative Survey Research    509

burden can be reduced, either through eliminating questions that aren’t working or
streamlining instructions or skip patterns (Willis 2005). Cognitive interviewing is
discussed in more detail in the next section of this chapter.

Concurrent with a Survey Effort


Sometimes, either during the initial research planning or during survey development,
it becomes clear that a research question cannot be fully answered by one kind of data
or one effort. It may be that there is a particularly sensitive or complicated issue that is
not likely to be completely addressed by a survey. Qualitative methods can also be used
when identified areas of interest are difficult, if not impossible, to address in a survey
format. And in some survey efforts, it becomes clear that, to best communicate with the
client or ultimate survey end users, qualitative data will be needed to bring the quantita-
tive findings to life. In this case, focus groups can be conducted as the survey is going on,
using the opportunity of face time with respondents to collect more detailed, narrative
data on issues that emerge in the survey data. It can be incredibly powerful in a final
report—​especially one for a nontechnical audience—​to not only present quantitative
survey findings, but also illustrate those findings with sound bites or video clips, which
breathe life into numbers.
In addition, when a project timeline is so tight that qualitative testing cannot occur
before a survey launches, conducting focus groups or interviews while a survey is
fielding can still be useful. Depending on the survey methodology, it may be possible to
make adjustments in the field or, at the very least, to make notes on adaptations that may
need to be made in future waves of survey fielding.

After a Survey Has Been Administered


Finally, just because a survey has been fielded and the data collection is complete, that
doesn’t mean it is too late for qualitative methods to be valuable. While survey findings
allow researchers to generalize their findings to the larger population, survey results
do not always paint the full picture for the ultimate audience of the data. Qualitative
methods can be used in three primary ways in the wake of a survey effort:

1) To explain findings. Survey designers work extremely hard to ensure that their
questions are clearly worded and that all respondents will be answering questions
in the same way. However, survey questions still may not be able to capture all of
the nuances of an issue, or a survey may have resulted in a surprising or unex-
pected finding that baffles researchers. This is a key opportunity to conduct focus
groups or interviews centered on those unexpected findings—​you may learn that
respondents think about the subject in a wildly different way than survey designers
expected, or that there is another issue confounding results. Qualitative research
510    Kinsey Gimbel and Jocelyn Newsome

can also offer an opportunity to advance and extend theory; when an existing ex-
planation of a phenomenon is not borne out in the survey results, the stories and
data gathered in qualitative research can help researchers develop new theory.
To add depth to the reporting. As mentioned previously, focus group and interview
2)
findings can illustrate and add color to final reports and presentations. No matter
how well done a survey is, if the audience cannot process the findings or does not
see how survey data relate to their practical issues, all the effort of collecting survey
data will be wasted. If using qualitative data to augment survey findings allows
survey data to reach a wider audience or have a greater impact, it is well worth the
effort.
To identify next steps based on survey findings. Some surveys are intended to iden-
3)
tify problems or measure satisfaction of customers or constituents, but once that
problem or level of unhappiness has been determined, what next? Survey sponsors
may believe that they know how to respond to a problem, but it may be wiser to
make sure that any steps you take in response to survey findings will truly re-
spond to the problem. Qualitative research is ideal for this step: focus groups and
interviews allow researchers to learn more details about how people feel about spe-
cific problems, or their thoughts on how they might respond to possible solutions.

Selected Qualitative Methods

Just as there are endless ways to design and administer surveys, there are multitudes of
ways to structure and implement qualitative research. This section reviews three pri-
mary qualitative methodologies: focus groups, IDIs, and cognitive interviews. These are
not the only ways to conduct qualitative research, but they may be of the most use to
those who do primarily survey research. Different elements of these methods can be
customized in different ways, depending on what the research questions require, so an
understanding of these methods will allow researchers to use what will be most benefi-
cial to a particular project.

Focus Groups
Focus groups are probably the most well-​known form of qualitative research. This can
work in their favor; most clients, researchers, and potential respondents are familiar
with both the concept and structure of focus groups, so a focus group is a recognizable,
understandable way to collect data and will require little explanation. However, the va-
lidity of focus groups and the findings that emerge are sometimes questioned, so it is im-
portant to know when focus groups are appropriate and what kinds of qualitative data
they are best positioned to obtain (Krueger and Casey 2009).
Qualitative Methods in Quantitative Survey Research    511

What is a focus group? At its most basic, it is a small group of people assembled in a
room with a moderator, who leads a discussion about a specific set of topics or issues.
Beyond that, the specific structure can vary widely. Participants can be asked to sample
products, review ad copy or marketing materials, or discuss more abstract concepts or
issues; specific group structure and content will be determined by the research goals
and (if focus groups are complementing a quantitative survey effort) when in the survey
process the groups take place. Groups generally last between one and two hours and
often take place at dedicated focus group facilities, but time frame and location can be
flexible—​all that is really necessary is a quiet room and a table for people to sit around,
and online focus groups may not even require that. Most often groups include between
six and ten participants and are led by moderators who have been specifically trained
to conduct focus groups. As a general rule, it is wise to conduct between three and six
groups in any one location, to ensure that your findings are not due to the composition
of any one group. However, all of these elements can be adjusted based on need and re-
sources, which is one reason that focus groups are such a popular way to collect data.
More details on the specifics of developing and implementing a focus group project are
provided in the best practices section of this chapter.
When should one use a focus group? The primary purpose of focus groups is to learn
how people feel about a subject, issue, experience, or product. A typical survey may
only have the respondents’ attention for a few minutes, and people may be completing a
survey while doing other things; this can be effective when the primary goal is to collect
factual data on how people have behaved in the past or decisions they may have made.
However, when the goal is to learn what people believe about an issue, or what kind of
emotional response people have to something, more than a few minutes of their time
will be needed. In a focus group, respondents are a captive audience for an extended
period of time, and there is a moderator present who can probe into specific questions
until respondents have answered the question and the researchers have acquired the
level of data they are looking for. A skilled moderator understands body language, voice
intonation, and other cues that may hint at opportunities for follow-​up and additional
probing, something that is not possible with a survey. The group setting also allows
respondents to discuss ideas among themselves and to build on each other’s ideas, often
providing richer information than a single person might provide. It is these factors that
allow focus groups to produce detailed data that go beyond a yes/​no or Likert scale an-
swer, delving into the details of what people are thinking and feeling. Focus groups are
also ideal when time or resources are very limited. A series of focus groups can be put
together in a matter of weeks; all that is needed is a discussion guide, a moderator, a
room to conduct the group, and a small number of participants. And while national
issues might involve traveling to multiple locations to ensure that regional differences
are accounted for, many focus group projects will not involve any travel at all.
However, focus groups are not appropriate for everything, and there are a few key
elements to keep in mind when considering whether to use focus groups to collect data.
First, as discussed previously in the chapter, qualitative data are not intended to be gen-
eralizable to a larger population. It is easy to get caught up in discussions of sample
512    Kinsey Gimbel and Jocelyn Newsome

sizes when planning focus groups: “If we do 10 groups, with 10 people in each group,
we’ll have an n of 100!” While it is important to do more than one group in a study, no
matter how many groups one does, focus group participants are not randomly selected,
and the data will never be able to speak to an entire population. Rather than focusing
on the total number of participants involved in a focus group project, it’s better to focus
on saturation; in other words, when focus groups in a market begin to repeat them-
selves and no new findings are coming out of each group, then enough groups have
been conducted. Another thing to keep in mind when planning focus group projects
is that traditional focus groups are not ideal for generating ideas; for brainstorming or
idea generation, researchers should use a methodology more specifically focused on
facilitation or ideation. Finally, people are not very good at speculating about what they
might do in the future. To maximize what people will be able to discuss in knowledge-
able ways, stick to focus group research plans that center on what people feel about an
issue or product.

In-​depth Individual Interviews


Another classic qualitative methodology is the IDI. While it is possible to do dyads
or triads, in which two or three respondents interact with a moderator/​interviewer,
IDIs are traditionally conducted one on one and, as Marshal and Rossman described
in Designing Qualitative Research (2014), can be considered a “conversation with a
purpose.”
Similar to a focus group of one, IDIs involve a participant having a face-​to-​face dis-
cussion with a moderator, who leads the conversation and ensures that all the relevant
topic areas are addressed. Also as in focus groups, the benefit of an IDI is that the re-
searcher will have the undivided attention of the respondent for an extended period of
time (IDIs can generally be shorter than focus groups—​often forty-​five minutes to an
hour—​since content can be covered much more quickly in a one-​on-​one session). This
allows time to delve deeply into an individual’s experiences and feelings about an issue
or topic.
If IDIs are so similar to focus groups, why would a researcher choose to use an IDI
instead of a focus group, which would allow for more participants? A one-​on-​one data
collection effort may be preferable in a few key situations:

• When the subject matter is extremely sensitive, respondents may not feel com-
fortable discussing the issue in front of other participants. This may be espe-
cially important to consider in small communities, where respondents may know
each other.
• If a respondent’s experience is likely to be very individualized, then an IDI may
be preferable. For example, asking detailed questions about someone’s interaction
with the medical system during a hospital stay might be better accomplished in an
IDI, rather than in a focus group, where it would be difficult for each individual to
Qualitative Methods in Quantitative Survey Research    513

tell the details of his or her story. IDIs also allow a moderator to adapt questions and
topics as appropriate for each respondent.
• Similarly, if the goal of the qualitative data collection is to get detailed feedback on a
large amount of information, IDIs may make it easier to go through text or images
and hear the participant’s thoughts on a point-​by-​point basis.
• For some populations, it may not be feasible to gather multiple participants into
focus groups, making IDIs a necessity. This could be the case when there is a very
low incidence of the target population in an area or for “elite” groups such as sur-
geons, who may be difficult to schedule together for focus groups.

Cognitive Interviews
Cognitive interviews are a specific type of IDI that is typically used to test survey
questions with respondents in order to identify potential sources of response error. In
the last few decades, cognitive testing has been increasingly recognized as a best prac-
tice in survey question design. It is used extensively in the design of federal surveys, as a
means of helping to ensure survey instruments collect statistically valid data.1
The practice of cognitive interviews is based on a cognitive psychological model of
the survey response process as four stages (Tourangeau et al. 2000):

• Comprehension: Respondents must first interpret and understand the question.


• Retrieval: Respondents then search their memories for information relevant to an-
swering the question.
• Judgment: Respondents must evaluate that information to see if it’s sufficient to an-
swer the question (or if they can infer the answer from what they do remember).
• Reporting: Respondents must map their internal response to the format required
by the survey (e.g., “Do I agree or strongly agree?”). Respondents may also self-​
censor at this stage, choosing to give an answer they feel is more socially acceptable.

The stages may happen so quickly that the respondents are not conscious of each, and
not every respondent goes through all four stages. Some may take a “shortcut” and
simply process enough to generate a plausible response. This is known as “satisficing”
and may involve simply picking the first or last response they hear (primacy or recency
effect), choosing the first acceptable response (acquiescence), or selecting the same an-
swer for each item (straightlining) (Krosnick 1991).
Despite the limitations of the model, conceiving of the survey response process as a
cognitive process allows researchers to identify potential problems before a survey is
fielded. Understanding a respondent’s thought processes while answering survey items
allows researchers to identify

• instructions that are overlooked, difficult to understand, or missing important in-


formation needed by the respondent;
514    Kinsey Gimbel and Jocelyn Newsome

• unknown terminology or vague wording that needs to be clarified or defined for


respondents;
• questions that ask respondents for information they simply don’t have;
• question wording that is unclear or that is interpreted differently by different
respondents; and
• unclear or incomplete response options.

One of the strengths of cognitive testing is that it can reveal issues with seemingly
straightforward questions.

Cognitive Interviewer:  Have you ever had an alcoholic drink?


Respondent: No.
Cognitive Interviewer:  Tell me more about your answer.
Respondent:  Well, I’ve never really liked liquor. I tried it once or twice, and it made
me sick as can be. So, I just stick with beer.

This exchange reveals two issues with the question: the definition of “alcoholic drink”
and what it means to have “ever” had one. This respondent apparently limits his concep-
tion of alcohol to “liquor,” and so excluded beer from his answer. He also assumed that
this question was asking about frequent or ongoing consumption, and so excluded his
one or two failed attempts at drinking liquor. Based on this finding, researchers may rec-
ommend adding a definition of “alcoholic drink” and adding a threshold to the question
wording, to make it clear what respondents should include when answering. The new
question wording might read, “Have you ever had an alcoholic drink, even just a sip or
taste? By alcoholic drink, we mean . . . .”
Interviewers focus on the respondents’ process of answering survey questions by
observing how they interact with the instrument and by asking follow-​up questions,
known as probes. Interviewers may ask probes as a respondent moves through the in-
strument, known as concurrent probing, or after the respondent has completed the
questionnaire, known as retrospective probing. In addition, sometimes “think-​aloud”
probing is used, in which respondents are asked to verbalize their thoughts—​literally,
thinking aloud—​as they answer the questions. Interviewers simply remind them to “tell
me what you are thinking” if they fall silent.
Probes are often structured, designed ahead of time to focus on areas that researchers
suspect may be problematic. The probe “Tell me more about your answer” is a common
one, since it can reveal both anticipated and unanticipated issues. In addition,
interviewers may ask spontaneous, “emergent” probes in response to what the re-
spondent has reported during the interview. In the example above, the interviewer may
decide to ask, “In your own words, what is an alcoholic drink?” in order to explore the
unforeseen notion that everyone might not define “alcoholic drink” in the same way.
Often, cognitive interviews also incorporate some type of usability testing, which
looks at how a respondent interacts with the design (particularly visual design) of an
instrument. Usability testing may explore whether formatting and design elements
Qualitative Methods in Quantitative Survey Research    515

appropriately cue respondents regarding how to navigate the instrument. In addition,


interviewers may use other techniques, such as vignettes, cardsorts, or rating tasks.
For instance, if it is difficult to recruit respondents with the desired characteristics,
respondents may be given a hypothetical situation, called a vignette, and asked to an-
swer questions in light of that vignette.
Interviews are typically about an hour long, although they may be as short as thirty
minutes or as long as ninety minutes, depending on the length of the questionnaire.
Cognitive interviews can be conducted in person, by telephone, or via an online plat-
form. The mode is determined by many factors, including the following:

• Mode of the survey. If a survey is paper and pen, it may make the most sense to
conduct the interviews in person, so that the interviewer can observe how the re-
spondent is interacting with the instrument. Conversely, a telephone survey might
be best tested over the phone, since it more closely mimics the testing conditions.
• Stage of survey development. In earlier stages of survey development, it may be best
to conduct cognitive interviews in person, even if the final mode is a telephone
survey. It is easier for interviewers to build rapport in person, and they are also able
to observe nonverbal cues (such as a confused expression or a flash of annoyance)
and follow up on them.
• Recruiting constraints. In some instances, the challenges of recruiting may necessi-
tate using a particular mode. If the survey is being tested with surgeons who per-
form a rare procedure, it may not be feasible to interview them in person. In that
case, an online platform or a telephone interview allows interviews that may need
to be geographically dispersed or scheduled at short notice.
• Costs. A  limited budget may necessitate selecting a more inexpensive mode of
interviewing. Limiting interviews to the local area, or conducting interviews via
telephone, can avoid expensive travel costs.

Ideally, cognitive testing is done iteratively, so that researchers can identify issues, re-
design or reword questions, and then test the revised questions. A first round of testing
might be done in person, with concurrent probing. This round might identify major
issues with unclear instructions, misleading question wording, and missing response
options. Based on the findings from the first round, the researchers will clarify the
instructions, revise the question wording, and add needed response options. A second
round of testing, perhaps done over the phone with retrospective probing, can confirm
that the revisions both addressed the original problems and did not introduce new ones.
While there is no clear evidence about the ideal number of cognitive interviews,
historically cognitive testing has been completed with a relatively small number of
respondents. Ideally, interviews are conducted until no new issues are revealed, a con-
cept referred to as saturation. However, the number of interviews is often constrained
by practical concerns, such as costs, recruiting challenges, or a tight timeline. Even a
handful of interviews can identify issues that would have seriously impacted response
error. As in the case of the example of the beer-​drinker above, even one interview reveals
516    Kinsey Gimbel and Jocelyn Newsome

that the original question wording, if used in a survey, may have reported an artificially
high number of self-​reported teetotalers.
For an extensive discussion of the methodology of cognitive interviews, see Willis
(2005), Miller et al. (2014), and Collins (2015).

Best Practices

Research Plan
Before beginning qualitative research, it is important to have a research plan in place
to ensure that the research design allows you to collect the data you need. The more
detailed the research plan, the more likely it is that the research will be successful at
capturing the information you want. There are several steps in developing a comprehen-
sive research plan:

• Clearly articulate your research questions. What do you want to know at the end
of the project? Your research questions will guide your decisions about the other
components of a research plan: what method you use, where and with whom you
conduct your research, how you conduct your analysis. The more clearly your
research questions are stated, the easier it will be to develop a research plan that
answers your questions.
• Select a research method. What method will have the best chance of gathering the
information you need? For instance, if your research question is exploratory—​
What makes someone likely to vote?—​then a focus group is probably the best
approach. If the goal is to have a survey that can gather accurate data, then cog-
nitive interviews will have the best chance of gathering the information you need.
Keep in mind that you may want to incorporate multiple methods in your plan.
A focus group can allow you to explore an unknown topic and give you a sense of
what questions should be asked in a survey. Cognitive testing can then allow you to
refine the questions to ensure they are asking what you think they are asking.
• Determine the number of interviews or groups you want to conduct. The numbers
will depend on many factors, including the diversity of respondents you need to
recruit, the geographic coverage you hope to achieve, any demographic variables
you need to meet, and as always, the constraints of your timeline. Ideally, for both
interviews and focus groups, you will continue until you no longer discover new
things—​you’ve reached saturation. The number of interviews or groups required to
reach saturation will vary based on the diversity of the target population. For focus
groups, a very basic guideline can be to conduct three to six groups in each specific
segment of your target population that you have identified as being of interest to
your research. If you’re interested in how white voters in a population differ from
Qualitative Methods in Quantitative Survey Research    517

Hispanic voters, you may want to conduct three to six groups with white voters
and three to six additional groups with Hispanic voters. For cognitive interviews,
the number of interviews should be determined by the complexity of the question-
naire. If your survey looks at voting patterns by party affiliation, you need to have
a sufficient mix of voters and nonvoters across parties to ensure all questions are
adequately tested. Typically, you also want to conduct enough interviews to ensure
that your respondents are demographically diverse, in terms of age, gender, race/​
ethnicity, and education levels.
• Decide where, when, and how long. You will need to decide where to conduct the
research. This will be primarily decided by the type of respondents you need. For
example, if you want to explore how St. Louis area residents view their local police,
you’ll need to conduct the interviews in St. Louis. Alternatively, if you want to en-
sure geographic diversity, you may want to select several sites across the country. If
you anticipate that geographical differences are not a factor, or if you have a limited
budget or timeline, you may decide to conduct the research locally to limit costs
and expedite the process. Scheduling is also an important consideration. For gen­
eral population studies, evenings or weekends are typically better. However, for
some special populations, weekday daytime groups/​interviews may work best. You
also need to consider the length of the interview or group. The length of the group/​
interview should be determined by the material needed to be covered, but you also
want to consider the burden you’re placing on participants. Typically, individual
interviews are an hour or less, while focus groups are two hours or less.
• Consider cultural and linguistic issues. Depending on your research, you may need
to conduct your groups or interviews in a language other than your own. To do this,
you will need experienced interviewers or moderators who are bilingual—​able to
communicate in the target language for the interviews or groups, as well as able to
clearly communicate in the language of the analysts so that they can report back
findings. If you are conducting research in multiple languages, keep in mind that it
is not necessarily sufficient to have English-​language materials translated into the
target language(s). You will also need to consider linguistic and cultural issues that
should be taken into account. An experienced bilingual moderator or interviewer
can assist you in adapting your protocol appropriately.
• Develop an analysis and reporting plan. Before you begin your research, have your
analysis plan in place. Knowing ahead of time how you will conduct the analysis
and how you will report results will ensure that your data collection provides the
information you need in a format that will work best. It’s also important to allow
time in the schedule for this stage. Depending on the scope of the research and
the format of the report, it can take a significant amount of time. Keep in mind
that analysis and reporting is a separate activity from conducting interviews or
moderating groups. While they are frequently conducted by the same individuals,
they do require a separate skill set. Depending on your team, it may make sense to
have different researchers conduct each phase.
518    Kinsey Gimbel and Jocelyn Newsome

Identifying Respondents and Recruiting


A typical quantitative study selects a sample of respondents that will ensure results can be
generalized to the larger population. Qualitative studies, on the other hand, use purposive
samples, in which respondents are selected based on the purpose of the study and knowl­
edge about the characteristics of a population. The first step in any recruiting project is to
identify what kinds of participants are needed. Some recruiting is based strictly on demo-
graphics (“women between the ages of thirty-​nine and fifty-​four”), and some projects may
require people who have certain experience or background (“people who use iPhones”).
A recruit may also have primary and secondary goals: the initial screening criteria may be
whether or not the individual uses an iPhone, but the client would like a mixture of ages
and genders as well. Just as important is to think about what kind of participants will not be
appropriate for the study. For example, a federal client might want federal employees to be
excluded from focus groups. Researchers and clients should think about specific charac-
teristics or situations that could come up in the study and establish during the initial proj­
ect planning stages what should qualify or disqualify someone from the study.
Once a project’s screening criteria have been established, they can then be used to
create a screener, or the standardized questions that potential participants will be asked
to determine if they qualify for the project.2 In general, a good screener will include four
categories of questions:

• Past participation/​conflict of interest questions. If there are specific criteria that will
rule out a participant, such as recent participation in a group or employment in a
certain industry, those questions should be asked at the beginning of the screener.
• Demographics. If the recruit is being drawn from an existing database, basic dem-
ographic data may already be available, but these can be confirmed and any addi-
tional, project-​specific items can be asked.
• Project-​specific characteristics. If the recruiting is focused on individuals with spe-
cific behaviors or characteristics, specific questions on those topics need to be in-
cluded. However, be aware that screening questions can alert respondents about
the topic that will be discussed during the group or interview. This is not necessarily
a problem, but researchers should be aware that participants may be influenced by
this foreknowledge. For instance, asking during screening whether a participant
is familiar with a specific candidate may motivate the participant to research that
candidate beforehand.
• General willingness to talk. A painfully shy or reticent participant is not going to be
of much help in a focus group or interview, so if the screener is being done over the
phone, it can be an opportunity to ask an open-​ended question and see whether the
individual is responsive and articulate.

Once the desired characteristics have been identified, researchers need to decide
whether to manage the recruiting themselves or use an external recruiter. The ben-
efit of internal recruiting is that the researchers can provide very specific recruiter
Qualitative Methods in Quantitative Survey Research    519

training and oversight and be very involved in the recruiting process and the selection of
respondents. However, an external recruiter, such as those associated with a focus group
facility, will likely be familiar with the local population and may have insight into how to
recruit hard-​to-​reach populations. The final decision will likely depend on the location
of the planned research, the difficulty of the recruiting, and the available research staff
capacity. Issues of data security and privacy may also determine whether recruiting can
be handled by an organization outside of the research team; ethics in qualitative research
are discussed later in this chapter, but should always be kept in mind in situations, like
recruiting, when information on specific individuals is involved.
Whether recruiting is handled by the researcher or by an external company, the re-
searcher needs to keep several things in mind before and during recruiting:

• Matching the recruitment strategy to population. Qualitative participants can be


recruited from any number of places: GoogleAds, Facebook, LinkedIn, Craigslist,
existing recruitment databases, special interest groups, customer lists, and so
forth. Researchers should ensure that the outlets used match the population of in-
terest (e.g., online resources may not be appropriate for low-​income or illiterate
populations) and that a variety of potential respondents will be reached.
• Training recruiters. Researchers rely on the recruiters—​the people who will actually
be calling or emailing respondents—​to weed out individuals whom they feel might
be lying about their qualifications or will be difficult or nonresponsive participants.
Recruiters should be trained to ensure that they have a clear understanding of what
is needed for a particular study and what an ideal participant is and, if more than
one recruiter is involved in a project, to ensure consistency across the recruiting
efforts.
• Monitoring/​adjustment of recruiting as needed. Research project staff should re-
quest regular status updates. If recruiters are having difficulty finding participants,
or if there are persistent questions about qualifications, the screener or incen-
tive may need to be adjusted to ensure that enough participants can be recruited.
Researchers and recruiters should be in close enough contact that these kinds of
adjustments can be made before the project is put at risk.
• Following up with participants. Researchers should also be extremely clear with
recruiters about the follow-​up strategy that will be used with participants once
they are recruited. Recruiters should follow up by mail or phone to ensure that
participants will attend. Respondents also need to be given specifics on directions,
parking, and any information they will need to bring with them to the facility.

On a final note, some clients may be concerned about “professional respondents,” or


people who participate in focus groups regularly and may not represent the “average”
respondent. One way to address this is to exclude participants who have recently
participated in another focus group, interview, or market research study. This could be
defined as participants who have participated in another study within the last three or
six months, but this can be adjusted based on the topic of the study (a researchers doing
520    Kinsey Gimbel and Jocelyn Newsome

one-​on-​one web usability testing may not care that someone recently participated in
a political focus group), how difficult the recruit is going to be, or other factors. Using
an external recruiter can also help weed out these respondents, since recruiters should
maintain records in their database of when an individual last participated in a study.
And just because a participant has been in groups or interviews before, that doesn’t
mean he or she will not react in an honest way to questions. A trained recruiter can
also help identify participants who appear to be “professionals” or are responding in
questionable ways.

Incentives
The incentives provided to qualitative research participants are generally significantly
larger than the token incentives of several dollars that might be sent to a survey re-
spondent. Participating in a qualitative study requires significantly more effort from
the respondents than completing a survey. Researchers are often asking participants to
come to a specific location. Even when technology is used for online focus groups that
allow participants to contribute from their homes, most studies require participants to
participate at a specific time and ask more of their time (possibly up to two hours) than
most surveys do. And unlike the upfront incentives often included with a survey invi-
tation, qualitative incentives are generally provided after the interview or focus group
is over; since interviews and focus groups generally do require participants to attend
at a certain time and place, offering the incentive after data collection truly motivates
respondents to show up.
Despite the greater amounts and contingent nature of qualitative incentives, it is
still best to consider qualitative incentives as tokens recognizing the assistance that
participants have provided, rather than as payment for their time. Researchers don’t
want to create an atmosphere in which respondents could feel that they are being paid
for their opinion or, even worse, paid to have a particular opinion. And if a respondent
chooses to leave a focus group or an interview before it is complete, as respondents are
always free to do, the incentive must still be provided; an incentive should not be used to
coerce respondents to stay or to pay them only for a completed job. Rather, researchers
should approach incentives from the position that respondents are offering opinions
and experience to help with a research effort, and that incentives help remind them to
keep their appointment and thank them for the effort required to participate in qualita-
tive research.
The specific incentive amount can vary widely depending on location and popula-
tion. Unlike incentives for surveys, there has not been systematic research into the ideal
amount of an incentive for qualitative research. Instead, researchers typically deter-
mine the incentive amount based on the local market—​focus groups conducted with
teenagers in a rural area may offer incentives of $40, while $200 or more may be needed
to recruit medical doctors in a large city. An advantage of working with a local facility
Qualitative Methods in Quantitative Survey Research    521

or recruiters is that they will know the going rate in that location for the population.
Incentive amounts also may be limited by outside factors; for example, some govern-
ment or private organizations will put limits on the levels of incentives that can be pro-
vided. And while it is important to identify the incentive level upfront, incentives can
also be adjusted if recruiting proves to be difficult. As a general rule, the lower the incen-
tive, the more difficult it is to quickly recruit the targeted population.

Developing the Protocol
At the center of every qualitative project is the protocol, sometimes also called the
moderator’s guide, the discussion guide, or the script. This document establishes the
structure of the group or interview, details the specific questions that participants will
be asked, and ultimately determines the kind of data that will be collected. It is also the
guide that all of the parties working on the project—​whether clients, an IRB, or the
moderator—​will be using as they make their decisions and collect data.
When developing a protocol, it is important for researchers to articulate specific re-
search goals. Given the nature of qualitative research, it can be tempting to proceed with
only a general sense of what one wants to know. However, as discussed in the section
on developing a research plan, it is essential that focus group and interview protocols
be developed around very specific research questions. If the research goal is only stated
vaguely as, “We want to know what people think about ostriches,” it is not clear whether
the research should gather opinions on ostriches as an attraction at a zoo, a mascot for
a new sports team, or a meat source. It will be much easier to develop a useful (and
relevant) protocol if the researcher is able to say, “We want to know how people think
ostriches relate to their lives, how they would react to an ostrich in their home, and what
we could do to make them happy to have an ostrich.”
In terms of structure, focus group and IDI protocols generally use an inverted pyr-
amid construction, in which discussion starts with broad topics and gradually narrows
down to focus on key, specific questions. This allows participants to start off talking
about more general subjects that are easy to offer opinions on. Later in the group, after
participants have grown more comfortable talking about their opinions and have had
some time to think about the subject matter, the questions can become more detailed.
Cognitive interview protocols typically follow the structure of the questionnaire that is
being tested.3
Ideally, a moderator or interviewer guide will be developed collaboratively with the
client, the research team, and the moderator or interviewer who will be conducting the
group or interview. Even in cases in which a moderator or interviewer does not have the
same content background as the research team, it is still helpful to involve that person
in protocol development. The moderator/​interviewer can offer expertise on what sorts
of questions or activities will and won’t work and may be able to suggest creative ways to
ask questions.
522    Kinsey Gimbel and Jocelyn Newsome

Conducting the Focus Group or Interview


When at all possible, interviews and focus groups should be conducted by a professional,
trained qualitative researcher. Asking questions off a discussion guide may look simple, but
building rapport with respondents; knowing how to manage a discussion (particularly when
respondents may get off topic); and knowing how to probe respondents to get beyond flip,
surface-​level responses are all skills that require both training and experience. Having this skill
set becomes especially important when dealing with populations that may prove more diffi-
cult to work with (such as children or teens) or when the research addresses sensitive topics.
Most experienced qualitative researchers have backgrounds in psychology or another
social science, but their backgrounds may vary widely. Rather than a degree, an experi-
enced moderator more commonly will have taken specific interviewer or moderator
training and have experience with different populations and different types of qualita-
tive research. For cognitive interviews, interviewers are typically survey methodologists,
who have an in-​depth understanding of how survey questions are constructed and how
they should function. Organizations such as RIVA, the Qualitative Research Consultants
Association (QRCA), and the Joint Program in Survey Methodology (JPSM) offer courses
that can provide essential training in moderator and interviewing techniques.
Although the actual structure of groups and interviews can vary widely, there are four
areas that researchers should consider when conducting qualitative research:  issues
of consent, building rapport, managing the discussion, and conducting effective data
collection.

Consent
Before any qualitative data collection begins, the researcher should explain the purpose
of the research, notify participants if they are being recorded, explain any confidentiality
or privacy issues that may exist, and allow participants to ask questions. This is critical to
ensure that participants are fully informed and feel comfortable with the research they
are about to participate in. While some basic information about the study is generally
provided during the recruitment process, a more detailed review of the study and any
human subject issues generally takes place immediately before an interview or focus
group. This explanation will typically be accompanied by a formal consent form that
participants are asked to review and sign. (Occasionally, when a topic is extremely sen-
sitive, researchers may opt to forego documenting consent, in order to better protect the
identity of the participants.) A fuller discussion of what constitutes informed consent is
described in the ethics section below.

Rapport
Building rapport, which may seem extraneous to the purpose of the research, is abso-
lutely crucial to the success of any qualitative research. The goal is to create an atmosphere
Qualitative Methods in Quantitative Survey Research    523

in which respondents feel free to speak up and honestly share their thoughts. Each mod-
erator and interviewer will have his or her own personal style, but good ones listen to
respondents, give them time and space to think about their answers, respect when a
participant does not want to answer a question, and are considerate of respondents’ time
and effort. Possibly the most important element in developing rapport is maintaining
what trainers at the RIVA Institute4 call “unconditional positive regard.” Participants
should never feel judged. It is crucial that moderators and interviewers remain neutral,
no matter how outlandish a participant’s comment may be. They must always keep in
mind that they are not there to teach respondents and should refrain from correcting
a respondent who is misinformed. If, during an interview, a respondent volunteers the
opinion that menthol cigarettes do not contain tobacco, then the interviewer must re-
frain from both expressing surprise (“Really? That’s what you think?”) and correcting
that person (“Actually, menthol cigarettes do contain tobacco.”). Creating rapport and
an unconditionally positive atmosphere is essential to ensuring respondents are willing
to share the data needed for the research.
In interviews, rapport can typically be established in small talk and one-​on-​one
interactions during the discussion. In focus groups, there are typically more structured
ways of building rapport. Moderators often facilitate rapport through ice breakers, or
initial questions or activities intended to get participants talking. Clients or observers
sometimes dislike icebreakers and introductions, believing that they waste time that
could be used to discuss the study subject. However, establishing rapport at the begin-
ning of a group is a simple means of ensuring good-​quality data can be collected later in
the group.

Managing the Group/​Interview
Establishing rapport also makes it possible for the moderator or interviewer to success-
fully manage the group or interview. Keep in mind that a participant has been asked
to talk to a stranger (or in the case of a focus group, a room full of strangers) about a
random subject that may or may not be of interest to the participant. It is quite easy
for the discussion to veer off into unrelated tangents. A moderator or interviewer will
often need to redirect the conversation without stifling it. It is important to keep four
key factors in mind:

• Time. Discussion guides usually cover a lot of material in a short time. While one
of the benefits of a focus group or interview is the ability to delve deeper into topics
when warranted, moderators and interviewers also need to be monitoring how
much time is left, to ensure that the key issues are all addressed.
• Keeping participants’ attention. Moderators and interviewers also need to be able
to judge when the group or interview subject is tiring of a subject and needs to
move on to another. Typically, data collection should be broken up into fifteen-​or
twenty-​minute pieces focusing on different topics or different activities, so that
participants do not begin to lose interest or run out of things to say.
524    Kinsey Gimbel and Jocelyn Newsome

• Drawing out quiet respondents. Some respondents will inevitably be more reticent
than others. These respondents may need more encouragement to contribute to the
conversation. In interviews, a technique called “living with the silence” can be par-
ticularly effective. If an interviewer simply remains quiet (while conveying through
eye contact and body language that he or she is engaged and eager to hear the
respondent’s contributions), then respondents are more likely to try to fill the si-
lence. In groups, a moderator should keep mental notes on who hasn’t contributed
frequently in a group and try to draw them out. Calling on participants by name
or referring back to something they said earlier and asking them to expand are
common ways to address this.
• Redirecting overbearing respondents. There will also inevitably be some respondents
who overwhelm the conversation. In a focus group, moderators need to be comfort-
able with gently directing attention away from these respondents or simply indicating
that they’d like to hear from someone else. Overbearing respondents can be especially
tricky in one-​on-​one interviews, since there are no other respondents to enter the
conversation. Interviewers will need to be prepared to gently cut off extraneous and
irrelevant conversational tangents. Reminders of time constraints can be one method
of doing this, for example, “Thank you for sharing. I do want to move on to the next
question. We have a great deal to cover and I want to make sure we end on time.”

Asking Questions
Data collection will center around the well-​planned, research-​driven protocol developed
at the beginning of the project. However, one of the key benefits of qualitative research is
the ability to go “off script” when necessary. If a participant says something particularly
interesting or relevant, the moderator or interviewer can follow up on the comment and
ask additional “spontaneous” questions to learn more. While many of the questions to be
asked in qualitative research can be crafted ahead of time and included in the discussion
guide, spontaneous probing relies on the moderator or interviewer. Spontaneous probes
must be nonleading, so rather than asking “And did you like that?,” it would be better to ask
“And how did you feel about that?” to avoid “leading” the respondent to answer in a par-
ticular way. The need for spontaneous probing is another reason that it is ideal to include
the moderator or interviewer in discussions on the overall research goal and the develop-
ment of the protocol. If the moderator/​interviewer has a clear understanding of the ulti-
mate aims of the research and the issues of particular interest to the client, it will be easier
for him or her to know which topics are worth probing on and which are not.

Observers
One of the strengths of qualitative data collection is that it allows for observers. (It
is, of course, important that participants be informed of the presence of observers.)
Qualitative Methods in Quantitative Survey Research    525

Observing a focus group or cognitive interview can be a compelling experience for


both clients and the research team. Watching a group react to campaign materials or
an individual answer a potential survey question can provide insights that can be diffi-
cult to gain from a secondhand report. Focus group facilities and cognitive testing labs
typically offer an observation room that allows live observation, through either a one-​
way mirror or cameras. When clients and researchers observe in real time, it also gives
them the opportunity to have the moderator or interviewer ask unplanned, follow-​up
questions based on what they heard during the interview or group. This is generally
done through a “false close” at the end of the interview or group, in which the moderator
or interviewer briefly leaves the room and checks in with the observers to see if there are
any additional questions before the respondents are paid and released.

Analyzing and Reporting Findings


Although qualitative studies typically deal with smaller sample sizes than quantita-
tive studies, it is still important to carefully consider and plan how to manage the data,
conduct analysis, and prepare the report. Because of the nature of the data—​lengthy
descriptions or even verbatim conversations—​ qualitative data can be “unwieldy.”
Rather than a series of (relatively) tidy responses to a series of questions, there will be
more descriptive data. An entire paragraph may be needed to describe a respondent’s
issue with a particular survey item during cognitive testing, or five pages of a transcript
may capture a convoluted (and disorganized) discussion of a topic of interest.

Data Management and Organization


Because of the nature of qualitative research, the data collected will typically be in the
form of a plethora of words. As a general rule, interviews or focus groups are recorded
using either video or audio recording. Focus group facilities and cognitive testing labs
are typically equipped with video and audio recording capabilities. In other settings, a
digital recorder can be employed by the moderator or interviewer.
Video recording has the advantage of capturing nonverbal expressions and can be
helpful in identifying the conversational dynamics of a focus group, since it may be dif-
ficult to distinguish between different speakers in an audio-​only recording. However, it
is much more difficult to preserve respondent confidentiality in a video, and so in some
cases an audio recording may be preferable.
Since audio recordings can be unwieldy, researchers typically convert them into
written form. This may include a verbatim transcript of the group or interview, or it
may be a collection of notes that capture salient points and relevant quotations from
participants. If a recording is not practical, either for privacy concerns or technical lim-
itations, notes can be taken during the discussion by the moderator, interviewer, or a
note taker.
526    Kinsey Gimbel and Jocelyn Newsome

However the data are captured, there should be clear procedures for labeling
and storing the data to ensure that it is clear what is what. For instance, the file-
name “Group 2” could be problematic if multiple moderators use it to name their
second group. The filename “Young Adult Males, 05/​06/​15, 4 pm, Memphis, TN”
is much more descriptive. For interviews with individuals, it is important that the
filename (along with the contents of the file) not contain any personally identifiable
information.
Storage is also an important consideration. It is important to balance accessibility for
researchers (particularly when researchers are at multiple locations or organizations)
with security. While cloud storage is easily accessible, it may not provide sufficient se-
curity to ensure protection of the data. It’s also important that the storage be protected
against data loss—​backups can ensure that an unfortunate computer crash doesn’t wipe
out all of the research data.

Analysis
Once the data have been organized, analysis and reporting can begin. Qualitative anal-
ysis can be described as having three general stages:

• Reviewing the data. The analyst begins by reading and re-​reading the data, whether
in the form of notes or transcripts, to get a sense of the scope of the data. The analyst
may want to review audio or video recordings as well.
• Identifying themes and patterns in the data. The analyst then begins the process of
identifying themes and patterns that are apparent in the data. These patterns and
themes can be clustered into broader categories. For example, in a focus group
discussion of physical activity, mentions of the demands of work, child care, and
time spent sitting in traffic may all be clustered together as “time constraints” that
limit activity. This stage requires an open mind. While it can be helpful to have
hypotheses before beginning the research, it is important to not limit this process
to things one expected to see.
• Coding. Coding is a means of systematically classifying the entire data set by the
themes and patterns that have been identified. Each theme or pattern becomes a
“code,” and the data are tagged with the relevant codes. This allows the data to be
sorted by different themes and makes it easier to spot the relative pervasiveness
of a particular theme throughout the data. Coding qualitative data is often struc-
tured around a grounded theory approach, which is a methodology that focuses
on collecting qualitative data first, then identifying key points in the data through
coding, and finally identifying concepts and categories that can offer explanations
(Glaser and Strauss 1967).

It is also possible to code data based on the attributes of the respondent—​this can
be particularly helpful if the researcher suspects that there are gender differences in
Qualitative Methods in Quantitative Survey Research    527

views on a particular topic. Coding respondent comments by gender can easily allow
one to see if there are patterns by gender, and what those patterns are. When coding
is completed by multiple researchers, it is important to consider the issue of relia-
bility. It is important that different coders are consistently applying the codes in the
same way. Intercoder reliability can be calculated using percent agreement or other
statistical tests
(Lombard et al. 2002).

Note that these stages are not necessarily conducted one at a time, in a strictly linear
fashion. While coding, new themes and patterns may become apparent. An insight
gained may require the analyst to re-​read other portions of the data from a new perspec-
tive. Also, coding is not always a necessary step in qualitative analysis, particularly for
smaller qualitative samples with limited data. In those cases, simply reviewing the data
and identifying themes may be sufficient.5
Qualitative analysis, at its simplest, can be conducted using word processing software
or paper and pen. Excel spreadsheets can also be a relatively simple means of sorting
data. However, for more complex projects, qualitative analysis software, such as NVivo
or ATLAS, can be a powerful tool for analysis.6 These software programs allow large
amounts of unstructured text data to be easily coded, annotated, queried, and visualized
by analysts. While they cannot replace the researchers’ analysis of themes and patterns,
they do facilitate data storage and retrieval. When evaluating software packages, it is im-
portant to consider licensing costs, security issues, and the amount of training required
to master the software.

Reporting
Reports should be targeted to the intended audience. Findings reports can range from
formal, detailed technical reports to a single PowerPoint slide; the complexity and
format should be determined by how the findings will ultimately be used. Given the
range, it’s best to establish at the beginning of the project what the report will look like.
Formal reports typically contain six sections:

• Introduction. This section introduces readers to the purpose and scope of the qual-
itative research project. It outlines general research questions and how the qualita-
tive research will be used.
• Methodology. It is important to document the methods used in completing the re-
search. What methods were used? How were data collected? How many groups or
interviews were conducted? Where did they take place?
• Participant selection. This section describes how participants were recruited and
selected. It also contains demographic information about the respondents so that
the readers have an idea of what populations were represented (or, in some cases,
not represented) in the research.
528    Kinsey Gimbel and Jocelyn Newsome

• Data analysis. Although this is often left out of qualitative research reports, ideally
researchers should document how analysis was conducted. This ensures that others
can replicate the analysis and confirm findings.
• Findings. Findings should be presented systematically, whether by theme, item (in the
case of cognitive testing), or group or interview (when each group or interview is no-
ticeably distinct). Ideally, findings should be filled with descriptive detail that gives
readers a clear picture of the complex detail uncovered during the research.
• Recommendations. In some cases, a report may include recommendations.
Cognitive testing reports frequently offer recommendations for revising survey
question wording. Focus group reports may offer suggestions on how to improve
outreach materials based on group feedback. In addition, reports may recommend
“next steps”:  what additional research may need to be completed to adequately
address the research questions.

Qualitative research reports should clearly communicate the limitations of qualitative


research. While it can be helpful to include counts (e.g., five of ten respondents felt this
way), it is important to be clear that these numbers are not generalizable to the popula-
tion at large. Just because 50% of focus group respondents shared a similar viewpoint
does NOT mean that 50% of the general population will also share that viewpoint.

Ethics

Ethical treatment of the participants in qualitative research is an important consideration.


Many organizations have an IRB that reviews studies and ensures that human subjects
are treated appropriately. If your institution has an IRB, then you need to work with it to
ensure you have the proper procedures and approvals in place for your research project.
When working with federal agencies, your project may need to be reviewed by the Office of
Management and Budget (OMB). Even if you do not have an IRB or need OMB approval, it
is key to be familiar with ethical guidelines within the field to ensure you treat participants
appropriately. As the case of the controversy over a Stanford and Dartmouth study that
involved sending a mailer to 100,000 registered voters in Montana demonstrates, even
unintentional missteps in designing research can have serious consequences (“Professors’
Research Project” 2014). That study, which was intended to test whether providing ide-
ological information about candidates in a nonpartisan election would increase voter
turnout, triggered outrage and accusations of scientific misconduct.
There are several issues that should be considered when designing an ethical research
project.

Informed Consent
The most fundamental ethical behavior and the heart of any research project is in-
formed consent: participants should know that they are part of a research process, and
Qualitative Methods in Quantitative Survey Research    529

it should be clear what is being studied. Anyone doing research should be aware of and
knowledgeable about informed consent and should ensure that any research project
meets these standards.7
Informed consent involves explicitly providing and explaining to the respondent the
following elements:

• A statement that this study involves research, and how that research will be used.
• A description of what the research entails for the participant, and how long partici-
pation will last.
• A description of any foreseeable risks to the participant, along with any benefits.
• A description of how confidentiality will be maintained.
• Acknowledgment that this study is voluntary, and that there will be no penalty for
refusing to participate or for ending participation during the study.
• Any compensation that may be provided.
• Information on whom to contact with questions about the research and
participants’ rights as research subjects.

The uproar about the recent experiment conducted by Facebook, in which users’ feeds
were manipulated without their knowledge, highlights the importance of informed con-
sent. While IRB procedures were followed in that case, the public outrage at the very im-
pression that consent was not received makes it clear that researchers must pay careful
attention to both the technical requirements and the public understanding of their re-
search (Ross 2014).

Confidentiality
One of the most important considerations to keep in mind is the protection of person-
ally identifiable information. Information about subjects that can be used to identify
them must be carefully protected. This includes not only names and addresses, but also
any information that could realistically be used to identify a participant. For example,
even if you withhold a participant’s name, if you reveal other unique or distinctive
identifiers (e.g., the respondent was one of the few people of a certain ethnicity in Small
Town, USA), then you have essentially revealed the identity of your participant.
For focus groups, it is important to consider confidentiality within the group. While
typically focus groups are recruited so that participants in a group don’t know each other,
depending on the subject or the market size, it is possible that friends, acquaintances,
or even relatives may end up in the same group. Be careful when asking people to re-
veal things that could affect them later. Consider asking about sensitive topics using in-​
group questionnaires, so participants don’t have to reveal this information in front of
the group.
To maintain the confidentiality of personally identifiable information, you will want
to store personal data separately from your research data. Information with personal
data (such as names, addresses) should be stored separately from your research (such as
focus group transcripts or recordings). Respondent IDs can be used as a way to maintain
530    Kinsey Gimbel and Jocelyn Newsome

the link between personal information and the data, without compromising respondent
confidentiality. In addition, you may need to redact personally identifiable information
from transcripts or notes before releasing data. Respondents sometimes volunteer in-
formation that would allow them to be identified: “I am proud to be the longest-​serving
city council member in Smithfield, North Dakota.” You also want to be conscious of the
risks of small cell sizes. If your study involves a unique population (e.g., female fighter
pilots) and you also mention a location, it may be possible for someone to identify your
respondent, breaching confidentiality.
Videos are obviously much more difficulty to de-​identify. Even if participants
completed a release form prior to the focus group or interview, ensure videos are stored
securely and are not shared beyond the immediate audience of researchers.

Respect for Participants
At all times, ensure that the research process evidences respect for the participants.
Even if participants are receiving a monetary incentive, they are still offering up their
time, effort, and experience—​qualitative research would not be possible without them.
Respondents should feel that their input is valued, and that the researchers respect their
contributions.
With that in mind, researchers should do the following:

• End groups or interviews on time. Although it can be tempting to keep going until
every question is answered, it’s important to respect the time that participants have
committed. If a respondent has been told that participation will end at 8:00, then it
should end at 8:00. Respondents often have babysitters waiting or buses to catch; it
is important to be respectful of their time.
• Avoid “testing” respondents. A simple way to learn about participants’ awareness and
knowledge of an issue or product is by asking participants if they know something.
Avoid following this with, “Well, actually, the real answer is . . . .” Participants aren’t
there to be quizzed or tested and shouldn’t be made to feel ignorant. As a corollary,
qualitative research is not the appropriate venue to educate participants. While it
is acceptable to offer contact information or resources after a group or interview is
finished, do not try to modify opinions or behavior. Participants in a tobacco study
shouldn’t feel as if the focus group was intended to shame them into quitting. The
purpose of the research is to learn from the participants, not teach them.
• Avoid creating unnecessary tension or emotional upheaval. It is important to ensure
that participants are comfortable enough to share their thoughts and opinions. In
groups, avoid creating tension and setting up any potential hierarchies within the
group, particularly in the icebreaker and warm-​up exercises—​for example, be wary
of starting off by asking a socioeconomically diverse group about their occupations.
Also be aware that some topics may be emotionally upsetting for respondents and
plan accordingly.
Qualitative Methods in Quantitative Survey Research    531

Conclusion

Like survey research and, indeed, almost every other field, the face of qualitative research
is rapidly shifting. Social media and online platforms are making it possible to collect
qualitative data in new ways. More and more often, group discussion and individual
interviews can now take place over online bulletin boards or in virtual reality spaces.
Participants from across the country, or even the world, can participate in research to-
gether, without the cost and hassle of researchers traveling to multiple locations. Virtual
spaces and online platforms can also allow participants to interact over days or weeks,
rather than in a single two-​hour block. These technological advancements open up new
possibilities in terms of how participants can be recruited and the kinds of stimulus
that can be provided during data collection. These opportunities, as well as potential
reduced costs of online data collection, will make new options like these attractive to
clients. The range of online tools is too broad and changes too quickly to discuss here,
but researchers should continue to investigate online venues for conducting interviews
and focus groups. In addition, a whole new field of data collection is being pioneered
with eye tracking and other physiological methods, which allow researchers to measure
involuntary responses to stimuli.
However, it is important to remember that the foundation of qualitative research
consists of people talking to other people. Eye tracking, virtual reality spaces, and anal-
ysis software all offer new opportunities and insights for the practice of qualitative re-
search. However, these should be considered layers of information that enrich (but do
not replace) what is at the heart of qualitative research: talking to people to gather the
stories and experiences that quantitative data alone cannot provide.

Notes
1. The Office of Management and Budget (part of the Executive Office of the President)
issues standards and guidelines for federal surveys. These guidelines require testing (either
through cognitive testing, focus groups, or usability testing) of all federal surveys before
they are fielded (https://​www.whitehouse.gov/​sites/​default/​files/​omb/​inforeg/​statpolicy/​
standards_​stat_​surveys.pdf).
2. For an example of a recruiting screener, see Krueger and Casey (2009, app. 4.1).
3. See Krueger and Casey’s chapter “Developing a Questioning Route” for a discussion of
focus group protocol development (Krueger and Casey 2009). For an example of a cogni-
tive interview protocol, see Willis (2005, app. 1).
4. The RIVA Training Institute, founded in 1982 and based in Rockville, Maryland, offers
multiday training programs in moderating focus groups and IDIs, as well as on other aspects
of qualitative research, including reporting, working with teenagers and children, ethnog-
raphy, and usability. More information can be found at http://​www.rivainc.com/​training/​.
5. For an in-​depth discussion of analysis specifically in the context of cognitive interviews, see
Willis (2015).
532    Kinsey Gimbel and Jocelyn Newsome

6. More information about NVivo and Atlas, two of the most commonly used qualitative anal-
ysis software packages, can be found at http://​www.qsrinternational.com/​products_​nvivo.
aspx and http://​atlasti.com/​.
7. For a discussion of ethics generally (and informed consent specifically) in a qualitative re-
search setting, see Collins (2015, sec. 3.9).

References and Other Resources


Burke Institute. Cincinnati, OH. http://​www.burkeinstitute.com/​. A private organization
offering seminars and courses on marketing research, including qualitative research.
Bystedt, J., S. Lynn, and D. Potts. 2003. Moderating to the Max:  A Full-​tilt Guide to
Creative, Insightful Focus Groups and Depth Interviews. Ithaca, NY:  Paramount Market
Publishing, Inc.
Collins, D. 2015. Cognitive Interviewing Practice. Thousand Oaks, CA: Sage.
Glaser, B. G., and A. Strauss. 1967. The Discovery of Grounded Theory: Strategies for Qualitative
Research. Chicago, IL: Aldine.
Henderson, N. 2011. Secrets of a Master Moderator. Bethesda, MD: VISAR Corporation.
Joint Program in Survey Methodology (JPSM), University of Maryland, College Park. www.
jpsm.umd.edu. A graduate degree program teaching state-​ of-​the-​
art principles and
practices in the design, conduct, analysis, and evaluation of sample surveys. In addition to a
traditional degree program, JPSM offers short courses, open to practitioners.
Krueger, R. A., and M. A. Casey. 2009. Focus Groups: A Practical Guide for Applied Research.
Thousand Oaks, CA: Sage.
Krosnick, J. A. 1991. “Response Strategies for Coping with the Cognitive Demands of Attitude
Measures in Surveys.” Applied Cognitive Psychology 5 (3): 213–​236.
Lombard, M., J. Snyder-​Duch, and C. C. Bracken. (2002). “Content analysis in mass communi-
cation.” Human communication research 28 (4): 587–​604.
Marshall, C., and G. B. Rossman. (2014). Designing qualitative research. Thousand Oaks,
CA: Sage publications.
Miller, K., V. Chepp, S. Willson, and J. L. Padilla, eds. 2014. Cognitive Interviewing Methodology.
Hoboken, NJ: John Wiley & Sons.
“Professors’ Research Project Stirs Political Outrage in Montana.” New York Times, October 28,
2014, http://​nyti.ms/​1vbaE3r.
Qualitative Research Consultants Association (QRCA). www.qrca.org. A not-​for-​profit asso-
ciation of consultants involved in the design and implementation of qualitative research.
RIVA Training Institute, Rockville, MD. www.rivainc.com. A private institute offering inten-
sive qualitative research training, including in-​person courses on moderating focus groups
and webinars.
Ross, M. W. 2014. “Do Research Ethics Need Updating for the Digital Age? The Facebook
Emotional Contagion Study Raises New Questions.” Monitor on Psychology 45 (9): 64.
Tourangeau, R., L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
Cambridge: Cambridge University Press.
Willis, G. B. 2005. Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand
Oaks, CA: Sage.
Willis, G. B. 2015. Analysis of the Cognitive Interview in Questionnaire Design. New York: Oxford
University Press.
Chapter 23

Integrat i on of
C ontextua l  Data
Opportunities and Challenges

Armando Razo

Introduction

Many theoretical approaches that inform survey analysis point to the importance of
contextual factors. Mass political behavior, for example, is understood to be deter-
mined by individual attributes and a variety of external factors such as families, social
networks, and communities (Cohen and Dawson 1993; Eagles 1995; Agnew 1996). These
latter factors are not routinely measured in the course of a typical poll or survey, and
researchers often seek to add or append them to individual-​level survey data. Sometimes
the notion of context is more implicit but no less important, as is the case in comparative
approaches that emphasize group or country-​level differences that have a systematic im-
pact between collections of individuals (De Vries et al. 2011; Duch and Stevenson 2005;
Gordon and Segura 1997).
While contextual data are important, discussion of their collection and use by survey
researchers runs into two major impediments:  (1) vague or incomplete conceptual
definitions of “context” and (2) lack of methodological guidance to collect and analyze
contextual data. This chapter addresses those impediments with a conceptual frame-
work that clarifies the nature and importance of context in social scientific research. On
the methodological front, statistical approaches are presented to provide a blueprint
for researchers interested in explicit measurements and analysis of contextual data. The
chapter also includes a discussion about potential needs to modify conventional sam-
pling techniques in order to capture relevant contextual variability.
534   Armando Razo

How Does Context Fit into Survey


Analysis?

For quantitative research, there are many conceptual and empirical definitions of con-
text. At a very high level of abstraction, context refers to settings or situations that differ
across subpopulations. Digging deeper, there are at least two distinct conceptualizations
in terms of physical and social settings. Physical settings capture the fact that individuals
are often affixed to a particular geographical location such as census tracts, cities, or
counties. Social settings refer to individuals’ social environment.
Although there are multiple operational definitions for geographical context, they
all point to environmental factors with three different manifestations. For one, well-​
delineated political units often translate into—​or at least are assumed to produce—​
environmental or institutional differences across boundaries. In fact, many studies
of political behavior have demonstrated that the actual location of individual voters
matters for their behavior. Countries, and their respective settings, differentially affect
processes of socialization and levels of political information (Gordon and Segura 1997;
De Vries et al. 2011). Within countries, geography determines local aspects of political
competition (Pacheco 2008), availability of local campaign information (Alvarez 1996),
informational or cue environments (Alvarez and Gronke 1996), national economic
conditions (Nadeau and Lewis‐Beck 2001), and the composition of candidate sets or
gen­eral electoral conditions available to voters (Alvarez 1997; Atkeson 2003).
A second manifestation recognizes that the impact of geography is not restricted
to contextual effects within self-​contained physical settings, but also across them.
From this angle, physical space plays a critical mediating role for such mechanisms
as diffusion (Shipan and Volden 2008) and spatial interdependence (Agnew 1996;
Huckfeldt 2009; Ward and Gleditsch 2008). Finally, physical settings can interact with
additional cognitive or psychological processes to produce outcomes affected by both
individual and contextual factors. For instance, Berger et al. (2008) have demonstrated
that polling locations, whose physical features we would expect to have a neutral im-
pact on behavior, can nonetheless prime individuals to respond sympathetically to their
surroundings. For instance, when a polling location is a school, voters are more likely to
support school funding initiatives.1
A second approach to context in quantitative studies examines social factors. Here, rel-
evant context includes an individual’s social interactions (Eulau and Rothenberg 1986),
urban neighborhoods (Cohen and Dawson 1993; Huckfeldt and Sprague 1987); personal
networks broadly defined (Zuckerman 2005), exogenous social ties (Sinclair 2012),
and social structures (Rolfe 2012) that impact political behavior.2 In contrast to phys-
ical settings, social context has two distinctive features. Individuals might not be able to
choose their country of origin, but they can choose their friends; that is, social context
is endogenous, partly due to network homophily, which is the tendency for individuals
to associate with similar people (Kadushin 2012; Prell 2012). Another distinctive feature
Integration of Contextual Data    535

is that social context can sometimes be an emergent phenomenon of variable scope that
results from a large number of decentralized social interactions (Eulau and Rothenberg
1986).3 In other words, individuals play a major role in constructing their own social
context while also affecting the social context of others.
Clearly these physical and social conceptualizations can operate at the same time,
perhaps inadvertently, which greatly complicates the identification of contextual effects.
For instance, geographical differences might in their own right affect the political envi-
ronment in which individuals operate (e.g., different U.S. state constitutions have vari-
able balanced budget provisions, which might have a differential impact on individual
economic behavior). However, a common physical setting also brings people together,
thus creating social ties. Social ties can further enable mechanisms such as social influ-
ence (Eagles 1995). Beyond the social realm, however, these ties can have an indirect
impact on geography by affecting residential choices, thus creating clustering or seg-
regation patterns that effectively redefine the relevant geography or location of distinct
groups (Daraganova et al. 2012).

Surveys and Contextual Information


In general, most surveys do not collect contextual data in a systematic way. In fact,
Huckfeldt notes that “most surveys produce information on socially independent
variables” (2007, 102). This narrow data collection does not preclude integrating a con-
textual dimension at a latter time, but this indirect approach is not always justified. For
example, contextual information can certainly be appended with identifiers that affix
geographical settings to socially independent units. But this approach carries a strong
assumption that all units affixed to a particular geography share the same context, itself
an empirical question. To the extent that context relies on specific relationships between
individuals and their environment, as is the case with social ties that are not anonymous
by definition, this geographical approach overstates the role of context.
Major academic surveys in the United States collect contextual information, but not
in a systematic fashion. For example, since the 1980s the American National Election
Studies (ANES) has included questions about whether respondents discuss politics with
family and friends, but without capturing concrete attributes of those third parties.4
Rolling cross-​section survey designs have intermittently explored informational and
cognitive contexts affecting campaign engagement (Johnston and Brady 2002), while
specific questions about respondents’ perceptions of and direct interactions with po-
litical actors have been recorded as part of the Senate Election Studies of 1988, 1990,
and 1992. Also, the General Social Survey (GSS) has had a few special editions in which
context has played a prominent role, but it generally lacks a clear and explicit interest
in contextual factors, with a few exceptions.5 In 1985 there was a topical module on
social networks that captured data on various structural properties of reported so-
cial networks along with corresponding attributes. In 1986 there was a module on so-
cial support and networks. In 1990 there was a module on intergroup relations. Most
536   Armando Razo

recently, the Cooperative Congressional Election Study (CCES) has at times incorpo-
rated some contextual measures.6 By design, CCES surveys ask general questions that
tap into reported behaviors, attitudes, and opinions with respect to various salient
political issues. Data explicitly denoted as “contextual” include campaign election data
for both the House of Representatives and the Senate starting in 2006.7 Contextual
data can be merged using indicators like ZIP and FIPS codes and congressional district
numbers to cross-​reference individual-​level information with corresponding geo-
graphical properties (Pettigrew and Fraga 2014).
Outside of the United States, three surveys warrant brief mention. First, the
European Social Survey (ESS) has measured attitudes, beliefs, and behavior since 2001.
The main contextual approach in this survey is to propose two contexts that can be
relevant settings for reported individual-​level data: countries and regions. This mul-
tilevel orientation that nests individual observations into aggregate units is a typical
approach that extends to other regions of the world. For example, there are two major
Latin American surveys that emphasize contextual differences across countries in
that region. There, the Latinobarometer, an annual public opinion survey that began
in 1995, routinely asks questions about social classes, participation in social organiza-
tions, community engagement, and relationships (especially perceptions of trust) with
other people.8 Likewise, the Latin American Public Opinion Project (LAPOP) includes
questions on social organizations and community involvement.9 These measures can
serve, for example, to denote whether different national contexts (i.e., contexts) ex-
hibit more or less social cohesion or other aspects of the social-​political environment
surrounding respondents.
We can draw three conclusions from extant efforts. First, it is clear that researchers
think that context matters, and that it deserves special attention. Second, despite the
interest in context, researchers have failed to provide a narrow definition for how it
matters for public opinion and political behavior and how to measure it. Finally, the lack
of standardized contextual variables across surveys, which is coupled with sporadic em-
pirical inquiries, impedes systematic research and cumulative knowledge on contextual
effects.

How to Conceptualize Context

To better measure context, we first need to distinguish contextual descriptors (or data)
from contextual mechanisms. A contextual descriptor is an actual measurement of con-
text. In its simplest form, the descriptor would be a nominal variable with values drawn
from a (finite) contextual set C = {c1 , c2 ,....}. Given C, we can “contextualize” survey data
by linking individual responses to particular contexts c1,c2, and so forth. In contrast, a
contextual mechanism maps contextualized observations onto different individual
behaviors.10 An example of a contextual mechanism is social influence or socialization.
For example, individuals are members of specific families, and families can have their
Integration of Contextual Data    537

own attributes, such as party identification. A desire to please one’s relatives can there-
fore increase the likelihood that a young person will eventually share the family’s party
identification.
Next, it is important to distinguish intrinsic and extrinsic properties associated with
independent survey observations (i.e., our unit of analysis). Intrinsic properties are es-
sential defining features of our unit of analysis. For example, if we study individuals,
their age and height are intrinsic properties of people because we use them to describe
individuals. In contrast, an example of an extrinsic property is an individual’s place of
residence. A place of residence like Chicago or New York can be attached to a person,
but a location is not a part of a person proper. Another way to see this distinction is
that when people move from one place to another, they carry with themselves intrinsic
properties such as age and height, but leave behind extrinsic ones.
I advance here a notion of context as the “surrounding” associated with extrinsic
properties attached to individual units. Surrounding is an adequate and general de-
piction of context, because it effectively captures all possible external conditions that
could affect individuals. To justify this notion, however, we first need to address a crit-
ical question regarding who defines the surrounding. Usually, this notion is defined by
analysts, and sometimes as a matter of convenience (exploiting geographical references
to affix geographical units to particular individuals). But to the extent that we want to
study contextual mechanisms—​not just finding descriptors to contextualize data—​
then we cannot discount an important cognitive basis for the definition of context: that
individuals themselves may play a major role in defining it.
Take, for example, the problem of how to process external information. Ultimately,
processing information is a task in which a combination of messages (frames,
schemas, etc.) interact with an individual’s own cognitive abilities to define the rele-
vant surrounding. From a research perspective, a major problem arises when multiple
individuals identify different surroundings due to perceptional variability—​even when
they appear to face similar environmental conditions from our own external analytical
perspective. For example, we might think that two individuals living in the same neigh-
borhood are subject to the same conditions, but they might actually perceive and ex-
perience that same physical space in very different ways, thus adding latent contextual
variability that we miss with our external measurements.
In reality, it appears that there is an inherent subjectivity associated with the task of
identifying relevant surroundings. The point here is not that context is what we imagine
it to be in some postmodern relativist sense, but rather that individual perceptions and
derived “maps of the external world” differ for individuals. This is a foundational re-
search concern about epistemic contextual effects that social surveys routinely ignore.11
If contextual perceptions do vary across individuals, then polling methods need to
better understand and accommodate these cognitive processes, a consideration that
opens a rich vein of research for contextual surveys. The reason is that this is not just
a substantive problem of political psychology but also a methodological one involving
the validity and measurement of contextual data. These cognitive considerations lie out-
side the scope of this chapter, so I focus instead on a first (practical) step of defining a
538   Armando Razo

surrounding solely in terms of extrinsic properties: as objective measures of entities that


exist separately from individuals. As information on internal mental processes is not
readily available to us, however, this first step necessarily assumes uniform perceptions
of surroundings based on available extrinsic information.

Contextual Possibilities
Thinking of context as surroundings further clarifies that “contextual data” does not de-
scribe either individuals or context in a vacuum. Rather, to contextualize means having
information about contextual relationships between our units and their surroundings.
Contextual data are therefore relational data, so we need a conceptual framework that
classifies relationships.
In the framework advanced here, contextual relationships have three basic forms: (ex-
clusive) groupings, neighborhoods, and social ties. First, groupings are equivalent to
the notion of a contextual partition, by which we require that every individual be part
of a group and that groups be mutually exclusive. Contextual effects in a group con-
text are self-​contained and uniform, by affecting all members of a particular group.
Second, neighborhood requires an underlying measurable space to assess whether
two units are neighbors (i.e., dyadic geographical proximity). Extrinsic properties of
the physical space surrounding an individual might serve to define relevant context,
but this space is not constant. Defined by geographical proximity, the most relevant
aspect of neighborhoods is the proximity of one individual to others, so it’s an inher-
ently local notion (i.e., individuals are the focal points of their own neighborhoods).
Moreover, because an individual can have multiple neighbors, it is possible to observe
overlapping neighborhoods. Hence, a notion of context in terms of neighborhoods does
not guarantee a partition of the original population. The main implication is that poten-
tial contextual affects may spill over from one neighborhood onto another. Finally, we
have social ties that retain the dyadic nature of neighborhoods, but with a more flexible
notion of proximity that does not require a measurable physical space. Individuals can
be related through various social relationships (e.g., friendship, kinship, work teams) or
nature of social interactions (e.g., communication). Contextual effects can have a local
scope if they are restricted to direct contact between an individual and his or her own set
of connections; or a global scope, if the overarching network structure can have an indi-
rect impact on individuals.

Context and the Scope of Inference in Survey Studies


As noted previously, the incorporation of context into polling necessarily invites spe-
cial consideration of subpopulations. This is the case because the potential existence
of contextual effects effectively means that there are distinctive subsets of the popula-
tion that operate under different circumstances. In this section I show how the first type
Integration of Contextual Data    539

of contextual relationships, exclusive groupings, can facilitate a systematic analysis of


clearly demarcated subpopulations.
Recall that we can use a first conceptual approximation of context as a discrete set
of elements C that are populated by our analytical units. This definition captures the
notion that two different units i and j can be contained within two contexts c and c′, re-
spectively. If c equals c′, then we will end up grouping i and j together within the same
context. If c differs from c′, then we will place the units in separate groupings. With
the additional restriction that units can only belong to one context, then this unit-​to-​
context assignment rule generates a contextual partition I ′ with a typical element lj¢ that
collects all i in I that are also in cj.12 To implement a contextual analysis at this point ef-
fectively means that we attempt to replicate results derived from the whole population in
each subpopulation Ij. These results can be either univariate, as in the presentation of the
baseline noncontextual survey analysis with a single global parameter, or multivariate,
if we had previously identified a relationship such as a correlation between two different
properties Y and X of the same population.
A contextual approach can also expand the scope of original populations. It makes
intuitive sense to contextualize I in terms of subpopulations defined by membership
in different contexts, but there is no necessary reason to focus on smaller populations.
Just as we can create a partition I′ from an original population I, we can easily postu-
late a superpopulation Is set that includes the original population I among many other
populations.
Two possible candidates for superpopulation contexts are time and space. If we
add to an analysis based around a single I a (finite) temporal dimension with T time
units, then the corresponding superpopulation is simply the collection of T time-​
indexed populations {I1 , I 2 , ...,IT }. A  spatial extension follows a similar logic. If we
have an index of occupied spaces s ∈{1, 2 , ...,S}, and assume that a single space can
only accommodate a single population, then our original population will be uniquely
affixed to some s. The corresponding superpopulation is the set of space-​indexed
populations {I1 , I 2 , ...,I s , ...,I S }.
As was the case with subpopulations, contextual analysis with superpopulations
requires examination of new global results based on Is against results from specific time-​
or space-​indexed populations. Note that contextual analytical tasks are restricted here to
assessing how properties of specific contexts relate to global properties. It is, of course,
possible to compare contexts among themselves, so contextual analysis encompasses a
richer set of tasks, but not all of these perform the robustness checks currently under
consideration.
Whether through subpopulation or superpopulation analysis, contextual analysis has
the potential to examine “general” arguments under a wider variety of contexts. To the
extent that original survey results “survive” the test, we gain greater confidence in those
results. It’s also worth noting that a successful outcome in which original results are ro-
bust to varying contextual conditions ultimately implies that context does not actually
matter. But this knowledge need not be evident a priori, and the actual finding is itself
significant by identifying generalizable results that “travel” across distinct populations.
540   Armando Razo

Context plays an incidental role in the robustness checks described previously, but
this does not mean that context cannot be studied in its own right. We may actually
want to carry contextual information along with our original observations yi to have
an enhanced data set of {yi , cj} in which we explicitly identify the context cj associated
with unit i. Context can assume the theoretical role of an independent or control vari-
able. Coupled with a regression framework, contextual information can be modeled and
examined in more sophisticated ways beyond comparisons of contextual groupings, as
will be in the following discussion.

How to Develop Contextual Surveys

Collecting and Sharing Contextual Data


We currently lack a standard battery of contextual questions to guide the design of new
surveys. One reason is that typical approaches to contextual analysis tend to focus on
very specific empirical content like social networks or a particular domain of social in-
quiry. Moreover, it is also the case that these types of questions are not asked regularly in
social surveys, thus preventing the development of common protocols for future survey
designs. To integrate contextual data into a survey, a first major decision is to decide
what contextual data to collect. In the absence of standard measurements, desirable data
collection can be approached in terms of three basic questions: (1) Do we want to cap-
ture specific or general contextual information?; (2) How do we completely capture all
the relevant context?; and (3) Where is context to be found?
First, it helps to consider the extension of contextual variables. The most restrictive
case is exclusive groupings, in which a simple group membership ID serves to link an in-
dividual to a group. A bit less constrained are neighborhoods, which require a notion of
contiguity on top of some fixed physical space. The most flexible case, indeed the most
personal measurement, is idiosyncratic by definition: the social context (such as a group
of friends) for unit i hinges on the identity of i. In general, as we relax requirements for
contextual relationships, more personalized contextual information requires a greater
and potentially more expensive data collection effort.
Second, our respondents are more subject to varying and heterogeneous contexts,
so we need to assess our ability and need to collect complete contextual information.
Ideally, one would want to collect the three types of contextual relationships: exclusive
groupings, neighborhoods, and social networks. But collecting all of these data could be
expensive, so as a pragmatic principle, we can let theory be our guide to focus on a par-
ticular notion of context. However, that principle needs to be qualified, for two reasons.
Collecting contextual data to evaluate a single theory potentially limits the future use of
those data for other purposes. Moreover, the theory must explicitly rule out (i.e., be in-
variant with respect to) other competing contextual mechanisms. In the end, contextual
Integration of Contextual Data    541

data collection requirements are not independent from contextual theory building: the
less explicit our theories are with regard to contextual effects, the greater the need to
collect comprehensive contextual data, and vice versa.
Third, a one-​dimensional notion of context points to three potential sources of in-
formation. We can always ask people directly and, for some personalized context like
social networks, that might be strictly necessary. We might also be able to derive contex-
tual information from other collected survey data. For example, if we have information
about group affiliations, we can check for shared affiliations. Finally, context can be de-
rived from cross-​referenced information with external data sources, which is the most
common approach for adding contextual data. For example, we can use available aggre-
gate statistics to define socioeconomic context in terms of information such as average
community income.13
Collecting contextual data also has implications for the way we store and process
such data. Because contextual data are essentially relational and extrinsic, it is neither
recommended nor always possible to store contextual information along with intrinsic
variables within the typical rectangular format of social science data sets.14 A more gen­
eral approach to integrating contextual data is to have in place a relational database
management (RDBM) system to better organize survey data (Harrington 2009). As
illustrated in figure 23.1, these RDBM systems essentially compartmentalize data into
separate tables, each corresponding to a distinct unit of analysis, which can be cross-​
referenced as needed.
In this diagram, each rectangle represents a table with a corresponding list of fields
or variables.15 For example, the “Individuals” table includes a numeric identifier i, a per-
sonal label, two quantitative variables, and a contextual identifier j. The “Contexts” table
includes several Z variables that encode different types of contextual information. Each
of these tables can be separately updated with either more observations or more fields.
Most important, we can relate individuals to particular contexts identified by j. With
this common index, RDBMS systems readily enable custom queries to create new rec-
tangular data sets with all or any subset of contextual variables, or simply recreate pre-
vious data sets after entering more cases.

Figure 23.1  Sample database structure to link individuals and contexts.


Note: This MySQL diagram shows a many-​to-​one relationship that reflects how various individuals can be linked to a
particular context using the common identifier j. See http://​dev.mysql.com/​doc/​for more information.
542   Armando Razo

Cross-​referencing tables to derive analytic data sets can save a lot of time. In particular,
adding new individual cases does not require that we enter all corresponding Z contex-
tual variables, but simply the corresponding j value. This approach can be readily extended
to capture more complicated nested data structures. For example, if our original contexts
are later themselves deemed to be part of broader contexts, all we have to do is add to the
“Contexts” table a new field, say k, that identifies higher-​level contexts in a separate (third)
table. Moreover, custom queries can easily create analytic data sets that link individuals to
the higher-​level context, if we think the latter is more relevant than the original context.
Compartmentalizing contextual data also adds flexibility regarding the actual timing
of contextual data collection. It may be possible to measure context later by adding a
contextual placeholder in a first wave of data collection. For example, geocoding allows
us to link to extraneous contextual information with corresponding coordinates.
Moreover, we can create neighborhoods based on collected samples by feeding our co-
ordinates to a spatial analysis system that generates required distance-​based contiguity
matrices (Bivand et  al. 2008). Similarly, with social media, capturing an individual’s
identity (e.g., Twitter handle) can help us incorporate future tweeting and retweeting
activities. All in all, because context can be added after survey data have been collected,
it is especially critical to develop an adequate technological infrastructure that meets
current and future contextual data collection needs.
Although integrating contextual data may increase technical requirements vis-​à-​
vis an attribute-​based social survey, two major ethical challenges arise. The first stems
from the fact that integrating contextual data potentially involves more people than
survey respondents. The applicable ethical question here is whether we need permis-
sion from all affected parties, even if they are not in our sample. For example, asking
questions about attributes or behavior of neighbors can be considered an invasion of
privacy for those third parties (who may or may not have a positive relationship with the
responding neighbor).
Clearly, it is not practical for researchers to seek third-​ party permission, and
respondents themselves cannot offer it even as they volunteer related information. To
address this ethical concern, researchers should always incorporate—​as they seek ap-
proval from institutional review boards—​an explicit regard for third parties who can
provide contextual data, along with a feasible action plan to mitigate potential harms.16
A second challenge stems from the need to protect confidentiality and privacy.
Confidentiality refers to the protection of personally identifiable information, which
should not be disclosed without the provider’s permission. Privacy shields respondents
and other affected parties from the public. For example, a snowball sampling scheme
with sequential interviews of named contacts can violate privacy if reported third
parties did not want to be reached. These desirable protections are, of course, not unique
to contextual surveys. In fact, social surveys routinely anonymize personally identifiable
individual attributes. But because contextual data are inherently relational, researchers
should be prepared to implement more comprehensive protection measures to prevent
the identification of contextualized observations.17
Integrating both data collection and ethical concerns, the most important challenge of in-
tegrating contextual data into surveys is the need to balance a risk-​utility trade-​off inherent
Integration of Contextual Data    543

in data sharing (Drechsler 2011). Risk refers to our inability to adequately protect data, thus
disclosing the identity of respondents. Utility refers to the ability of other researchers to use
undistorted shared data. The trade-​off and potential distortion come together due to ex-
plicit attempts to address the aforementioned second challenge of data protection.18 Despite
our best efforts, it is well known that anonymized data can be collated with other informa-
tion to reconstruct personalized records, a major concern with the increasing use of online
databases (de Montjoye et al. 2015; El Emam et al. 2011). In response to this problem, various
data transformations have been suggested to further anonymize observations, which in-
clude random value changes or imputations, among many other techniques. The result is a
partially synthetic data set, which acts as a proxy for the original data set.19
For survey researchers, resolving this risk-​utility trade-​off has several practical
implications. For one, it is critical that researchers have an appropriate statistical disclo-
sure control (SDC) plan, which also requires familiarity with SDC methods and required
technology to implement them.20 Although major organizations like the American
Association for Public Opinion Research (AAPOR) have long required protection of
confidential data, the relatively narrow focus of SDC methods on de-​identifying records
with sensitive individual attributes needs to be applied to contextualized observations as
well, in effect, anonymizing both individuals and (unique) contexts.
This last challenge is particularly relevant to current efforts to improve data access and
research transparency in disciplines like political science that make heavy use of survey
analysis.21 In particular, researchers who append contextual information to extant survey
data might have usage and disclosure restrictions that run counter to new professional
standards. Commercially available information can add a very rich contextual dimension
to survey studies, but if access is restricted to paid subscribers, it will be difficult to rep-
licate and extend contextual survey studies. Individual researchers alone cannot readily
solve this problem, but they can take steps to maximize access to other researchers.22
To the extent possible, researchers with the ability to purchase that information should
encourage collaborative arrangements between private companies and universities to
create a supporting infrastructure for data sharing among researchers. A relevant model
comes from Census Research Data Centers (RDCs), which allow restricted access to sen-
sitive data to a select number of researchers at specially designated physical facilities.23

Methodological Considerations
for Contextual Surveys

Contextual Variability
If context is indeed a variable of interest, then we need to think seriously about how
our data collection efforts ensure a desirable degree of contextual variability. Simple
random sampling (SRS) methods are inherently noncontextual because they group all
observations under the same category (in effect, a common context).
544   Armando Razo

To move beyond SRS, we need to remember that the unit of analysis is a key factor
that informs adequate sampling methods. The unit of analysis is the object of interest
for a particular study, typically individuals or households in social surveys. Although we
may obtain multiple variables, these are all anchored or affixed to these individual units,
so it is appropriate to depict sampling schemes as being one-​dimensional (in terms of
unit of analysis). The relevant single dimension refers to an underlying sampling frame
with a comprehensive listing (i.e., distinctive labels or identification numbers) for all
population units i in I.
Integrating contextual data necessarily alters the original unit of analysis, effectively
increasing the dimensionality of sampling frames. For the sake of illustration, let the
sets I and C correspond to discrete enumerations of available individuals and contexts,
respectively. Since contextual information is relational, the unit of analysis is not just
any i in I, but rather the Cartesian product I × C with typical (pair) elements {i,c}.24 This
conceptual formulation of a “contextualized” unit of analysis immediately implies that
a sampling scheme that ignores the contextual dimension may not derive into a prob-
ability sample proper, except in fairly unique cases. Specifically, if we assume or have
reason to believe that available contexts in C are uniformly distributed across the popu-
lation, then we can deduce that all possible {i,c} combinations are equiprobable. But this
is also the very same case in which context does not matter, because it averages out at the
population level. Beyond the special case, SRS survey designs do not actually know the
ex ante probability of {i,c} pairs. Hence, design-​based population surveys that do not
explicitly account for—​but still want to study—​contextual differences lack a complete
probabilistic foundation.
There is a straightforward approach to dealing with nonuniform contextual
distributions, which requires a probability distribution over available contexts. First,
having defined a sample space for relevant contexts, researchers need to either estimate
or calculate the probability of selecting particular contexts. Second, researchers can use
these probabilities to design sampling schemes in a manner that is analogous to stratified
random sampling. 25 Basically, instead of strata built around some intrinsic individual
trait (like sex), researchers use their preferred notion of context. For example, if we have
a community type variable with two values (rural, urban), we can construct a sample
constrained to have the relative (estimated or actual) proportion of rural and urban
communities from which we will sample individuals separately. Contextual analysis, in
turn, entails a comparison of relevant parameters in each of these community types.
Theoretical guidance will also be key to determining how one might sample multiple
contexts if these interact. For example, it is likely that some contextual factors like so-
cial networks may operate differently across rural or urban communities. Perhaps in
rural communities, social networks may be smaller, denser, and more homogenous in
composition (in terms of the attributes of individual participants). In contrast, rural
communities may give rise to larger, sparser, and more heterogeneous networks.26
Although tedious, the stratified contextual sampling scheme advanced here is
implementable in many circumstances. These circumstances are limited by the type of
context that one wants to analyze. In particular, this sampling scheme will only work
Integration of Contextual Data    545

when one can partition the set of available contexts into mutually exclusive categories.
This observation means that neighborhoods or social network contexts do not lend
themselves as readily to stratification, except in special cases in which we can define ex-
clusive neighborhoods or isolated network components.

Statistical Inference
Whether or not one engages in purposeful contextual survey design, integrating contex-
tual data invites explicit analysis of contextual effects. From a design-​based perspective,
these effects can be evaluated with a variety of well-​known techniques, including anal-
ysis of variance (ANOVA) and ANCOVA (analysis of covariance) methods. The rest of
this section focuses on model-​based survey analyses with two purposes: (1) to showcase
statistical methods that explicitly model underlying context to estimate its impacts and
(2) to provide a counterpart statistical methodology to each of the three types of context
described previously.27
To motivate the first family of methods known as multilevel models, it helps to think
about how one could enhance an individual-​level analysis, simply denoted as yi = f (xi ),
with contextual information.28 Earlier approaches that focused on fixed contextual effects
include Stipak and Hensler (1982), who posited a regression function with independent
individual and contextual effects, yi = f ( Xi , C j ), in effect a type of ANCOVA analysis
if the contextual variables were discrete factors, or a simple regression with multilevel
variables. In either case, contextual variables are modeled to have an independent im-
pact and can be allowed to interact with individual-​level factors. Iversen (1991) addresses
contextual effects from a similar approach, positing that contextual effects are best un-
derstood in terms of cross-​level interaction terms, or yi = f ( Xi , C j , Xi × C j ). The multi-
level formulation will build on these earlier insights.
Multilevel models, also known as hierarchical linear regressions (HLR) in the case of
an interval-​valued yi, have two distinctive features with respect to the conventional linear
model of independent and identically distributed (IID) observations.29 First, multilevel
models assume a nested data structure that translates into exclusive groupings in which
individual observations belong to one and only one possible grouping. We can let such
groupings represent distinct “contexts.” Second, and most important, there is an explicit
attempt to model such groupings in terms of separate (extrinsic) properties. Letting i and j
represent distinct individuals and contexts, respectively, these two features are implemented
in two different ways with random-​intercepts and random-​coefficients models.
The simpler formulation of random intercepts is based around a reformulated indi-
vidual (or level-​1) equation yi = α j[i ] + βxi + i , in which the intercept is modeled to be
a function of some context j.30 For example, yi could be a measure of political knowl­
edge that is affected by income (xi). If the relevant context is a county, we could use a
country-​level variable such as degree of urbanization (Zj) to distinguish different
contexts, thus deriving this contextual (or level-​2) equation for the random inter-
cept: α j[i ] = γ 00 + γ 01Z j + u0 j. In practice, contextual effects will be manifested through
546   Armando Razo

different j-​specific intercepts, but this equation makes it clear that different levels of po-
litical knowledge (measured through varying intercepts) could be a function of level-​2
factors such as Zj. As this intercept is a stochastic equation, it is also clear that the re-
searcher acknowledges some uncertainty in the specification of context j, which is supe-
rior to simply positing and measuring a fixed contextual measurement.
A second multilevel variation involves the explicit modeling of the slope coefficients.
Rather than have slopes be fixed population parameters, we can make these a function
of level-​2 or contextual factors. In that case, letting the intercept be fixed, we have a
level-​1 equation yi = α + β j[i ]xi + i and a level-​2 equation for β, which now becomes
β j[i ] = γ 10 + γ 11Z2 j + u1 j, where Z2j could be another county-​level property such as a
measure of economic development. A  more general formulation can include both
random intercepts and coefficients. Multilevel models provide a natural approach to
incorporating extraneous information in level-​2 equations. Substituting level-​2 into
level-​1 equations can derive into a comprehensive equation that separates level-​1 and
level-​2 factors as well as their potential interactions.31
There are some contextual effects like diffusion and interdependence that cannot be
modeled with nested data structures (Agnew 1996; Braun and Gilardi 2006). In these
situations, there are two main choices to model overlapping contexts, starting with spa-
tial regression approaches that exploit the existence of a fixed physical space that serves
to locate individual units. For a more general discussion of how a spatial perspective can
inform contextual analysis, see Franzese and Hays (2007, 2008).
In both cases, contexts are endogenously derived (through individual neighborhoods
defined around particular i’s), and contextual effects are defined in terms of averages of
the dependent variable y. First, contexts are endogenous and heterogenous because the
spatial lag varies across individual observations (not all neighborhoods have the same
size and composition). Second, despite acknowledging spatial heterogeneity, the quan-
tity of interest is ρ, which serves to measure a global spatial lag effect that takes into
account the recursive nature of the SAR model. The main manifestation of contextual
effects is to amplify or reduce the impact of neighboring observations.32
A wide range of social network techniques can be used to measure and analyze social
contexts and their impact on individual attitudes and behaviors (Borgatti et al. 2009;
Kadushin 2012; Sinclair 2012). These tools allow the study of context—​as measured
through social relations—​in its own right. In fact, current statistical analysis of networks
focuses on models of endogenous networks or endogenous context. Rather than
studying a contextual effect per se, the underlying research question is to explain how
to explain social structures or social contexts as whole objects. This approach falls under
the umbrella of exponential random graph models (ERGMs) (Cranmer and Desmarais
2011). This is a more technical approach that requires an advanced understanding of
random networks, which cannot be adequately explained here, but the general idea is
to model complete network structures as a function of both individual attributes and
microstructural features (Harris 2013; Kolaczyk 2009; Lusher et al. 2013).
Besides capturing different notions of context, the three statistical approaches
mentioned here also have practical considerations for data collection, as illustrated in
Integration of Contextual Data    547

Do observations Yes Contextual


share a common information is
context? irrelevant

No

How
subpopulations
relate to one Mutually exclusive Overlapping
another groupings contexts

(Spatial) Social
Neighborhoods Networks

Informational A contextual index A spatial index and Unit-specific ties


Requirements for subsequent derived contiguity and derived
cross-references matrix adjacency matrix

Statistical Multilevel Spatial Ego network


Methods analysis Regression analysis or ERGM

Figure 23.2  Different contextual relationships and associated methodologies.

figure 23.2. Multilevel models are the least demanding, requiring collection of only an
index survey variable that can be later cross-​referenced with contextual information
available elsewhere. Spatial regression approaches require as a basic input at least one con-
tiguity matrix W. This matrix can be derived from existing coordinates, but researchers
need to think hard about preferred measures of contiguity.33 Whether captured at the
moment of observation or later, spatial configurations need to be stored outside of the
conventional rectangular array format associated with independent observations.
Finally, social network data are the most demanding, insofar as they require calculations
of multiple adjacency matrices, which can differ across individuals when so-​called ego
networks have a different size and combination. Additional data structures are required
if we are also trying to measure an overarching social network on the basis of individual
reports. As the size of the sample increases, so will the number of these potentially dis-
tinct matrices, which will need to be stored separately from individual attributes.

Conclusion

Integrating contextual information offers many opportunities for data collection and
theorizing the impact of context on public opinion and political behavior. Future research
548   Armando Razo

needs to consider how to produce new types of contextual information. First, we can ask
subjects to reflect on context to get a better handle on potential gaps between typical an-
alytical (external) versus internal (subjective) perceptions of context. For example, rather
than inferring that individuals who talk to others are influenced by social context (because
that reported information is correlated with their behavior), we can also ask the extent to
which they think they are influenced by and aware of their surroundings (Sterba 2009).
This line of inquiry opens opportunities to further explore the cognitive foundations of
contextual effects to build on discovered empirical relationships.
Another extension is to ask subjects to compare and contrast different contexts. Can
individuals themselves distinguish multiple contexts—​and most important, do they
think that the various contexts have varying effects? Along those lines, do respondents
perceive a structure that relates those multiple contexts? For example, a rule that says that
neighbors must be friends effectively combines two qualitatively distinct contexts based
on physical space and social relations. Are subjects, in fact, able to assess contexts to tell
the difference? If our guiding theories, or empirical setup, suggests that there will be
community effects, is it the case that when the community spans something like a small
town, our respondents actually know everyone in their town? Do they only know their
neighborhood, and how exactly do they recognize neighborhood boundaries? How in-
tegrated are our respondents in their community? To what extent do they know—​and
are affected by—​aggregate patterns of segregation or integration in their community?
Surveys have not entirely ignored these considerations, but more work needs to be done.
For instance, the Latinobarometer has previously asked related questions about “social
cohesion,” but without an explicit attempt to understand the basis for these perceptions.
These various extensions invite a deeper integration of political psychology into con-
textual analysis, even if the original intent or research question is not evidently psycho-
logical. Indeed, despite the lack of an overarching theoretical framework for context,
there is a recognizable common theme in contextual studies that points to the impor-
tance of personal psychology, such as the processing of information cues (Alvarez
1997; Atkeson 2003).34 Of course, the quantity and mix of these new contextual data
approaches will be contingent on particular projects’ research questions and resources.
Although not strictly a call for new contextual data, it bears noting that contextual data
analysis can also be advanced not just with new data sources, but with new methods.
Experimental survey methods seem particularly apt to explore questions about how con-
current contextual approaches matter. Without a comprehensive examination of all pos-
sible contextual effects, we may still get a lot of mileage from randomizing the contexts to
which subjects are exposed (through our questions). From a methodological perspective,
offering different experimental contexts can also help us assess the robustness of our results.

Notes
1. Although some researchers consider time to be an important contextual factor in and of it-
self—​see Goodin and Tilly (2006, pt. VI)—​I deem time too broad a notion to identify con-
textual differences. Although we can certainly distinguish two time units, t and t+1, that
Integration of Contextual Data    549

mere distinction does not imply that time-​indexed situations are qualitatively different.
To make that claim, it is necessary to point to substantive differences, and if that’s the case,
then contexts can be differentiated on the basis of (time-​indexed) notions of context re-
lated to physical, analytical, or social settings.
2. An excellent review of recent work on this topic is available in Heaney and McClurg (2009)
and Huckfeldt (2009).
3. It is also the case that “communities” can be defined inductively as the output of algorithms
that seek to group nodes into distinctive, not necessarily mutually exclusive, subgroupings.
This is a common practice in the study of large, complex networks that has been criticized
by social scientists due to lack of prior and explicit conceptual definitions of community
(see Jackson (2008), ch. 13).
4. Lacking this information, we cannot examine compositional differences in the social
circles of different respondents, limiting contextual analysis to a sharp distinction be-
tween respondents who discuss or don’t discuss politics with others. The attributes of third
parties are also relevant to assess whether homophily drives these discussions, in which
case a respondent might not be getting new information to change individual behavioral
tendencies.
5. See http://​www3.norc.org/​GSS+Website/​.
6. See http://​projects.iq.harvard.edu/​cces.
7. This survey uses a threefold stratification scheme that accounts for the distinctions be-
tween registered and nonregistered, competitive and uncompetitive congressional
districts, and the number of congressional districts across states, resulting in sixteen strata.
As discussed further below, each stratum can be considered a distinct context.
8. See http://​www.latibarometro.org.
9. See http://​www.vanderbilt.edu/​lapop/​.
10. This distinction between descriptors (evidence) and mechanisms (theory) has two
implications for the design of contextual surveys. First, when context is appended after
data collection, researchers can often choose from multiple C sets to contextualize
responses (e.g., linking individuals to a county, a state, or a country). Without prior theo-
retical guidance, it is not always clear which descriptor or combination of descriptors best
measures relevant context. Moreover, as relevance is primarily a theoretical concern, the
design of contextual surveys benefits from consideration of concrete mechanisms in order
to develop a sampling strategy that captures relevant contextual variability. This methodo-
logical concern is revisited in the next section.
11. I use the distinction made by Sterba (2009) regarding a fundamentally different stance
about randomness between design-​based and model-​based survey analysis. In a design-​
based approach, the randomness of sampling error is empirically induced. In contrast,
model-​based approaches posit an underlying data-​generating process as the source of ran-
domness (which Sterba denotes as epistemic randomness).
12. This functional assignment restriction will be relaxed later in the context of spatial or so-
cial contexts, at which point it will also become necessary to discard these partitions as a
general model of context.
13. However, these aggregations are conceptually problematic, because they do not separate
intrinsic from extrinsic properties. Moreover, some contextual measures can be highly
sensitive to imposed contextual boundaries, as is the case with the mean statistic, which
can change drastically if a new community boundary excludes or introduces extreme in-
come values.
550   Armando Razo

14. This is especially the case with neighborhoods and social networks, which have more com-
plex data structures that do not lend themselves to a single table.
15. These fields correspond to column names in typical rectangular data sets.
16. One possibility is to destroy identifiable information as soon as possible after it has
been properly anonymized and stored within a database system, but this option must be
weighed against future needs to expand contextual information. For example, a neighbor
could become a future respondent, in which case preserving true identities is critical.
17. These more stringent requirements parallel those that arise in the context of health-​related
information. In that particular domain, researchers are routinely required to comply with
stringent data management requirements stemming from the federal Health Insurance
Portability and Accountability Act of 1996, also known as HIPAA. As is the case with
the healthcare sector, technology can be a major factor in enabling required protections.
Survey researchers who integrate highly personalized contextual data can emulate
practices from the health sector. An example of required technology is the REDCap tool,
accessible through the Indiana Clinical and Translational Science Institute at https://​
www.indianactsi.org/​redcap, which is a secure system that insulates extremely private
data from the public while also allowing fine-​granularity access control to researchers and
collaborators.
18. Certainly one aspect of data protection has to do with data security, which can be mitigated
with appropriate technology. The most relevant aspect, however, has to do with data that
are (eventually) made publicly available.
19. See Domingo-​Ferrer (2008), Reiter (2003, 2012), and Dreschler (2011) for details. These
steps are actually required for research sponsored by the U.S. Census Bureau (https://​
www.census.gov/​srd/​sdc) and the U.S. federal government (see FCSM 2005).
20. Unfortunately there is no conventional tool to enable these tasks, which further highlights
the need to have a database system in place that can automate some of these transforma-
tional tasks for data sharing. To get a sense of the required steps, researchers can assess
relevant functionality in two freely available tools: (1) the R package sdcMicro (http://​
cran.r-​project.org/​web/​packages/​sdcMicro) and (2)  the Cornell Anonymization Tool
(http://​sourceforge.net/​projects/​anony-​toolkit/​).
21. See http://​www.dartstatement.org for the incorporation of these principles in the
American Political Science Association’s ethics guide. This site also includes the transpar-
ency statement of twenty-​seven journal editors.
22. Indeed, some scholars see new (big) data collection trends that rely on private companies
to be especially problematic for scientific research and call instead for a better public infra-
structure that is open to more participants (Conley et al. 2015).
23. See http://​www.census.gov/​ces/​rdcresearch/​.
24. If we add another contextual set S, then the relevant unit of analysis is a triple {i,c,s}.
Additional contexts increase the dimensionality of these units in a similar manner.
25. There are, of course, complex survey designs that exploit existing exclusive groupings,
such as with multistage cluster sampling schemes (Lumley 2010, ch. 3 and 8). Individual
units are therefore uniquely nested within a multilevel structure that can reveal contex-
tual differences across stages. However, by subsampling at each stage, these approaches
still carry a stringent assumption that units within the same cluster are subject to similar
contexts.
26. At least in terms of network size, this problem can be mitigated with questions that ask for
a fixed number of connections s. However, this approach makes strong distributional
Integration of Contextual Data    551

assumptions about the size of underlying (egocentric) network structures, which is artificially
bounded above by our choice of s. Snowball sampling techniques can mitigate this artificial
boundary problem by letting respondents reveal variable network sizes, but there are associ-
ated problems involving the practical ability of recovering complete networks (in which case,
arbitrary stopping rules during data collection add spurious contextual variability).
27. Model-​based survey analysis need not ignore the underlying design (i.e., sampling
weights), but there is also some consensus that incorporating design features in these
statistical models does not affect final results (Lumley 2010, ch. 5). This means that for
practical purposes, researchers interested in analyzing contextual effects can readily apply
these models to existing survey data, provided of course that they can identify or readily
merge “contextual” information. There is some methodological overlap here with related
approaches in social epidemiology that include multilevel and network approaches in
addition to multiple practical approaches to collecting social or contextual data (Oakes
and Kaufman 2006).
28. To facilitate the presentation, I am setting aside the specification of an underlying sto-
chastic component, which is important but not relevant for the current discussion.
29. Gelman and Hill (2007) is a standard reference for multilevel models, also offering com-
puting examples using the R statistical environment. Another approach using the Stata
programming language is Rabe-​Hesketh and Skrondal (2008). For an overview of multi-
level analysis, see Jones (2008) and Steenbergen and Jones (2002).
30. I use the bracketed context[unit] notation advanced by Gelman and Hill (2007).
31. Multilevel models can readily accommodate a hierarchy of contextual effects. For in-
stance, our level 2 variables here could be nested with a higher level 3, and so forth.
32. The presentation here restricts the application of W to neighboring values of y, but there
are more general models like the Spatial Durbin model ( yi = α + ρWy + βxi + γ Wx + i ),
in which neighboring exogenous values can also produce spatial effects (Anselin 1988). In
fact, the γ parameter captures the idea that neighborhood averages of covariates impact yi,
which is similar in orientation to previous approaches that measure context with neigh-
borhood measures that are analogous to individual covariates (e.g., the average neigh-
borhood income, or the percentage of the population with some individual trait, etc.).
However, these earlier analyses have been done using linear regression analysis, without
modeling the underlying spatial dependence (see Stipak and Hensler 1982 for a review).
Beyond spatially lagged covariates, other extensions include the possibility of analyzing
more than one spatial relationship concurrently (Lacombe 2004).
33. Contiguity can be assessed with different criteria. Two common approaches, informed by
legal chess moves, are rook and queen styles. For the former, two units are contiguous
if their encompassing areas share a boundary; the latter includes both boundaries and
corner points. See Bivand et al (2008) for details.
34. See also Mutz (2007) for an overview of psychological studies of political behavior.

References
Agnew, J. 1996. “Mapping Politics:  How Context Counts in Electoral Geography.” Political
Geography 15 (2): 129–​146.
Alvarez, R. M. 1996. Studying Congressional and Gubernatorial Campaigns. California Institute
of Technology, Division of the Humanities and Social Sciences, Pasadena, CA.
552   Armando Razo

Alvarez, R. M. 1997. Information and Elections. Ann Arbor: University of Michigan Press.


Alvarez, R. M., and P. Gronke. 1996. “Constituents and Legislators: Learning about the Persian
Gulf War Resolution.” Legislative Studies Quarterly 21 (1):105–​27.
Anselin, L. 1988. Spatial Econometrics:  Methods and Models. Dordrecht; Boston:  Kluwer
Academic Publishers.
Atkeson, L. R. 2003. “Not All Cues Are Created Equal:  The Conditional Impact of Female
Candidates on Political Engagement.” Journal of Politics 65 (4): 1040–​1061.
Berger, J., M. Meredith, and S. C. Wheeler. 2008. “Contextual Priming: Where People Vote
Affects How They Vote.” Proceedings of the National Academy of Sciences 105 (26): 8846–​8849.
Bivand, R., E. J. Pebesma, and V. Gómez-​Rubio. 2008. Applied Spatial Data Analysis with R.
New York, London: Springer.
Borgatti, S. P., A. Mehra, D. J. Brass, and G. Labianca. 2009. “Network Analysis in the Social
Sciences.” Science 323 (5916): 892–​895.
Braun, D., and F. Gilardi. 2006. “Taking ‘Galton’s Problem’Seriously Towards a Theory of Policy
Diffusion.” Journal of Theoretical Politics 18 (3): 298–​322.
Cohen, C. J., and M. C. Dawson. 1993. “Neighborhood Poverty and African American Politics.”
American Political Science Review 87 (2): 286–​302.
Conley, D. J., L. Aber, H. Brady, S. Cutter, C. Eckel, B. Entwisle, D. Hamilton, S. Hofferth, K.
Hubacek, E. Moran, and J. Scholz. 2015. “Big Data, Big Obstacles.” Chronicle Review. http://​
chronicle.com/​article/​Big-​Data-​Big-​Obstacles/​151421/​, Accessed on 2/​7/​2015.
Cranmer, S. J., and B. A. Desmarais. 2011. “Inferential Network Analysis with Exponential
Random Graph Models.” Political Analysis 19 (1): 66–​86.
Daraganova, G., P. Pattison, J. Koskinen, B. Mitchell, A. Bill, M. Watts, and S. Baum. 2012.
“Networks and Geography: Modelling Community Network Structures as the Outcome of
Both Spatial and Network Processes.” Social Networks 34 (1): 6–​17.
de Montjoye, Y.-​A., L. Radaelli, V. K. Singh, and A. “S.” Pentland. 2015. “Unique in the Shopping
Mall: On the Reidentifiability of Credit Card Metadata.” Science 347: 536–​539.
De Vries, C. E., W. Van der Brug, M. H. van Egmond, and C. Van der Eijk. 2011. “Individual
and Contextual Variation in EU Issue Voting: The Role of Political Information.” Electoral
Studies 30 (1): 16–​28.
Domingo-​Ferrer, J. 2008. “A Survey of Inference Control Methods for Privacy-​Preserving
Data Mining.” In Privacy-​Preserving Data Mining, edited by C. Aggarwal and P. S. Yu.
New York: Springer, 53-​80.
Drechsler, J. 2011. Synthetic Datasets for Statistical Disclosure Control Theory and
Implementation. New York: Springer.
Duch, R. M., and R. Stevenson. 2005. “Context and the Economic Vote: A Multilevel Analysis.”
Political Analysis 13 (4): 387–​409.
Eagles, M. 1995. “Spatial and Contextual Models of Political Behavior:  An Introduction.”
Political Geography 14 (6): 499–​502.
El Emam, K., E. Jonker, L. Arbuckle, and B. Malin. 2011. “A Systematic Review of Re-​
Identification Attacks on Health Data.” PLOS ONE 6 (12):  e28071. doi:10.1371/​journal.
pone.0028071.
Eulau, H., and L. Rothenberg. 1986. “Life Space and Social Networks as Political Contexts.”
Political Behavior 8 (2): 130–​157.
Federal Committee on Statistical Methodology (FCSM). 2005. “Report on Statistical Disclosure
Limitation Methodology.” Statistical Policy Working Paper 22. Washington, DC: Office of
Management and Budget.
Integration of Contextual Data    553

Franzese, R. J., and J. C. Hays. 2007. “Spatial Econometric Models of Cross-​ Sectional
Interdependence in Political Science Panel and Time-​Series-​Cross-​Section Data.” Political
Analysis 15 (2): 140–​164.
Franzese, R. J., and J. C. Hays 2008. “Interdependence in Comparative Politics Substance,
Theory, Empirics, Substance.” Comparative Political Studies 41 (4–​5): 742–​780.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/​Hierarchical
Models. Cambridge, UK and New York: Cambridge University Press.
Goodin, R. E., and C. Tilly. 2006. “The Oxford Handbook of Contextual Political Analysis.
Oxford and New York: Oxford University Press.
Gordon, S. B., and G. M. Segura. 1997. “Cross-​National Variation in the Political Sophistication
of Individuals: Capability or Choice?” Journal of Politics 59 (1): 126–​147.
Harrington, Jan L. 2009. Relational Database Design and Implementation: Clearly Explained.
Cambridge, MA: Morgan Kaufmann.
Harris, J. K. 2013. An Introduction to Exponential Random Graph Modeling. Quantitative
Applications in the Social Sciences Vol. 173: Thousand Oaks, CA: Sage Publications.
Heaney, M. T., and S. D. McClurg. 2009. “Social Networks and American Politics: Introduction
to the Special Issue.” American Politics Research 37 (5): 727–​741.
Huckfeldt, R. 2007. “Information, Persuasion, and Political Communication Networks.” In
Oxford Handbook of Political Behavior, edited by R. J. Dalton and H.-​D. Klingemann. Oxford
Handbooks Online. doi:10.1093/​oxfordhb/​9780199270125.003.0006.
Huckfeldt, R. 2009. “Interdependence, Density Dependence, and Networks in Politics.”
American Politics Research 37 (5): 921–​950.
Huckfeldt, R., and J. Sprague. 1987. “Networks in Context:  The Social Flow of Political
Information.” American Political Science Review 81 (4): 1197–​216.
Iversen, G. R. 1991. Contextual Analysis. Newbury Park, CA: Sage Publications.
Jackson, M. O. 2008. Social and Economic Networks. Princeton, NJ: Princeton University Press.
Johnston, R., and H. E. Brady. 2002. “The Rolling Cross-​Section Design.” Electoral Studies
21: 283–​295.
Jones, B. S. 2008. “Multilevel Analysis.” In The Oxford Handbook of Political Methodology, edited
by J. M. Box-​Steffensmeier, H. E. Brady, and D. Collier, 605-​623. Oxford, New York: Oxford
University Press.
Kadushin, C. 2012. Understanding Social Networks:  Theories, Concepts, and Findings.
New York: Oxford University Press.
Kolaczyk, E. D. 2009. Statistical Analysis of Network Data:  Methods and Models.
New York: Springer Science & Business Media.
Lacombe, D. J. 2004. “Does Econometric Methodology Matter? An Analysis of Public Policy
Using Spatial Econometric Techniques.” Geographical Analysis 36 (2): 105–​118.
Lumley, T. 2010. Complex Surveys:  A Guide to Analysis Using R. Hoboken, NJ:  John Wiley
& Sons.
Lusher, D., J. Koskinen, and G. Robbins, eds. 2013. Exponential Random Graph Models
for Social Networks:  Theories, Methods, and Applications. Cambridge, UK:  Cambridge
University Press.
Mutz, D. C. 2007. “Political Psychology and Choice.” In Oxford Handbook of Political Behavior,
edited by R. J. Dalton and H.-​D. Klingemann. Oxford Handbooks Online. doi:10.1093/​
oxfordhb/​9780199270125.003.0005.
Nadeau, R., and M. S. Lewis‐Beck. 2001. “National Economic Voting in US Presidential
Elections.” Journal of Politics 63 (1): 159–​181.
554   Armando Razo

Oakes, J. M., and J. S. Kaufman. 2006. Methods in Social Epidemiology. Vol. 1. Hoboken,
NJ: John Wiley & Sons.
Pacheco, J. S. 2008. “Political Socialization in Context: The Effect of Political Competition on
Youth Voter Turnout.” Political Behavior 30 (4): 415–​436.
Pettigrew, S., and B. Fraga. 2014. CCES Master Question List, V4. The Institute for Quantitative
Study of Society. http://​hdl.handle.net/​1902.1/​14743.
Prell, C. 2012. Social Network Analysis:  History, Theory & Methodology. Los Angeles,
London: SAGE.
Rabe-​Hesketh, S., and A. Skrondal. 2008. Multilevel and Longitudinal Modeling Using Stata.
College Station, TX: STATA Press.
Reiter, J. P. 2003. “Inference for Partially Synthetic, Public Use Microdata Sets.” Survey
Methodology 29 (2): 181–​188.
Reiter, J. 2012. “Statistical Approaches to Protecting Confidentiality for Microdata and Their
Effects on the Quality of Statistical Inferences.” Public Opinion Quarterly 76 (1): 163–​181.
Rolfe, M. 2012. Voter Turnout: A Social Theory of Political Participation. New York: Cambridge
University Press.
Shipan, C. R., and C. Volden. 2008. “The Mechanisms of Policy Diffusion.” American Journal of
Political Science 52 (4): 840–​857.
Sinclair, B. 2012. The Social Citizen:  Peer Networks and Political Behavior. Chicago,
London: University of Chicago Press.
Steenbergen, M. R., and B. S. Jones. 2002. “Modeling Multilevel Data Structures.” American
Journal of Political Science 46 (1): 218–​237.
Sterba, S. K. 2009. “Alternative Model-​Based and Design-​Based Frameworks for Inference
from Samples to Populations:  From Polarization to Integration.” Multivariate Behavioral
Research 44 (6): 711–​740.
Stipak, B., and C. Hensler. 1982. “Statistical Inference in Contextual Analysis.” American
Journal of Political Science 26 (1): 151–​175.
Ward, M. D., and K. S. Gleditsch. 2008. Spatial Regression Models. Los Angeles:  Sage
Publications.
Zuckerman, A. S., ed. 2005. The Social Logic of Politics:  Personal Networks as Contexts for
Political Behavior. Philadelphia: Temple University Press.
Chapter 24

M e asuring Publ i c Opi ni on


w it h So cial M e dia  Data

Marko Klašnja, Pablo Barberá,


Nicholas Beauchamp, Jonathan Nagler,
and Joshua A. Tucker

Social Media and Public


Opinion: Opportunities and
Challenges

Social media sites such as Facebook and Twitter are playing an increasingly central
role in politics. As Kreiss (2014) shows, the 2012 Barack Obama and Mitt Romney pres-
idential election campaigns relied heavily on social media to appeal to their supporters
and influence the agendas and frames of citizens and journalists. In 2016 the role of
social media accelerated, with Twitter, for example, becoming a central pillar of the
Trump campaign. Social media sites have also been essential for disseminating infor-
mation and organizing during many recent episodes of mass protest, from the pro-​
democracy revolutions during the Arab Spring to Euromaidan to the recent wave of
pro–​civil rights demonstrations in the United States (see, e.g., Tufekci and Wilson 2012;
Tucker et al. 2016). The influence of social media has also become pervasive in tradi-
tional news outlets. Twitter is commonly used as a source of information about breaking
news events, journalists and traditional media often solicit feedback from their viewers
through social media, and political actors can rely on social media rather than press
releases to reach the public. Most fundamentally, for numerous political organiza-
tions and millions of users, social media have become the primary means of acquiring,
sharing, and discussing political information (Kwak et al., 2010; Neuman et al., 2014).
This chapter examines to what extent one can aggregate political messages
published on social networking sites to obtain a measure of public opinion that is
556    Marko Klašnja et al.

comparable or better than those obtained through surveys. It is well known that
public opinion surveys are facing growing difficulties in reaching and persuading re-
luctant respondents (De Leeuw and De Heer 2002). According to the Pew Research
Center, the typical contact rates dropped from 90% to 62% between 1997 and 2012,
with response rates dropping from about 40% to 9% (Pew Research Center 2012).1
One important reason for these trends is the falling rate of landline phone use,
coupled with the fact that federal regulations prohibit the use of automated dialers
for all unsolicited calls to cell phones (but not landline phones). According to one
estimate, the share of cell-​phone-​only households in the United States has grown by
70% in four years, reaching 44% of all households in 2014.2 While the relationship
between nonresponse rates and nonresponse bias—​which arises when those who
answer are different from those who do not—​is complex (Groves 2006; Groves and
Peytcheva 2008), survey responders tend to be more likely to vote, contact a public
official, or volunteer than are survey nonresponders (e.g., Pew Research Center
2012). The responders’ answers tend to exhibit less measurement error and lower so-
cial desirability bias (Abraham, Helms, and Presser 2009; Tourangeau, Groves, and
Redline 2010). The cell-​phone-​only respondents can differ in political preferences
than those with landline phones; for example, they were significantly more likely to
support Obama in 2008, especially older voters (Mokrzycki, Keeter, and Kennedy,
2009). These trends have raised questions about the reliability and precision of rep-
resentative surveys and have increased the costs of fielding high-​quality polls, at the
same time that funding available for a number of established large-​scale surveys has
been threatened.3
These factors are increasing the incentives for using social media to measure public
opinion. First and foremost, social media provide an opportunity to examine the
opinions of the public without any prompting or framing effects from analysts. Rather
than measure what someone thinks about politics in the artificial environments of a
front porch, dinnertime phone call, or survey web page, we can observe how people
spontaneously speak about politics in the course of their daily lives. And instead of
depending on the analyst’s view of which topics are important at any given time, we can
observe the topics that the public chooses to raise without our prompting.
The second major appeal of social media data is their reach:  over time, across
individuals, cross-​nationally, and within small geographical regions. Due to the fine-​
grained nature of Twitter and Facebook data, for example, it should be possible to
measure changes in opinion on a daily or even hourly basis. Similarly, because hundreds
of millions of people use Twitter and Facebook regularly, the scope of opinion that can
be measured goes far beyond anything we could previously have attempted. And since
social media can be found throughout the world, they provide a convenient platform for
sampling opinion in many countries where it would otherwise be difficult or impossible
for survey researchers to work. In fact, it is likely that the Twitter archive is already the
largest cross-​national time-​series data set of individual public opinion available to the
mass public.4
Measuring Public Opinion with Social Media Data    557

The third appeal of using social media to measure public opinion is the cost and prac-
ticality. With a little programming and a decent-​sized hard drive, anyone can capture,
for example, every comment made about a presidential debate, in real time and for free.
To the extent that we care about public opinion because we think it helps to hold rulers
more accountable and to make policy more responsive to the mass citizenry, the poten-
tial to dramatically reduce the cost of studying public opinion may be perhaps the most
exciting opportunity afforded by social media.5
Of course while social media have desirable properties that traditional public
opinion surveys cannot match, truly developing tools to effectively harness their poten-
tial involves enormous challenges, discussed in the next section. Each of the strengths
discussed above also constitutes a challenge—​both theoretical and technical—​for meas-
uring opinion in the ways we are used to using traditional surveys. First, identifying
emergent topics and sentiments is hugely challenging, not just computationally but the-
oretically, as we strive to understand machine-​or human-​generated summaries and rec-
oncile them with previous survey measures and research agendas. Second, the breadth
and scale of social media use is counterbalanced by the opacity of its user population,
and the steps needed to reweigh this entirely unrepresentative “survey” in order to
measure any population of interest remain difficult and uncertain. Third, the technical
challenges of collecting and aggregating the data are nontrivial, particularly given the
diffident and often opaque cooperation of private social media providers like Twitter
and Facebook.
We believe that many of these challenges involved in using social media data to
study public opinion can be overcome, and that the potential payoff certainly justifies
the effort. But we also believe it is crucial to be upfront about these challenges moving
forward, and therefore one of the goals of this chapter is to lay these out explicitly. In
the second section we discuss in greater detail these three main challenges and how
they have arisen in past social media research. In the third section we discuss some of
the strategies for overcoming many of these challenges, both drawing upon past work
that suggests various successful strategies and suggesting new ones. And in the fourth
section we discuss in greater detail some of the novel uses for social media, ones that
have fewer direct analogs in traditional survey work. We conclude in the fifth section
with a series of recommendations for a research agenda that uses social media for public
opinion work, as well as providing a list describing how social media data were collected
that we suggest scholars and practitioners use when reporting any results based on so-
cial media data, but especially when reporting results claiming to be representative of
public opinion.
We focus here on Twitter data because they are widely used, mainly public, and
relatively easy to collect; for these reasons, these data have been the focus of the ma-
jority of recent social media research. But of course all of these concepts apply more
generally, and the difficulties and solutions we propose here will likely continue well
into a future in which social media platforms that do not exist yet may dominate the
landscape.
558    Marko Klašnja et al.

Challenges in the Measurement of


Public Opinion with Social Media Data

In the study of public opinion, a survey is commonly defined as a systematic method for
gathering information from a sample of individuals for the purposes of constructing
quantitative descriptors of the attributes of the larger population of which the
individuals are members (see, e.g., Groves et al. 2011). This information is commonly
gathered by asking people questions. The three core components of a survey are thus
a standardized questionnaire, a population frame from which individuals are sampled
using a probability sampling method, and a method to aggregate individual responses to
estimate a quantity of interest.
Without any adjustment, treating social media data as a survey fails to meet any of
these three criteria: the opinions expressed by individuals are unprompted and unstruc-
tured, the probability that an individual is included in the sample varies in systematic
but opaque ways, and the collection and aggregation of data into quantities of interest
are problematic due to uncertainties in the data-​generating and -​collection processes. In
this section we describe these difficulties and why they are critical for the measurement
of public opinion with social media data.
When we speak of trying to measure “public opinion” we are primarily concerned
with the traditional notion of who “the public” is: adults in a particular polity (or set
of polities). However, one of the benefits of social media is that there is no such con-
straint on whose opinion is uttered on social media. We are potentially able to measure
subpopulations of interest within a polity, such as ethnic groups, ideological groups,
or speakers of particular languages (Metzger et al. 2016). On the other hand, this also
extends to populations such as children, political activists, persecuted minorities,
and other subpopulations that want, expect, or deserve privacy in their online activi-
ties. Fully tackling the myriad ethical issues entailed in using social media to measure
public opinion would require an entire chapter, but we should be aware that such issues
permeate every stage discussed below. Some of these issues are common to any sort of
collection of publicly available data, including the issue of consent regarding data that
have been made public but may be used in ways not anticipated by the participant, and
the collection of data from minors and others not able to give consent themselves. Other
issues are common to data collection and storage more generally, including data pro-
tection and anonymization, and specific to sharing data, particularly for replication
purposes. Other questions are more specific to social media data, including how to deal
with posts that were deleted by users after the data were collected, the potential privacy
violations inherent in using sophisticated machine-​learning methods to infer demo-
graphic and other characteristics that had not been publicly revealed, and the question
of whether the results of these analyses could put exposed users at risk of political or
other forms of retaliation (Flicker Haans and Skinner 2004; Tuunainen, Pitkänen,
and Hovi 2009; Zimmer 2010; Solberg 2010; Bruns et al. 2014). The scope of ethical
Measuring Public Opinion with Social Media Data    559

considerations in social media studies is rapidly growing, but for our purposes we focus
here on the technical challenges of measuring public opinion.

Identifying Political Opinion


If we seek to fit social media into the framework of existing public opinion measure-
ment, we may consider social media posts as something like unstructured and en-
tirely voluntary responses to external stimuli analogous to public opinion questions. In
this sense, like a survey question, the stimuli set (or affect) the topics, and our job is
to identify these topics and turn the unstructured responses into something like sen-
timent, approval levels, feeling thermometers, or the like. In both traditional surveys
and unstructured social media, we have something like a subject (the question, or a
post’s topic) and a predicate (the numeric response, or a post’s sentiment), and we seek
to turn the raw data of the unstructured social media text and metadata into something
more like the structured survey responses we are familiar with. This analogy is often
latent in research using social media to measure public opinion, but making it more ex-
plicit clarifies a number of issues in putting social media to such a use. This distinction
has also been referred to as the distinction between “designed data” and “organic data”
(Groves 2011). Whereas traditional collections of public opinion data, or data on eco-
nomic behavior based on survey responses, are curated and created by the designer with
intent in mind, many data sets now available are based on data that exist simply because
much human behavior occurs online—​and is recorded.
First, the questions are not directly asked of people; instead, people give their opinions
in response to events and discussions. How do we define what topics to examine and
which tweets are relevant for a given topic? For example, if we want to measure users’
sentiment toward the candidates in the 2016 presidential election, how do we identify
a corpus of relevant tweets? The vast majority of studies focus on tweets mentioning
candidate names, without discussing the possibility of systematic selection bias in de-
termining the search criteria in this way (but see King, Lam, and Roberts 2014). For
example, focusing only on tweets that mention Hillary Clinton or Donald Trump may
miss a number of social media messages that also relate to the 2016 election but do not
mention candidate names (He and Rothschild 2014). If tweets that refer to either can-
didate without using the candidate’s name tend to be either more positive or negative
than tweets that do explicitly mention the candidate’s name, then obviously selecting
tweets based on the use of the name will generate a large amount of selection bias. And
that’s just bias relative to the full corpus of tweets on the candidates, even apart from
bias relative to the population of interest; perhaps only persons with some particular
characteristic use particular terms. If that is the case, and we omit terms used by that
group, we will fail to measure group opinion accurately. Even without generating bias
by collecting based on candidate names, collections based on names may include sub-
stantial noise or miss substantial numbers of tweets. Tweets containing “Hillary” in 2016
may be predominantly about Hillary Clinton, but tweets containing “Trump” or “Cruz”
560    Marko Klašnja et al.

may not be about either Donald Trump or Ted Cruz, thus adding noise to the corpus.
Filtering on tweets containing “Donald Trump” or “Ted Cruz” may miss many tweets
actually focused on the candidates. In general, the choice of the relevant corpus of tweets
is almost invariably ad hoc, in part because the analyst cannot be omniscient about what
constitutes the set of tweets related to a given topic.
In addition to defining the topics and content that shape the collected data, meas-
uring the topics in individual tweets, particularly when topics may be rapidly changing
over time or responding to major events, remains both a technical and theoretical
challenge. Are people responding to changing questions on the same topic, are the
topics themselves changing, or do we need a complex hierarchical and temporal struc-
ture of all our content before we can begin to quantify public opinion in a systematic
way? For example, during the presidential debates there was considerably more com-
mentary and discussion among users than at other times, when information-​sharing
tweets (with high frequency of URLs within tweets) were more common (Diaz et al.
2014). Similarly, during politically charged events, such as the Wisconsin labor strikes
of 2011, many social media users seem to have been particularly focused on tweeting
non-​mainstream news and alternative narratives of the protest, unlike during less con-
tentious events (Veenstra et al. 2014). The same occurs during mass protest events, since
regime elites can respond strategically to protest and try to shift the focus of the discus-
sion (Munger 2015; King, Pan, and Roberts 2016). Topics themselves can change, be-
cause the comments on social media may represent a change in public opinion: either
the development of a new issue that was previously not part of political discourse or the
disappearance of an issue from public concern. The set of topics dominating political
discussion in 2000 would be very different than the set of topics dominating political
discussion in 2016. And just as refusal to answer surveys may not be random, but may
vary systematically with the likely response, discussion on any issue on social media
may vary with context. During a period of “good news” for a candidate, we may see more
tweets by the candidate’s supporters, and vice versa. Thus the population, topics, and
sentiments may all be continually shifting in ways that are very challenging to measure.
Even assuming we are able to resolve the subject—​the topics—​what of the pred-
icate:  the sentiment, approval, enthusiasm, and so forth? What exactly is the quan-
tity of interest? Simple counts of mentions of political parties or issues have in some
cases produced meaningful results. For example, Tumasjan et  al. (2010) and Skoric
et al. (2012) showed that mentions of parties on Twitter were correlated with election
results. However, that is often not the case (Metaxas, Mustafaraj, and Gayo-​Avello
2011; Bermingham and Smeaton 2011). In fact, Gayo-​Avello (2011) showed that tweet-​
counting methods perform worse than a random classifier assigning vote intentions
based on the proportion of votes from a subset of users who directly revealed their
election-​day intentions to the researcher. Similarly, Jungherr, Jürgens, and Schoen
(2012) criticize the tweet-​counting method used by Tumasjan et al. (2010) to predict
German elections for focusing only on tweets mentioning the largest parties. They show
that if tweets mentioning a new party—​the Pirate Party—​were counted as well, the
results differed considerably and mispredicted the election outcome, as the Pirate Party
Measuring Public Opinion with Social Media Data    561

was a clearly predicted election winner, whereas in fact it won only 2% of the vote (see
also Jungherr et al. 2016).
One common alternative to counting methods is the use of sentiment analysis, which
aims at measuring not the volume of tweets on a particular topic, but the valence of
their content. This method often relies on existing dictionaries of positive and negative
words, in which the ratio of positive to negative words that co-​occur with a topic on, for
example, a given day, is taken as a measure of the overall public sentiment about that
topic on that day. For example, O’Connor et al. (2010) show that Twitter sentiment over
time in economics-​related tweets is correlated with consumer confidence measures in
the United States. The downside of this approach is that its performance can vary in un-
predictable ways. The approach depends on potentially ad-​hoc dictionaries and often
exhibits low out-​of-​sample accuracy (González-​Bailón and Paltoglou 2015) and even
significant differences in its performance across different applications within a similar
context. For example, Gayo-​Avello (2011) finds that the performance of a lexicon-​based
classifier was considerably more reliable for tweets about Barack Obama than about
John McCain during the 2008 election campaign.
Finally, even if we have a good method for measuring topics and, for example,
sentiments, it is not at all clear that what we are measuring is necessarily an honest ex-
pression of opinion. It remains unknown to what degree the (semi-​)public nature of
social media could induce stronger social desirability bias than in the context of tradi-
tional survey responses. On the one hand, given potential social stigma, users may be
even less likely to reveal attitudes on sensitive topics than they are in standard surveys
(Newman et al. 2011; Pavalanathan and De Choudhury 2015), and individuals can con-
trol their content after it is posted, with changes and edits potentially inducing selection
bias in the type of content that remains (Marwick and Boyd 2011). On the other hand,
Twitter in particular does allow users a certain degree of anonymity (though perhaps
less than they think) and thus may allow individuals to express their true preferences
and attitudes more honestly than in many traditional surveys (Joinson 1999; Richman
et al. 1999). However, to our knowledge this potential issue has not been examined sys-
tematically in the context of measuring public opinion on political (particularly sensi-
tive) topics.

Representativeness of Social Media Users


One crucial advantage we lose with social media relative to traditional surveys is the
opportunity to control our sampling frame. Traditional surveys attempt to guarantee
a known probability of any individual in the population being asked a survey question.
Where those surveys fail is in both high and non-​random nonresponse, and non-​
random item nonresponse. With social media, since control of the sampling frame is
lost, we can neither know the likelihood that someone has been asked a “question” nor
know the likelihood of a response. In a traditional survey, to generalize to a target pop-
ulation we have to assume that nonresponses are missing at random, or that they are
562    Marko Klašnja et al.

missing at random conditioning on measured covariates. In the best of worlds, this is a


strong assumption; it may be that people who choose not to reveal their preferences on
something are systematically different than those who do reveal their preferences. On
social media, where we do not ask the question but depend on the participants to re-
veal their opinions, we might have more trouble. The set of people offering unprompted
opinions on a topic may be more passionate or different in myriad other ways from the
set of people who offer opinions on that topic when explicitly asked. This presumably
makes our missing data problems far worse than those caused by traditional survey
nonresponse.
It would of course be extremely unwise to generalize directly from Twitter behavior
to any of the standard populations of interest in most surveys. A number of studies
have demonstrated that Twitter users are not representative of national populations
(Duggan and Brenner 2015; Mislove et al. 2011; Malik et al. 2015). For example, in the
United States most populous counties are overrepresented, and the user popula-
tion is nonrepresentative in terms of race (Mislove et al. 2011). Comparing geotagged
tweets and census data, Malik et al. (2015) also demonstrate significant biases toward
younger users and users of higher income. Differences in usage rates of social media
platforms across countries are also an obstacle for the comparative study of public
opinion (Mocanu et al. 2013). These differences are also present, although perhaps to a
lesser extent, in the analysis of other social media platforms like Facebook (Duggan and
Brenner 2015).
For the purposes of the study of public opinion, however, it is more important
whether and how representative the politically active Twitter users are relative to the
general population. But here, too, the evidence is consistent with Twitter users being
highly nonrepresentative. For example, women are the majority of Twitter users, but
a much smaller minority among politically active Twitter users (Hampton et al. 2011);
politically active Twitter users are more polarized than the general population (Barberá
and Rivero 2014); and they are typically younger, better educated, more interested in
politics, and ideologically more left wing than the population as a whole (Vaccari et al.
2013). Crucially, nonrepresentativeness may even vary by topic analyzed, as different
issues attract different users to debate them (Diaz et al. 2014).6
Evaluating the representativeness of Twitter users is not straightforward, given that
unlike standard surveys, Twitter does not record precise demographic information.
Instead, most studies try to infer these characteristics. While some approaches have
been quite successful (see, e.g., Al Zamal, Liu, and Ruths 2012a; Barberá and Rivero
2014), these are still approximations. These difficulties can be compounded by the possi-
bility of bots and spammers acting like humans (Nexgate 2013), especially in the context
of autocratic regimes (Sanovich 2015). It becomes much harder to infer how representa-
tive tweets are of any given population if some tweets come from automated computer
programs, not people. And even determining how many people are paying attention to
discussions is problematic, as fake accounts can be used to inflate common metrics of
popularity. For example, one study found at least twenty sellers of followers on eBay, at
an average price of $18 per thousand followers, demonstrating how fake accounts can
Measuring Public Opinion with Social Media Data    563

rack up followers very easily (Barracuda Labs 2012).7 In addition, there may be impor-
tant deviations from one-​to-​one correspondences between individual users and indi-
vidual accounts, given the existence of duplicate and parody accounts and accounts that
represent institutions, companies, or products, such as the White House, Walmart, or
Coca-​Cola.
Moreover, the demographic composition of users can change over time, particularly in
response to important events, such as presidential debates or primaries. These changes may
be quite unpredictable. For example, during the presidential debates in the 2012 presidential
election in the United States, the male overrepresentation among political tweeters dropped
significantly, whereas the geographic distribution of tweets (by region) became consider-
ably less representative (Diaz et al. 2014). In the Spanish context, Barberá and Rivero (2014)
find that important events during the 2011 legislative election, such as party conferences
and the televised debates, increased the inequality on Twitter by increasing the rate of par-
ticipation of the most active and most polarized users. It is important to keep these shifts
in mind, since raw aggregates of public opinion may be due to these shifts in demographic
composition rather than any shifts in actual opinion (see, e.g., Wang et al. 2015).

Aggregating from Individual Responses


to Public Opinion
A number of other platform-​specific issues also affect researchers’ ability to aggregate
individual social media messages. At present, access to 100% of tweets is only available
through third-​party companies like Gnip (recently bought by Twitter) at prices often be-
yond what most researchers can afford. Instead, researchers rely on Twitter’s streaming
application programming interface (API), which only provides content in real-​time, not
historical, data. That means most researchers have to anticipate in advance the period
of study they will focus on. Results can change significantly when using different time
windows (Jungherr 2014), which can lead to ad hoc choices of period of coverage and a
non-​negligible likelihood of missing key events.
Most important, Morstatter et  al. (2013) and González-​Bailón et  al. (2014) found
significant differences between the full population of tweets (the so-​called Twitter
“firehose”) and the samples obtained through Twitter’s streaming API, the most pop-
ular source of data used by researchers. In particular, it appears the rate of coverage (the
share of relevant content provided by the streaming API relative to all content) varies
considerably over time; the topics extracted through text analysis from Streaming API
can significantly differ from those extracted from the Firehose data; that users who
participate less frequently are more likely to be excluded from the sample; and that top
hashtags from the streaming API data can deviate significantly from the full data when
focusing on a small number of hashtags.
In those cases in which researchers are interested in aggregating data from social
media to specific geographic units such as a state or congressional district, they face the
problem that only a small proportion of tweets are annotated with exact coordinates
564    Marko Klašnja et al.

(Leetaru et al. 2013). Geolocated tweets are highly precise but are not a representative
subset of all tweets (Malik et al. 2015). An alternative is to parse the text in the “location”
field of users’ profiles. While this increases the degree of coverage, it is not a perfect solu-
tion either, as Hecht et al. (2011) found that up to a third of Twitter users do not provide
any sort of valid geographic information in this field.
Finally, one important issue often overlooked in social media studies is that, given
Twitter’s opt-​in nature, tweets often cannot be treated as independent because many
individuals tweet multiple times. It is often the case that a minority of unique individuals
dominates the discussion in terms of tweet and retweet volume, making oversampling
of most active users very likely (Barberá and Rivero 2014; Gruzd and Haythornthwaite
2013; Mustafaraj et al. 2011). For example, in the run-​up to the 2012 presidential election,
70% of tweets came from the top 10% of users, with 40% of the tweets coming from the
top 1% of users (Barberá and Rivero 2014). This problem is exacerbated by practices such
as astroturfing—​coordinated messaging from multiple centrally controlled accounts—​
disguised as spontaneous behavior (Castillo, Mendoza, and Poblete 2011; Morris et al.
2012). Importantly, politically motivated actors use astroturf-​like strategies to influence
the opinions of their candidates during electoral campaigns (Kreiss 2014; Mustafaraj
and Metaxas 2010). The more influential are the attempts to characterize the behavior
of online users, the greater may be the incentive to manipulate such behavior (Lazer
et al. 2014).

How Should It Be Done? Potential


Solutions and Areas for Future
Research

We argue that the three concerns about using social media data to measure public
opinion outlined in the previous section—​measuring opinion, assessing representative-
ness, and overcoming technical challenges in aggregation—​are the main challenges to
overcome in this field. Each of these stages has its analog in traditional survey meth-
odology, but each presents unique challenges when using social media. In this section
we describe current efforts by previous studies to address these issues and potential
solutions that could be implemented in future research.

Better Methods for Identifying Political Opinion


In choosing the corpus of tweets that will be included in the analysis, previous studies
often defined a set of ad hoc search criteria, such as a list of hashtags related to an event
or the names of political actors. This is partially driven by the limitations imposed by
Twitter’s streaming API and researchers’ inability to collect historic data freely. We claim
Measuring Public Opinion with Social Media Data    565

that it is necessary to establish more systematic criteria to select what set of tweets will be
included in the sample.
One approach that has yielded promising results is the development of automated se-
lection of keywords. He and Rothschild (2014) apply such a method in their study of the
2012 U.S. Senate elections. They started with a corpus drawn based on candidate names,
then iteratively expanded it by identifying the most likely entities related to each candi-
date. Their final corpus is 3.2 times larger, which gives an indication of the magnitude
of the potential biases associated with simple keyword selection methods. For example,
they find that the aggregate sentiment of tweets mentioning only candidate names is
different from that of the extended corpus after applying their selection method. King,
Lam, and Roberts (2014) also propose a similar method that adds human supervision in
the selection of new keywords to resolve linguistic ambiguities and reduce the propor-
tion of false positives.
An alternative solution is to abandon keyword filtering altogether and instead sample
at the user level. As Lin et al. (2013) demonstrate, tracking opinion shifts within a care-
fully selected group of Twitter users can overcome some of the limitations mentioned
above by learning from users’ prior behavior to detect their biases and controlling for it
in any analysis.8 These “computational focus groups” can be further improved if they are
combined with surveys of Twitter users that contain questions about sociodemographic
and political variables (Vaccari et al. 2013).
In addition to topics, the other half of assessing opinion is the predicate side, such as
the estimation of sentiment about those topics. One of the most successful examples
of sentiment analysis applied to election prediction, the Voices from the Blogs proj­
ect (Ceron et al. 2014; Ceron, Curini, and Iacus 2015), combines supervised learning
methods with human supervision in the creation of data sets of labeled tweets that
are specific to each example. González-​Bailón and Paltoglou (2015) conducted a sys-
tematic comparison of dictionary and machine-​learning methods, finding similar
results: classifiers trained with a random sample of the data set to be used for prediction
purposes outperformed dictionary methods, which are in many cases no better than
random. One possible refinement of application-​specific methods is the combination of
topic models and sentiment analysis (Fang et al. 2015), which could leverage differences
in words’ usage across different topics to improve the performance of these techniques.

Increasing Representativeness
The majority of studies using Twitter data, particularly those estimating voting
preferences and predicting election outcomes, do not attempt to address the
nonrepresentativeness of (politically active) Twitter users (the exceptions include Gayo-​
Avello, 2011; Choy et al., 2011; 2012). In fact, many of these studies do not clearly specify
the target population, which in the case of electoral predictions should be the voting
population. The implicit assumption is that the size of the data, the diversity of Twitter
users, and the decentralized nature of social media may compensate for any potential
566    Marko Klašnja et al.

bias in the sample. Of course as we know in cases where it has been studied, the set of
Twitter users is not representative of typical target populations such as voters or eligible
voters (see, e.g., Duggan and Brenner 2015).
Significantly more work is needed to examine the plausibility of these assumptions.
On the one hand, for predictive purposes, the skew in the sample may not be problem-
atic if politically active users on Twitter act as opinion leaders who can influence the
behavior of media outlets (Ampofo, Anstead, and O’Loughlin 2011; Farrell and Drezner
2008; Kreiss 2014) or a wider audience (Vaccari et  al. 2013). On the other hand, as
discussed in the previous section, the nonrepresentativeness of these users relative to
the general population may be quite severe, suggesting that the biases may not balance
out unless addressed by reweighting.
One potentially promising method is multilevel regression and post-​stratification
(MRP), particularly because it relies on post-​stratification adjustments to correct for
known differences between the sample and the target population (Little 1993; other po-
tential weighting approaches can be found in AAPOR 2010). Somewhat like traditional
weighting in telephone or online polls, this approach partitions the target population
into cells based on combinations of certain demographic characteristics, estimates
via multilevel modeling the variable of interest in the sample within each cell (e.g., av-
erage presidential approval for white females, ages 18–​29), and then aggregates the cell-​
level estimates up to the population level by weighting each cell by the proportion in
the target population (Park, Gelman, and Bafumi 2004; Lax and Phillips 2009). This
approach has been fruitfully used to generate quite accurate election predictions from
highly nonrepresentative samples, such as XBox users (Wang et al. 2015).
The main challenge with this approach is of course to obtain the detailed sample dem-
ographics needed for post-​stratification. Twitter does not collect or provide data on
demographics. And unlike some other platforms such as Facebook, Twitter metadata
and profile feeds contain limited information to directly classify users. There are two
ways to address this concern: first, consider demographic variables as latent traits to be
estimated, and second, augment Twitter data with other types of data, such as voter reg-
istration records or surveys.
Pennacchiotti and Popescu (2011) and Rao et al. (2010) provide proofs of concept that
demonstrate that coarse categories of age, political orientation, ethnicity, and location
can be estimated by applying a variety of supervised machine-​learning algorithms to
user profiles, tweets, and social networks. Al Zamal, Liu, and Ruths (2012b) demonstrate
that users’ networks (i.e., whom they follow and their followers) can be particularly in-
formative about their age and gender. However, these studies often rely on small con-
venience samples of labeled users, and it is still an open question whether these methods
can scale up to the large samples researchers often work with.
One of the key variables in MRP applications has been party identification (Park,
Gelman, and Bafumi 2004; Lax and Phillips 2009). Thus it is extremely useful to be
able to infer ideological orientation and partisanship, in addition to gender, eth-
nicity, age, and geographic location. There are several promising approaches in this
Measuring Public Opinion with Social Media Data    567

direction. Barberá (2015) shows that Twitter users’ ideology can be accurately estimated
by observing what political actors they decide to follow. Other studies estimate polit-
ical ideology or partisan identification using different sources of information, such
as the structure of retweet interactions, follower networks, or similarity in word use
with respect to political elites (Boutet, Kim, and Yoneki 2013; Cohen and Ruths 2013;
Conover et al. 2011; Golbeck and Hansen 2011; Wong et al. 2013). One limitation of these
approaches is that ideology, as well as the other demographic variables, often cannot
be estimated for the entire sample of users, or at least with the same degree of accu-
racy, especially if they rely on usage of specific hashtags, which can vary significantly
across users.
An alternative solution to this problem is to augment Twitter data with demographic
information from other sources. For example, Bode and Dalrymple (2014) and Vaccari
et  al. (2013) conducted surveys of Twitter users by sampling and directly contacting
respondents through this platform, achieving relatively high response and completion
rates. By asking respondents to provide their Twitter user names, they were able to learn
key characteristics of a set of Twitter users directly from survey responses provided by
those users. Matching Twitter profiles with voting registration files, publicly available
in the United States, can also provide researchers with additional covariates, such as
party affiliation, gender, and age (see, e.g., Barberá, Jost, et al. 2015). The subset of users
for which this information is available could then be used as a training data set for a
supervised learning classifier that infers these sociodemographic characteristics for all
Twitter users.9 These matching approaches could also be conducted at the zipcode or
county level with census data to control for aggregate-​level income or education levels
(see, e.g., Eichstaedt et al. 2015).

Improving Aggregation
It is perhaps the last step—​aggregating from tweets to a measure of public opinion—​on
which most attention has been placed in previous studies. We now have a good under-
standing of the biases induced by how Twitter samples the data that will be made avail-
able through the API (Morstatter et al. 2013; González-​Bailón et al. 2014), the power-​law
distribution of users’ Twitter activity (Barberá and Rivero 2014; Wu et al. 2011), and the
fact that very few tweets contain enough information to locate their geographic origin
(Leetaru et al. 2013; Compton, Jurgens, and Allen 2014). Researchers need to be aware
of these limitations and address them in their analyses. For example, if the purpose of a
study is to measure public opinion about a topic, then the analysis should add weights
at the user level to control for different levels of participation in the conversation. When
such a solution is not possible, the study should include a discussion of the direction and
magnitude of the potential biases introduced by these limitations.
Finally, regardless of the approaches to aggregation, weighting, or opinion meas-
urement that we choose, an important step in any analysis should be the removal of
568    Marko Klašnja et al.

spam messages and accounts (or bots), which in some cases can represent a large share
of the data set (King, Pan, and Roberts 2016). One option is to apply simple filters
to remove users who are not active or exhibit suspicious behavior patterns. For ex-
ample, in their study of political communication on Twitter, Barberá, Jost, Nagler,
Tucker, and Bonneau (2015) only considered users who sent tweets related to at least
two different topics, which should filter spam bots that “hijack” a specific trending
topic or hashtag (Thomas, Grier, and Paxson 2012). Ratkiewicz et  al. (2011) and
Castillo, Mendoza, and Poblete (2011) implemented more sophisticated methods that
rely on supervised learning to find accounts that are intentionally spreading misin-
formation. Their study shows that spam users often leave a distinct footprint, such
as a low number of connections to other users, high retweet count among a limited
set of strongly connected (and likely fake) users, and a string of very similar URLs
(e.g., differing only in mechanically created suffixes). Therefore, it appears possible
and therefore potentially warranted to invest more effort in preprocessing the data by
removing the suspect content, or at least in inspecting the sensitivity of the results to
the presence of bot accounts.

Validation
Once we have specified our data collection and aggregation strategies, our popula-
tion of interest and weighting strategies, and our opinion measurement methods,
it is essential to validate these purported measures against trusted ground truths,
or at least against previously established measures. The success of these approaches
must be examined relative to clear benchmarks, such as previous election results, ex-
isting surveys, public records, and manually labeled data (Metaxas, Mustafaraj, and
Gayo-​Avello 2011; Beauchamp 2016). This validation should be conducted with out-​of-​
sample data, ideally forward in time, and should be measured statistically, by computing
the predicted accuracy. Depending on the application, other forms of validity should
be considered, such as convergent construct validity (the extent to which the measure
matches other measures of the same variable) or, in the case of topic-​specific measures,
semantic validity (the extent to which each topic has a coherent meaning).10
Conversely, rather than engaging in demographics-​based weighting and topic/​sen-
timent estimation to predict public opinion, it may also be possible to reverse the val-
idation process and instead train machine-​learning models to sift through thousands
of raw features (such as word counts) to find those that directly correlate with vari-
ations in the quantities of interest (such as past polling measures of vote intention)
(Beauchamp 2016). In this way, one could potentially go directly from word counts
and other metadata (such as retweets, URLs, or network data) to opinion tracking
with no worry about demographics, topics, or sentiments—​although potentially at
the cost of interpretability and generalizability to other regions, times, and political
circumstances.
Measuring Public Opinion with Social Media Data    569

New Directions for Measuring


Public Opinion: Going beyond Survey
Replication

As we have said, each of the challenges to using social media data to measure public
opinion also reveals how social media can be taken well beyond existing survey
methods.
Weighting and demographics aside, the sheer size of social media data make it the-
oretically possible to study subpopulations that would not be possible with traditional
survey data, including those defined by demographic, geographic, or even temporal
characteristics (Aragón et al. 2016; Barberá, Wang, et al. 2015). Social media also en-
able us to measure opinion across national borders. While Twitter penetration varies in
different countries (Poblete et al. 2011), as long as we know something about the char-
acteristics of who in a country is on Twitter, we can try to generalize from tweets to a
measure of mass opinion in a country.
Because of their organic nature, social media data are generated continuously, and
thus we can track changes over time at very fine-​grained temporal units (e.g., Golder
and Macy 2011 track changes in mood across the world over a course of one day). This
means we can aggregate the data by any temporal unit we choose, and simulate designed
data for tracking opinion change over time and for using in traditional time-​series anal-
ysis. Moreover, because social media data come with individual identifiers, they also
constitute panel data. We (often) have repeated observations from the same informant.
This high frequency means that social media can reveal public opinion changes over
time about issues that are not polled very frequently by traditional surveys. Reasonably
dense time-​series survey data exist for some issues, such as presidential approval or con-
sumer sentiment, but social media data offer the opportunity to put together dense time
series of public opinion on a host of specific issues that are rarely or infrequently polled.11
And by taking advantage of information identifying characteristics of informants, those
time series could be evaluated for distinct subgroups of populations.
The temporal nature of social media also lets us observe the emergence of public
opinion. This is perhaps social media’s greatest strength; it does not depend on the ana-
lyst to ask a preconceived question. So while at some point almost no survey firm would
think to invest in asking respondents whether or not they thought gay marriage or mari-
juana should be legal, by collecting sufficiently large collections of social media posts in
real time, it should be possible to observe when new issues emerge, and indeed to iden-
tify these newly emerging issues before we even know we should be looking for them.
While the free-​form nature of opinion revelation on social media can be a barrier to
measuring what everyone is thinking about an issue, it may give us a way to measure not
just sentiment, but intensity of sentiment via content and retweeting, as well as richer
measures of sentiment along as many potential dimensions as there are topics.12
570    Marko Klašnja et al.

Social media also allow us to measure not just mass opinion, but especially that of
political activists and other elites. Legislators, political parties, interest groups, world
leaders, and many other political elites tweet. And while these are public revelations
and not necessarily truthful, we do see what these actors choose to reveal. We are able
to see this contemporaneously with mass opinion revelation. And again, by taking ad-
vantage of the fine-​grained, temporal nature of social media data, we can observe how
elite opinion responds to mass opinion and vice versa (Barberá et  al. 2014; Franco,
Grimmer, and Lee 2016) and how both groups respond to exogenous events. While of
course we need to be sensitive to the fact that elites know that what they are posting
on Twitter or Facebook is intended for public consumption and thus may not reflect
genuine “opinion” in the sense that we are trying to measure mass opinion, the social
media record of elite expression may nevertheless prove extremely valuable for studying
both elite communication strategy generally as well as changes in the issues that elites are
emphasizing. Thus, while we may not know whether a particular politician genuinely
believes gun control laws need to be changed, social media can easily help us measure
whether that politician is emphasizing gun control more at time t than at time t-​1.
Social media also come with a natural set of contextual data about revealed
opinion: the social network of the individual informant (Larson et al. 2016). Measuring
this with survey questions is notoriously difficult because people have not proven ca-
pable of stating the size of their networks, much less providing information necessary
for contacting network members. Yet social media provide us not only with the size of
the social networks of informants, but also a means to measure the opinions of network
members. Traditional surveys have tried to measure the connection of network ties pri-
marily by depending on self-​response, which has proven to be unreliable.
Social media data also provide the potential to link online information directly to
opinion data. Many social media users reveal enough information about themselves to
make it possible to link them to public records such as voter files. Social media data
can also be directly supplemented with survey data obtained by contacting social media
users directly through Twitter, via “replies” (as in Vaccari et al. 2013) or promoted tweets
targeted to a specific list of users. However, it remains challenging to construct matched
samples using these more direct methods that will be sufficiently large and representa-
tive or that can be reweighted to ensure representativeness.
Finally, social media significantly democratize the study of public opinion.
Researchers can potentially address novel research questions without having to field
their own surveys, which are often much more costly. Access to large volumes of so-
cial media is free and immediate, unlike many existing surveys that may be embargoed
or restricted. Moreover, this accessibility extends well beyond scholars of public
opinion: Anyone—​from campaigns selling political candidates or consumer goods, to
regimes trying to understand public wants—​can access social media data and see what
people are saying about any given issue with minimal expense. Even in a world in which
the numbers of surveys and surveyed questions are proliferating, social media poten-
tially offer a spectrum of topics and a temporal and geographic density that cannot be
matched by existing survey methods.
Measuring Public Opinion with Social Media Data    571

A Research Agenda for Public Opinion


and Social Media

Making better use of social media for measuring public opinion requires making prog-
ress on multiple fronts. Perhaps the issue that remains the most theoretically challenging
is the measurement of topic and sentiment: the “question” and “response.” The unstruc-
tured text is what is most unusual about social media—​it is not an answer to a question
specified by the researcher but rather free-​form writing—​and attempts to transform it
into the traditional lingua franca of opinion research—​answers to specific questions—​
remains an open problem. We may even eventually discover that this approach is ob-
solete, as we move entirely beyond the constraints imposed by traditional surveys into
the naturally high-​dimensional world of free-​form text. The tremendous opportunity
presented by social media data makes the payoff for solving such problems worth the
investment. Social media data can give us measures of public opinion on a scale both ge-
ographically, temporally, and in breadth of subject that is vastly beyond anything we can
measure with other means.
There are of course major mistakes that can be made when analyzing social media
data. One advantage we have pointed out about social media is that they democratize
measuring public opinion: anyone can do it. That means anyone can do it badly. And
the endemic deficiency of ground-​truths can make it difficult to know when a measure
is a bad measure. Reputable scholars or organizations reporting measures based on tra-
ditional polls have adopted standardized practices to increase the transparency of their
reporting (items such as sample size and response rates). We thus conclude with some
obvious standards in reporting that social medi–​based measures of opinion should ad-
here to in order to at least guarantee a minimum amount of transparency and allow
readers, or users of the measures created, to better evaluate the measures. We describe
and list standards with respect to analyzing data from Twitter, but these can be easily
applied to other sources of social media with appropriate modifications. The points
apply generally across organic public opinion data.
First, researchers need to be clear about the technical means of gathering data. Data
gathered in real time through a rate-​limited means could be incomplete, or differ in un-
predictable ways from data purchased after the fact from an archive or collected thru
Firehose. Second, researchers need to very clearly report the limitations placed on the
sampling frame. Data on social media can be gathered based on user identification or
on content of text, and can be further filtered based on other metadata of users or indi-
vidual tweets (such as language or time of day).
Second, researchers need to explain whether data were gathered based on the context
of the text, the sender of the text, or some contextual information (such as time or place
of tweet).
Third, researchers need to very precisely describe the criteria for inclusion in their
sample, and how those criteria were arrived at. If a sample is based on keywords,
572    Marko Klašnja et al.

researchers need to describe how the keywords were selected. If someone claims to be
measuring opinion about gun control, they could state that: “we collected all tweets
about gun control.” But this claim could not be evaluated unless the full set of keywords
used to gather tweets is provided. One could collect all tweets containing the expression
“gun control,” but that would omit many tweets using assorted relevant hashtags and
phrases. Or, if an analyst were to try to measure public opinion about Barack Obama
with all tweets containing “Barack Omaba,” the analyst would miss any tweets that are
about Barack Obama, but do not use his name. If one were to omit all tweets that contain
“Barack Hussein Obama” rather than “Barack Obama,” then this could obviously cause
significant measurement error. Thus precise description of what can be included in the
corpus of text is important.
If the corpus is collected with a set of keywords, the analyst should explain how the set
was generated. Did the analyst assume omniscience and create the list of keywords, or
was it generated using some algorithim proceeding from a set of core terms and finding
co-​occuring terms? And no matter how the keywords were chosen, or how the topic was
collected, the analyst should describe any testing done to confirm that the corpus was in
fact about the chosen topic.
Researchers also need to note whether filters or constraints were imposed on the
collection that would exclude some tweets based on language or geography. If we are
only measuring opinions expressed in English, that is important. Similarly, if we filter
out all tweets that cannot be identified as being from a particular geographic region,
that is important. And researchers need to clearly explain how any such constraints were
implemented. If a language filter was used, precisely what was the filter? If a geographic
constraint was imposed, what was it? Were only geocoded tweets considered? Or were
metadata about the source of the tweet considered, and if so, how?
If tweets are aggregated by topic, the analyst must explain the aggregation method
used. Or the analyst must explain how tweets were assigned to topics, whether by topic
modeling or by human assignment. If by topic modeling, the analyst should provide in-
formation on how robust the results are to variations in the number of topics selected.
Information about the criteria for tweets being assigned to topics (such as top terms
from linear discriminant analysis) is essential. And the analyst should indicate whether
the topics were validated against human judgments.
If a collection of tweets claiming to measure public opinion excludes tweets by some
individuals, that information is crucial and needs to be provided. Such exclusions could
be based on the individual’s frequency of social media use, on whether he or she is part
of a set of individuals following particular political (or nonpolitical) actors. Such ex-
clusion could also be based on the individual’s characteristics, such as use of language,
demographic characteristics, or political characteristics. And as such characteristics are
often estimated or inferred from metadata, the analyst must be precise and transparent
about how characteristics for individuals are inferred. Precise data on who is eligible to
be included in the data set are essential for any attempt to draw a population inference.
If a measure of sentiment is given, the analyst must carefully explain how senti-
ment was calculated. If a dictionary was used, the analyst should explain how robust
Measuring Public Opinion with Social Media Data    573

the measure was to variations in the dictionary—​or across dictionaries. And the analyst
should explain whether sentiment measures were validated in any way against human
judgments.
Following is a set of guidelines:

1. Describe the source of data for the corpus.


(a) Were data taken from the Twitter Firehose, or from one of the rate-​limited
sources?
(b) Were data retrieved using the rest API or streaming API?
2. Describe whether the criteria for inclusion in the corpus were based on the text,
the sender of the text, or contextual information (such as the place or time of the
tweet).
3. For corpora collected based on keywords or regular expressions in the text, de-
scribe the criteria for inclusion.
(a) What were the criteria by which the keywords or regular expressions were
selected?
(b) Were keywords or regular expressions chosen based on
• the analyst’s expertise or prior beliefs?
• an extant document?
• an algorithm used to generate keywords based on an initial seeding and
further processing of the text?
(c) Describe any testing done to confirm that tweets gathered were relevant for
the intended topic for which the keywords, or regular expressions, were used.
4. For corpora collected using topics as the criterion for inclusion:
(a) Describe how individual documents were determined to be relevant to the
chosen topic (i.e., what were the technical requirements for inclusion in the
corpus?).
(b) Describe any testing done to estimate, or determine exactly, the number of
documents in the corpora that were germane to their assigned topic(s).
5. If the content of tweets was aggregated by topic, after some selection criteria into
the corpus, how were topics generated?
(a) Were topics hand-​coded?
(b) Was some form of automated topic generation method used?
(c) How robust are the topics to variations in the sample or the number of topics
selected?
(d) Was inclusion of text into topics validated against human judgments, and if
so, how?
6. Describe any limitations placed on the sampling frame that could limit whose
opinions could be in the data. State any limitations of the sampling frame based on
metadata provided by informants, either directly provided in metadata or inferred
from metadata. If characteristics were inferred, explain the procedure for infer-
ence. This would include:
(a) Exclusions based on language of the informant,
574    Marko Klašnja et al.

(b) Exclusions based on geography of the informant, and


(c) Exclusions based on gender, age, or other demographic characteristics of the
informant.
7. For corpora in which selection was based on the sender, how were the senders
chosen?
8. Describe any constraints (or filters) imposed on the collection that would exclude
some tweets from being included based on characteristics of the tweet, such as
constraints on geography or language.
(a) Describe whether any geographic constraints were based on geocoding or on
an algorithim used to infer geography from metadata.
(b) Describe how language was determined if language constraints were
imposed.
9. If a sentiment measure, or a related measure, was applied, describe how the
measure was calculated.
(a) If a dictionary was used, describe how robust the measure was to variations
in the dictionary or across dictionaries.
(b) Were the sentiment measures validated against human judgments?
(c) What information, other than the text of the tweet, such as linked content
or images, characteristics of the sender, or context, was used to determine
sentiment?
10. Describe the aggregation method used to generate the quantity of interest.
(a) Describe precisely the temporal units used, including reference to time zone.
(b) Describe how retweets are treated.

For anyone interested in studying public opinion, it would be foolish to ignore the in-
formation about public opinion revealed by social media data. However, it would also
be foolish to treat measurement of social media data in the same manner one treats a
well-​designed survey yielding something approximating a random sample of a popu-
lation of interest. We have listed many of the reasons that this is not a viable strategy.
Either one accepts that one has a nonrepresentative opt-​in sample, which may or may
not be a useful sample for some goal other than measuring mass public opinion, or one
attempts to weight the sample. We think continued work on studying public opinion via
social media is a fruitful endeavor. And we urge scholars and practioners to both work
on improving our ability to measure mass public opinion via social media and to follow
solid guidelines for reporting results obtained via social media.

Notes
1. While the response rates for the “gold standard” surveys such as the General Social
Survey, the American National Election Study, and the National Household Education
Survey are higher, they too have been falling off markedly (Brick and Williams 2013;
Hillygus 2011).
Measuring Public Opinion with Social Media Data    575

2. See, for example, http://​www.businesswire.com/​news/​home/​20150402005790/​en#.


VR2B1JOPoyS.
3. The newspaper industry, a major source of public opinion polls, shrank 43% from 2000
to 2012 (see http://​www.stateofthemedia.org/​2012/​overview-​4/​). The declining public
support to higher education due to the financial crisis of 2008–​2009 led to the closing
of some university-​based survey research centers (Keeter 2012), and there has been
increasing political pressure to defund such initiatives as the American Community
Survey and the Economic Census. Overall interest in polls, however, has only grown, with
the total number of active pollsters (with at least ten polls per campaign) having risen since
2000: in presidential years, this has increased from appoximately ten to twenty polls per
presidential campaign over the last two decades and from approximately five to ten polls
for midterm elections (based on our analysis of data from http://​projects.fivethirtyeight.
com/​pollster-​ratings/​).
4. The Twitter archive is of course dwarfed by the Facebook archive, but this is not yet avail-
able to the public. And to be clear, by “available” we mean available for purchase; collecting
relatively large amounts of Twitter data is free in real time, but it is not free to retrieve
tweets with a broad backward-​looking search.
5. It also raises all sorts of new questions for social scientists, who will find themselves in the
future wanting to work with huge private companies, such as Facebook or Twitter, much
in the way that natural scientists have had to learn how to work with big pharma. Although
this discussion is beyond the scope of this article, this too will likely pose all sorts of new
challenges for researchers, the likes of which we have previously rarely encountered.
6. Note that all of the studies cited here are country specific; we cannot really make these
claims about the global set of Twitter users.
7. Such concerns could be particularly pernicious if politicians are buying bots precisely for
the purpose of manipulating measures of public opinion. Although we do not yet have ev-
idence of this occurring, it does not seem to be a large leap to imagine politicians moving
from simply buying followers to buying accounts that will deliver positive sentiment about
themselves (or negative sentiment about opponents) in an attempt to manipulate reports
in the media about online popularity.
8. See below for a discussion of working with a randomly chosen set of users.
9. As an additional challenge, social media users and their demographic distributions are
presumably constantly evolving, so these models will have to be frequently updated to
keep up with this rapidly shifting landscape.
10. See Quinn et al. (2010) for a more extensive discussion of different types of validity.
11. For example, such topics might include intermittently polled issues in the United States,
like gun control or immigration; government approval measures in less well-​polled
nations; public opinion about specific foreign or domestic policies (e.g., Syria or the
Affordable Care Act) or factual questions (e.g., climate change or genetically modified
organisms); and more local issues, such as opinion on the policies or services in specific
cities.
12. In addition to issues with representativeness, the public nature of social media means
that these sentiments are presumably also affected by social desirability bias. It may be
that in these more polarized times, mean sentiment will remain representative even
as both sides are driven to extremes by social pressures, but it will nevertheless be im-
portant to measure and correct for these effects using existing polling measures as
ground-​truth tests.
576    Marko Klašnja et al.

References
AAPOR. 2010. “AAPOR Report on Online Panels.” http://​poq.oxfordjournals.org/​content/​
early/​2010/​10/​19/​poq.nfq048.full.pdf?ijkey=0w3WetMtGItMuXs&keytype=ref.
Abraham, K. G., S. Helms, and S. Presser. 2009. “How Social Processes Distort
Measurement: The Impact of Survey Nonresponse on Estimates of Volunteer Work in the
United States.” American Journal of Sociology 114 (4): 1129–​1165.
Al Zamal, F., W. Liu, and D. Ruths. 2012a. “Homophily and Latent Attribute Inference: Inferring
Latent Attributes of Twitter Users from Neighbors.” In Proceedings of the Sixth International
AAAI Conference on Weblogs and Social Media, 387–​ 390, AAAI Press, Palo Alto,
California.
Ampofo, L., N. Anstead, and B. O’Loughlin. 2011. “Trust, Confidence, and Credibility: Citizen
Responses on Twitter to Opinion Polls During the 2010 UK general Election.” Information,
Communication & Society 14 (6): 850–​871.
Aragón, P., Y. Volkovich, D. Laniado, and A. Kaltenbrunner. 2016. “When a Movement
Becomes a Party: Computational Assessment of New Forms of Political Organization in
Social Media.” Proceedings of the Tenth International AAAI Conference on Weblogs and
Social Media, 12–​21, AAAI Press, Palo Alto, California.
Barberá, P. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation
Using Twitter Data.” Political Analysis 23 (1): 76–​91.
Barberá, P., R. Bonneau, P. Egan, J. T. Jost, J. Nagler, and J. Tucker. 2014. “Leaders or Followers?
Measuring Political Responsiveness in the US Congress using Social Media Data.” Paper
presented at the 110th American Political Science Association Annual Meeting.
Barberá, P., J. T. Jost, J. Nagler, J. Tucker, and R. Bonneau. 2015. “Tweeting from Left to Right: Is
Online Political Communication More Than an Echo Chamber?” Psychological Science 26
(10): 1531–​1542.
Barberá, P., and G. Rivero. 2014. “Understanding the Political Representativeness of Twitter
Users.” Social Science Computer Review 33 (6): 712–​729.
Barberá, P., N. Wang, R. Bonneau, J. T. Jost, J. Nagler, J. T., and S. González-​Bailón. 2015. “The
Critical Periphery in the Growth of Social Protests.” PloS one 10 (11): e0143611.
Barracuda Labs. 2012. “The Twitter Underground Economy: A Blooming Business.” Internet
security blog. https://​www.barracuda.com/​blogs/​labsblog?bid=2989.
Beauchamp, Nicholas. 2016. “Predicting and Interpolating State-​level Polls using Twitter
Textual Data.” American Journal of Political Science 61 (2): 490–​503.
Bermingham, A., and A. F. Smeaton. 2011. “On Using Twitter to Monitor Political Sentiment
and Predict Election Results.” In Sentiment Analysis: Where AI Meets Psychology (SAAIP)
Workshop at the International Joint Conference for Natural Language Processing, http://​
doras.dcu.ie/​16670/​.
Bode, L., and K. E. Dalrymple. 2014. “Politics in 140 Characters or Less:  Campaign
Communication, Network Interaction, and Political Participation on Twitter.” Journal of
Political Marketing 15(4): 311–​332.
Boutet, A., H. Kim, and E. Yoneki. 2013. “What’s in Twitter:  I Know What Parties Are
Popular and Who You Are Supporting Now!” Social Network Analysis and Mining 3
(4): 1379–​1391.
Brick, J. M., and D. Williams. 2013. “Explaining Rising Nonresponse Rates in Cross-​
sectional Surveys.” Annals of the American Academy of Political and Social Science 645
(1): 36–​59.
Measuring Public Opinion with Social Media Data    577

Bruns, A., K. Weller, M. Zimmer, and N. J. Proferes. 2014. “A Topology of Twitter


Research: Disciplines, Methods, and Ethics.” Aslib Journal of Information Management 66
(3): 250–​261.
Castillo, C., M. Mendoza, and B. Poblete. 2011. “Information Credibility on Twitter.” In
Proceedings of the 20th International Conference on World Wide Web, Association for
Computing Machinery, New York, NY, 675–​684.
Ceron, A., L. Curini, and S. M. Iacus. 2015. “Using Sentiment Analysis to Monitor Electoral
Campaigns Method Matters—​Evidence from the United States and Italy.” Social Science
Computer Review 33 (1): 3–​20.
Ceron, A., L. Curini, S. M. Iacus, and G. Porro. 2014. “Every Tweet Counts? How Sentiment
Analysis of Social Media Can Improve Our Knowledge of Citizens’ Political Preferences
with an Application to Italy and France.” New Media & Society 16 (2): 340–​358.
Choy, M., M. Cheong, M. N. Laik, and K. P. Shung. 2012. “US Presidential Election 2012
Prediction using Census Corrected Twitter Model.” https://​arxiv.org/​abs/​1211.0938.
Choy, M., M. L.  F. Cheong, M. N. Laik, and K. P. Shung. 2011. “A Sentiment Analysis of
Singapore Presidential Election 2011 sing Twitter Data with Census Correction.” https://​
arxiv.org/​abs/​1108.5520.
Cohen, R., and D. Ruths. 2013. “Classifying Political Orientation on Twitter: It’s Not Easy!”
Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media,
91–​99, AAAI Press, Palo Alto, California.
Compton, R., D. Jurgens, and D. Allen. 2014. “Geotagging One Hundred Million Twitter
Accounts with Total Variation Minimization.” In 2014 IEEE International Conference on
Big Data (Big Data), 393–​401, http://​ieeexplore.ieee.org/​abstract/​document/​7004256/​
?reload=true.
Conover, M. D., B. Gonçalves, J. Ratkiewicz, A. Flammini, and F. Menczer. 2011. “Predicting
the Political Alignment of Twitter Users.” In Privacy, Security, Risk and Trust (PASSAT), and
2011 IEEE Third International Conference on Social Computing (SocialCom), 192–​199, http://​
ieeexplore.ieee.org/​document/​6113114/​.
De Leeuw, E., and W. De Heer. 2002. “Trends in Household Survey Nonresponse:  A
Longitudinal and International Comparison.” In Survey Nonresponse, edited by R. M.
Groves, D. A. Dillman, J. L. Eltinge, and R. J.  A. Little, 41–​54. New  York:  John Wiley
& Sons.
Diaz, F., M. Gamon, J. Hofman, E. Kiciman, and D. Rothschild. 2014. “Online and Social Media
Data as a Flawed Continuous Panel Survey.” Working Paper, Microsoft Research.
Duggan, M., and J. Brenner. 2015. The Demographics of Social Media Users, 2014. Pew Research
Center’s Internet & American Life Project, vol. 14. Washington, DC: Pew Research Center.
Eichstaedt, J. C., H. A. Schwartz, M. L. Kern, G. Park, D. R. Labarthe, R. M. Merchant, . . . M.
Sap. 2015. “Psychological Language on Twitter Predicts County-​ level Heart Disease
Mortality.” Psychological Science 26 (2): 159–​169.
Fang, A., I. Ounis, P. Habel, and C. Macdonald. 2015. “Topic-​centric Classification of Twitter
User’s Political Orientation.” In Proceedings of the 38th International ACM SIGIR
Conference on Research and Development in Information Retrieval, Association for
Computing Machinery, New York, NY, 791–​794.
Farrell, H., and D. W. Drezner. 2008. “The Power and Politics of Blogs.” Public Choice 134
(1–​2): 15–​30.
Flicker, S., D. Haans, and H. Skinner. 2004. “Ethical Dilemmas in Research on Internet
Communities.” Qualitative Health Research 14 (1): 124–​134.
578    Marko Klašnja et al.

Franco, A., J. Grimmer, and M. Lee. 2016. “Changing the Subject to Build an Audience: How
Elected Officials Affect Constituent Communication.” Unpublished Manuscript.
Gayo-​ Avello, D. 2011. “Don’t Turn Social Media into Another ‘Literary Digest’ Poll.”
Communications of the ACM 54 (10): 121–​128.
Golbeck, J., and D. Hansen. 2011. “Computing Political Preference among Twitter Followers.” In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Association
for Computing Machinery, New York, NY, 1105–​1108.
Golder, S. A., and M. W. Macy. 2011. “Diurnal and Seasonal Mood Vary with Work, Sleep, and
Daylength across Diverse Cultures.” Science 333 (6051): 1878–​1881.
González-​ Bailón, S., and G. Paltoglou. 2015. “Signals of Public Opinion in Online
Communication A  Comparison of Methods and Data Sources.” Annals of the American
Academy of Political and Social Science 659 (1): 95–​107.
González-​Bailón, S., N. Wang, A. Rivero, J. Borge-​Holthoefer, and Y. Moreno. 2014. “Assessing
the Bias in Samples of Large Online Networks.” Social Networks 38: 16–​27.
Groves, R. M. 2006. “Nonresponse Rates and Nonresponse Bias in Household Surveys.” Public
Opinion Quarterly 70 (5): 646–​675.
Groves, R. 2011. “ ‘Designed Data’ and ‘Organic Data’.” http://​directorsblog.blogs.census.gov/​
2011/​05/​31/​designed-​data-​and-​organic-​data/​.
Groves, R. M., and E. Peytcheva. 2008. “The Impact of Nonresponse Rates on Nonresponse
Bias: a Meta-​analysis.” Public Opinion Quarterly 72 (2): 167–​189.
Groves, R. M., F. J. Fowler, Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau.
2011. Survey Methodology. New York: John Wiley & Sons.
Gruzd, A., and C. Haythornthwaite. 2013. “Enabling Community Through Social Media.”
Journal of Medical Internet Research 15 (10), https://​www.ncbi.nlm.nih.gov/​pmc/​articles/​
PMC3842435/​.
Hampton, K., L. Sessions Goulet, L. Rainie, and K. Purcell. 2011. “Social Networking Sites and
Our Lives.” Pew Internet & American Life Project Report, http://​www.pewinternet.org/​
2011/​06/​16/​social-​networking-​sites-​and-​our-​lives/​.
He, R., and D. Rothschild. 2014. “Who Are People Talking about on Twitter?” Working Paper,
Microsoft Research.
Hecht, B., L. Hong, B. Suh, and E. H. Chi. 2011. “Tweets from Justin Bieber’s Heart:  The
Dynamics of the Location Field in User Profiles.” In Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, Association for Computing Machinery, New York,
NY, 237–​246.
Hillygus, D. S. 2011. “The Practice of Survey Research:  Changes and Challenges.” In New
Directions in Public Opinion, edited by A. Berinsky. Routledge Press, New York, NY.
Joinson, A. 1999. “Social Desirability, Anonymity, and Internet-​ based Questionnaires.”
Behavior Research Methods, Instruments, & Computers 31 (3): 433–​438.
Jungherr, A. 2014. “Twitter in Politics: A Comprehensive Literature Review.” https://​papers.
ssrn.com/​sol3/​Papers.cfm?abstract_​id=2402443.
Jungherr, A., P. Jürgens, and H. Schoen. 2012. “Why the Pirate Party Won the German Election
of 2009 or the Trouble with Predictions:  A Response to Tumasjan, A., Sprenger, T.  O.,
Sander, P. G., & Welpe, I. M. ‘Predicting Elections with Twitter: What 140 Characters Reveal
about Political Sentiment’.” Social Science Computer Review 30 (2): 229–​234.
Jungherr, A., H. Schoen, O. Posegga, and P. Jürgens. 2016. “Digital Trace Data in the Study of
Public Opinion: An Indicator of Attention Toward Politics Rather Than Political Support.”
Social Science Computer Review 35 (3): 336–​356.
Measuring Public Opinion with Social Media Data    579

Keeter, S. 2012. “Presidential Address: Survey Research, Its New Frontiers, and Democracy.”
Public Opinion Quarterly 76 (3): 600–​608.
King, G., J. Pan, and M. E. Roberts. 2016. “How the Chinese Government Fabricates
Social Media Posts for Strategic Distraction, Not Engaged Argument.” Unpublished
Manuscript.
King, G., P. Lam, and M. Roberts. 2014. “Computer-​Assisted Keyword and Document
Set Discovery from Unstructured Text.” https://​gking.harvard.edu/​publications/​
computer-​assisted-​keyword-​and-​document-​set-​discovery-​fromunstructured-​text.
Kreiss, D. 2014. “Seizing the Moment: The Presidential Campaigns’ Use of Twitter During the
2012 Electoral Cycle.” New Media & Society 18 (8): 1473–​1490.
Kwak, H., C. Lee, H. Park, and S. Moon. 2010. “What Is Twitter, a Social Network or a News
Media?” In Proceedings of the 19th International Conference on World Wide Web, Association
for Computing Machinery, New York, NY, 591–​600.
Larson, J., J. Nagler, J. Ronen, and J. A Tucker. 2016. “Social Networks and Protest
Participation: Evidence from 93 Million Twitter Users.” SSRN, https://​papers.ssrn.com/​sol3/​
papers.cfm?abstract_​id=2796391.
Lax, J. R., and J. H. Phillips. 2009. “How Should We Estimate Public Opinion in the States?”
American Journal of Political Science 53 (1): 107–​121.
Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in
Big Data Analysis.” Science 343 (March 14): 1203–​1205.
Leetaru, K., S. Wang, G. Cao, A. Padmanabhan, and E. Shook. 2013. “Mapping
the Global Twitter Heartbeat:  The Geography of Twitter.” First Monday 18 (5),
http://​ f irstmonday.org/ ​ a rticle/ ​ v iew/ ​ 4 366/ ​ 3 654?_​ _ ​ h stc=225085317.e835c34ab7b
f88e972fdd7a7debc8575.1436140800094.1436140800095.1436140800096.1&_ ​ _​
hssc=225085317.1.1436140800097&_​_​hsfp=1314462730.
Lin, Y.-​R., D. Margolin, B. Keegan, and D. Lazer. 2013. “Voices of Victory: A Computational
Focus Group Framework for Tracking Opinion Shift in Real Time.” In Proceedings of the
22nd International Conference on World Wide Web, Association for Computing Machinery,
New York, NY, 737–​748.
Little, R. J.  A. 1993. “Post-​Stratification:  A Modeler’s Perspective.” Journal of the American
Statistical Association 88 (423): 1001–​1012.
Malik, M. M., H. Lamba, C. Nakos, and J. Pfeffer. 2015. “Population Bias in Geotagged Tweets.”
In Ninth International AAAI Conference on Weblogs and Social Media, 18–​27, AAAI Press,
Palo Alto, California.
Marwick, A. E., and D. Boyd. 2011. “I Tweet Honestly, I Tweet Passionately: Twitter Users,
Context collapse, and the Imagined Audience.” New Media & Society 13 (1): 114–​133.
Metaxas, P. T., E. Mustafaraj, and D. Gayo-​Avello. 2011. “How (Not) to Predict Elections.” In
Privacy, Security, Risk and Trust (PASSAT), and 2011 IEEE Third International Conference on
Social Computing (SocialCom), Institute of Electrical and Electronics Engineers, Piscataway,
NJ, 165–​171.
Metzger, M., R. Bonneau, J. Nagler, and J. A. Tucker. 2016. “Tweeting Identity? Ukranian,
Russian, and #Euromaidan.” Journal of Comparative Economics 44 (1): 16–​50.
Mislove, A., S. Lehmann, Y.-​Y. Ahn, J.-​P. Onnela, and J. N. Rosenquist. 2011. “Understanding
the Demographics of Twitter Users.” ICWSM 11 (5).
Mocanu, D., A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani. 2013. “The
Twitter of Babel: Mapping World Languages Through Microblogging Platforms.” PloS One
8 (4): e61981.
580    Marko Klašnja et al.

Mokrzycki, M., S. Keeter, and C. Kennedy. 2009. “Cell-​phone-​only Voters in the 2008 Exit Poll
and Implications for Future Noncoverage Bias.” Public Opinion Quarterly 73 (5): 845–​865.
Morris, M. R., S. Counts, A. Roseway, A. Hoff, and J. Schwarz. 2012. “Tweeting Is Believing?
Understanding Microblog Credibility Perceptions.” In Proceedings of the ACM 2012
Conference on Computer Supported Cooperative Work, Association for Computing
Machinery, New York, NY, 441–​450.
Morstatter, F., J. Pfeffer, H. Liu, and K. M. Carley. 2013. “Is the Sample Good Enough?
Comparing Data from Twitter’s Streaming API with Twitter’s Firehose.” In ICWSM.
Munger, K. 2015. “Elites Tweet to Get Feet Off the Streets: Measuring Elite Reaction to Protest
Using Social Media.” Working paper, New York University.
Mustafaraj, E., and P. Metaxas. 2010. “From Obscurity to Prominence in Minutes: Political
Speech and Real-​Time Search.” Paper presented at WebSci10: Extending the Frontiers of
Society On-​Line, April 26–​27, Raleigh, NC.
Mustafaraj, E., S. Finn, C. Whitlock, and P. T. Metaxas. 2011. “Vocal Minority Versus Silent
Majority: Discovering the Opionions of the Long Tail.” In Privacy, Security, Risk and Trust
(PASSAT), and 2011 IEEE Third International Conference on Social Computing (SocialCom),
Institute of Electrical and Electronics Engineers, Piscataway, NJ, 103–​110.
Neuman, W. R., L. Guggenheim, S. M. Jang, and S. Y. Bae. 2014. “The Dynamics of Public
Attention:  Agenda-​ Setting Theory Meets Big Data.” Journal of Communication 64
(2): 193–​214.
Newman, M. W., D. Lauterbach, S. A. Munson, P. Resnick, and M. E. Morris. 2011. “It’s Not
That I  Don’t Have Problems, I’m Just Not Putting Them on Facebook:  Challenges and
Opportunities in Using Online Social Networks for Health.” In Proceedings of the ACM
2011 Conference on Computer-​Supported Cooperative Work, Association for Computing
Machinery, New York, NY, 341–​350.
Nexgate. 2013. “2013 State of Social Media Spam.” Nexgate Report. http://​go.nexgate.com/​
nexgate-​social-​media-​spam-​research-​report.
O’Connor, B., R. Balasubramanyan, B. R. Routledge, and N. A. Smith. 2010. “From Tweets to
Polls: Linking Text Sentiment to Public Opinion Time Series.” ICWSM 11: 122–​129.
Park, D. K., A. Gelman, and J. Bafumi. 2004. “Bayesian Multilevel Estimation with
Poststratification:  State-​Level Estimates from National Polls.” Political Analysis 12
(4): 375–​385.
Pavalanathan, U., and M. De Choudhury. 2015. Identity Management and Mental Health
Discourse in Social Media.” In Proceedings of the 24th International Conference on World
Wide Web Companion, Association for Computing Machinery, New York, NY, 315–​321.
Pennacchiotti, M., and A.-​M. Popescu. 2011. “A Machine Learning Approach to Twitter User
Classification.” ICWSM 11: 281–​288.
Pew Research Center. 2012. “Assessing the Representativeness of Public Opinion Surveys.”
http://​w ww.people-​press.org/​2012/​05/​15/​assessing-​t he-​representativeness-​of-​public-​
opinion-​surveys/​.
Poblete, B., R. Garcia, M. Mendoza, and A. Jaimes. 2011. “Do All Birds Tweet the Same?
Characterizing Twitter Around the World.” In Proceedings of the 20th ACM International
Conference on Information and Knowledge Management, Association for Computing
Machinery, New York, NY, 1025–​1030.
Quinn, K. M., B. L. Monroe, M. Colaresi, M. H. Crespin, and D. R. Radev. 2010. “How to
Analyze Political Attention with Minimal Assumptions and Costs.” American Journal of
Political Science 54 (1): 209–​228.
Measuring Public Opinion with Social Media Data    581

Rao, D., D. Yarowsky, A. Shreevats, and M. Gupta. 2010. “Classifying Latent User
Attributes in Twitter.” In Proceedings of the 2nd International Workshop on Search and
Mining User-​G enerated Contents, Association for Computing Machinery, New York,
NY, 37–​4 4.
Ratkiewicz, J., M. Conover, M. Meiss, B. Gonçalves, A. Flammini, and F. Menczer. 2011.
“Detecting and Tracking Political Abuse in Social Media.” In ICWSM. 297–​304.
Richman, W. L., S. Kiesler, S. Weisband, and F. Drasgow. 1999. “A Meta-​analytic Study of
Social Desirability Distortion in Computer-​ administered Questionnaires, Traditional
Questionnaires, and Interviews.” Journal of Applied Psychology 84 (5): 754.
Sanovich, S. 2015. “Government Response Online:  New Classification with Application to
Russia.” Unpublished Manuscript, New York University.
Skoric, M., N. Poor, P. Achananuparp, E.-​P. Lim, and J. Jiang. 2012. “Tweets and Votes:  A
Study of the 2011 Singapore General Election.” In System Science (HICSS), 2012 45th Hawaii
International Conference on Systems Science. (HICSS-​45 2012). Institute of Electrical and
Electronics Engineers, Piscataway, NJ, 2583–​2591.
Solberg, L. B. 2010. “Data Mining on Facebook:  A Free Space for Researchers or an IRB
Nightmare?” Journal of Law, Technology and Policy 2: 311–​343.
Thomas, K., C. Grier, and V. Paxson. 2012. “Adapting Social Spam Infrastructure for Political
Censorship. In Proceedings of the 5th USENIX Conference on Large-​Scale Exploits and
Emergent Threats. USENIX Association, Berkeley, CA, 13–​13.
Tourangeau, R., R. M. Groves, and C. D. Redline. 2010. “Sensitive Topics and Reluctant
Respondents: Demonstrating a Link Between Nonresponse Bias and Measurement Error.”
Public Opinion Quarterly 74 (3): 413–​432.
Tucker, J. A., J. Nagler, M. M. Metzger, P. Barberá, D. Penfold-​Brown, and R. Bonneau. 2016.
“Big Data, Social Media, and Protest: Foundations for a Research Agenda.” In Computational
Social Science:  Discovery and Prediction, edited by R. M. Alvarez. Cambridge University
Press, New York, NY, 199–​224.
Tufekci, Z., and C. Wilson. 2012. “Social Media and the Decision to Participate in
Political Protest:  Observations from Tahrir Square.” Journal of Communication 62
(2): 363–​3 79.
Tumasjan, A., T. O. Sprenger, P. G. Sandner, and I. M Welpe. 2010. “Predicting Elections with
Twitter: What 140 Characters Reveal about Political Sentiment.” ICWSM 10: 178–​185.
Tuunainen, V. K., O. Pitkänen, and M. Hovi. 2009. “Users’ Awareness of Privacy on Online
Social Networking Sites—​Case Facebook.” In Proceedings of the 22nd Bled eConference,
eEnablement:  Facilitating an Open, Effective and Representative eSociety, Association for
Information Systems, Atlanta, GA, 42–​58.
Vaccari, C., A. Valeriani, P. Barberá, R. Bonneau, J. T. Jost, J. Nagler, and J. Tucker. 2013. “Social
Media and Political Communication:  A Survey of Twitter Users during the 2013 Italian
General Election.” Rivista Italiana di Scienza Politica 43 (3): 381–​410.
Veenstra, A., N. Iyer, N. Bansal, M. Hossain, and J. Park, 2014. “#Forward! Twitter as Citizen
Journalism in the Wisconsin Labor Protests.” Paper presented at the Annual Meeting of the
Association for Education in Journalism and Mass Communication, St. Louis, MO.
Wang, W., D. Rothschild, S. Goel, and A. Gelman. 2015. “Forecasting Elections with Non-​
Representative Polls.” International Journal of Forecasting, 31 (3): 980–​991.
Wong, F., M. Fai, C. W. Tan, S. Sen, and M. Chiang. 2013. “Quantifying Political Leaning from
Tweets and Retweets.” In Proceedings of the Seventh International AAAI Conference on
Weblogs and Social Media, AAAI Press, Palo Alto, California, 640–​649.
582    Marko Klašnja et al.

Wu, S., J. M. Hofman, W. A. Mason, and D. J. Watts. 2011. “Who Says What to Whom on
Twitter.” In Proceedings of the 20th international Conference on World Wide Web. Association
for Computing Machinery, New York, NY, 705–​7 14.
Zimmer, M. 2010. “ ‘But the Data Is Already Public’: On the Ethics of Research in Facebook.”
Ethics and Information Technology 12 (4): 313–​325.
Chapter 25

Expert Surv eys as a


M easu rem e nt  To ol
Challenges and New Frontiers

Cherie D. Maestas

Introduction

Expert surveys are a valuable tool of measurement, because experts have specialized
knowledge that, when tapped, permits researchers to explore topics that might other-
wise be impossible to study in a systematic fashion. Consider, for example, the challenge
of studying the factors that enhance election quality across many countries when there
is no uniform global standard for reporting election conduct. Without a strategy for
collecting systematic data that are valid and reliable across countries, such a study would
prove impossible. The Electoral Integrity Project sought to address this type of problem
by surveying country experts about their views on forty-​nine specific indicators of elec-
tion quality in eleven categories (Norris, Martinez i Coma, and Gromping 2015). By
surveying more than 1,400 election experts about 127 national elections in 107 countries,
the researchers were able to create a set of standardized scores that could be compared
systematically across countries, thereby opening up new avenues for testing theories
about the factors that influence electoral integrity.1 In a similar vein, a new and ambi-
tious project, Varieties of Democracy (V-​Dem), seeks to enhance scholarship on de-
mocracy through creating globally comparable measures of seven core principles of
democracy represented by nineteen subcomponents and hundreds of indicators, a
number of which are measured using surveys of country experts (Coppedge et al. 2015).2
Perhaps one of the longest running examples of the use of expert surveys as a tool in
political science is the measurement of the ideological placement of European parties
on a left-​right scale.3 Numerous rounds of expert surveys have been conducted since
the 1980s to gauge both the general position of parties as well as their positions on spe-
cific issues (e.g., Bakker et al. 2012; Benoit and Laver 2006; Castles and Mair 1984; Huber
584   Cherie D. Maestas

and Inglehart 1995; Rohrschneider and Whitefield 2009). One multi-​year study, the
Chapel Hill Expert Surveys (CHES), places party positions on policy issues and ideolog-
ical scales in twenty-​eight European Union (EU) countries, which permits both cross-​
sectional and over-​time comparisons (Bakker et al. 2012; Hooghe et al. 2010).
The term “expert,” in this case, refers to academic scholars with specialized knowl­
edge of one or more countries who can synthesize multiple sources of information when
locating individual parties on a policy or ideological scale (Hooghe et al. 2010). The use
of expert surveys offers an advantage over behavioral measures such as party-​member
roll call votes or documentary sources such as party manifestos, because such sources
may reflect strategic behavior of large parties and be sparse or nonexistent for small
parties. One advantage to using expert surveys rather than document-​based evidence
is that researchers can measure party positioning at any point in time rather than only at
release dates of specific documents (Bakker et al. 2012; Hooghe et al. 2010). For example,
the timing of document releases such as party platforms varies across countries, so
measures based on official documents often measure the same concept but at different
points in time for different countries.
Control over the timing and speed of the survey is an important feature of ex-
pert surveys. Scholars can tailor their data collection to fit temporal contexts rele-
vant for testing specific theories. In American politics, for example, many theories
about elections and candidate behavior rely on assumptions about incumbents’
prospects of winning reelection, prospects that must be measured well ahead of the
start of the campaign season and before challengers have emerged. No such measures
existed prior to the Candidate Emergence Study, which tapped the opinions of polit-
ical experts nested in U.S. House of Representatives districts prior to candidate filing
deadlines (Stone et al. 2010; Maestas, Buttice, and Stone 2014).4 In addition to meas-
uring incumbent prospects of winning, experts were also able to provide estimates of
the various strategic, personal, and performance qualities of incumbents. This permits
measurement of valence and policy positioning as well as forecasts of chances. Expert-​
based measures of prospects and valence are a valuable addition to the study of po-
litical competition and election outcomes, permitting researchers to test hypotheses
that had previously been untestable (Adams et al. 2011; Stone and Simas 2010; Stone
et al. 2010; Buttice and Stone 2012).
Other examples of the use of expert surveys include measuring democratic account-
ability (Kitschelt and Kselman 2013), democratic states’ foreign policy positions toward
Iran (Wagner and Onderco 2014), the positions of key political actors on the EU consti-
tution (Dorussen, Lenz, and Blavoukos 2005), and the ideological leanings of legislative
and bureaucratic institutions (Saiegh 2009; Clinton and Lewis 2008). Beyond political
science, researchers use experts to assess classroom interactions in education (Meyer,
Cash, and Mashburn 2011), gauge risk and uncertainty related to civil infrastructure
(Cooke and Goossens 2004), estimate species population in biology (Martin et al. 2012),
and create indexes of societal stressors (McCann 1998).
These examples highlight the many possible applications of expert surveys in research
designs that require data collection for difficult-​to-​measure phenomena. However,
Expert Surveys as a Measurement Tool    585

only a few of these studies offer generalized guidelines for how to design, validate, or
report on measures based on expert surveys (but see Martinez i Coma and Van Ham
2015; Maestas, Buttice, and Stone 2014). This chapter adds to those studies by providing
an overview of the considerations that are important at various stages of expert-​based
measurement projects: study design, expert selection, the elicitation of opinions, and
the aggregation of expert observations into unit-​scores.

Design and Reporting When Expert


Surveys Are Used as Tools of
Measurement

Throughout this chapter, I use the terms target or target measure to refer to the theoret-
ical concept of interest to be measured by experts. I use the term target-​units to distin-
guish the units of analysis for the target measure (e.g., countries, institutions, processes,
actors, or events) from the units of analysis for the expert surveys (individual experts).
The terms experts, raters, and observers are used interchangeably to refer to individuals
providing descriptive or forecast information about target-​units. Expertise is defined by
the context of the target measure and units under study. Experts might be academics,
practitioners, political elite, managers, or any other individuals with specialized expe-
rience or knowledge. They may also be created by training individuals to provide first-
hand information about a target of interest, for example, election observers (Alvarez,
Atkeson and Hall 2013; Atkeson et al. 2014; Atkeson et al. 2015) or classroom interaction
observers (Meyer, Cash, and Mashburn 2011).

Design Considerations in Mapping Experts onto


Target-​Units
In some studies a single expert observation might serve as the only measure of the
target-​units of interest, but in most studies researchers combine multiple expert
observations of target-​units into a single score per unit to create the target measure.
Figure 25.1 illustrates several different designs that map experts to target-​units, each
with advantages and disadvantages in the types of errors likely to contaminate the target
measure.
Diagram A represents a case in which one expert provides information about one
target-​unit, and each unit has only a single expert evaluation. Few studies in political
science rely on only a single rater per unit for all units in the study; however, some rely
on a single rater per unit for a subset of units. This usually occurs when researchers
can only identify one person with relevant expertise for a unit, or only one respondent
586   Cherie D. Maestas

A. Single Expert Rates Single Target Unit

Expert A Target A

Expert B Target B C. Multiple Experts Rate Multiple Target Unit

Expert A Target A
B. Multiple Experts Rate Single Target Unit

Expert B
Expert A
Target A
Expert B
Expert C Target B

Expert C
Target B
Expert D

Figure 25.1  Designs for Expert Surveys.

replies from among a small pool surveyed (see, e.g., Bailer 2004; Dorrussen, Lenz, and
Blavoukos 2005).
In single-​rater designs, the expert survey responses are the target-​unit measures, and
the errors in the target-​unit scores reflect individual errors associated with survey re-
sponse. Expert opinions, like any survey respondent, are prone to both systematic and
random error. Decades of research in multiple fields of study find that individuals are
subject to cognitive and judgment biases when forming opinions (see Kahneman 2011;
Kunda 1990; Lodge and Taber 2013; Tetlock 2005). Experts might have incomplete infor-
mation about targets, leading them to guess incorrectly; they might interpret questions
differently; they might rely on heuristics to simplify complex information; or they might
adopt biased views based on their political perspectives (Budge 2000; Curini 2010;
Maestas, Buttice, and Stone 2014; Martinez i Coma and Van Ham 2015; Powell 1989;
Steenbergen and Marks 2007).
Because target-​unit measures from single-​rater designs are especially vulnerable to
the biases of individual raters, multiple-​rater designs, in which the errors of one rater
can offset errors from another, are considered substantially stronger (Boyer and Verma
2000; Maestas, Buttice, and Stone 2014; Philips 1981). The forecasting community has
long been aware that combining multiple forecasts into a “consensus forecast” allows
the judgment errors of individuals to offset one another, thereby improving the quality
of forecasts (Clemen 1989; Winkler and Clemen 2004). As McNeese (1992, 704–​705)
points out, this result stems from the properties of numbers: the mean square error of
a group mean is lower than the mean square error of any individual forecast from the
group members.
Further, aggregating also reduces spurious correlations between two or more target
measures drawn from the same survey respondent that stem from common method var-
iance (CMV). Such CMV biases arise when individual respondents rate multiple survey
Expert Surveys as a Measurement Tool    587

items about the target-​unit similarly high or similarly low due to exogenous factors like
mood, personal perceptual biases, context, or the like, producing spurious correlations
among survey responses drawn from the same expert (Podsakoff et al. 2003). The po-
tential for spurious correlation in studies that rely on a single expert is a considerable
threat to causal inference.5 For example, testing a hypothesis about the effects of candi-
date characteristics (i.e., quality) on election prospects using data from a single expert
might produce results that are due to the CMV if the expert is biased toward using only
the upper end of the survey scales (positivity bias) for the survey measures of candi-
date quality and prospects. In contrast, the same test using measures aggregated from
observations of multiple experts who vary in their partisan leanings is less subject to
spurious correlations due to CMV, because individual-​level survey biases are offset
through aggregation. Of course, the degree to which CMV is reduced depends on the
number and independence of raters. If multiple raters share similar perceptual or con-
textual biases, the errors will reinforce rather than cancel each other. Diagrams B and C
in Figure 25.1 highlight different types of multiple-​rater designs that might help mitigate
problems associated with CMV.
Diagram B in Figure 25.1 shows a “nested-​experts” design in which each unit is rated
by multiple experts, but experts’ ratings of units do not overlap. An example of a pure
form of this design can be found in the Candidate Emergence Study or the UC-​Davis
Congressional Election Study, in which experts residing in U.S. House districts pro-
vided information only about the characteristics of incumbents and challengers in their
districts (Maestas, Buttice, and Stone 2014; Stone et al. 2010). The strength of this type of
design is the ability to use “consensus” observations of each unit to comprise the target-​
unit measure, which helps to reduce the impact of individual-​level biases and increase
the reliability of the overall target-​unit measure. However, multiple raters per target-​
unit cannot guarantee an absence of systematic or random bias in the target measure.
Systematic bias must be identified and corrected to increase the validity of the target-​
unit measures. Although we are accustomed to thinking of random error as inconse-
quential because it affects the variance and not the mean, in practice, random errors
in very small pools of raters can produce target-​unit measures with invalid orderings
among target-​unit cases.
Martinez i Coma and Van Ham (2015, 306) highlight three specific areas that affect
the validity of expert-​based measures that potentially afflict this and other multi-​
rater designs: the nature of the concept being evaluated, heterogeneity among expert
evaluators, and the context in which the evaluation is made. Some concepts such as
corruption or democracy are inherently complex and thus open up room for experts
to insert their own interpretations when answering survey questions. Even for clearly
defined concepts like party placement, scholars have expressed concern that individual
raters use different underlying assumptions to judge parties (Budge 2000; Ray 2006).
Expert heterogeneity might matter in other important ways as well. Expert raters drawn
from the political sphere have been shown to exhibit in-​group bias in judging polit-
ical targets of interest (Stone et al. 2010). Finally, the context in which the evaluation is
made might alter the “yardstick” of measurement used by raters in judging the target of
588   Cherie D. Maestas

interest. For example, different cultures have different norms or perceptions of concepts
like corruption or ideology, which leads to systematic differences, in which experts
apply similar scales (see the section below on the use of anchoring vignettes).
One weakness of the nested-​experts design lies in the fact that experts rating one
unit might interpret the survey questions differently than experts rating another unit.
This type of bias happens when experts A and B are embedded in a different contex-
tual environment from experts C and D and that context influences their perception
of the meaning of the scale. For example, in party placement, scholars have questioned
whether the scales used to place parties can really be considered comparable across
country contexts (Bakker et  al. 2014; McDonald, Mendes, and Kim 2007). Further,
some speculate that scale use might be tied to the number or diversity of parties rated on
the scale, something that varies across countries (Albright and Mair 2011). Systematic
errors that occur among observers within units and differ across units will not “cancel
out” with aggregation; hence they undermine the validity of the target-​unit measure
by calling into question both the cardinal value of the target-​unit scale and the ordinal
placement of units on that scale. The potential for this type of error highlights the im-
portance of paying careful attention to the design of survey questions, a topic addressed
in subsequent sections of this chapter.
The third design (diagram C in Figure 25.1) uses multiple raters to rate multiple
and overlapping targets. One example of this type of design is the Clinton and Lewis
(2008) study of agency ideology, which surveyed twenty-​six experts, asking them
to place eighty-​two federal agencies on the same scale. Such a measure is essential to
testing theories about inter-​institutional relations that require knowledge of the rela-
tive ideological placement of bureaucratic agencies by their bargaining partners, such as
legislatures or executives.
This type of design helps to reduce scaling problems such as the application of
context-​dependent “yardsticks,” since one rater applies the same yardstick to all units.
However, this type of design is still subject to errors that arise from individual hetero-
geneity in knowledge, scale application, and judgment. In a multiple-​rater single-​target
design, individual-​level biases only influence a single unit, but in a multi-​rater, multi-​
target design, individual-​level biases contaminate measurement across multiple target-​
units. In these cases, it is especially important to draw measures from larger pools of
raters, whose errors are likely to be offsetting (i.e., experts with diverse and independent
perspectives on the target).
Another weakness of this design is that asking experts to rate many different target-​
units may tax the limits of their expertise and lead to greater random error in individual
scores of target-​units. Thus, data for some target-​units may be more reliable than for
others. In the Clinton and Lewis study (2008), not all experts were familiar with all
eighty-​two agencies, so the number of raters per agency ranged from a low of four for
the Trade and Development Agency to twenty-​six for the Department of Defense. To
address this type of problem, designs that solicit information about multiple target-​units
sometimes include a “don’t know” option and encourage raters to only offer opinions for
those units with which they are most familiar (see Bakker et al. 2014).
Expert Surveys as a Measurement Tool    589

Some studies use a mixed design, in which some raters provide ratings for multiple
targets so target-​units have some raters that overlap, but the raters who overlap differ
across target-​units. The V-​Dem survey specifically asks experts to code additional coun-
tries to provide “bridging” and “lateral” scores to enhance cross-​country comparability
of the data. Although experts are recruited for their expertise regarding a particular
country, they are also asked to provide coding for other countries over the full time pe-
riod of 1900–​2012 (called bridge coding) or for a single year (called lateral coding). Such
coding forces experts to compare across countries and provides data that can be used in
measurement models to help correct for cross-​country biases in scaling (Coppedge et al.
2015, 17).
Sometimes the use of a mixed design is unintentional and tied to the availability or
response patterns of experts. For example, Dorrussen, Lenz, and Blavoukos (2005)
sought to identify multiple experts per country to report on country-​actor support for
the EU constitution, but in some cases they could only identify a single expert, so their
implemented design is a blend of diagrams A and B. In other cases, scholars uninten-
tionally end up with incomplete and variable mappings between targets and units due
to differences in item or unit response rates, something that is important to disclose.
Notably, many studies do not describe the design that the researchers intended to use
or the degree of overlap in respondent pools for different target-​units. Disclosing the
design intent and realized outcome is essential to assessing the nature of errors and the
quality of the resulting target measures.

Design Transparency and Reporting Guidelines


The design of expert surveys and the nature of target-​unit response patterns have
implications for the characteristics measurement error within and across units, but
it is often not apparent from published descriptions which design was intended by
the researchers, who is included or excluded from the pool of experts, or how expert
responses and units are related. This type of information is essential for the assessment
of the quality of the measures; thus transparency in all aspects of the design is an essen-
tial part of creating a high quality, expert-​based study. Moreover, social sciences, and
political science in particular, have increasingly placed strong emphasis on the trans-
parency and reporting of procedures, including providing replication code and data
(Lupia and Elman 2014). Reporting standards are well-​established for public opinion
polls, but no such analog exists for expert surveys. At a minimum, studies that utilize
experts should provide readers with sufficient detail to replicate both the design and the
survey. When such information is too lengthy to include in published journal articles,
researchers should provide this information as online appendices or codebooks.
Many of the American Association for Public Opinion Research (AAPOR) “best
practices” for reporting public opinion surveys also apply to reporting the characteris-
tics of expert surveys, albeit with some variation to account for the differences inherent
in using experts as a measurement tool. For example, AAPOR recommends providing
590   Cherie D. Maestas

“a definition of the universe of the population under study,” “a description of the sample
design,” a “description of sample selection procedures,” “a description of the mode of
data collection,” and “full accounting of the final outcome for all sample cases.”6 Unlike
public opinion surveys, expert surveys are rarely intended to serve as representative
samples of a well-​defined population or used to make inferences back to said population.
Instead, researchers attempt to define the universe of experts and make judgments about
the degree and type of expertise necessary to be considered part of the pool. However,
these differences do not negate the importance of explaining how experts were defined
and selected for the study or whether selection criteria varied across target-​units. Yet
many studies report scant details about the criteria for inclusion or exclusion or whether
these criteria vary by target-​unit. Kitschelt and Kselman (2013) report in a footnote that
they surveyed “more than 1400 political scientists, political sociologists, and political
journalists from 88 countries” but give no details about the characteristics or size of ex-
pert pools by country or the response rates per county from the different categories of
experts. Only a few studies, such as Ray (1999), provide specific details of the sources for
the list of experts for each target-​unit and the procedures for supplementing the orig-
inal list in units that had too few potential experts. Such information is vital to assessing
whether differences in expert respondents across target-​units create measurement error
for some units in the target measure and should be reported as a matter of course in
expert-​based surveys.
Published studies also vary in how much detail they provide about target-​unit re-
sponse rates, and this information is essential for assessing the quality of the target
measures built from expert observations. Variability in item and survey response rates
across target-​units is not surprising, but it should be reported because the number of
experts rating each target and the response rates may be tied to systematic factors that
correlate with the target measure of interest. Others who incorporate these measures
into their research may need to consider excluding cases built on only a small number of
raters, but cannot do so without information on the number of raters per item for each
target-​unit.
Some studies set a threshold for a minimum number of raters and thus eliminate
some target-​units that lack sufficient responses (e.g., Huber and Inglehart 1995; Ray
1999), while others opt to rely on small pools or even single respondents to maximize the
number of units included in the target measure (e.g., Dorussen, Lenz, and Blavoukos
2005). Variation in responses and response rates across units can be considerable and are
generally related to things like size, visibility, or salience of the target-​unit being rated.
The 2002 and 2006 CHES had rater pools as small as four in Latvia (2002) and as large as
eighteen in the United Kingdom (2002). The Electoral Integrity Project’s Perception of
Electoral Integrity (PEI) survey had an average of eleven raters per country-​election but
ranged from a low of two (Mauritania in 2013) to a high of thirty-​six (Pakistan in 2013).
The Electoral Integrity Project also provides data on the number of experts solicited per
country (average thirty-​nine) and the response rates by country-​election (average 29%),
but response rates varied substantially across countries, from 6% (Mauritania) to 58%
(Czech Republic in 2012).
Expert Surveys as a Measurement Tool    591

It is worth noting that there are special considerations for reporting on response rates
and response totals when expert respondents are also political elites. If a pool of elite
respondents is especially small per target, reporting details of responses at the target-​
unit level could jeopardize respondent confidentiality. For example, in the Candidate
Emergence Study, expert-​based measures are built from small groups of identifiable
political elites nested in a random sample of U.S. House districts; therefore, even the
names of the districts could not be revealed without potentially revealing respondents’
identities (Maisel and Stone 1998). In such circumstances, researchers can still provide
information on the patterns of responses within units without revealing the identity of
the target-​unit. Further, researchers who use restricted-​access expert data and plan to
submit their work to journals must be prepared to address confidentiality issues that
arise related to the replication and posting of data.7

Assessing and Reporting Uncertainty


Since some degree of error is inevitable in expert-​based measures, it is important that
researchers report measures of uncertainty about their target-​unit scores. Indeed, one
criticism of existing measures of democracy that are created from observations of mul-
tiple raters is that they rarely, if ever, report inter-​rater reliabilities for each country
(Coppedge et al. 2011, 251). The simplest approach is to report one of several possible
measures of expert agreement, such as percent agreement among experts, variance,
standard deviations, or confidence intervals for each unit of the target measure. Norris,
Martinez i Coma, and Gromping (2015, 36), for example, report confidence intervals
around the PEI index for each country-​election alongside the number of responses and
response rate for each unit. As a result, those using the index can judge the quality of
the expert-​derived indicator on an election-​by-​election basis. In addition, the project
makes available full individual-​level expert data sets so researchers can choose to calcu-
late other measures of uncertainty.8
Inter-​rater agreement scores are a useful way to summarize whether evaluators have
similar perceptions of a common target (Dorussen, Lenz, and Blavoukos 2005; Maestas,
Buttice, and Stone 2014). They are calculated at the level of the target-​unit, and sum-
mary statistics for all target-​units are often used as a measure of reliability of the target
measure. These types of reliability measure are very common in communications re-
search when multiple coders create measures of the content of text or video data, but
they are also useful in any study in which multiple experts are used to measure the
same latent attribute of a target-​unit. Inter-​rater agreement scores are rooted in the
assumption that if all raters measure the true characteristics of a target without error,
their evaluations should be identical.
The goal of evaluating inter-​rater agreement, then, is to “evaluate whether a coding
instrument serving as common instructions to different observers of the same set of phe-
nomenon, yields the same data within a tolerable level of error” (Hayes and Krippendorff
2007, 78). Krippendorff ’s Alpha (Kalpha) is a popular measure, because it can be used to
592   Cherie D. Maestas

assess agreement across a wide range of situations, including variable numbers of raters
per target, and is flexible with respect to the scaling of items being assessed (Hayes and
Krippendorff 2007). Another measure, created by Steenbergen (2001) and utilized in
Steenbergen and Marks (2007), employs a “scalewise similarity coefficient” to summarize
pairwise similarities across experts rating parties on a left-​right scale. Like other meas-
ures of inter-​rater reliability, this measure scores low when experts diverge and may in-
dicate that experts are not measuring the same trait (Steenbergen and Marks 2007, 356).
One measure of inter-​rater reliability (rwg) that is frequently used in organizational
research, psychological research, educational research, nursing, and other fields that
use multiple raters to assess targets (see Bliese, Halverson, and Schriesheim 2002; Burke
2002; Lindell, Brandt, and Whitney 1999; Lindell 2001; Meade and Eby 2007) is calcu-
lated as follows:

 s2 
rwg = 1 −  2d 
 snull 

where sd2 is the within-​unit variance around the mean of an item or an average var-
iance around the mean of a set of items and snull2 is the expected variance, under the
assumption that respondents answered by randomly selecting points from the scale (i.e.,
all response is random error).9 This measure has an upper bound of 1, perfect agree-
ment, because if raters are identical, sd2 = 0 and the rwg = 1. Lindell (2001) notes that
when calculating r for an index, the appropriate sd2 is the variance of the index rather
than the average variance of the index items, because the former will always be smaller
than the latter.
An alternative approach to reporting inter-​rater agreement scores is to report uncer-
tainty around target-​unit point estimates for the concept. The simplest version of this
is reporting confidence intervals around a mean. However, some scholars use sophisti-
cated latent variable models suitable for multi-​rater data, such as Bayesian Item Response
Theory models. These models can be used to estimate a target-​unit’s placement along
a latent scale and produce a measure of uncertainty about the placement of each unit
(see, e.g., Clinton and Lewis 2008; Coppedge et al. 2015; Jackman 2004). These models
are especially powerful for assessing the quality of individual raters and their biases in
applying scales across units, and they permit direct assessment of contextual effects on
raters. As such, they provide evidence to help explain why some subsets of target-​units
are more or less reliable.
Regardless of the specific measure used, it is crucial that researchers report uncer-
tainty about the target-​unit scores at the level of the target-​unit and not at the level of
the target measure. Whether represented by posterior densities, inter-​rater reliabilities,
standard deviation, or variance measures, they all vary across units and correlate with
factors associated with the units and with individual-​level respondent errors. High reli-
ability in some units and low reliability in others is indicative of context and rater effects.
Low reliability across all units often reflects vagueness in the conceptualization of the
measurement instrument that introduces high variability in response from experts.
Expert Surveys as a Measurement Tool    593

In addition to reporting target-​ unit uncertainties, researchers should report


reliabilities at the level of the target measures. One approach is to simply create sum-
mary statistics from the inter-​rater reliability unit, but doing so fails to exploit the infor-
mation that can be obtained from the variance of the target measure across target-​units.
A better approach is to use a pooled “generalizability coefficient” that compares the var-
iance across target-​units (called a universe score) with the pooled observable variation
in the aggregates and individuals within each aggregate (Jones and Norrander 1996;
O’Brien 1990). The benefit of this measure compared to inter-​rater agreement scores
is that it speaks to the likelihood that the target measure distinguishes among target-​
units by leveraging the variance between units relative to the variance within target-​
units.10 Target measures that have greater variance across units and smaller variance
within units are judged more reliable. This type of measure is especially useful when
aggregating survey data to a higher unit, such as creating mean public opinion in a state
(Jones and Norrander 1996), but also works well in expert survey designs. The general-
izability coefficient ranges from 0 to 1 and target-​unit measures that have high variation
between units and a low variation within-​unit score closer to 1.
To summarize, researchers, at a minimum, should report the following aspects of the
expert design:

1. the nature of the design, including the set of units intended to be included, the
number of raters per unit, and whether raters are nested or overlapping;
2. the definition of expert and the source of information for identifying experts, in-
cluding strategies for supplementing the defined pool of experts with other types
of respondents;
3. the procedures for recruitment of experts and the survey mode used to elicit the
information, noting if these vary in systematic ways across target-​units;
4. the survey instrument and question wording;
5. the disposition of survey responses, including the total number surveyed, the total
number of respondents to the survey, including at the level of the target-​unit, the
overall response rates, and the target-​unit response rates; and
6. reliability scores or other measures of uncertainty at the level of both the target-​
unit and the target measure.

Which Is Better, More Expertise


or More Experts?

Whether the goal is forecasting or observation, combining expert assessments produces


better outcomes than using a single expert or rater (Boyer and Verma 2000; Clemen
1989; McNeese 1992; Philips 1981), but how many is enough? Enlarging a group of
594   Cherie D. Maestas

“experts” often comes at the price of changing the boundaries of the definition of “exper-
tise.” Generally speaking, those most expert—​well-​positioned practitioners such as po-
litical elites, heads of agencies, or top managers in business—​are quite difficult to reach
and reticent to give opinions that might be traced back to them. Alternative sources,
such as staff members, journalists, or academics with specialized knowledge, might be
more numerous and easier to reach, but their level of expertise pertaining to the target
of interest may be less direct. Academics have also proven to be a good source of data on
a number of topics, including democratic accountability (Kitschelt and Kselman 2013),
electoral integrity (Norris, Frank, and Martinez i Coma 2013), and party placement (e.g.,
Bakker et al. 2012; Hooghe et al. 2010), among others.
Is it better to have a larger, less expert pool of raters, or a smaller, more expert pool?
Unquestionably, a higher number of equally skilled experts per target would improve
the reliability and validity of target-​unit measures, but pools of experts and research re-
sources to reach them are constrained; thus it is important to think about the trade-​off
from increasing a small rater pool by each additional, but perhaps less expert, rater.
One study addressed this question through simulating target measures from different
pools of raters while varying the size and expertise of the rater pool (Maestas, Buttice,
and Stone 2014, 359–​360). The researchers compared the validity and reliability of an
expert-​based measure of U.S. House incumbent ideology against a well-​accepted cri-
terion variable for incumbent ideology, DW-​NOMINATE scores. Using respondents
from the Cooperative Congressional Election Survey (CCES) who have varying degrees
of political knowledge, they created target measures from rater pools that systemat-
ically varied in the number of raters from two to thirty per district in each pool and
selected pools in one of two ways: randomly selecting among only respondents with
demonstrated political expertise in the U.S. House member’s district or randomly
selecting from all respondents in the district.
Two findings from this study are instructive. First, holding constant the level of ex-
pertise of the pool, the marginal gains from additional raters drop considerably once
the size of the rater pool surpasses ten raters. Second, the gains in target measure va-
lidity and reliability from selecting on expertise are greatest for very small raters pools
(< 5) and become negligible when rater pools approach fifteen to twenty raters (Maestas,
Buttice, and Stone 2014, 360). Crucially, this result depends on the difficulty of the rating
task facing experts. Their findings revealed that gains from adding one additional rater
declined more rapidly for rater pools assessing “typical” incumbents (i.e., those for
whom the DW-​NOMINATE ideology score fell close to their party’s median) than for
rater pools assessing atypical incumbents (i.e., the DW-​NOMINATE score fell far from
their party median.)11
These findings suggest that there are gains from adding raters, but the gain is much
greater if one is moving from a rater pool of say four to five rather than ten to eleven.
However, the findings also suggest that scholars should stretch their budgets to increase
the number of raters per unit when the concept or target that experts are rating is more
complex or atypical. Further, these findings suggest that researchers would benefit from
allocating additional resources to encouraging responses from solicited experts to boost
Expert Surveys as a Measurement Tool    595

response rates in units where the number of available raters is small (Maestas, Buttice,
and Stone 2014).
The finding that larger pools of raters, even if less expert, produce target measures
with greater validity and reliability is echoed in a number of studies that pit expert
predictions (from either single experts or small groups of experts) against predictions
from other sources that draw from large pools of respondents, such as gambling
markets, crowd-​sourcing, or aggregated public opinion polls (e.g., Andersson, Edman,
and Ekman 2005; Gaissmaier and Marewski 2011; Graefe 2014; Green and Armstrong
2007 Sjoberg 2009). Green and Armstrong (2006), comparing forecasts of novices and
experts, found that the experts only slightly outperformed the accuracy of the novices,
and neither group did much better than would be expected by chance. In a direct
comparison of survey responses from the public and three pools of experts (political
scientists, journalists, and editors), Sjoberg (2009) found that the median forecasts from
the public outperformed the median for the expert group, even though the average error
in individual forecasts was greater in the public. These and similar studies highlight the
tension between expertise and the “wisdom of crowds” logic (see Surowiecki 2004).
Aggregating a large, diverse “crowd” of opinions, even if members of the crowd possess
incomplete knowledge, can produce superior forecasts to a single individual, regardless
of how expert he or she might be (Surowiecki 2004). Ironically, despite evidence to the
contrary, people are biased toward preferring a single expert to averages of large crowds
(Larrick and Soll 2006).
To summarize, evidence from a number of studies across several fields suggests that
researchers benefit by easing the boundaries that define expertise in order to widen the
pool of raters. The marginal gains are greatest when supplementing pools with fewer
than ten raters. Marginal gains are also greater when the concept being rated is complex.
Holding constant the size of the rater pool, greater expertise produces better quality
measures, but small pools of experts perform worse than larger pools with more diverse
expertise. Moreover, the crowd from which opinions are solicited must have at least par-
tial knowledge of the construct of interest. Surveys of experts can often be conducted
more quickly and with less cost than large public opinion surveys. This, combined with
the knowledge store, suggests seeking expertise is valuable. However, it points to the im-
portance of seeking observations from more than just a few experts per target.

Suggestions for Reducing Response


Biases When Eliciting Expert Opinions

Errors in survey responses are unavoidable, but they can be minimized. Researchers
might consider several strategies when constructing instruments to help decrease both
systematic and random errors at the individual level, which in turn helps to improve the
validity and reliability of target measures.
596   Cherie D. Maestas

Consider Cognitive Interviewing to Improve Clarity


of Questions
When writing survey questions for expert surveys, conceptual clarity is essential
to recovering high-​quality, comparable responses. This can be challenging, since
researchers typically turn to expert surveys to measure concepts, which are by defi-
nition difficult to measure. In some cases, the theoretical concepts of interest are al-
ready well-​defined and understood in a uniform way, so developing questions to tap
these theoretical concepts is straightforward. The Electoral Integrity Project survey
questions are based on items that represent “agreed-​upon international conventions
and global norms that apply universally to all countries worldwide and cover each
stage of the election cycle” (Norris, Frank, and Martinez i Coma 2013, 128). Such has
not been the case in developing measures of corruption from expert data, where the
concept of corruption can be defined in a number of different ways. With no univer-
sally accepted definition, expert respondents have a wide berth in interpreting the
meaning of survey questions, leading to questions about the validity of the resulting
measures (Heywood and Rose 2014).
A key first step in obtaining comparable measures is to ensure that the relevant
dimensions of a concept are clearly specified to raters. To avoid ambiguity when
surveying experts about party positions, for example, researchers might specify
whether experts should provide the “formal position of the party” or the position of
“party leaders” (Whitefield et al. 2007). This type of specificity is important to en-
sure comparability of answers across experts; however, it is not always clear during
the questionnaire design phase how or whether questions may be interpreted
in different ways by different experts. Pretesting is an essential step in reducing
errors that arise from ambiguity in questions and must be undertaken in multiple
target-​units.
One approach to explore how experts interpret survey questions is to use cog-
nitive interviewing, a set of procedures to probe for and understand errors
respondents make when answering survey questions (Beatty and Willis 2007). The
procedures are usually performed as part of a pretest of an instrument, in which
interviewers ask subjects open-​ended questions about their understanding of the
question and the reasons for answering the way they did. Although this technique
was initially developed to be administered verbally following a survey, cognitive
interviewing has been successfully applied in online survey settings (Behr et  al.
2014). This type of pretest probing can be particularly useful in cross-​national
contexts, in which surveys are translated into different languages and question
wording may elicit different cultural referent points (Lee 2014). To get the most
from a pretest utilizing cognitive interviewing, researchers should seek to iden-
tify the variation in the expert pool most likely to create divergent responses to
identical survey items and make sure that members from relevant subgroups are
included in the pretest.
Expert Surveys as a Measurement Tool    597

Use Anchoring Vignettes When Possible


A second area in which survey design can reduce error is through the use of anchoring
vignettes to reducing the errors associated with differential item functioning (DIF)
(Hopkins and King 2010; King et al. 2004; King and Wand 2007; Wand 2013). Anchoring
vignettes are short, concrete examples of a concept of interest (e.g., ideological
placement) that are included in the survey to assess how different individuals apply a
scale to the same example. This information can be used to construct a common scale
that is comparable across individuals in the analysis (King et al. 2004). This approach
has been applied to a number of substantive issues in cross-​national survey research and
has undergone a number of refinements over the past decade (see Hopkins and King
2010; King and Wand 2007).
One example of the successful use of anchoring vignettes can be found in the CHES,
in which they were used to improve the quality of expert measures of European parties
on an economic left-​right scale (Bakker et al. 2014).12 In the CHES, experts were first
asked to place the general position, economic position, and social position of parties
on a left-​right scale. After providing positioning information, they were given three
anchoring vignettes—​concrete examples of hypothetical parties—​and asked to place
them on the same eleven-​point scale used in the earlier evaluations (Bakker et al. 2014).
Since all respondents rated the same three vignettes, the responses could be used as
bridging information in the generalized “black box” scaling techniques developed by
Poole (1998). The country-​party level measures that result from this approach permit
cross-​national comparisons of parties on the left-​right scale and also offer a measure of
level of uncertainty about the placements (Bakker et al. 2014).
Vignettes offer an option to improve the quality of expert surveys, but like all
techniques, they involve trade-​offs that must be carefully weighed. Issues researchers
must consider include determining the number of vignettes to add to a survey, the con-
tent of vignettes, and the ordering of vignettes relative to the observations of interest. In
terms of survey ordering, experimental research by Hopkins and King (2010) strongly
suggests placing vignettes prior to the observations of interest so that the vignettes serve
as prime and reduce DIF. Respondents, after answering several vignette questions, are
more likely to use a common conceptualization of the scale when reporting their own
attitudes. While this has not been tried in expert surveys, the approach seems promising.
Responses to the vignettes provide a point of comparison for the observations of in-
terest; more vignettes provide greater precision by increasing the common scale points
of comparison (King et al. 2004). However, there are costs to adding vignettes in terms
of survey time and respondent attentiveness. To address this, vignettes might be given
to only a subset of respondents or given during a pretest to reduce such costs (see King
et al. 2004). Development of content and evaluating vignettes for discriminatory power
is also essential to ensure that the corrections employed are, in fact, correcting response
category DIF and not introducing other forms of error into the process (see King and
Wand 2007).13
598   Cherie D. Maestas

Include Survey Measures to Evaluate Expertise


Finally, in addition to carefully refining measures and including anchoring vignettes,
researchers can also build into the surveys mechanisms for assessing the quality of
responses received from experts. Some researchers advocate asking experts to express
their level of certainty about their assessments, then incorporating expert certainty
into the aggregation procedures to produce the target-​unit scores (e.g., Coppedge et al.
2014; Van Bruggen, Lilien, and Kacker 2002). Van Bruggen and colleagues compared
certainty-​weighted averages of target-​unit scores to unweighted averages on a measure
for which they had factual data to validate the measures and found that the certainty-​
weighted averages were more accurate. Coppedge and colleagues (2014) suggest
incorporating certainty assessments into measurement models used to produce the
point estimates for democracy items.
While the use of certainty measures has some advantages, there are potential
problems with this approach. First, including certainty measures for each question can
lengthen the survey considerably, thereby taxing respondents and increasing errors
that result from respondent fatigue. With that in mind, a short scale certainty question,
on which respondents are asked if they are “very,” “somewhat,” or “not at all” certain
of their response, might be preferable to a longer scale and has worked well in public
opinion surveys (Alvarez 1996; Alvarez and Franklin 1994). Survey length is not the only
problem, however. Self-​reports of certainty may reflect individual-​level characteristics
unrelated to knowledge or expertise and thus may introduce unexpected biases into the
aggregation process. For example, women are more likely to express uncertainty than
men when placing incumbents on an ideological scale (Alvarez and Franklin 1994), and
“experts” are prone to greater overconfidence than novices (Tetlock 2005).
An alternative approach is to include questions designed to evaluate the expertise of
raters: a set of questions about target-​units that can be validated against factual data or
other well-​established criterion variables. In essence, including such measures permits
researchers to “grade” the knowledge the rater has about the target of interest. The un-
derlying assumption is that raters who perform poorly at rating target-​units on known
quantities are unlikely to perform well at rating them on less obvious qualities. Stone and
colleagues (2010) used this approach in constructing and validating the target measures
for U.S. House incumbents’ prospects of winning and valence. They assigned lower ag-
gregation weights to raters who exhibited little knowledge of their district incumbent’s
ideological position (compared to DW-​NOMINATE). In comparing weighted and
unweighted target measures, they found that raters who did well at reporting their
incumbents’ ideology tended to hold more similar views of the valence characteristics
of the incumbent. However, it is worth noting that in a more controlled comparison,
Maestas, Buttice, and Stone (2014, 368) found that expertise weighting yields gains only
for the smallest pools of experts. The greater value of adding expertise questions is to
demonstrate that experts accurately rate target-​units on concepts for which a criterion
variable exists. By doing so, researchers can make a more convincing case that the same
Expert Surveys as a Measurement Tool    599

raters are likely to perform well at rating target-​units on concepts that lack a measure to
establish criterion validity.

Combining the Wisdom of
Experts: What Works Best?

Minimizing error at the target-​unit level involves a two-​step process: first mitigating


error when eliciting information from experts and second mitigating error when com-
bining expert observations into target-​unit measures (Maestas, Buttice, and Stone 2014).
How information from raters is combined into a single measure per target-​unit varies
considerably across study designs. Some take simple unweighted means of observers per
target (Norris et al. 2015), others advocate dropping outlying raters before aggregating to
the mean (Ray 1999; Wagner and Onderco 2014), still others weight raters by expertise
(Stone et. al. 2010; Van Bruggen, Lilien, and Kacker 2002), and some use sophisticated
multi-​rater latent variable models (Clinton and Lewis 2008; Jackman 2004).
There are essentially two schools of thought about how best to combine data for mul-
tiple raters into a single measure: the “mean aggregation” approach and the “measure-
ment model” approach. The conclusions from the former arise mostly from research
into forecast accuracy, where researchers can compare the forecasts that arise from
different aggregation strategies against realized outcomes. Numerous studies show that
computing an unweighted mean of all forecasts produces a consensus forecast that typ-
ically performs as well as or sometimes better than consensus forecasts produced by
more complicated aggregation schemes (Clemen 1989; Genre et al. 2013; Graefe et al.
2014; Smith and Wallis 2009). Although complicated weighting algorithms or measure-
ment models can, in some circumstances, produce improvements over equally weighted
combined forecasts, the potential for improvement comes with a risk of introducing ad-
ditional error through the weighting scheme (Graefe et al. 2014; Jose and Winkler 2008).
Genre and colleagues (2013) compared unweighted means to a number of different
combination strategies, including principle components, performance-​based weights,
and Bayesian shrinkage models, and concluded that the more complicated strategies
offered only modest improvements over unweighted aggregation to the mean, and
that no single alternative approach consistently beat unweighted means over a range of
variables. The finding across many studies that alternative procedures offer little or no
improvement has created a consensus among forecast researchers that simple is best,
particularly if combining larger pools of forecasters whose errors serve to offset one an-
other (Graefe et al. 2014).
It is more challenging to assess the performance of different aggregation schemes for
studies in which experts are rating latent traits, because there is no criterion variable
for comparison. When differences arise between measures of complex concepts such
as “democracy” or “candidate quality,” it is difficult to determine which of the measures
600   Cherie D. Maestas

is most accurate, expert assessments or alternative proxies for the concept. As a result,
scholars who use experts as measuring tools sometimes select aggregation strategies
with an eye toward eliminating or down-​weighting raters who seem atypical. Ray
(1999), for example, drops outlying raters in each unit when constructing measures of
party positions on European integration. The logic behind dropping outliers is that their
scores are more likely to be fraught with individual-​level measurement error.
Van Bruggen, Lilien, and Kacker (2002) highlight several approaches to minimizing
errors when aggregating to the mean by incorporating information about the experts
and their ratings, all of which, they argue, are both computationally simple and sim-
ilar in effectiveness to more complicated and costly strategies such as Bayesian estima-
tion. They employ “accuracy weights,” where accuracy is measured either by (1) distance
from the group mean or (2) respondents’ self-​reported confidence in their ratings. The
problem with the first strategy, of course, is that the group outliers are already incorpo-
rated into the group mean, which forms the reference point for judging evaluator accu-
racy. This is particularly problematic when working with small pools of raters, where a
single error-​prone evaluator can make a tremendous difference in the group mean, thus
biasing the weight measure as well as the aggregate measure. The approach is a better
choice, provided that certainty questions for each item can be included in the survey
and the biases associated with self-​reports of certainty are addressed in weighting. Stone
and colleagues (2010) offer a third option, which involves weighting experts by their
performance at scoring target-​units on dimensions that can be easily validated against
external information, although as discussed above, the gains from this type of expertise
weighting are most significant when working with very small pools of raters.
The model-​based approach is a different strategy for addressing systematic errors
that arise at the level of both the rater and the unit. In this approach, target-​unit score
estimates are produced through latent variable models, and the model selected varies
from researcher to researcher. Some use Aldrich-​McKelvey (1977) scaling procedures
to correct for differences in item scaling across expert respondents (Bakker et al. 2014;
Saiegh 2009), while others turn to Bayesian Item Response Theory models (Clinton and
Lewis 2008; Coppedge et al. 2015; Jackman 2004).
A particularly clear explanation of a multi-​rater latent variable model appears in
Jackman’s (2004) article, in which he estimates graduate program applicant quality
based on multiple raters on a graduate admissions committee. This article highlights the
different types of errors that typically crop up in any type of expert rater data. The raw
data show evidence that committee members apply the quality rating differently from
one another, and that systematic biases likely contaminate their ratings of applicants.
Further, not all committee members reviewed all files, so the mapping of experts to
target-​units contains some overlap for each target-​unit, but the overlap is incomplete,
making simpler approaches for extracting latent scores inapplicable.
Jackman’s concern is not only to produce estimates of graduate applicant quality,
but to provide an estimate of uncertainty about the latent trait and permit meaningful
comparisons across applicants while taking uncertainty into account. To do so, he
utilizes a Bayesian item response model derived from education, but he extends it to
Expert Surveys as a Measurement Tool    601

apply to a multi-​rater setting. The model utilizes data about the applicants (their scores,
fields of study, gender, etc.) and data about the committee members (which files they
read, their applicant ratings) to estimate via Markov chain Monte Carlo methods a large
number of unknown parameters, including the latent applicant trait “quality” (posterior
mean), which, crucially, is purged of systematic rater bias.
The strength in this type of model lies in the flexibility to produce estimates for a wide
range of quantities of interest, including those that permit direct comparison of target-​
units to one another (i.e., rankings with uncertainties) and estimates of systematic
biases and differential reliability of individual experts, in this case committee members.
Clinton and Lewis (2008) provide an example of this type of a model applied to ratings
of bureaucratic agencies. Coppedge and colleagues (2015) draw on this technique in
estimating latent traits of democracies. In future iterations, they plan to extend their
model to incorporate a wider range of information about the characteristics of raters,
target-​units, and temporal dynamics.
Regardless of which model is chosen, researchers should strive for transparency,
which means that they should provide full details about the aggregation procedures
selected to combine expert observations into target-​unit scores. Included in this is the
disclosure of any criteria for excluding raters from the aggregation procedure and any
mathematical formula for weighting individuals when aggregating to the target-​unit
level. As discussed previously, full reports should also include an estimate of uncertainty
about the score at the level of the target-​unit.

Summary and Suggestions

This chapter provided an overview of a number of studies from different fields that draw
on experts as a tool of measurement. Taken together, they offer exciting possibilities
for exploring factors that shape the quality of governance through improved measures
of democracy; election quality; and the qualities and positions of candidates, parties,
legislators, executives, and bureaucratic agencies. Expert surveys hold promise of
expanding our ability to measure a wide range of theoretical constructs that are im-
portant but difficult to observe through observational data or document sources. They
provide scholars with flexibility in the timing and frequency of data collection. This flex-
ibility reduces errors that occur when measuring constructs that are temporally distant
from the temporal domain of the construct. The methodological toolkit for creating
and evaluating measures built from expert survey responses is evolving rapidly, and the
core goal of those advancing this field is to identify strategies to minimize error at both
the individual and target-​unit levels of analysis to enhance the validity and reliability of
target measures.
The various sections in this chapter have highlighted four specific areas researchers
should think about carefully as they design projects that draw on experts. The first step is
to define the domain of the theoretical concepts of interest and identify the target-​units
602   Cherie D. Maestas

appropriate to study. Once target-​units are defined, researchers must identify pools
of individuals to serve as expert raters and consider how they map onto target-​units.
Single-​rater-​per-​unit designs should be avoided, because multi-​rater designs produce
measures that are both more valid and reliable. When possible, raters should take ad-
vantage of the additional information gained from fully crossed or partially crossed
designs, in which at least some raters provide overlapping ratings for at least some units.
Crossed designs offer the greatest leverage to recover information about differential ap-
plication of scales across raters.
Central to the design task is defining who qualifies as an “expert.” It is important to
define expertise broadly and not set the bar overly high; larger rater pools are better,
even if the average expertise is lower. It is also important to identify pools of raters with
diverse perspectives rather than to draw from pools of individuals likely to hold sim-
ilar biases or who draw from common information sources. Aggregate error is reduced
when a wider range of individuals with differing perspectives and stores of information
contribute to the aggregate.
Although survey response error is unavoidable, it can be minimized by paying careful
attention to the construction of the survey instrument. Pretesting is a must and ideally
should include open-​ended response opportunities to learn how raters with different
characteristics or in different contexts perceive the meaning of questions. To help re-
duce the variance in how raters apply scales, researchers might include anchoring
vignettes and place them toward the front of the instrument to encourage subjects to
apply the scale in a similar fashion across the entire survey. Finally, it is important to en-
courage raters to feel comfortable opting out of rating a target-​unit on one or more items
or taking the option of saying “don’t know” to reduce the errors that arise from guessing
when they lack knowledge.
The quality of data from raters will certainty vary based on factors such as their at-
tentiveness to the rating task, their level of knowledge about target-​units, and their
understanding of the meaning of the questions or scale. Researchers can benefit from
including survey items to help them evaluate the quality of expert responses, such as
measures of respondent certainty about their answers or questions that ask raters to
score items that have verifiable referents. This information can be incorporated into
the design of aggregation weights or included in measurement models to help re-
duce error in the construction of target-​unit scores. Perhaps most important is that
researchers should strive to provide full and transparent information about how they
arrived at the estimates of target-​unit scores and their uncertainty about the scores.
At a minimum, measures based on expert surveys should report the number of raters
per unit along with a clear description of the procedures used to combine their data.
This description should include any procedures used to preprocess the data before
combining them, including things like purging bias from individual observations,
imputing data, excluding items or observations from the aggregation set, or anything
else that transforms the raw data prior to aggregation. It should also include full math-
ematical specification of the aggregation process, including details of weighting indi-
vidual experts. Ideally, the replication data sets posted would include the raw expert
Expert Surveys as a Measurement Tool    603

data as well as the aggregated measures. In some cases, confidentiality makes this im-
possible; in those cases, full transparency in describing procedures associated with
creating the target-​unit measures is especially important. Finally, researchers should
report measures of inter-​rater reliability or other measures of uncertainty for the
aggregated unit scores. By providing full transparency, researchers empower others
with an interest in using their measures to assess their quality. In addition, transpar-
ency in reporting provides a roadmap to others developing studies that draw on ex-
pert surveys as a measurement tool.

Notes
1. Data sets and the details of the study can be found at Electoral Integrity Project, https://​
sites.google.com/​site/​electoralintegrityproject4/​home. Martinez i Coma and Van Ham
(2015) provide validity analysis for the 2012–​2013 survey responses.
2. The Varieties of Democracy Project, https://​v-​dem.net/​en/​.
3. See Albright and Mair (2011) for a concise review of the history of using expert surveys for
ideological placement of European parties. See also Bakker, Jolly, Polk, and Poole (2014),
Benoit and Laver (2006), Marks et. al. (2007), Ray (2007) for discussions of the validity of
this approach relative to other measurement strategies.
4. Information about the Candidate Emergence Study can be found at http://​ces.iga.ucdavis.
edu/​. The expert survey approach was extended and refined in the UCD-​Congressional
Election Study in 2010. Data and general information about this project are available at
http://​electionstudy.ucdavis.edu/​.
5. Aggregating cannot offset biases that arise from question wording or other factors that
create a similar CMV bias across all respondents. See Podsakoff et al. (2003) for a full
discussion of the problem of CMV biases and potential research design and statistical
solutions.
6. See AAPOR’s list of Best Practices at http://​www.aapor.org/​AAPORKentico/​Standards-​
Ethics/​Best-​Practices.aspx#best12.
7. A number of journals in social sciences have adopted standards that require cited data in
published works to be posted in a public repository, which potentially creates challenges
for researchers using confidential expert assessment data. Authors using data with unu-
sual access restrictions due to confidentiality or other reasons must notify editors at the
time of submission of their limits. For a discussion of research data access and transpar-
ency issues in political science, see Lupia and Elman (2014) and related articles in the same
issue of PS: Political Science and Politics.
8. The PEI data can be found at https://​dataverse.harvard.edu/​dataverse/​PEI.
9. If the assumed null is a uniform distribution, this value is (c-​1)2/​12, the moments of a dis-
crete distribution with c categories. In a case where a survey question has a seven-​point
scale, such as a seven point Likert scale, the snull2 is 4.9 There are other possible assumptions
about the shape of the null variance, but this is an easy starting place.
10. The pooled measure designed for use in studies in which individual raters assess a single
trait (survey question) for a single unit is calculated as follows:

[ MS(a) − MS(r : a)]


Eρ˘ 2 =
MS(a)
604   Cherie D. Maestas

In this case, MS(a) is the variance of the mean across all districts and can be estimated
from the “between” mean square error from an ANOVA. The MS(r : a) captures the vari-
ance of individual responses around the means within districts, estimated as the “within”
mean sum of squares (Jones and Norrander 1996, 301–​302). The measure is more com-
plicated when raters observe more than one target unit, but O’Brien (1990) also derives
the variance components necessary to calculate generalizability coefficients for measures
created when two or more raters assesses all units of a target (fully crossed designs). In this
case, the variance components are drawn from a two-​way ANOVA. However, he notes
that the coefficient does not work well for calculating reliability situations in which two or
more raters assess multiple units but not all units (1990, 480–​490). In such cases, a better
alternative might be to calculate summary statistics from the inter-​rater agreement scores
such as Steenbergen’s (2001) scale reliability score or the rwg statistic.
11. The study used data from the Cooperative Congressional Election Study Common
Content survey. We compared the reliability and validity of ratings of expertise screened
respondents for pools of raters from two to thirty raters. We also compared the expertise-​
screened rater pools to randomly selected rater pools of identical size. See Maestas,
Buttice, and Stone (2014) for full details of the study.
12. Details of the methodology of the CHES can be found in Hooughe et al. (2010) and Bakker,
Jolly, Polk, and Poole (2014).
13. A  full review of the methodology of anchoring vignettes is beyond the scope of this
chapter, but Gary King has developed extensive Web resources for scholars at http://​gking.
harvard.edu/​vign.

References
Adams, J., S. Merrill III, E. N. Simas, and W. J. Stone. 2011. “When Candidates Value Good
Character: A Spatial Model with Applications to Congressional Elections.” Journal of Politics
73 (1): 17–​30.
Albright, J. J., and P. Mair. 2011. “Does the Number of Parties to Place Affect the Placement of
Parties? Results from an Expert Survey Experiment.” Electoral Studies 30 (4): 858–​864.
Aldrich, J. H., and R. McKelvey. 1977. “A Method of Scaling with Applications to the 1968 and
1972 Presidential Elections” American Political Science Review 71 (1): 111–​130.
Alvarez, R. M. 1996. Information and Elections. Ann Arbor: University of Michigan Press.
Alvarez, R. M., L. R. Atkeson, and T. E. Hall. 2013. Evaluating Elections: Tools for Improvement.
New York: Cambridge University Press.
Alvarez, R. M., and C. H. Franklin. 1994. “Uncertainty and Political Perceptions.” Journal of
Politics 56 (3): 671–​688.
Andersson, P., J. Edman, and M. Ekman. 2005. “Predicting the World Cup 2002: Performance
and Confidence of Experts and Non-​Experts.” International Journal of Forecasting 21
(3): 565–​576.
Atkeson, L. R., A. N. Adams, C. Stewart, and J. Hellewege. 2015. “The 2014 Bernalillo County
Election Administration Report.” Typescript, University of New Mexico. https://​polisci.
unm.edu/​common/​documents/​2014-​b ernalillo-​county-​nm-​election-​administration-​
report.pdf.
Atkeson, L. R., Y. Kerevel, R. M. Alvarez, and T. E. Hall. 2014. “Who Asks for Voter
Identification?” Journal of Politics 76 (4): 944–​957.
Expert Surveys as a Measurement Tool    605

Bailer, S. 2004. “Bargaining Success in the European Union: The Impact of Exogenous and
Endogenous Power Resources” European Union Politics 5 (1): 99–​123.
Bakker, R., C. de Vries, E. Edwards, L. Hooghe, S. Jolly, G. Marks, . . . M. A. Vachudova. 2012.
“Measuring Party Positions in Europe: The Chapel Hill Expert Survey Trend File, 1999–​
2010.” Party Politics 21 (1): 143–​152.
Bakker, R., S. Jolly, J. Polk, and K. Poole. 2014. “The European Common Space: Extending the
Use of Anchoring Vignettes.” Journal of Politics 76 (4): 1089–​1101.
Beatty. P. C., and G. B. Willis. 2007. “Research Synthesis:  The Practice of Cognitive
Interviewing.” Public Opinion Quarterly 71 (2): 287–​311.
Behr, D., M. Braun, L. Kaczmirek, and W. Bandilla. 2014. “Item Comparability in Cross
National Surveys: Results from Asking Probing Questions in Cross-​national Web Surveys
about Attitudes Towards Civil Disobedience.” Qual Quant 48: 127–​148.
Benoit, K., and M. Laver. 2006. Party Policy in Modern Democracies. London: Routledge.
Bliese, P. D., R. R. Halverson, and C. A. Schriesheim. 2002. “Benchmarking Multilevel Methods
in Leadership: The Articles, the Model, and the Dataset.” Leadership Quarterly 13 (1): 3–​14.
Boyer, K. K., and R. Verma. 2000. “Multiple Raters in Survey-​Based Operations Management
Research: A Review and Tutorial.” Production and Operations Management 9 (2): 128–​140.
Budge, I. 2000. “Expert Judgments of Party Policy Positions: Uses and Limitations in Political
Research.” European Journal of Political Research 37 (1): 103–​113.
Burke, M. J., and W. P. Dunlap. 2002. “Estimating Interrater Agreement with the Average
Deviation Index: A User’s Guide.” Organizational Research Methods 5 (2): 159–​172.
Buttice, M. K., and W. J. Stone. 2012. “Candidates Matter: Policy and Quality Differences in
Congressional Elections.” Journal of Politics 74 (3): 870–​887.
Castles, F. G., and P. Mair. 1984. “Left-​Right Political Scales, Some ‘Experts’ Judgments.”
European Journal of Political Research 12 (1): 73–​88.
Clemen, R. T. 1989. “Combining Forecasts:  A Review and Annotated Bibliography.”
International Journal of Forecasting 5 (4): 559–​583.
Clinton, J. D., and D. E. Lewis. 2008. “Expert Opinion, Agency Characteristics, and Agency
Preferences.” Political Analysis 16 (1): 3–​20.
Cooke, R. M., and L. H. J. Goossens. 2004. “Expert Judgment Elicitation for Risk Assessments
of Critical Infrastructures.” Journal of Risk Research 7 (6): 643–​656.
Coppedge, M., and J. Gerring, with D. Altman, M. Bernhard, S. Fish, A. Hicken, . . . J. Teorell.
2011. “Conceptualizing and Measuring Democracy:  A New Approach.” Perspectives on
Politics 9 (2): 247–​267.
Coppedge, M., J. Gerring, S. I. Lindberg, D. Pemstein, S.-​ E. Skaaning, J. Teorell,  .  .  .
B. Zimmerman. 2015. “Varieties of Democracy Methodology v4.” Varieties of Democracy
Project, Project Documentation Paper Series. https://​v-​dem.net/​en/​reference/​version-​4-​
mar-​2015/​.
Curini, L. 2010. “Experts’ Political Preferences and Their Impact on Ideological Bias.” Party
Politics 16 (3): 299–​321.
Dorussen, H., H. Lenz, and S. Blavoukos. 2005. “Assessing the Reliability and Validity of Expert
Interviews.” European Union Politics 6 (3): 315–​337.
Gaissmaier, W., and J. N. Marewski. 2011. “Forecasting Elections with Mere Recognition from
Small Lousy Samples: A Comparison of Collective Recognition, Wisdom of Crowds, and
Representative Polls.” Judgment and Decision Making 6 (1): 73–​88.
Genre, V., G. Kenny, A. Meyler, and A. Timmerman. 2013. “Combining Expert Forecasts: Can
Anything Beat the Simple Average?” International Journal of Forecasting 29: 108–​121.
606   Cherie D. Maestas

Graefe, A. 2014. “Accuracy of Vote Expectation Surveys in Forecasting Elections.” Special issue,
Public Opinion Quarterly 78: 204–​232.
Graefe, A., J. S. Armstrong, R. J. Jones, and A. G. Cuzán. 2014. “Combining Forecasts:  An
Application to Elections.” International Journal of Forecasting 30: 43–​54.
Green, K. C., and J. S. Armstrong. 2007. “The Ombudsman: Value of Expertise for Forecasting
Decisions in Conflicts.” Interfaces 37 (3): 287–​299.
Hayes, A. F., and K. Krippendorff. 2007. “Answering the Call for a Standard Reliability Measure
for Coding Data.” Communication Methods and Measures 1 (1): 77–​89.
Heywood, P. M., and J. Rose. 2014. “ ‘Close but No Cigar’: The Measure of Corruption.” Journal
of Public Policy 34 (3): 507–​529.
Hooughe, L., R. Bakker, A. Brigevich, C. De Vries, E. Edwards, G. Marks, . . . M. Vachudova.
2010. “Reliability and Validity of the 2002 and 2006 Chapel Hill Expert Surveys on Party
Positioning.” European Journal of Political Research 49: 687–​703.
Hopkins, D. J. and G. King. 2010. “Improving Anchoring Vignettes:  Designing Surveys to
Correct Interpersonal Incomparability.” Public Opinion Quarterly 74 (2): 201–​222.
Huber, J., and R. Inglehart. 1995. “Expert Interpretations of Party Space and Party Location in
42 Societies.” Party Politics 1 (1): 73–​111.
Jackman, S. 2004. “What Do We Learn from Graduate Admissions Committees? A Multiple
Rater Latent Variables Model with Incomplete Discrete and Continuous Indicators” Political
Anlaysis 12: 400–​424.
Jones, B. S., and B. Norrander. 1996. “The Reliability of Aggregated Public Opinion Measures.”
American Journal of Political Science 40 (2): 295–​309.
Jose, V. R. R., and R. L. Winkler 2008. “Simple Robust Averages of Forecasts: Some Empirical
Results.” International Journal of Forecasting 24: 163–​169.
Kahneman, D. 2011. Thinking Fast and Slow. New York: Farrar, Straus and Giroux.
King, G., C. J. L. Murray, J. A. Salomon, and A. Tandon. 2004. “Enhancing the Validity and
Cross-​Cultural Comparability of Measurement in Survey Research.” American Political
Science Review 98 (1): 191–​207.
King, G., and J. Wand. 2007. “Comparing Incomparable Survey Responses: Evaluating and
Selecting Anchoring Vignettes.” Political Analysis 15 (1): 46–​66.
Kitschelt, H., and D. M. Kselman. 2013. “Economic Development, Democratic Experience,
and Political Parties Linkage Strategies.” Comparative Political Studies 46 (11): 1453–​1 484.
Kunda, Z. 1990. “The Case for Motivated Reasoning.” Psychological Bulletin 108 (3): 480–​498.
Larrick, R. P., and J. B. Soll. 2006. “Intuitions about Combining Opinions: Misappreciation of
the Averaging Principle.” Management Science 52 (1): 111–​127.
Lee, J. 2014. “Conducting Cognitive Interviews in Cross-​National Settings.” Assessment 21
(2): 227–​240.
Lindell, M. K. 2001. “Assessing and Testing Interrater Agreement on a Single Target Using
Multi-​Item Rating Scales.” Applied Psychological Measurement 25 (1): 89–​99.
Lindell, M. K., C. J. Brandt, and D. J. Whitney. 1999. “A Revised Index of Interrater Agreement
for Multi-​Item Ratings of a Single Target.” Applied Psychological Measurement 23 (2): 127–​135.
Lodge, M., and C. S. Taber. 2013. The Rationalizing Voter. New  York:  Cambridge
University Press.
Lupia, A., and C. Elman. 2014. “Openness in Political Science:  Data Access and Research
Transparency.” PS: Political Science and Politics 47 (1): 19–​42.
Maestas, C. D., M. K. Buttice, and W. J. Stone. 2014. “Extracting Wisdom from Experts and
Small Crowds: Strategies for Improving Informant-​Based Measures of Political Concepts.”
Political Analysis 22 (3): 354–​373.
Expert Surveys as a Measurement Tool    607

Maisel, L. S., and W. J. Stone. 1998. “The Politics of Government Funded Research:  Notes
from the Experience of the Candidate Emergence Study.” PS-​Political Science & Politics 31
(4): 811–​817.
Marks, G., L. Hooghe, M. Steenbergen, R. Bakker. 2007. “Crossvalidating Data on Party-​
positioning on European Integration” Electoral Studies 26 (1): 22–​38.
Martin, T. G., M. A. Bergman, F. Fidler, P. M. Kuhnert, S. Low-​Choy, M. McBride, and K.
Mengersen. 2012. “Eliciting Expert Knowledge in Conservation Science.” Conservation
Biology 26 (1): 29–​38.
Martinez i Coma, F., and C. Van Ham. 2015. “Can Experts Judge Elections? Testing the Validity
of Expert Judgments for Measuring Election Integrity.” European Journal of Political
Research 54 (2): 305–​325.
McCann, S. J. H. 1998. “The Extended American Social, Economic, and Political Threat Index
(1788–​1992).” Journal of Psychology 132 (4): 435–​449.
McDonald, M. D., S. M. Mendes, and M. Kim. 2007. “Cross-​Temporal and Cross-​National
Comparisons of Party Left-​Right Positions.” Electoral Studies 26 (1): 62–​75.
McNeese, S. K. 1992. “The Uses of Abuses of ‘Consensus’ Forecasts.” Journal of Forecasting 11
(8): 703–​7 10.
Meade, A. W., and L. T. Eby 2007. “Using Indices of Group Agreement in Multilevel Construct
Validation.” Organizational Research Methods 10 (1): 75–​96.
Meyer, J. P., A. E. Cash, and A. Mashburn. 2011. “Occasions and the Reliability of Classroom
Observations:  Alternative Conceptualizations and Methods of Analysis.” Educational
Assessment 16 (4): 227–​243.
Norris, P., R. W. Frank, and F. Martinez i Coma. 2013. “Assessing the Quality of Elections.”
Journal of Democracy 24 (4): 124–​135.
Norris, P., F. Martinez i Coma, and M. Gromping. 2015. “The Year in Elections, 2014.” The HKS
Faculty Research Working Paper Series. https://​research.hks.harvard.edu/​publications/​
workingpapers/​Index.aspx.
O’Brien, R. M. 1990. “Estimating the Reliability of Aggregate-​Level Variables Based on
Individual-​level Characteristics.” Sociological Methods and Research 18: 473–​504.
Philips, L. 1981. “Assessing Measurement Error in Key Informant Reports: A Methodological
Note on Organizational Analysis in Marketing.” Journal of Marketing Research 18
(4): 395–​415.
Podsakoff, P. M., S. B. Mackenzie, J.-​Y. Lee, and N. P. Podsakoff. 2003. “Common Method Biases
in Behavioral research: A Critical Review of the Literature and Recommended Remedies.”
Journal of Applied Psychology 88 (5): 879–​903.
Poole, K. T. 1998. “Recovering a Basic Space from a Set of Issue Scales.” American Journal of
Political Science 42 (3): 954–​993.
Powell, L. 1989. “Analyzing Misinformation:  Perceptions of Congressional Candidates’
Ideologies.” American Journal of Political Science 33 (1): 272–​293.
Ray, L. 1999. “Measuring Party Positions on European Integration:  Results from an Expert
Survey.” European Journal of Political Research 36 (2): 283–​306.
Ray, L. 2007. “Validity of Measured Party Positions on European Integration: Assumptions,
Approaches, and a Comparison of Alternative Measures.” Electoral Studies 26: 11–​22.
Rohrschneider, R., and S. Whitefield. 2009. “Understanding Cleavages in Party Systems: Issue
Position and Issue Salience in 13 Post-​Communist Democracies.” Comparative Political
Studies 42 (2): 280–​313.
Saiegh, S. M. 2009. “Recovering a Basic Space from Elite Surveys:  Evidence from Latin
America.” Legislative Studies Quarterly 34 (1): 117–​145.
608   Cherie D. Maestas

Sjoberg, L. 2009. “Are All Crowds Equally Wise? A Comparison of Political Elections Forecasts
by Experts and the Public.” Journal of Forecasting 28 (1): 1–​18.
Smith, J., and K. F. Wallis. 2009. “A Simple Explanation of the Forecast Combination Puzzle.”
Oxford Bulleting of Economics and Statistics 71 (3): 331–​355.
Steenbergen, M. 2001. “Item Similarity in Scale Analysis.” Political Analysis 8 (3): 261–​283.
Steenbergen, M., and G. Marks. 2007. “Evaluating Expert Judgments.” European Journal of
Political Research 46: 347–​366.
Stone, W. J., S. A. Fulton, C. D. Maestas, and L. S. Maisel. 2010. “Incumbency
Reconsidered: Prospects, Strategic Retirement, and Incumbent Quality in the U.S. House
Elections.” Journal of Politics 72 (1): 178–​190.
Stone, W. J., and E. N. Simas. 2010. “Candidate Valence and Ideological Positions in U.S. House
Elections.” American Journal of Political Science 54 (2): 371–​388.
Surowiecki, J. 2004. The Wisdom of Crowds. New York: Random House.
Tetlock, P. E. 2005. Expert Political Judgment: How Good Is It? How Can We Know? Princeton,
NJ: Princeton University Press.
Van Bruggen, G. H., G. L. Lilien, and M. Kacker. 2002. “Informants in Organizational
Marketing Research:  Why Use Multiple Informants and How to Aggregate Responses.”
Journal of Marketing Research 39 (4): 469–​478.
Wagner, W., and M. Onderco. 2014. “Accommodation or Confrontation? Explaining
Differences in Policies Toward Iran.” International Studies Quarterly 58 (4): 717–​728.
Wand, J. 2013. “Credible Comparisons Using Interpersonally Incomparable Data:
Nonparametric Scales with Anchoring Vignettes.” American Journal of Political Science 57
(1): 249–​262.
Whitefield, S., M. A. Vachudova, M. R. Steenbergen, R. Rohrschneider, G. Marks, M. P.
Loveless, and L. Hooghe. 2007. “Do Expert Surveys Produce Consistent Estimates of Party
Stances on European Integration? Comparing Expert Surveys in the Difficult Case of
Central and Eastern Europe.” Electoral Studies 26 (1): 50–​61.
Winkler, R. L., and R. T. Clemen. 2004. “Multiple Experts vs. Multiple Methods: Combining
Correlation Assessments.” Decision Analysis 1 (3): 167–​176.
Chapter 26

The Rise of P ol l
Aggregati on a nd
El ection Fore c ast i ng

Natalie Jackson

Introduction

The public face of polls and elections fundamentally changed in 2008 when a statis-
tician named Nate Silver, who had been busy forecasting the performance of Major
League Baseball players up to that point, created a mostly poll-​based forecast of the pres-
idential election on a blog he called FiveThirtyEight. The attempt to forecast elections
was not in itself a new idea; academic political scientists had been producing electoral
forecasts for quite some time. But how Silver presented it to the public generated tons of
attention from nonacademic audiences and changed how elections and polls are cov-
ered in the media.
In only a few years, election forecasts became highly valued and a necessary compo-
nent of election coverage for some media outlets. By 2012 academics were publicizing
their forecasts on blogs and websites, the New York Times was hosting FiveThirtyEight,
and the Huffington Post Pollster added a forecast to its repertoire. In 2014 five media
outlets produced forecasts for the Senate elections that calculated the probability that
Republicans would gain the Senate majority.
Silver’s popularity may have seemed to come out of nowhere, but the groundwork for
his success had been laid by other developments. Advances in polling methodology and
technology had resulted in a large increase in the number of election polls, especially
since 2000, leaving media and the public wondering how to sort through all the infor-
mation from various polls which often showed different results. In response, websites
providing polling averages, or “aggregations,” began to pop up, most prominently
Pollster (now part of the Huffington Post) and RealClearPolitics, providing a single esti-
mate for an electoral contest. These aggregations show where the electoral contest stands
610   Natalie Jackson

at the current moment in time, but before long Silver and others added complex statis-
tical techniques to poll aggregations to generate forecasts of the later electoral outcomes.
This chapter begins by tracking the developments in polling technology that allowed
more and more polls to be conducted, how aggregators attempted to condense all the
polls for consumers and media, and the development of forecasting techniques that
used poll aggregation to create election forecasts. Following that is a technical discus-
sion of how poll aggregation and forecasting are done and the statistical challenges
specific to each. The last section focuses on how consumers can evaluate aggregations
and forecasts and how analysts can build and communicate better about their models.
The forecasts are typically fairly advanced statistical models, and it is difficult both for
analysts to communicate about them and for consumers to know what to look for in de-
ciding how much faith to put in a forecast, or even which forecast to trust.

Polling Developments, Aggregation,


and Forecasting

Election polling methods were developed throughout the 1930s and 1940s, and some of
the first efforts were complete failures (Gallup 1951). The process of obtaining a repre-
sentative sample of voters for any contest was complicated and expensive and required
face-​to-​face interviewing, in which a trained interviewer had to go to randomly selected
households and administer the survey. Eventually household telephones became ubiq-
uitous enough that instead of face-​to-​face interviews, pollsters could have interviewers
call house phones and do an interview without leaving the office.
In the 1960s and 1970s telephone interviewing became the standard. By the 1980s
nearly 100% of American households had at least one phone line, which was great news
for pollsters, since face-​to-​face surveys were very reliable but also very expensive and
time-​consuming. Getting a good sample of phone numbers was simple, and the race
to improve calling efficiency and data collection speed was on. In the 1990s computer-​
assisted telephone interviewing became the norm, as software developers created sys-
tems that would dial phone numbers and record data from an interview.
Computer technology continued to improve the efficiency of polling operations.
Autodialers, which automatically place calls, and predictive dialers, programs that auto-
matically dial a phone number but only connect the call to a live interviewer if a person
answers the call, greatly increased efficiency and reduced costs by eliminating the need
for interviewers to sit in silence listening to phones ring endlessly. Some pollsters went
even further: To circumvent the costs of employing interviewers to conduct the polls,
automated voice technology was adapted for polling purposes. A recording would read
respondents questions, and they would answer using buttons on a touch-​tone phone.
Polling was getting easier and cheaper, which meant more companies wanted to get
in the game, and more individuals, campaigns, and organizations wanted—​and could
Rise of Poll Aggregation and Election Forecasting    611

afford—​polling data. The number of national-​level presidential election trial-​heat polls


skyrocketed by as much as 900% between 1984 and 2000 and continued to expand in the
2000s, when Internet polls came on the scene (Hillygus 2011). The industry suddenly
had an entirely new mode of interviewing people that was fast and cheap, making it de-
sirable despite its considerable coverage issues. But telephone surveys were facing cov-
erage issues of their own as cell phone use expanded and response rates declined.
As the volume of polling, number of pollsters, and methods of conducting polling
grew, poll watchers faced a problem: When several polls all have different estimates,
which set of numbers is right? How could a consumer even find all of the available polls?
Poll aggregation and averaging offered some answers.

Aggregation
The Internet not only offered a new mode of collecting information, it offered a place
to store and display that information. Early in the 2000s a handful of websites emerged
dedicated to collecting available polling data and attempting to help consumers make
sense of those data. RealClearPolitics came online in 2000, and Pollster (originally
Mystery Pollster, and now part of the Huffington Post) began collecting polls in 2004.
The goals of these sites were simple: create a one-​stop shop for information about
pre-​election polling and the polls themselves, and provide a simple explanation of what
the polls say about the electoral contest. The key difference was that RealClearPolitics
tended to focus its analysis on the campaigns and political side, whereas Pollster focused
its analysis on the polling and methodology used to collect the data. Both sites eventu-
ally began to produce poll averages as parsimonious indicators of what was happening
in the race. These averages offered a single set of numbers that took multiple polls into
account, a simpler prospect than consumers trying to figure out pollster ratings (Silver,
2016) or measures of poll accuracy (Martin, Traugott, and Kennedy 2005). Averages pro-
vided an easier way for the general public and media to look at the polls—​particularly
since media resources tightened during the same period in which polling expanded,
leaving newsrooms with less expertise and leaning more on polls to frame election cov-
erage (Rosenstiel 2005).
The sites took different approaches to averaging. RealClearPolitics used a simple
average of the last five polls for that particular electoral contest in its database to gen-
erate overall estimates. Pollster created charts using all of the polls, plotted by the dates
they were conducted, and then used a regression technique to estimate a poll average—​
illustrated as a line on the chart—​to show what the estimated average was over time. The
statistical implications of each averaging technique are discussed in the next section.
Although these methods differed significantly in execution, the end result was a way to
look at many different polls and make sense of the information.
In recent years more websites have begun collecting and aggregating polls.
FiveThirtyEight and the Daily Kos track polls, although mostly for use in their
forecasting models, and other sites often emerge during election seasons. The methods
612   Natalie Jackson

for aggregation have become more complex (although the simple ones described above
are still used), to calculate confidence intervals and the probability of a candidate
winning. But the first polling averages paved the way for poll-​based forecasts by devel-
oping the concept of pooling all of the polls into a “model” of sorts that would use the
power of all the available polling information to estimate where public opinion stands.

Forecasting
The inevitable question that emerged from poll aggregation was this:  If all of the
polls put together say x, then what does that say about the future election outcome?
Academic political scientists had been using various data sources—​some including
polls, some not—​for many years to forecast electoral outcomes by the time Nate Silver’s
forecast attracted public interest in 2008 (Lewis-​Beck and Stegmaier 2014). Some of
the academic forecasts were cited in blogs and news sources, but most had remained
confined to the meeting rooms of the American Political Science Association and its
smaller affiliates. Scholars were doing tremendous work, but the forecasts were typically
static—​the forecast was calculated once. Silver not only packaged his forecast well on his
blog, but made it a continuously updating forecast that practically demanded constant
attention from political junkies.
Silver benefited from great timing. There was no shortage of polling data by 2008, and
new polls were released at least every week, sometimes every day close to Election Day.
Combined with the information political scientists had honed about the “fundamentals”
that affect presidential election outcomes, such as economic indicators and the sitting
president’s approval ratings, there was a lot of quantitative information about the elec-
tion to utilize in a forecast.
It probably helped that the 2008 presidential contest commanded more attention
than other recent elections, in part due to Democratic nominee Barack Obama’s pop-
ularity with young and minority voters. People were willing to log onto a website to see
if the projections supported the enthusiastic “hope and change” campaign. The model
correctly projected Obama’s win and predicted the election’s outcome in forty-​nine of
fifty states (Clifford 2008). The forecast was viewed as a huge success.
As with any successful venture that generates a lot of Internet traffic, people began
to look for ways to replicate that success. The New  York Times took over hosting
FiveThirtyEight, and Silver became a full-​time writer and forecaster. By the 2012 elec-
tion, two other independent blogs, run by Princeton University professor Sam Wang
and Emory University professor Drew Linzer, were running forecasts, and Stanford
University political scientist Simon Jackman produced a forecast for Pollster, by then
part of the Huffington Post. Most of these forecasts were very accurate; Silver, Jackman,
and Linzer correctly predicted all fifty states. Wang only missed Florida, which had a
razor-​thin margin and couldn’t be called on election night (Jackman 2012c).
In 2014 the expansion continued, and more media outlets got into the forecasting
game for the midterm Senate elections. The big national question this time was
Rise of Poll Aggregation and Election Forecasting    613

whether the Republicans would take the majority in the Senate away from the
Democrats. Five separate national media outlets had their own forecasts. The
Washington Post got into the game via an academic blog called The Monkey Cage,
led by George Washington University professor John Sides; FiveThirtyEight became
its own data journalism website under the auspices of ESPN and ABC; and Linzer
worked with the Daily Kos. The New  York Times and he Huffington Post (Pollster)
hired data scientists to work on their forecasts. Wang ran his forecast again at the
Princeton Election Consortium.
Most of these forecasts depend heavily on polls, and there are usually more polls for
higher offices. Combined with a desire to appeal to as broad an audience as possible,
that means there is a bias toward forecasting races at the national level. Forecasting
presidential elections is the most obvious choice for getting attention and using lots
of polls, but Senate elections can also garner attention. In 2014 there might have been
little national appeal in an individual Senate race, but the possibility of an overall
majority swing attracted national interest. So forecasters projected individual races
to get to the bigger forecast: how likely it was that Republicans would take over the
Senate. Not all Senate contests were heavily polled, but the races identified as close
and likely to affect the national majority had plenty of polls with which to construct
a forecast.
By contrast, fewer outlets covered the 2014 gubernatorial contests, for which there
was less national attention, than covered the Senate races. The Huffington Post, the Daily
Kos, and FiveThirtyEight produced forecasts for the thirty-​six individual gubernato-
rial elections, and only the Daily Kos released its gubernatorial projections at the same
time as its Senate projections. Both Huffington Post and FiveThirtyEight debuted their
gubernatorial forecasts closer to Election Day than their Senate forecasts. In addition
to less expected attention from a national audience, there were fewer polls in many of
the gubernatorial races than in the Senate races. Since the forecasting methods depend
mostly on polls (and some depend completely on polls), this meant that the gubernato-
rial outcomes were more difficult to forecast.
For the same reasons, elections for the House of Representatives are difficult to fore-
cast. The polls problem is much larger here; House races are rarely polled often enough
to generate a forecast for any individual races. The best option for getting an idea of
where House contests stand is to use questions from national polls that ask generically
whether respondents plan to vote for the Republican or the Democratic candidate in
their local congressional election, although this does not produce enough information
to determine any single district’s status. The result is that House elections get little focus
in forecasting, but Senate and presidential elections are popular.
In the future, more media outlets and more academics could get into the public
forecasting game. There might not be much utility for consumers in increasing the
number of forecasts, a point discussed in the section on controversies, but the tech-
nology needed to do the statistics required for these forecasts is becoming easier and
easier to access even as the methods of aggregating and forecasting have become more
complex.
614   Natalie Jackson

The Statistics of Aggregation


and Forecasting

Using polls to produce aggregated estimates and forecasts is a complex task, because no
two polls are alike. In theory, by pooling the polls the sample size is effectively increased
and uncertainty about the estimates decreased, but polls cannot be simply pooled to-
gether, because most pollsters do not release the raw data sets when they release the poll
numbers. Even if a pollster does deposit the raw data into an archive, typically the data
are not immediately available for aggregators and forecasters producing in-​the-​moment
estimates. Without raw data available, aggregators and forecasters have to work with the
aggregated numbers the pollsters do release—​usually the “toplines” that show what pro-
portion of the sample answered the question a certain way. In a horse-​race pre-​election
poll, aggregators work with the poll’s estimate of support for each candidate. Forecasters
typically rely on poll toplines when they incorporate polling data as well.
Instead of working with individual-​level data, as the pollsters do in their raw data,
aggregators and forecasters work with poll-​level data. This distinction has substantial
implications for working with the data. Treating each poll as a unit of analysis means
there are far fewer units to analyze and restricts the type of statistical analysis that
can be done. Aggregators tend to use simpler methods that frequently resemble (or
are) simple averages of the poll estimates. Forecasters typically use advanced models
that include information about the polls and produce estimates of uncertainty in the
aggregated polls and the entire forecast model. The next sections discuss common
statistical methods for aggregating polls and forecasting election results based on
polling data.

Aggregation
The simplest way to aggregate polls is to average the estimates of recent polls.
RealClearPolitics calculates poll averages by reporting the arithmetic mean of the most
recent four to eight polls. If a lot of polls have been done on a contest within the last few
weeks, the time range covered by the average will be shorter (and if several polls were
conducted on the same dates, they are all usually included in the average), whereas if
only a few polls were done over several months, the average will cover a much longer
period of time.
For example, if President Barack Obama’s approval rating in the last five polls was
42, 45, 45, 43, and 42, the polling average would be (42 + 45 + 45 + 43 + 42)/​5, or 43.4.
Sometimes called “rolling” or “moving” averages, these numbers are updated each
time a new poll is released, creating a series of averages over time that can be plotted
on a chart. Multiple averages can be plotted on the same chart to show how multiple
candidates or different answer options compare. Figure 26.1 shows the RealClearPolitics
Rise of Poll Aggregation and Election Forecasting    615

10.8 Bush +0.2 10.6 Walker 10.0 Rubio 9.4 Carson


8.6 Huckabee 8.2 Paul 7.0 Cruz 4.6 Christie
3.6 Trump 3.2 Perry 2.2 Santorum 1.8 Fiorina
1.8 Kasich 1.4 Graham 1.0 Jindal

20

18

16

14

12

10

October 2014 April July October 2015 April

Figure 26.1  Real Clear Politics Polling Averages, National 2016 Republican Primary.

chart of moving averages for national polls on the 2016 Republican presidential primary
races, with each line representing a different candidate (RealClearPolitics 2015). Since a
new number is calculated each time a poll is released, the estimates move abruptly from
one number to the next in straight lines along the time series.
The Huffington Post Pollster aggregations use locally weighted scatterplot smoothing
(LOESS) to produce poll estimates. In this method, the chart comes first. Poll estimates
are plotted on a chart in which the x-​axis is the date the poll was conducted and the y-​
axis is the proportion of poll respondents answering with a specific option or candidate.
Once all the polls are plotted over time, LOESS plots a smooth line representing the best
estimate of candidate support (or whatever is being measured) over time, based on the
data nearest the point in time the line is estimating. Multiple estimates or candidates
can be shown on the same chart, but the LOESS line is calculated for each candidate
or option individually. Figure 26.2 shows the HuffPost Pollster chart using this method
for the national polls on the 2016 Republican presidential primary races (Huffpost
Pollster 2015).
The math behind a LOESS line is more complex than a simple average. The LOESS is
a nonparametric technique, meaning it does not assume that the data will follow a spe-
cific distribution; for example, typical regression analysis assumes that the data being
estimated are normally distributed. Nonparametric regression relaxes that assumption
616   Natalie Jackson

Polling Trend
Marco Rubio 11.7%
30 Jeb Bush 10.9%
Scott Walker 10.6%
25 Ben Carson 10.2%
Mike Huckabee 8.8%
Rand Paul 8.4%
20
Ted Cruz 6.2%
Chris Christie 4.3%
15 Carly Fiorina 4.2%
Rick Perry 2.7%
10 Rick Santorum 2.2%
John Kasich 2.1%
Donald Trump 1.9%
5
Lindsey Graham 1.2%
Bobby Jindal 0.8%
0
George Pataki 0.0%
Jan. 2013
March

September
November
Jan. 2014
March
March

September
November
Jan. 2015
March
March
May
July

May
July

May
Undecided
Other

Figure 26.2  Huffington Post Pollster Polling Averages, National 2016 Republican Primary.

so that the data can be fit as they are without a distributional assumption (Gibbons
1993). The locally weighted part means that the estimate produced by the regression is
weighted to the values of the data points closest to it. That means the value of the LOESS
line at a certain point is more representative of the points immediately surrounding it
than it is of the data farther away.
The number of polls prior to May 1 that LOESS will use to estimate the proportion for
May 1 depends on a user-​defined “span,” which in this case would be a number of days,
because the x-​axis is based on dates. The certainty of the LOESS estimate for any given
day depends on the number of polls within that date span. For example, the LOESS line
estimate for May 1, 2015, in figure 26.2 is forced to reflect the result of a regression on the
polling data closest to May 1 for that candidate. If there were many polls conducted just
before May 1, the LOESS calculation would be more reliable and certain than it would
be if there were not many polls conducted around May 1. If May 1 is the end of the time
series, meaning that is the last estimate, obviously only polls before that date can be used
in the estimate. As time moves on, however, span will include both polls before and after
that date.
However, even with the different methods of averaging, there are not many differences
between figures 26.1 and 26.2, illustrating the fact that the methods often produce sim-
ilar results. The advantage of LOESS is that it uses the weighting to treat more recent poll
results as more important and can therefore pick up trends faster when a candidate’s
support is moving up or down than a simple average that treats the last n polls equally.
Rise of Poll Aggregation and Election Forecasting    617

Once there is a critical mass of polls for two-​or three-​candidate electoral contests,
Pollster moves to an even more advanced model, based on a Bayesian Kalman filter,
to combine the polls and plot the averages on the chart (Jackman 2012b). The model
calculates an average for each day based on the polls available prior to that date and a
user-​defined number of simulations (typically a large number, 100,000 or more) in what
are called Markov chains. Markov chains use data to simulate what the outcome might
be as more data come in, as well as the probability that the current “state” of the outcome
might change. The Markov chains require starting values, or “initial” points, to begin, as
well as information telling the simulations how to work. There are many different shapes
data can take, called “distributions,” and it is necessary to specify what shape the data
in simulations are taking. For the poll averaging model, the initial points are randomly
selected by the computer along normal distributions.
The model uses these parameters to begin running the simulations with the Markov
chain Monte Carlo (MCMC) method to calculate the point estimate for each candidate
on each date of the time series—​typically from the date of the first poll until the current
date. The model incorporates the polls that were available for each day, pulling in the
relevant polls as it continues toward the current date, at which time all of the polls are
being considered. More recent polls are more influential in the average than older polls.
When it is done with the MCMC simulations, the model generates point estimates for
each candidate on each date, as well as estimates of how certain that outcome is, plus
estimates of undecided proportions and the margin between the candidates (Jackman
2005; 2012b). The information is plotted onto a chart, with a line summarizing the daily
poll averages and shaded bands representing the range within which the poll estimate
landed in 95% of the simulations. If these bands overlap, it illustrates that a leading can-
didate might not actually be ahead. The more polls there are to average, the smaller the
shaded error bands are, since more information leads to more certainty about the av-
erage. Figure 26.3 shows what this looked like for national polls asking respondents
which party’s candidate they intended to vote for in the 2014 House of Representatives
elections (Huffpost Pollster 2014).
The lines and shaded error bands illustrate the advantages of aggregating polls using
this Kalman filter model. When there are fewer polls, as there were in May through
September 2013, the error bands are wider and overlap despite a general consensus in
the polls that were conducted during those months that Republicans were ahead. When
there were more polls in the fall of 2014 leading up to the election, the error bands got
very small, since more polls clustering together reduces the uncertainty of the average
estimates.

Forecasting
Forecasting requires substantially more complex methods, which demand more sta-
tistical expertise and computing power. This section takes a very broad look at the
618   Natalie Jackson

50

45
Huffpost Model Estimate
Republican 46.1%
40
Democrat 43.8%
Undecided
35

30
13

ch

ay

Se July

ov er

n. r
14

ch

ay

Se July

ov er

r
be

be
b

b
M

M
ar

ar
20

20
em

em

em

em
M

M
n.

pt

pt
Ja

Ja
N

N
Figure  26.3  Huffington Post Pollster Polling Averages, 2014 House Race National Party
Preference.

techniques used to produce forecasts, primarily using examples from Senate and presi-
dential election forecasts.
Although the Senate is a series of several state-​level contests and the presidential race
is usually regarded as a national-​level election, the forecasting techniques are similar: (1)
estimate the contest at the state level, then (2) combine state-​level results to determine the
national-​level forecast. Presidential forecasts need the first part at the state level because
the Electoral College system allocates votes by state, essentially making the national elec-
tion a series of state-​level elections, which are then combined to generate the probability
of someone winning at the national level. Senate forecasts start as state-​level elections,
but become national by putting all the state-​level estimates together to estimate a proba-
bility of whether Republicans or Democrats will hold the majority after the election.
It is very difficult to summarize the methods behind forecasts since, because is
different, the detail of methodological explanations published about the models varies
substantially, and not all of the code used to organize the data and generate predictions
is posted publicly. However, there are some basic parts of forecasting models that can be
generally summarized: Bayesian models versus non-​Bayesian models, models that in-
clude election “fundamentals” versus polls-​only models, options for adjusting the data,
and simulating outcomes.

Bayesian vs. non-​Bayesian modeling


The fundamental difference between a Bayesian forecast model and a non-​Bayesian
forecast model is the ability to incorporate “priors,” or prior information, into the model.
The theory behind a Bayesian model is that the modeler knows certain things about the
Rise of Poll Aggregation and Election Forecasting    619

question—​in this case, we know a few things about the election coming up—​but there
are also data available to update the preexisting knowledge. Both the prior knowledge
(or beliefs) and data are used to generate the “posterior,” or what is known about the
question after it is modeled (Gill 2007). Unlike non-​Bayesian, or “frequentist,” methods,
the posterior is a distribution of possible values rather than a single estimate. The mean
of that distribution serves as the point estimate, and the distribution itself provides in-
formation about the uncertainty of that estimate.
In the case of an election, prior information could be what has happened in past
elections; what others think will happen in the election; or “fundamentals” such as ec-
onomic indicators, incumbency, or approval ratings. The 2014 Huffington Post model
created priors based on ratings produced by the Cook Political Report, quantified by
analyzing the proportion of times that the Cook Report predictions were correct in the
past (Jackson 2014). The 2014 New York Times model and Linzer’s 2012 model use prior
estimates from fundamentals, quantified by putting the various measures into a regres-
sion model (Cox and Katz 2014; Linzer 2012).
The priors are combined with the polling data in a time series model to produce the
posterior estimates for each electoral contest. The mean of the posterior distribution
is typically used as the estimate, with the rest of the distribution serving as the cred-
ible intervals for the estimate. When the question is how likely one candidate is to de-
feat another candidate, the posterior is calculated for each candidate and for the margin
between the candidates. If the posterior distributions for the candidates overlap, or
the distribution for the margin between the candidates crosses 0, there is a chance the
candidates are tied. The probability of one candidate leading another, or one candidate
winning, is calculated out of these posterior distributions and the likelihood that they
overlap (Jackman 2012a; Jackson 2014).
Non-​Bayesian modeling eliminates the priors and works with traditional regression
techniques. The basic procedure is similar, however. Polling averages are calculated using
some form of time series model or LOESS procedure, similar to the Pollster aggregation
techniques just described (Silver 2008a). If fundamentals are used, they are combined
or modeled to get a fundamentals estimate. Then the polls and fundamentals are put to-
gether to generate a single outcome (Silver 2014a). These outcomes can still be expressed
in probabilistic terms using the confidence intervals and standard errors that the models
produce, so that both the Bayesian and frequentist models are reported similarly. It is
only in reading the details of each model that the difference in techniques becomes clear.
There have not been enough models yet to say whether Bayesian or frequentist
models do a better job of forecasting the election. In theory, the Bayesian setup, which
makes use of prior information seems better suited to forecasting an election in which
prior beliefs and information abound, but in practice in 2014, the differences between
the model types were minute (Sides 2014b).

Fundamentals vs. Polls-​Only


The other major methodological difference between forecast models is whether the
model includes “fundamentals” about the election or is polls-​only. Polls-​only models
620   Natalie Jackson

are exactly what they sound like: the only data used to predict the outcome are polls.
Models that include fundamentals pull in a wide variety of nonpolling data to help pre-
dict the outcome. In 2014 the New York Times, FiveThirtyEight, and Washington Post
models all incorporated various fundamentals about the election (Cox and Katz 2014;
Sides 2014c; Silver 2014a).
Fundamentals are generally anything besides horse-​race polls that contains in-
formation about how the election might turn out. Each of the 2014 models that used
fundamentals incorporated some combination of indicators about how Americans were
feeling about the president, the political parties, and the candidates, as well as economic
and financial indicators. The New York Times model used polling data on presidential ap-
proval and the generic congressional ballot question (the proportion of voters planning
to vote for the Republican or Democratic congressional candidate), FiveThirtyEight
used the congressional ballot question and congressional approval ratings, and the
Washington Post used presidential approval ratings and economic performance (change
in the gross domestic product). The Washington Post and FiveThirtyEight incorpo-
rated measures of the partisan makeup of the district or state, incumbents’ previous win
margins, and measures of political experience for each candidate. FiveThirtyEight went
even further, adding fundraising information and ideology scores for the candidates.
This is not an exhaustive list of the possible fundamentals that could be included in
a forecast model, or even of all the nuances of the 2014 models that used fundamentals.
Election fundamentals are subjective, and anything that relates to the partisan makeup
of the electorate, the mood of the electorate, or any aspect of the candidates themselves
could be considered a “fundamental.”
How these fundamentals are used in the models depends on what they measure.
Senate majority and presidential forecasting models generally have two stages: in the
first outcomes for individual contests within each state are estimated, and in the second
those outcomes are aggregated to produce the probability that a party will get the Senate
majority or a candidate will win a sufficient number of electoral votes to become pres-
ident. In practice the stages are not completely separate, as results will correlate across
states, especially in the presidential election, but in order to simplify the process to un-
derstand how the fundamentals work, it can be thought of in this way. In the Senate
models, individual candidate characteristics such as incumbency and fundraising, and
state-​level estimates of partisanship or previous election results, will factor into the first
stage. National measures would be included in the second stage to calculate the overall
chances of the party getting a majority in the Senate. In a presidential model, the candi-
date information moves to the second stage, since the candidates are the same nation-
ally (unless of course there is state-​level fundraising information that could be used in
the first stage), and the national information is used to help determine the candidate’s
chances of getting at least 270 electoral votes.
As with Bayesian versus non-​Bayesian models, though, there seems to be little
difference in the models’ ultimate performance between the polls-​only models and the
ones that include fundamentals. Models that include fundamentals do have an advan-
tage over polls-​only models in the months preceding the election, however. Polls are
Rise of Poll Aggregation and Election Forecasting    621

known to be ineffective at predicting election outcomes more than a couple of months


prior to the election (Erikson and Wlezien 2012). Fundamentals are able to provide
more information about the electorate and the general election atmosphere and there-
fore act as a stabilizing force in the model, since fundamentals do not change frequently,
when early polls are not necessarily indicative of what will happen. Figure 26.3 illustrates
how unpredictable early polls can be; during the fifteen-​month span prior to the elec-
tion starting in September 2013, the lead changed from Republican to Democrat, or vice
versa, no fewer than eight times. Fundamentals had steadily indicated that Republican
candidates would get the majority of votes for the House of Representatives for most of
that time (Sides 2014a).
Most of the models for the 2014 Senate forecasts debuted on their media websites in
the spring and summer prior to the election, and presidential forecasts have followed the
same pattern. Interest in the forecasts grows as the election gets closer, but releasing the
forecasts several months before the election is wise; Pew research data show that over
the last several presidential election cycles, between 25% and 40% of Americans said
they were paying close attention to the election nine months out (Jackson 2015b).
Generally the forecasts that rely on fundamentals heavily several months prior to
the election will slowly lean more and more on the polls, assuming that as the election
gets closer, poll respondents pay more attention to the electoral atmosphere, and the
fundamentals are absorbed into polling preferences (Cox and Katz 2014; Sides 2014c).
Once the election is only a few weeks away, even models that use fundamentals are
leaning primarily on polling data, meaning that there are few differences between the
polls-​only models and the fundamentals and polls models by the end of the cycle (Sides
2014b).

Other Options for Adjusting Data


Some models take other steps to adjust the polling data beyond simply modeling what
the polls say. The primary reason for this is that not all polls are created equal, and a
significant question forecasters face is which polls to include and how to account for
differences between polling methods and populations—​some polls report registered
voter populations and some report likely voters. Undecided proportions in polls are an-
other issue that requires attention.
Most forecasts incorporate all or almost all of the available polls, but FiveThirtyEight
makes a notable adjustment for pollster quality based on its internally calculated poll-
ster ratings (Silver 2014a; Silver, 2016). Pollsters are ranked based on how accurate
they have been in the past and their transparency. The 2014 Huffington Post model
also adjusted estimates by pollster quality, measured by how pollsters had performed
in the 2012 model (Blumenthal and Jackson 2014). Other poll adjustments addressed
differences between likely voter polls and registered voter polls: both FiveThirtyEight
and the New York Times tweaked registered voter polls by shifting them in the expected
direction of likely voter polls—​toward Republicans (Cox and Katz 2014).
Undecided poll respondents are a problem for forecasters; poll results inevitably re-
port some proportion of the sample that was undecided about their vote choice, but on
622   Natalie Jackson

Election Day there are no undecideds. The Daily Kos forecast completely removed the
undecided proportions from the calculations and recalculated the proportions for each
candidate to equal 100% in all of the polls (Daily Kos 2014). The Huffington Post forecast
added more uncertainty in the estimates based on the proportion of undecideds in the
polling averages (Blumenthal and Jackson 2014).
Finally, because many polls are conducted at the state level, simply forecasting based
on each state’s polls creates an assumption that every election is completely unconnected
to the election happening in other states. For Senate elections, this is not a completely un-
reasonable assumption, since candidates and issues vary across states. But for presidential
elections, it is safe to assume that the election in one state is closely related to the election in
the next state. Allowing the states’ polling results to correlate, particularly when there are
few polls in a state, will alleviate this problem (Jackman 2012b; Silver 2014a).
Beyond the polls, some forecasts tweaked the overall results to account for the pos-
sibility of unknown events. The principle here is that there is always a chance of an un-
known event shaking up the election right before it happens, when there is not time for
the polls to react, or that the polls could simply be wrong. FiveThirtyEight’s forecasts
typically include some random noise to lower certainty of outcomes, and the Huffington
Post’s 2012 and 2014 forecasts included this as well (Blumenthal and Jackson 2014;
Jackman 2012b; Silver 2014a).

Final Estimates
Most models use Monte Carlo simulations to estimate the final probabilities of a presi-
dential candidate winning an election across the various states, or as in the 2014 models,
to estimate the likelihood that a party will maintain or take over control of the Senate.
For example, a Monte Carlo simulation would pick a random number between 0 and
100 for each state, then compare that number to the probability of the Republican can-
didate winning in that state. If the number is lower than the probability of the candidate
winning in that state, it counts as a Republican win; if it is higher, it is a Democratic win.
If a Republican in a given state has a 35% chance of winning according to the model, a
random number from 0 to 35 would count as a Republican win, but a number from 36 to
100 would be a Democratic win.
The process is repeated for every state, counting the number of Republican-​won states
or Senate seats. In a presidential election forecast, winning a state is converted to the
number of electoral votes the winner would receive for that state; in a Senate forecast the
election in each state is counted as one seat. The process is repeated many times to sim-
ulate many different random elections—​often a million or more—​and the proportion
of times a presidential candidate gets more than 270 electoral votes or a party has 51 or
more seats in the Senate is the final probability for the outcome of the contests (Jackman
2012a; Jackson 2014).

Assessing Forecast Models
Despite leaning heavily on polls for data input, forecast models typically do not focus
on estimating candidates’ vote shares as their primary output. Instead, they focus on the
Rise of Poll Aggregation and Election Forecasting    623

probability of a candidate winning the contest. Many forecasts do not even report the
point estimates for candidates’ vote shares at all. The point of the forecast is not to accu-
rately represent the polling data—​a key difference from aggregations, where that is the
primary goal—​but to predict how likely someone, or a party, is to win. The probabilities
do provide some information about how a vote is likely to come out, though: If there are
two candidates, a probability close to .5 means a very close vote, and as the probabilities
approach 0 or 1, the vote is more likely to be one-​sided.
Brier scores, the most commonly-​reported metric for assessing the 2014 Senate
models, does take the probabilities into account, but primarily focuses on whether the
forecast got the winner right (Bialik 2014; Katz 2014; Sides 2014b). If candidate A wins,
the scoring takes the forecast’s probability of candidate A winning, say .68, subtracts it
from 1—​the actual probability of the candidate winning now that the result is known—​
and squares the difference. So 1 minus .68 equals .32, and .32 squared equals .1024. If
candidate B had won, and the forecast said candidate A had a .68 probability of winning,
the calculation would be 1 minus the forecast’s probability of candidate B winning—​
so 1 minus .32—​squared, which equals .4624. To get a total score for an entire forecast,
the Brier score for each individual state-​level race is calculated, then all the scores are
added together. Higher numbers mean the forecast had more error, so a lower Brier
score means the forecast did better by that metric. The Brier scores for the 2014 Senate
forecasts were very close together, clustered between .02 and .045 (Sides 2014b), since
most had identified the same most likely outcomes.

Challenges in Aggregation
and Forecasting

Explaining models, “priors,” and Markov chains to a nontechnical audience is no small


feat, but from the beginning Silver and other forecasters have prided themselves on
transparency about their methods (Silver 2008b). Some of the forecasters active in 2014
posted their code and data in public online repositories so that others could replicate
the models. Most provided detailed descriptions of their methods so that those with the
requisite statistics background could understand how the forecasts worked. However,
for those without a statistics background, the details of a complex forecasting model are
mostly incomprehensible. The nuances of statistical modeling, and especially the uncer-
tainty associated with statistics, can easily get lost in the race to say “Obama will win in
2012” or “Republicans will take over the Senate in 2014.” How to effectively communi-
cate about the statistics and uncertainty of the models to a lay public has become a con-
siderable challenge in election forecasting. The first part of this section identifies a few
ways audiences can identify reliable aggregations and forecasts, but the bulk focuses on
how aggregators and forecasters can communicate with their audiences and tackle the
challenges inherent in this type of work.
624   Natalie Jackson

The Audience’s Perspective


There are several things forecast readers can look for, even if they lack the statistical
training to know the technical aspects of what the forecast is showing. These are the basic
elements that any aggregator or forecaster should disclose to readers (Jackson 2015a).

Data Source
The source of the data going into poll aggregations and poll-​based forecasts should not
be a mystery. If sites or developers have their own database of polls, that should be pub-
licly available. If they are using someone else’s database, that should be discussed. If the
forecast includes “fundamentals,” the source of that information should be explicitly
discussed.

Data Collection
Data collection processes can be fairly boring to read about, even if they are simple.
However, it is important that some information is available about how the data were
collected so readers know what is included and what could be excluded. The primary
question for poll aggregation and forecasting is whether any polls were excluded; polls
might be excluded if there are questions about their reliability or credibility. Any time
the polls included in the forecast change, the estimates themselves are subject to change.

Describing the Statistics
Regardless of whether the audience is expected to understand the mechanics of a LOESS
line, Kalman filter model, or Monte Carlo simulations, in the interest of transparency, the
procedures should be explained. As mentioned previously, many forecasters promote trans-
parency in their methods and at least give detailed descriptions of their methods. A lack of
transparency does not necessarily mean the forecast or aggregation should be ignored, but
the best practices for any scientific field encourage transparency and the ability to replicate
findings. Everyone should have access to the information on how models were built.

The “Smell” Test
Results from poll aggregation and forecasting should pass a common-​sense test. Given
the polling numbers and information going into the model, does the outcome make
sense? The question is not whether the outcome aligns with the audience’s beliefs or
preferences, but rather whether it makes sense for a polling average to show a candidate
at 45% support when the last five polls estimated that candidate’s support at 43, 48, 46,
44, and 42%. If the outcome changes a lot from day to day, or the result doesn’t look any-
thing like what the polls say, readers should know to be cautious.

Discussion of Uncertainty
It is extremely rare that a single analysis would show a definitive conclusion without
any room for question. Virtually every time someone analyzes data or calculates any
Rise of Poll Aggregation and Election Forecasting    625

type of statistics using data, there is some flaw or shortcoming, and there is always un-
certainty about any conclusion when statistics are involved. These things should be ac-
knowledged, particularly in the case of probabilistic forecasts. The challenge in getting
an audience to understand uncertainty in probability-​based forecasts is that people tend
to want certain outcomes, and a forecast that says there is an 80% chance Republicans
will take over the Senate is not certain. If, as would be expected 20% of the time, the
Republicans did not take over the Senate, the audience is likely to think the forecast was
incorrect or failed to predict the outcome. That is not necessarily true; what is true is that
the less likely event happened rather than the favored event.

The Analyst’s Perspective


Communicating Uncertainty
Since the analyst has the job of communicating accurately about uncertainty in polls
and forecasts, the discussion becomes considerably more complex from that person’s
perspective. Explaining the uncertainty of probability-​based forecasting to the general
public is a task that has flummoxed scientists, and particularly weather scientists, for
many years. Social scientists moving their work into a more public domain are seeing
the difficulties firsthand. It seems no matter how many times a political pollster, aggre-
gator, or forecaster reminds the public that polls have margins of error and forecasts are
based on uncertain probabilities, the media and the public want to read the numbers as
completely certain, and they then castigate the analysts if the outcome is different from
their expectations—​or even if some other pollster or analyst says something different.
Despite these misunderstandings, which may seem impossible to overcome,
aggregators and forecasters still have the responsibility to communicate as clearly and
effectively as possible about their estimates. The margin of error is probably the most
misunderstood concept that aggregators have to deal with, and the process of averaging
or smoothing multiple polls makes it even more complex. Forecasters mostly deal with
misunderstandings about probability and precisely what it means to say a given event
has a certain probability of occurring.
Margin of error is difficult because it is often used as the catch-​all for polling error,
assumed to stand for all possible error in polls, but it is actually only one specific type
of error. It is a measure of only the error produced by interviewing a random sample
rather than the entire population whose opinion one wants to know. Other types of
error—​if the entire population was not available to be sampled, if the measures were
not quite right, if there are systematic differences between the people who answered and
the people who did not answer the survey, or if there were mistakes in weighting the
data or identifying likely voters—​are completely unaccounted for by the margin of error.
If a poll does not use a random sample, as Internet-​based panel surveys do not, some
question whether the margin of error is a valid measure of uncertainty at all (Blumenthal
and Jackson 2015).
626   Natalie Jackson

Despite the controversy, most pollsters provide a margin of error with their polls, but
these apply only to those specific polls. When aggregators begin putting polls together
in order to estimate a poll average, the margins of error for the individual polls become
largely meaningless, yet there is still uncertainty in the aggregated estimate. Pooling
polls does not eliminate uncertainty, although it should reduce uncertainty in theory.
The aggregated estimates at RealClearPolitics do not report any measure of uncertainty,
because they use a simple average of the last few polls. The simplest way to discuss un-
certainty would be to calculate, report, and explain the standard deviation of each poll
from the average.
The Huffington Post Pollster charts that rely on LOESS techniques do not show
measures of certainty, but their Kalman filter model–​based averages do illustrate the
uncertainty of the estimate, as described in the technical section of this chapter. The
model-​based averages pool the polls in a way that incorporates sample sizes for each
poll and its respective margin of error, so polls with larger samples and more certainty
have more influence over the average. The average itself, then, has error from the in-
dividual polls and the simulation process. However, users looking at the charts would
not necessarily know that; it is not indicated anywhere, and the explanation of how the
model-​based averaging works is buried in the archives.
Forecasters, on the other hand, do not necessarily need their audience to understand
the intricacies of polls and margins of error, unless the forecast is completely poll-​
based, but they do need to effectively communicate what probabilities mean. All of the
major media forecasts for 2014 measured the outcome in terms of the probability that
the Republicans would take over the majority in the Senate. Some, like the New York
Times forecast, included qualitative terms with the numbers indicating how strong the
chances were of a Republican takeover. Many simply reported the probabilities in per-
centage format and left the audience to determine what a 65% chance of a Republican
takeover meant. There was a lot of Internet ink spilled explaining that a 60% chance
of winning is not substantially different from a 50% chance of winning (Gelman 2014;
Silver 2014b).
Most forecasters did explain the uncertainty of the forecasts, often in great detail.
However, these discussions of uncertainty were typically buried in long discussions of
the methods used to generate the estimates—​which most people will not read all the
way through—​and the message was easily lost. For example, the methods explana-
tion for FiveThirtyEight’s 2014 Senate model was around ten thousand words long and
requires a commitment of about an hour to read (Silver 2014a). The Huffington Post’s
2014 forecast model explanation was shorter, around twenty-​five hundred words, but
still required more time than most casual news consumers are likely to spend (Jackson
2014). The (probably large) portion of the audience who went directly to the forecast
pages, ignoring the methods explanations, saw numbers that declared how likely the
Republicans were to take over the Senate without any explanation of what an 80% like-
lihood actually means. Presenting the numbers with the appropriate explanation of
uncertainty, without requiring the audience to spend an hour reading model details, is
something public forecasters need to work on in the future.
Rise of Poll Aggregation and Election Forecasting    627

There is a big opportunity to educate the public about statistics and probability; po-
litical aggregation and forecasting is a huge connection between the public and political
science that happens every two years, and in a bigger way every four years. The diffi-
cult part of the task is figuring out how to do that in a clear and concise way, then get
audiences to read the explanations.

Single Polls vs. Aggregation


There is some tension between pollsters and aggregators. Aggregation can seem to
render individual polls moot in promoting an average instead of any single pollster’s
estimates. Pollsters rely on their branding to attract business, and aggregation removes
that branding and replaces it with an average of several brands. The pollster is still
recognized as part of the average, but the average can seem to diminish the importance
of the individual poll. In addition, the claim that aggregation results in a more precise
estimate of where public opinion stands than individual polls can seem like an attack on
the accuracy of individual polls. Each of these criticisms deserves attention.
The obvious answer to the notion that aggregation renders individual polls moot is
that aggregation could not exist without individual polls. Aggregation is only possible
when there are multiple pollsters measuring the same question; without the pollsters,
aggregators have no job. Beyond the obvious, though, individual polls have an advan-
tage over aggregation in showing actual change in opinion over time. Aggregation will,
of course, show change in estimates over time, but the change could be due to which
polls were in the past five polls average, or which were most recent and therefore
weighted most heavily. When aggregations combine polls with sometimes very different
methods, it becomes difficult to say which changes in the estimates are due to actual
opinion change and which are due to polling method differences.
With individual polls, change in opinion over time is easier to detect. Two polls
produced by the same pollster with the same methods that differ only in when they
were conducted will offer a much clearer idea of how opinion has changed. If it is not
a panel—​that is, the samples in the two polls are different—​some of the difference in
numbers between the polls could be sampling error, but sampling error has known
estimates and can be accounted for with simple statistical testing. If opinion in the
second poll has changed from the first poll, and that change is statistically significant
after accounting for sampling error, there has probably been an actual opinion shift in
the population. The equivalent tests for aggregated estimates would be much more diffi-
cult, meaning that it is less clear that opinion has actually changed. Aggregation will pick
up opinion change patterns over the long term, but individual polls conducted by the
same pollster are much better for identifying opinion change as it is happening.
The idea that aggregation is a more precise estimate of opinion than individual polls,
and that this assertion attacks the accuracy of individual polls, is warranted but a bit
misguided. Aggregators are not claiming that individual polls are inaccurate by com-
bining the polls; rather, they are leveraging large amounts of information to improve
statistical precision. It is statistical fact that a single poll with a fixed sample size of ap-
proximately one thousand respondents has a 95% confidence interval margin of error
628   Natalie Jackson

around 3.4  percentage points (without including any design effect in the margin of
error). The margin of error cannot be reduced in this poll once it is completed, meaning
that in 95% of all possible samples, the estimates will vary by up to 3.4% above or below
the estimate obtained with the poll’s sample. In the other 5% of cases, estimates would
exceed that 3.4% error (Pew Research Center 2016).
The only way to improve precision is to increase the sample size. Aggregators are able
to effectively increase the sample size by combining several polls, thus decreasing the
margin of error. Other sources of error can be introduced in the process, as discussed in
the technical section, so the error of the aggregated estimates might not look exactly like
a margin of error for a sample size equal to all of the polls’ combined sample sizes, but
the statistical fact is that more information increases the precision of the estimates. This
is not an attack on pollsters’ accuracy or a comment on the methods they use to get indi-
vidual estimates. As just noted, aggregators are completely dependent on the pollsters to
produce data that can then be aggregated, but aggregators are trying to leverage that in-
formation to provide an easy-​to-​comprehend summary of opinion. In doing that, com-
bining polls does statistically increase the precision of estimates.

Polls as Forecast Tools: Expecting Too Much?


Pollsters are fond of noting that polls are “snapshots” of what opinion looks like at the
time the poll was conducted, particularly when electoral results do not quite match
what polls said the week (or weeks) prior to the election. They are correct to make that
assertion. When polls ask respondents for their opinions, what they get is the opinion
that comes to mind at the moment that person is answering the question. Lots of
different factors go into how a respondent will answer a question, but the single biggest
determinant seems to be what is at the top of a respondent’s mind when the pollster calls
(or emails, or knocks on the door) (Tourangeau, Rips, and Rasinski 2000; Zaller 1992).
We do know that polls are highly predictive of electoral outcomes when the polls are
conducted within a few weeks of the election, but the farther out from the election the
polls were conducted, the lower the correlation is between outcome and poll estimate
(Erikson and Wlezien 2012).
Polls inherently measure public opinion in the past, since they can only estimate
opinion at the time the questions are asked; by the time they are released, the data are
at least a few days old. Forecasts are attempting to do exactly the opposite: estimate vote
choices in the future. The measurement goals of polls and forecasts are fundamentally
at odds, and forecasts make it seem that polls should be predictive of the ultimate out-
come in order to have any value. It is critical to keep the goals of polling and forecasting
separate, even though polls are almost always a primary data source for forecasts. Polls
alone are not forecasts; expecting polls to always predict electoral outcomes is expecting
too much.
Polls are, however, appropriate data to use for forecasting. By the end of an election
cycle, poll outcomes correlate very highly with outcomes, according to Erikson and
Wlezien (2012). Most forecasts that begin more than a few months before the election do
not actually rely solely on the polls. These models start with the “fundamentals” of the
Rise of Poll Aggregation and Election Forecasting    629

election—​economic factors and presidential approval ratings are the most commonly
used fundamentals—​and blend the polls in as another source of information. As
the election gets closer, forecasters assume that people answering the polls begin
paying more attention to the campaigns and political environment, so that the poll
numbers align with what the fundamentals would expect. That reduces the need to
include fundamentals, so the forecasts can rely more and more heavily on the polls
as the election gets closer. As discussed in the technical section, the New York Times,
Washington Post, and FiveThirtyEight models have used this type of setup, gradually
leaning more on polls for their forecasts (Cox and Katz 2014; Sides 2014c; Silver 2014a).
Models that rely only on polls, such as the 2014 models produced by HuffPost Pollster
and the Daily Kos, are probably more like poll aggregation if they are calculated more
than a few months before the election. However, in 2014 these models were released
later in the cycle (Daily Kos 2014; Jackson 2014). By the time these models debuted in
late summer, there were only minor differences between their estimates and the hybrid
model estimates that included fundamentals. Since these models were released later and
made adjustments to account for the uncertainty of relying solely on polls, the expec-
tation that the polls could produce accurate forecasts was appropriate. But more than a
few months prior to the election, polls-​only forecasts are probably demanding too much
from polls, which are only meant to measure opinion at the time they are conducted.

The Future of Poll Aggregation


and Forecasting

Aggregation is likely to continue as long as there are plenty of polls to aggregate. There
is not a ton of competition for audiences, since only two websites produce aggregated
estimates and charts of these estimates over time. The idea of doing a “poll of polls”
to provide one estimate of where opinion stands remains useful, unless the volume of
polling slows down drastically in the future.
Forecasts could have a shakier future. In 2014 the concept of aggregation was ex-
tended to election forecasts: Vox, a new Internet media source, did not produce its own
election forecast, but instead aggregated the other forecasts into one meta-​forecast (Vox
2014). At the point when forecasts are being aggregated, a logical question is whether
having so many forecasts is worthwhile, especially if the forecasts mostly say the same
thing they did in 2014. Political scientists will likely continue to forecast elections for
journals and academic purposes, but media-​produced forecasts that need to appeal to a
broader audience could face problems.
Presumably there is a finite audience that these media forecasts can appeal to because
of their complexity, so too many forecasts would divide the audience and make it less
worthwhile for outlets to spend resources on them. Polls can continue to proliferate,
since there are different sources of sponsorship and different audiences—​campaigns,
630   Natalie Jackson

parties, and a multitude of news sources will continue to need polls—​but it is less clear
that forecasts are necessary to those groups. It is possible that more forecasts would
flood the market and the “bubble” of election forecasting popularity could burst, partic-
ularly if forecasts are not as accurate as they were in 2008, 2012, and 2014.
While the U.S.  election polls and forecasts have performed fairly well since
2008, forecasts are only as good as the polls that go into them. Even forecasts that
use fundamentals end up relying very heavily on the polls, so if the polls are wrong,
forecasts will be wrong. Polls and forecasts performed very poorly in the 2015 United
Kingdom Parliamentary and Israeli Knesset elections. In Israel, the exit polls and
fourteen polls taken the week before the election pointed to a different outcome
than what emerged once the ballots were counted. Late shifts in opinion could have
accounted for the poll discrepancies, but there is no consensus on what caused
the exit poll problems (Nardelli 2015; Blumenthal, Edwards-​Levy, and Velencia
2015a). To the surprise of the polling community, much the same thing happened two
months later when polls and forecasts missed the landslide Conservative Party vic-
tory (Blumenthal, Edwards-​Levy, and Velencia 2015b). As the tallies were coming in
and it was clear the polls were wrong, Nate Silver wrote that “the world may have a
polling problem” (Silver 2015). If the world does have a polling problem, then it also has
a forecasting problem.
The futures of aggregation and forecasting are similar in one crucial way: they will
always depend on the availability of quality polling data. If poll data quality wanes, ag-
gregation cannot fix it, and forecasts will be wrong when the polls are wrong, no matter
how complex or cautious the forecasting model. Without quality data, we are not able to
measure opinion accurately enough to predict anything. The future of aggregation and
forecasting is completely dependent on the future of polling.

References
Bialik, C. 2014. “Some Do’s and Don’t’s for Evaluating Senate Forecasts.” FiveThirtyEight,
November 4.  http://​fivethirtyeight.com/​datalab/​some-​dos-​and-​donts-​for-​evaluating-​
senate-​forecasts/​.
Blumenthal, M., A. Edwards-​Levy, and J. Velencia. 2015a. “Huffpollster: Where Israel’s Polls
Missed.” Huffington Post, March 18. http://​www.huffingtonpost.com/​2015/​03/​18/​israel-​
election-​polls_​n_​6893084.html.
Blumenthal, M., A. Edwards-​Levy, and J. Velencia. 2015b. “Huffpollster: Why the Polls Missed
the Mark on the UK Elections.” Huffington Post, May 13. http://​www.huffingtonpost.com/​
2015/​05/​13/​huffpollster_​1_​n_​7274030.html.
Blumenthal, M., and N. Jackson. 2014. “Huffpost Pollster Refines Senate Poll Tracking Model
before 2014 Elections.” Huffington Post, August 29. http://​www.huffingtonpost.com/​2014/​
08/​29/​senate-​polls-​2014_​n_​5731552.html.
Blumenthal, M., and N. Jackson. 2015. “The Margin of Error Is More Controversial Than You
Think.” Huffington Post, February 3.  http://​www.huffingtonpost.com/​2015/​02/​03/​margin-​
of-​error-​debate_​n_​6565788.html.
Rise of Poll Aggregation and Election Forecasting    631

Clifford, S. 2008. “Finding Fame with a Prescient Call for Obama.” New York Times, November
9. http://​www.nytimes.com/​2008/​11/​10/​business/​media/​10silver.html?pagewanted=all.
Cox, A., and J. Katz. 2014. “Meet Leo, Our Senate Model.” New  York Times. http://​www.
nytimes.com/​newsgraphics/​2014/​senate-​model/​methodology.html.
Daily Kos. 2014. “Election Outlook:  How It Works.” http://​www.dailykos.com/​election-​
outlook/​how-​it-​works.
Erikson, R. S., and C. Wlezien. 2012. The Timeline of Presidential Elections. Chicago: University
of Chicago Press.
Gallup, G. 1951. “The Gallup Poll and the 1950 Election.” Public Opinion Quarterly 15 (1): 16–​22.
Gelman, A. 2014. “Republicans Have a 54 Percent Chance of Taking the Senate.” Washington
Post, January 29. http://​www.washingtonpost.com/​blogs/​monkey-​cage/​wp/​2014/​01/​29/​
republicans-​have-​a-​54-​percent-​chance-​of-​taking-​the-​senate/​.
Gibbons, J. D. 1993. Nonparametric Statistics: An Introduction. London: Sage Publications.
Gill, J. 2007. Bayesian Methods. 2nd ed. London: Chapman Hall/​CRC.
Hillygus, D. S. 2011. “The Evolution of Election Polling in the United States.” Public Opinion
Quarterly 75 (5): 962–​981.
HuffPost Pollster. 2014. “Poll Chart:  2014 National House Race.” http://​elections.
huffingtonpost.com/​pollster/​2014-​national-​house-​race.
HuffPost Pollster. 2015. “Poll Chart:  2016 National Republican Primary.” http://​elections.
huffingtonpost.com/​pollster/​2016-​national-​gop-​primary#!showpoints=no&estimate=cus
tom.
Jackman, S. 2005. “Pooling the Polls over an Election Campaign.” Australian Journal of Political
Science 40 (4): 499–​517.
Jackman, S. 2012a. “Converting a Poll Average to a Forecast.” Huffington Post, October 30.
http://​www.huffingtonpost.com/​simon-​jackman/​converting-​a-​poll-​average_​b_​2044222.
html.
Jackman, S. 2012b. “Model-​Based Poll Averaging:  How Do We Do It?” Huffington Post,
September 14. http://​www.huffingtonpost.com/​simon-​jackman/​modelbased-​poll-​
averaging_​b_​1883525.html.
Jackman, S. 2012c. “Pollster Predictive Performance, 51 out of 51.” Huffington Post, November
7.  http://​www.huffingtonpost.com/​simon-​jackman/​pollster-​predictive-​perfo_​b_​2087862.
html.
Jackson, N. 2014. “How Huffpost Forecasts Senate Elections: The Technical Details.” Huffington
Post, September 9.  http://​www.huffingtonpost.com/​2014/​09/​09/​2014-​senate-​elections_​n_​
5755074.html.
Jackson, N. 2015a. “6 Simple Questions Everyone Can (and Should) Ask about Data.”
Huffington Post, May 27. http://​www.huffingtonpost.com/​2015/​05/​27/​simple-​questions-​
about-​data_​n_​7453668.html.
Jackson, N. 2015b. “Don’t Care about the 2016 Election Yet? You’re Part of the 74 Percent.”
Huffington Post, May 13. http://​www.huffingtonpost.com/​2015/​05/​13/​2016-​election-​
attention_​n_​7277006.html.
Katz, J. 2014. “What the Forecasts Got Right, and Wrong.” New  York Times, November
5.  http://​www.nytimes.com/​2014/​11/​06/​upshot/​what-​the-​forecasts-​got-​right-​and-​wrong.
html?abt=0002&abg=0.
Lewis-​ Beck, M. S., and M. Stegmaier. 2014. “US Presidential Election Forecasting—​
Introduction.” PS: Political Science Politics 47 (2): 284–​288.
Linzer, D. 2012. “Votamatic: How It Works.” http://​votamatic.org/​how-​it-​works/​.
632   Natalie Jackson

Martin, E. A., M. W. Traugott, and C. Kennedy. 2005. “A Review and Proposal for a New
Measure of Poll Accuracy.” Public Opinion Quarterly 69 (3): 342–​369.
Nardelli, A. 2015. “Israel Election: Why Were the Exit Polls Wrong?” The Guardian, March 18.
http://​w ww.theguardian.com/​world/​datablog/​2015/​mar/​18/​israel-​election-​w hy-​were-
​the-​exit-​polls-​wrong.
Pew Research Center. 2016. “Why Probability Sampling.” http://​www.people-​press.org/​meth-
odology/​sampling/​why-​probability-​sampling/​.
RealClearPolitics. 2015. “2016 Republican Presidential Nomination.” http://​www.
realclearpolitics.com/​ e polls/​ 2 016/​ p resident/ ​ u s/ ​ 2 016_ ​ r epublican_ ​ p residential_​
nomination-​3823.html#polls.
Rosenstiel, T. 2005. “Political Polling and the New Media Culture: A Case of More Being Less.”
Public Opinion Quarterly 69 (5): 698–​7 15.
Sides, J. 2014a. “The 2014 Midterm Election Fundamentals (in 4 Graphs).” Washington Post,
November 3.  http://​www.washingtonpost.com/​blogs/​monkey-​cage/​wp/​2014/​11/​03/​the-​
2014-​midterm-​election-​fundamentals-​in-​4-​graphs/​.
Sides, J. 2014b. “Election Lab on Track to Forecast 35 of 36 Senate Races Correctly.”Washington
Post, November 5.  http://​www.washingtonpost.com/​blogs/​monkey-​cage/​wp/​2014/​11/​05/​
election-​lab-​on-​track-​to-​forecast-​35-​of-​36-​senate-​races-​correctly/​.
Sides, J. 2014c. “How Election Lab Works.” Washington Post, May 5.  http://​www.
washingtonpost.com/​news/​politics/​wp/​2014/​05/​05/​how-​election-​lab-​works/​.
Silver, N. 2008a. “We Know More Than We Think (Big Change 2).” FiveThirtyEight, June 15.
http://​fivethirtyeight.com/​features/​we-​know-​more-​than-​we-​think-​big-​change-​2/​.
Silver, N. 2008b. “Frequently Asked Questions.” FiveThirtyEight, August 7.  http://​
fivethirtyeight.com/​features/​frequently-​asked-​questions-​last-​revised/​.
Silver, N. 2014a. “How the FiveThirtyEight Senate Forecast Model Works.” FiveThirtyEight,
September 17. http://​fivethirtyeight.com/​features/​how-​the-​fivethirtyeight-​senate-​forecast-​
model-​works/​.
Silver, N. 2014b. “FiveThirtyEight Senate Forecast:  Toss-​up or Tilt GOP?” FiveThirtyEight,
June 8.  http://​fivethirtyeight.com/​features/​fivethirtyeight-​senate-​forecast-​toss-​up-​or-​tilt-​
gop/​.
Silver, N. 2015. “The World May Have a Polling Problem.” FiveThirtyEight, May 7.  http://​
fivethirtyeight.com/​liveblogs/​uk-​general-​election-​2015/​?#livepress-​update-​12918846.
Silver, N. 2016. “The State of The Polls.” FiveThirtyEight’s Pollster Ratings, http://​fivethirtyeight.
com/​interactives/​pollster-​ratings/​.
Tourangeau, R. L. J. Rips, and K. Rasinski. 2000. The Psychology of Survey Response.
New York: Cambridge University Press.
Vox. 2014. “The Battle for Congress.” November. http://​www.vox.com/​a/​election-​2014-​forecast.
Zaller, J. 1992. The Nature and Origins of Mass Opinion. New York: Cambridge University Press.
Index

Tables and figures are indicated by an italic t and f following the page/​paragraph number.

A. C. Nielsen Company, 103 sampling design, 58, 80, 94n8, 491,


Achen, C. H., 32 535, 549n4
Adams, A. N., 5, 43, 493, 501n18 satisficing in, 68–​69, 69t
Adcock, R., 343, 344 social desirability effects, 67
additive models, 340–​41 weighting in, 301
Adkihari, P., 6, 174n7 The American Panel Survey, 4t, 28
Afrobarometer, 4t, 7, 221, 222t, 225f, 245, 392t AmericasBarometer, 215
age-​weight curves graphs, 428–​29f, 429–​30, AmeriSpeaks, 43n1
434–​35, 435f, 442–​45, 445t, 446f analysis, presentation, 7–​8
aggregation. see data aggregation anchoring vignettes, 235, 588, 597, 603n13
agree-​disagree scales,  116–​19 Andrews, F., 121–​22
Ahern, K., 35 Android Data Gathering System
Aitchison, J., 289, 290 (ADGYS),  212–​17
Algeria, 222t, 223, 224, 225f, 240n3 ANES Time Series Study, 301
Al Zamal, F., 566 Ansolabehere, S., 5, 89, 343
American Association of Public Opinion Arab Barometer
Research (AAPOR), 1, 79, 278–​79, 543, applications of, 7, 221, 222t
589–​90,  603n6 data quality assessment, 224, 225f
American Community Survey (ACS), 36, 59 described, 4t, 392t, 393
American Muslims, 183, 192–​95, 200, 201nn1–​2 topics included in, 226, 227–​28t, 240n2
American National Election Study (ANES) website, 245
background, 389–​90, 404n1 Arab world surveys. see MENA surveys
casual inference, 300–​301 Armstrong, J. S., 595
contextual cues in, 65, 97–​98 AsiaBarometer, 392t, 393
costs of, 81, 84 Asian Americans, 183–​84, 189, 195–​97
described, 4t, 28, 30, 90 Asian Barometer, 4t, 392t, 393
design effect in, 88 aspect ratio, 472–​76, 474–​76f
face-​to-​face surveys, 58, 80, 81, 89–​91, Associated Press, 150
300–​301 Atkeson, L. R., 5, 43, 493
intention stability, 40 Australia, 402–​3, 404n12
mass polarization, 344
mode studies, 89 Bafumi, J., 329
nonresponse rates, 80–​81 Bahrain, 222t
panel attrition, 35–​38 Barabas, J., 499
reliability of, 32, 343 Barberá, P., 354, 563, 567, 568
response rates as standard, 574n1 bar charts, 454–​56, 455f, 457f
634   Index

Bartels, L. M., 35, 345 overview, 463


Battaglia, M. P., 38 plotting symbols, 449–​50, 465–​66, 467f,
Bayesian Item Response Theory models, 592 471–​72, 472f, 478n5
Bayesian vs. non-​Bayesian modeling, 618–​19 scatterplots (see scatterplots)
Bayes’s law, 293 subsequent presidents effects, 435–​36, 435f
Benstead, L. J., 7, 236, 241n4 tables vs., selection of, 445–​46
Berkman, M., 330–​31 variance (R2) plotting, 433, 434f
Berry, J. A., 6 visual perception importance, 446–​48
best practices Blavoukos, S., 589
bivariate graphs, 464–​65, 477, 478n3 Blaydes, L., 236
expert surveys, 589–​91, 603nn6–​7 Blumberg, S. J., 56
graphs, 437–​38, 448–​52, 449f, 451f, 477 Bode, L., 567
for qualitative research, 513, 516, 531n1 Bond, R., 354
question wording, 115–​16, 116t Bonica, A., 354
univariate graphs, 448–​52, 449f, 451f, 477 Bonneau, R., 568
bias Bormann, N., 396, 404n7
acquiescence, 40 Boroditsky, L., 255
bias-​variance trade-​off,  343–​44 Brace, P., 7, 327
CMV biases, 586–​87, 603n5 Bradburn, N. M., 115, 116t
correction in CSES, 398, 404n11 Brady, H., 198
in expert surveys, 586–​88, 603n5 Brier scores, 623
intergroup conflict and, in MENA British Election Study (BES), 4t, 390, 401, 409
surveys,  236–​37 Bruce, A., 164
item, in group consciousness, 380 Bryant, L. A., 6
margin of error, 625–​26 bubble plots, 450, 478n1
nonresponse (see nonresponse bias) Burden, B. C., 58
positivity, in expert surveys, 587 Burstein, P., 317, 318
respondent vulnerability and, in MENA Butler, D., 390
surveys,  235–​36 Buttice, M. K., 598
response biases reduction, 595–​99, 602
seam,  39–​41 Campbell, D. T., 15, 121–​22
selection, in social media surveys, 559–​60 Canadian Election Study, 390
social desirability (see social Canadian Survey of Labour and Income
desirability bias) Dynamics, 39
in subnational public opinion, 327 Candidate Emergence Study, 4t, 584, 587,
time-​in-​sample, 37–​39, 44nn7–​10, 591, 603n4
492, 501n18 Caughey, D., 347, 348
binomial probability mass function, 276 Center for Strategic Studies, 246
bivariate graphs central limit theorem, 80
aspect ratio, 472–​76, 474–​76f Chandler, J., 490
best practices, 464–​65, 477, 478n3 Chapel Hill Expert Surveys (CHES), 584, 590,
categories of, 464 597, 604n12
jittering, 466–​68, 469f Chen, M. K., 260
labeling points, 468–​7 1, 470f, 478n4 Chen, Q., 36
line plots, 471–​72, 472f, 478n5 Ching, P. L. Y. H., 38
maps, 464, 474–​75, 474f Chouhoud, Y., 6
multiple subset/​single display, 471–​72, Citizen Participation Study, 198
472f, 478n5 Ciudadanía surveys, 212, 218n4
Index   635

Cleveland, W. S., 447, 450, 471, 476 computer assisted telephone interviewing


Clinton, J. D., 342, 588, 601 (CATI), 484
CMV biases, 586–​87, 603n5 conditioned reporting, 38–​39
cognitive aspects of survey methodology confidentiality
(CASM),  16–​17 context surveys, 542–​43, 550nn16–​19
cognitive interviewing expert surveys, 591, 603, 603n7
in expert surveys, 596 informed consent, 522, 528–​29
in MENA surveys, 234–​35, 235t qualitative research, 529–​30
in qualitative research, 508–​9, 512–​17, 531n1 conflict-​induced displacement surveys,
total survey error and, 16–​17 164–​72, 169–​70t, 174nn7–​10, 175n11
cognitive psychological model, survey Conrad, F. G., 40
process, 513 construct validity, 343
Cohen, J. E., 321 content validity, 343
Collier, D., 343, 344 context in social research
Collins, C., 39 cognitive foundations of, 547–​48
Columbia studies, 389 community, networks, 538, 542,
Comparative National Elections Project 549n3, 550n26
(CNEP), 392t, 394 concepts, definitions, 534–​35, 548n1, 549n3
Comparative Study of Electoral Systems confidentiality, privacy, 542–​43,
age-​weight curves, 442–​45, 445t, 446f 550nn16–​19
background, 389–​91, 394–​96, 404nn4–​6 contiguity, 542, 551n33
bias correction, 398, 404n11 data collection, management, 540–​43, 541f,
case selection, 396–​98, 397t, 404nn7–​10 549n13, 550n22, 550nn16–​20, 551n27
democratic regimes defined, 396, 404n7 descriptors vs. mechanisms, 536–​38, 549n10
described, 4t,  388–​89 ethical issues, 542–​43, 550nn21–​23
development of, 394–​96, 404nn4–​6 expert surveys, 587–​88
face-​to-​face surveys, 401–​2,  401t functional assignments, 539, 549n12
fieldwork, 394, 401–​3, 401t, 404nn12–​13 hypothesis testing, 106
funding of, 403, 404n14 language/​opinion relationships,  256–​57
incentivization effects, 404n13 multilevel models, 551n31
mode, 401–​3, 401t, 404nn12–​13 multiple contexts, 548
modules, themes, 395t neighborhood effect, 99
multilevel data structure, 398–​99 opinion formation, 98–​100
nonprobability samples, 404 random intercepts modeling, 545–​46
nonresponse bias, 402, 404n11 relationships, 538
online option, 402–​3, 404n12 respondent characteristics, 535, 549n4
party system dimensionality, 400 risk-​utility trade-​off, 542–​43,  550n18
political knowledge distribution, 399–​400 samples, balanced spatial allocation of,
question wording, 399–​400 106–​10,  108f
response rates, 401–​3, 401t, 404nn12–​13 samples, proportional allocation of,
sampling error, 402, 404n11 103–​6,  105f
statistical inference in, 290–​92, 291t, sampling designs, 97–​98, 101–​2, 110n1, 543–​
297nn10–​12 45, 549n7, 550nn24–​26
telephone surveys, 401t, 402 sampling error randomness, 549n11
websites, list, 409 slope coefficients modeling, 546
computer assisted personal interviews snowball sampling, 542, 550n26
(CAPIs), 3, 54, 223. see also developing socialization,  98–​100
countries/​CAPI systems social media surveys, 570
636   Index

context in social research (cont.) data aggregation


spatial distribution, 100–​103, 102f, 104–​5f, Bayesian vs. non-​Bayesian
110nn2–​3 modeling,  618–​19
statistical inference, 545–​47, 547f, expert surveys, 587, 588, 599–​600, 603n5
551nn27–​33 fundamentals vs. poll-​only,  619–​21
stratified sampling, 544–​45, 550n25 social media surveys, 563–​64, 567–​68
subpopulations, superpopulations, statistical inference, 614–​17, 615–​16f, 618f
538–​40,  549n12 subnational public opinion, 325–​28,
surroundings, properties of, 536–​38, 549n10 345–​46
surveys and, 535–​36, 549n4 data collection
unit dimensionality, 544, 550n24 context surveys, 540–​43, 541f, 549n13,
variability, 543–​45, 550nn24–​26 550n22, 550nn16–​20, 551n27
convergent validity, 343 exit polls, 145–​46
Converse, P., 390 Internet surveys, 90
Cook Political Report, 619 overview, 5, 6
Cooperative Congressional Election total survey error, 18–​19 (see also total
Study (CCES) survey error)
applications of, 28 data visualization. see graphs
described, 4t Debels, A., 36
expert raters, 594 Deng, Y., 36
question wording, 536 density sampling, 186–​87
TSE approach, 86–​89, 87t, 93, 94n6 designated market areas (DMAs), 103–​4,
Coppedge, M., 598, 601 104–​5f, 110n3
Cornell Anonymization Tool, 550n20 designs
costs of ANES, 88
of ANES, 81, 84 data collection (see data collection)
automated voice technology in, 610 of exit polls, 143–​47, 149
developing countries/​CAPI systems, 211–​14 expert surveys, 589–​91, 601–​2, 603nn6–​7
exit polls, 150 hard to reach populations, 174n9
face-​to-​face surveys, 84, 91 Internet surveys, 58
hard to reach populations, 160–​61 language/​opinion relationships, 252–​53,
Internet surveys, 78–​85, 83–​84t 258–​59,  262
low-​incidence populations, 189–​90, longitudinal (panel) surveys, 29–​30,
199–​200 41–​43,  44n11
mail surveys, 13, 22–​23 mixed design, 586f, 589
mixed mode surveys, 63, 64t, 70, 515 mixed mode surveys, 54–​59, 56f (see also
Couper, M. P., 42 mixed mode surveys)
cross-​national polling. see Comparative Study multiple rater, 586f, 588, 600–​601
of Electoral Systems Nepal Forced Migration Survey, 166–​67,
CSES. see Comparative Study of Electoral 174nn8–​9
Systems nested-​experts, 586f,  587–​88
Current Population Survey (CPS), 36, 39, 40 overview,  2–​3
question wording (see question wording)
Daily Kos, 611, 613, 622, 629 sampling (see sampling designs)
Dalrymple, K. E., 567 single-​rater, 585–​87,  586f
Danish National Election Study, 409 subnational public opinion, 319
Danziger, S., 257 target-​units mapping, 585–​89, 586f
Index   637

developing countries/​CAPI systems Erikson, R. S., 326, 349


Android Data Gathering System estimation, inference. see hypothesis testing;
(ADGYS),  212–​17 statistical inference
benefits of, 211–​14 ethical issues
coding error, 208–​10 context surveys, 542–​43, 550nn21–​23
costs,  211–​14 MENA surveys, 238–​40, 239t
data sets, 209, 218nn3–​4 qualitative research, 528–​30
error in, 207–​8 social media surveys, 558–​59
fraud, 210 Eurobarometer, 4t, 391–​92, 392t
GPS coordinates, 215–​17, 237, 239t European Community Household Panel
interview time, 216 Survey, 39
overview, 7 European Election Studies, 4t, 392t
PAPI surveys, 208 European Social Survey, 4t, 119–​20, 119–​20f,
paradata, 212, 215–​16, 218n5 126, 127f, 392t, 393, 536
partial question time, 216 exit polls
photographs, 216–​17, 239t absentee voters, 147–​49
questionnaire application error, 208 coding systems, 145
sample error, 210–​11 costs, 150
survey administration, 215 data collection, 145–​46
video/​voice clips, 216 design of, 143–​47, 149
DIFdetect,  377–​78 error in, 146, 152
Dijkstra, W., 116t estimates,  147–​48
Dillman, D. A., 59, 116t in-​person early vote, 149–​50
Dorussen, H., 589 interviewers, 144–​45, 152
dot plots, 458–​59, 458f, 460f, 478n2 methodology,  143–​47
Druckman, J. N., 486 models,  147–​48
DuGoff, E. H., 302 multivariate estimates, 151
Dutch Parliamentary Election Studies, 4t, 409 online panels, 150
precinct-​level data,  150–​51
Early Childhood Longitudinal Study, National predictive value of, 142, 147–​48, 630
Center for Education Statistics, 35 public voter file-​based, 150–​51
Edelman, M., 142, 149 questionnaires, 145, 151
Edison Research, 148, 151, 153n3 response reliability, 142
Egypt, 222t, 225f, 229, 240n3 roles of, 142–​43
election forecasting. see poll aggregation, sampling,  143–​45
forecasting state polls, 144
election polling generally technology in, 151–​52
challenges in, 2, 13–​14 by telephone, 147–​50
cross-​national, development of, vote count comparison, 146–​47
391–​94,  392t experiments. see survey experiments
data sets, readily accessible, 3, 4t expert surveys
disclosures,  278–​79 advantages of, 584, 601
forecasting,  612–​13 anchoring vignettes, 588, 597, 603n13
misses, causes, effects of, 1–​2 applications of, 583–​84
Electoral Integrity Project, 583, 590, 603n1 bias in, 586–​88, 603n5
encoding specificity principle, 256 certainty measures, 598, 602–​3
Enns, P. K., 348, 454 CMV biases, 586–​87, 603n5
638   Index

expert surveys (cont.) in-​depth individual interviews, 512–​13,


coding designs, 589 521–​24,  531n4
cognitive interviewing, 596 language/​opinion relationships, 259, 262
confidentiality, 591, 603, 603n7 MENA, 241n14
context in, 587–​88 mixed mode designs, 53, 59, 70
data aggregation, 587, 588, 599–​600, 603n5 open-​ended responses, 65
designs, 589–​91, 601–​2, 603nn6–​7 PAPI, errors in, 208
DW-​NOMINATE scores, 594, 598 satisficing,  68–​69
generalizability coefficient, 593, 603n10 social desirability bias, 67
hypothesis testing, 587 survey experiments, 496
inter-​rater agreement, 591–​92, 603n9 survey mode transitions, 79
item response theory models, 592, TSE approach to, 13, 79–​81
600–​601 factor analysis, 341
measurement error reduction, 599–​601 Fausey, C. M., 256
mixed design, 586f, 589 Findley, B., 8
multiple rater design, 586f, 588, 600–​601 Fink, A., 116t
nested-​experts design, 586f,  587–​88 Fiske, D., 121–​22
null variance, 592, 603n9 FiveThirtyEight.com, 1, 609–​13, 620, 621, 629
pooled measures, 593, 603n10 Fleiss equation, 284
positivity bias in, 587 focus groups, 510–​12, 521–​24, 531n4
reliability, validity, 587–​88, 590, 593–​95, Folz, D. H., 116t
598–​99,  603n11 forecasting. see poll aggregation, forecasting
response biases reduction, 595–​99, 602 Fowler, F. J., 116t
response rates, 590–​91 Frankel, L. L., 35
sampling designs, 589–​90, 593–​95, French National Election Study, 4t, 409
602, 603n11 Fricker, S. S., 40
single-​rater designs, 585–​87, 586f FTF. see face-​to-​face surveys
standards, best practices, 589–​91, 603nn6–​7
target-​unit point estimates, 592 Gaines, B. J., 484–​85, 487, 499
target-​units mapping design, 585–​89, 586f Gallup, G., 389
terminology, 585 Gallup American Muslim study, 194
timing, speed control, 584 Gallup Poll Daily, 194
uncertainty measures, 591–​93, 603nn8–​10 Gallup Presidential Approval series,
variance, 590, 592, 602, 603n10 434–​36,  435f
exponential random graph models, 546 Gallup World Poll, 4t, 392t, 393
Garcia, J. A., 252
Facebook, 556, 562, 575n4 Gayo-​Avello, D., 560, 561
face-​to-​face surveys Gelman, A., 8, 77, 301–​2, 329, 353–​54, 356n8,
ANES, 58, 80, 81, 89–​91, 300–​301 410, 411, 412, 417, 421, 424, 433–​36, 551n29
CAPI systems as quality control, 7 General Social Survey (GSS), 28, 97, 535, 574n1
costs, 84, 91 generational/​cohort trends, graphing, 412,
cross-​national polling,  401–​2 413f, 421–​23, 423f, 427–​29, 428–​29f, 436–​
CSES, 401–​2, 401t 37, 436f, 442, 443–​44t, 444f
in developing countries, 211 Gengler, J. J., 235
don’t know responses, 54 Genre, V., 599
hard to reach populations, 155–​56, 158 Genso Initiatives Web surveys, 212, 218n4
history of, 55, 79, 610 German Federal Election Studies, 4t
Index   639

GfK Knowledge Networks, 28, 30, 38, 43n1, 58, period effects, 429, 429f
76, 77, 150, 371 pie charts, 452–​54
Ghitza, Y., 353–​54, 411, 412, 417, 421, plotting symbols, 449–​50, 465–​66, 467f,
424,  433–​36 471–​72, 472f, 478n5
Gibson, J., 326 poll design, construction, 412, 414–​15f
Gideon, L., 174n5 purpose of, 448
Gill, J., 7 raw data, 411–​16, 413–​18f
Gillum, R. M., 236 results, interpretation of, 423–​30, 426–​29f
Gimbel, K., 8–​9 results, presentation of, 45–​436f,  433–​37
Gimpel, J. G., 5 sampling weights, 412–​16, 416–​18f
Global Barometer program, 393 univariate (see univariate graphs)
Golder, M., 396, 404n7 Green, K. C., 595
González-​Bailón, S., 563 group consciousness
GPS coordinates, 3, 215–​17, 237, 239t additive measures, 380
graphical perception theory, 447, 450, 476 attachment, 368, 369t
graphs classical test theory, 369–​70
advantages of, 440–​46, 441f, 441t, 443–​45t, data set, 371–​72, 372t
444f, 446f described,  364–​65
age-​weight curves, 428–​29f, 429–​30, 434–​35, differential item functioning (DIF), 364,
435f, 442–​45, 445t, 446f 370, 377–​78, 378t
bar charts, 454–​56, 455f, 457f evaluation, 366, 367t, 374–​75, 375t
best practices, 437–​38, 448–​52, 449f, identity importance, 367–​68, 368t,
451f, 477 374–​75,  375t
bivariate (see bivariate graphs) independent variable approaches, 380
bubble plots, 450, 478n1 item bias, 380
complicated displays, 449–​50, 449–​51f item response theory, 370–​7 1, 380–​81
dot plots, 458–​59, 458f, 460f, 478n2 Kaiser criterion, 372
election turnout, voting patterns, 424–​28, measurement of, 363–​64, 369–​7 1
426–​27f measurement precision, 375, 376f
full scale rectangle, showing, 450–​52, 453f methodology, 372–​78, 373–​75t, 376f, 377–​78t
generational/​cohort trends, 412, 413f, 421–​ model fit assessment, 375–​77, 377t
23, 423f, 427–​29, 428–​29f, 436–​37, 436f, Mokken scale analysis, 372–​73, 382n2
442, 443–​44t, 444f monotonicity, 373, 374t
histograms, 442, 444f, 446, 452, 455, recoded variables, 373, 374t
460–​63,  462f self-​categorization, 365–​66, 365t, 375, 375t
income effects, 419–​22f,  420–​21 summary statistics, 378–​79, 379t
information processing, 442, 443–​44t, 444f, 2PL model, 373–​74
446–​48,  459 unidimensionality, 373, 373t, 382n3
jittering, 466–​68, 469f validity,  379–​80
labeling points, 468–​7 1, 470f, 478n4 Groves, R., 13, 15, 188
line plots, 471–​72, 472f, 478n5
model building, 417–​23, 419–​23f H. M. Wood, 22–​23
model checking, 430–​33, 431–​34f Haberman, S. J., 377
multipanel, 464–​465, 471–​472, 472f, Hanretty, C., 351
478n3, 478n5 hard to reach populations. see also
outliers, 478n4 low-​incidence populations
overview, 8, 410–​11, 439–​40 categories of, 156–​57
640   Index

hard to reach populations (cont.) expert surveys, 587


contacting,  155–​56 Internet surveys, 80, 82–​85, 83–​84t, 88, 91
contextual factors, 174n10 low-​incidence populations, 190–​92, 199
costs,  160–​61
design, 174n9 incentives
disproportionate sampling, 160, 174n3 CSES, 404n13
forced migrants, 162, 164–​72, 169–​70t, hard to reach populations, 163
174n4, 174nn7–​10, 175n11 mail surveys, 163
full roster approach, 158 in MENA surveys, 231t
identification of, 158–​62, 170–​7 1, 174n4 in qualitative research, 520–​21
incentives, 163 response rates and, 19–​20
insurgency conflict study, 174n8 in survey experiments, 488, 491, 497, 500n10
internally displaced people, 6, 162, 174n4 in-​depth individual interviews, 512–​13,
interviewers, training, 172–​73 521–​24,  531n4
interviewing, 163–​64, 174n6 India, 404n9
locating, 161–​62, 174n4 Informal Sector Service Center
nonresponse, 162–​63, 174n5 (INSEC),  165–​66
persuasion of, 162–​63, 174n5 informed consent, 522, 528–​29
research, approach to, 172–​73 insurgency conflict study, 174n8
respondent-​driven sampling, 159 internally displaced people (IDP), surveying.
respondent identification/​recruitment, 519 see hard to reach populations
response rates, 163, 170–​7 1, 175nn13–​14 International Social Survey Programme,
sampling, 155–​61, 167–​72, 169–​70t, 174n3, 392t, 393
174n10, 175nn11–​14 Internet surveys
scoring, 164, 174n6 advantages of, 76–​77, 90–​91
screening methods, 158, 167–​68 costs, 78–​85,  83–​84t
snowball (chain referral) sampling, 159 coverage issues in, 20, 57–​58
He, R., 565 criticisms of, 77–​78
Hecht, B., 564 data collection, 90
Heckathorn, D. D., 159 designs, 58
Hensler, C., 545 hard to reach populations, 155–​56
Hersh, E. D., 356n8 hypothesis testing, 80, 82–​85, 83–​84t, 88, 91
hierarchical linear regressions, 545–​47, 547f language/​opinion relationships, 259, 262
high-​effort cases,  191–​92 MENA, 241n14
Hillygus, D. S., 5, 32, 35, 43, 492–​93 mixed mode, 84t, 85, 93
HIPAA, 550n17 modality, qualitative differences in, 89–​91
histograms, 442, 444f, 446, 452, 455, mode selection, 91–​94
460–​63,  462f mode studies, 88–​89
Homola, J., 7 nonresponse rates, 80–​81
Hong, Y., 259 online panels, 491–​93
Horn, J. L., 253 open-​ended responses, 65
Huckfeldt, R., 106 panels, 77
Huffington Post, 1, 611, 613, 615, 616f, 618f, 619, presentation effects, 66–​67
621, 622, 626 quality, 78–​85, 83–​84t,  92–​94
hypothesis testing quantifying quality of, 85–​89, 87t
ANES, 300–​301 representativeness effects, 60–​61, 61t, 71n5
context surveys, 106 response rates, 58–​59, 62–​63, 62t, 77, 90–​91
Index   641

sampling error, 20–​22, 81 Klašnja, M., 9


sampling methods, 77, 79–​81 Knight Foundation, 150
satisficing,  68–​69 Koch, J., 348, 454
statistical inference, 279, 297n5 Kosslyn, S. M., 448
survey mode effects, 22–​23, 70 Krosnick, J., 118
survey mode transitions, 79 Krupnikov, Y., 8
total survey error, 17, 78, 86–​89, 87t, 94n8 Kselman, D. M., 590
TSE approach to, 13 Kuklinski, J. H., 484–​85, 487, 499
weighting (modeling), 77, 81 Kuwait, 222t, 225f, 240n3
interviewer-​administered questionnaires
(IAQs), 54, 65–​68 labeling points, 468–​7 1, 470f, 478n4
inverse probability problem, 293 Laennec, R. T. H., 128
Iraq, 222t, 224, 225f, 240n3 Landry, P., 241n4
Israel, 630 language barriers, 188–​90, 194
item response theory. see also latent constructs language/​opinion relationships
in expert surveys, 592, 600–​601 bilingualism, 189, 196–​97, 250,
group consciousness measurement (see 256–​57,  260–​64
group consciousness) cognitive effects, 255–​57, 266
hierarchical group model, 346–​47, 356n5 cognitive sophistication, 263
latent constructs, modeling, 8, 341–​42, 356n3 culture influences, 256–​59, 262
diglossia, 232
Jackman, S., 32, 300–​301, 342, 343, 600 effect sizes, 265
Jackson, N., 43, 492–​93 framing effects, 261–​62
Jacobs, L. R., 320 future-​time reference,  260–​61
Jacoby, W. G., 8, 448 gendered vs. non-​gendered tongues, 255,
Jerit, J., 499 260, 262
Jessee, S. A., 350–​51 generational status, 263
Jeydel, A. S., 324 grammatical nuances in, 255, 260
jittering, 466–​68, 469f interviewer effects, 262, 267n5
Johnston, R., 101–​2 linguistic determinism, 254
Jordan, 222t, 223, 224, 225f, 240n3, 246 measurement equivalence, 253, 266nn2–​4
Jost, J. T., 568 memory effects, 251, 256
Jungherr, A., 560 MENA surveys, 231t, 231–​32, 241n9
Junn, J., 6 monolingualism, 264
Jürgens, P., 560 multilingual polls, 251
online polls, 262
Kacker, M., 600 overview, 7, 249–​51
Kalman filter model, 617, 618f, 626 regression models, 265–​66
Kalt, J. P., 322 research design, 252–​53, 258–​59, 262
Karp, J. A., 8, 43, 493 research studies, 253–​54
Kastellec, J. P., 439 survey response effects, 259–​64
Katosh, J. P., 38 thinking for speaking, 255, 259–​60, 263
Kaushanskaya, M., 256 thought, automatic influence on, 257
keeping in touch exercises (KITEs), 42 validation of, 258–​59, 264
King, G., 603n13 LAPOP surveys, 212, 218n4
Kitschelt, H., 590 latent constructs. see also item response theory
Klar, S., 497 additive models, 340–​41
642   Index

latent constructs (cont.) Le Brocque, R., 35


bias-​variance trade-​off,  343–​44 Lee, T., 252, 254
computational challenges, 353, 356nn6–​7 Lenski, J., 151
consumer confidence, 340 Lenz, G., 318
data disaggregation, 325–​28, 345–​46 Lenz, H., 589
data sets, 355 LeoGrande, W., 324
dimensionality assessment, 352–​53 Leoni, E. L., 439
dyadic representation, 349, 351–​52 Lepkowski, J. M., 42
emIRT, 346, 356n3 leverage-​saliency theory, 188
factor analysis, 341 Levine, A. S., 486–​87, 497
group level applications, 348–​49 Levinson, S., 255
group level measurements, 345–​47, 356 Lewis, D. E., 588, 601
income/​opinion relationships, Lewis, J. B., 350
353–​54,  356n8 LGBT surveys. see group consciousness
individual level applications, 344–​45 Libya, 222t, 223, 224, 225f, 229, 232
individual level measurements, 340–​44, 355 Likert, R., 118
IRT modeling, 8, 341–​42, 356n3 Lilien, G. L., 600
Markov chain Monte Carlo algorithms, Lin, Y.-​R., 565
353, 356n6 Lindeberg-​Feller central limit theorem, 290
mixed measurement responses, 342 line plots, 471–​72, 472f, 478n5
multilevel regression/​post-​stratification, Link, M. W., 89
328–​32, 346,  566–​67 list sampling, 185–​86
non-​survey-​based data,  354 Liu, W., 566
no-​U-​turn sampler, 353, 356n6 living with the silence, 524
overview,  338–​39 Local Governance Performance Index
polarization, 344 (LGPI), 241n4
policy liberalism/​mood, 339, 348, 356n2 LOESS lines, 615–​16, 626
political knowledge, 339–​41, 344–​45 log-​ratio transformation, 289–​92, 291t,
racial prejudice, resentment, 340, 349 297nn9–​14
spatial modeling, 356n5 longitudinal (panel) surveys
spatial voting, 350–​51 acquiescence bias, 40
subnational opinion measurement, 353–​ advantages of, 31–​32, 44n3
54, 356n8 (see also subnational public background, 29–​31, 43nn1–​2
opinion) challenges in, 33, 41, 44n4
uncertainty, 356n7 continuity, innovation in, 33–​34
validity/​reliability modeling,  342–​44 cross-​sectional design,  29–​30
variation measurement, 348 designs, 29–​30, 41–​43, 44n11
Latin American Public Opinion Project, measurement error, 37–​42
4t, 536 modeling approaches, 33
Latino Barometer, 392t, 393, 536, 548 online survey panels, 30
Latino National Political Survey, 183 panel attrition in, 34–​37, 43n2, 44nn5–​6
Latino National Survey, 182, 252 panel conditioning, 37–​39, 42, 44nn7–​10
Lauderdale, B. E., 351 question wording in, 33–​34, 42
Lavine, H., 498 retrospective design, 30
Lax, J. R., 330, 345 sampling designs, 30–​31, 42–​43
Lazarfeld, P., 389 seam bias, 39–​41
Lebanon, 222t, 225f, 232 weighting, 33, 36, 41, 44n4, 44n6
Index   643

low-​incidence populations. see also hard to response rates, 62–​65, 62t, 64t, 402
reach populations sampling designs, 21
American Jews, 193 social desirability bias, 22, 68
American Muslims, 183, 192–​95, 200, survey mode transitions, 79
201nn1–​2 TSE approach to, 13, 79–​81
Asian Americans, 183–​84, 189, 195–​97 validation of, 85–​91
background,  182–​83 Makela, S., 8, 411, 412
cooperation, gaining, 188–​90 Malawi, 223
costs, 189–​90, 199–​200 Malik, M. M., 562
estimation, inference, 190–​92, 199 Malouche, D., 237, 241n4
language barriers, 188–​90, 194 Mann, C. B., 38
measurement error, 189, 199 maps, 464, 474–​75, 474f
Mormons, 193 margin of error, 625–​26
nonresponse bias, 188–​90, 199 Marian, V., 256
political activists, 183, 197–​98 Markov chain Monte Carlo (MCMC)
question wording, 194, 200 method, 617
religious affiliation, 183, 193 Markov chains, 617
sampling, 183–​87,  198–​99 Markus, G., 394
survey methods, 188–​90 Marquis, K. H., 39
Luke, J. V., 56 Martinez i Coma, F., 587
Lupia, A., 484 matching algorithms and weights
Lust, E., 241n4 graphs, 412–​16, 416–​18f
Lynn, P., 39 longitudinal (panel) surveys, 33, 36, 41,
44n4, 44n6
MacLuen, M., 90 nearest-​neighbor propensity score
Maestas, C. D., 598 matching,  302–​4
mail surveys propensity scores, 191, 302, 304–​5
advantages, limitations of, 18 sampling in, 20–​22
complex designs, 314n4 subclassifications matching, 302, 304
cost-​error tradeoffs, 13, 22–​23 MCAPI. see developing countries/​CAPI
cross-​national, 401–​3,  401t systems
donation solicitations, 486–​87 McArdle, J. J., 253
don’t know responses, 54 McIver, J. P., 326
exit polls vs., 6, 148 MCMCpack, 342
hard-​to-​count measure,  164 Mechanical Turk, 79, 90, 91, 490, 492,
hard to reach populations, 155–​56, 160, 163, 500n12, 500n17
164, 166, 175n14 MENA surveys
history of, 55, 79 anchoring vignettes, 235
incentives, 163 behavior coding, 233–​34
interviewer gender effects, 241n15 cognitive interviewing, 234–​35, 235t
low-​incidence populations, 199 data quality assessment, 224, 225f, 240n3
mixed mode, 53–​55, 58–​63, 61–​62t, 70 data sets, 220–​23, 221f, 222t, 223f,
nonresponse, 54 240n3, 241n4, 245–​46
open-​ended responses, 65 democracy, support for, 224, 225f,
panel designs, 30, 42 226, 240n3
presentation effects, 66 environmental challenges, 231t,  231–​32
representativeness effects, 60–​61, 61t, 71n5 ethical issues, 239–​40, 239t
644   Index

MENA surveys (cont.) nonresponse error, 58–​59


gender effects, 229, 235–​36, 241n15 open-​ended responses, 65
household selection, 237–​38 presentation effects, 66–​67
incentives in, 231t representativeness effects, 60–​61, 61t, 71n5
intergroup conflict, bias and, 235–​36 response rates, 58–​59, 62–​63, 62t, 64t, 70
interviewer effects, 235–​36, 246–​48 sampling designs, 58–​59
language barriers, 231t, 232, 241n9 satisficing, 68–​69, 69t
latent constructs variation social desirability effects, 65, 67–​69, 69t
measurement, 348 straight lining, 69
measurement error, 233–​36, 235t, 241n14 survey mode effects, 22–​23
mode impacts, 237, 241n14 validation testing, 59–​63, 61–​62t, 71nn3–​5
nonresponse, 233, 238, 241n15 modus tollens,  292–​93
parliamentary election 2014, 241n7 Mokdad, A. H., 89
public service provision, 241n4 Monte Carlo simulations, 622
questionnaires, 231t, 232 Moore, J. C., 39
question wording, 226–​29, 227–​28t, Morocco, 222t, 223, 224, 225f, 236, 240n3
240n2, 241n10 Morstatter, F., 563
Q x Qs, 232–​33, 241n10 Morton, R. B., 487, 492
refusal, 238 MTurk, 79, 90, 91, 490, 492, 500n12, 500n17
religious dress effects, 236–​37 Multi-​Investigator Study, 484
representation error, 237–​38 multilevel regression/​post-​stratification, 328–​
research challenges, 229, 231t, 241n7 32, 346, 566–​67
respondent vulnerability, bias and, 235–​36 multipanel graphs, 464–​465, 471–​472, 472f,
response rates, 231t, 233, 241n11 478n3, 478n5
social networks, 231t, 233, 241n11 multiple rater design surveys, 586f, 588,
survey genre, 229, 231t 600–​601
total survey error, 233–​34, 234t Muslim American Public Opinion Survey
Messing, S., 354 (MAPOS), 195, 200
Michigan Survey Research Center. Mutz, D., 484, 489, 490, 499n4, 500n11
see American National Election
Study (ANES) Nagler, J., 568
Middle East Governance and Islam Nall, C., 356n8
Dataset, 245 National Asian American Survey, 182, 184, 196
Milgram experiment, 499n4 National Black Election Study, 182
Miller, W. E., 37, 317, 326, 390 National Election Pool, 148, 153n3
Miller-​Stokes problem, 326 National Election Studies, 325, 409
Mitchell, J. S., 235 National Health Interview Survey
Mitofsky, W., 142, 149 (NHIS),  55–​56
mixed mode surveys National Household Education Survey, 574n1
combining modes, 63–​65 National Opinion Research Center, 4t
contextual cues in, 65 National Politics Study, 182
costs, 63, 64t, 70, 515 National Survey of Black Americans, 182
coverage issues in, 55–​58, 56–​57f nearest-​neighbor propensity score
described, 5, 53–​55 matching,  302–​4
designs, 54–​59, 56f Nepal Forced Migration Survey
expert surveys, 586f, 589 background,  164–​65
mode effects, 69–​70 challenges in, 172–​73
Index   645

data set, 169–​70, 169–​70t Pennacchiotti, M., 566


design, implementation, 166–​67, 174nn8–​9 Perception of Electoral Integrity (PEI), 590,
female respondents, 171 591, 603n8
Maoist insurgency, 165–​66 Pereira, F. B., 345
response rates, 170–​7 1, 175nn13–​14 Pérez, E. O., 7, 252, 254, 260–​62
sampling frame, method, 167–​72, 169–​70t, Pew American Muslim study, 193–​94
174n10, 175nn11–​14 Pew Asian-​American Survey, 197
nested-​experts design surveys, 586f,  587–​88 Pew Global Attitudes Survey, 392t, 393
Newsome, J., 8–​9 Pew Global Research, 4t, 392t
New York Times, 620, 626, 629 Pew Research Center, 4t, 245
New Zealand, 402–​3, 404n12 Phillips, J. H., 330, 345
Nie, N., 198 photographs, 216–​17, 239t
non-​Bayesian modeling, Bayesian vs.,  618–​19 pie charts, 452–​54
nonresponse bias Pilot Asian American Political Survey, 182
ANES,  80–​81 plotting symbols, in graphs, 449–​50, 465–​66,
CSES, 402, 404n11 467f, 471–​72, 472f, 478n5
hard to reach populations, 162–​63, 174n5 Plutzer, E., 330–​31
Internet surveys, 80–​81 political activists, 183, 197–​98
low-​incidence populations, 188–​90, 199 poll aggregation, forecasting
mail surveys, 54 aggregation statistics, 614–​17, 615–​16f, 618f
mixed mode surveys, 58–​59 challenges in, 623–​29
Twitter, 556 data sources, collection, 624
null hypothesis significance test, 292–​94 forecasting statistics, 617–​23
overview,  609–​10
Oberski, D., 5 pollster quality in, 621
O’Brien, R. M., 603n10 predictive value of, 628–​29
O’Connor, B., 561 single polls vs.,  627–​28
Ogunnaike, O., 257 state level polls in, 622
Ohio registered voter study, 100–​103, 102f, statistical inference, 614–​24, 628–​29
104–​5f, 106–​10, 108f, 110nn2–​3 technology developments, 610–​13
Olson, K., 22–​23, 36 uncertainty in, 624–​27
Oman, 222t undecided respondents in, 621–​22
online surveys. see Internet surveys Pollster, 611, 615–​17, 616f, 618f, 626
Popescu, A.-​M., 566
Page, B. I., 318 population average treatment effects
Palestine, 222t, 225f, 240n3 complex data, causal inference with,
Palestinian Center for Policy and Survey 300–​303
Research, 246 methodology,  303–​5
Pan, J., 344 overview, 299–​300,  312–​13
Panel Study on Income Dynamics, 28, 39, 42, 81 post-​stratification weights, 301
panel surveys. see longitudinal (panel) surveys simulation study, 305–​9, 307t, 308f,
Paolacci, G., 490 313nn1–​3
paradata, 212, 215–​16, 218n5 social media/​political participation study,
Park, D. K., 329 309–​12, 310f, 312t, 314n4
PATEs. see population average treatment effects weighting for differential selection
PATT estimation, 303–​5 probabilities, 301
Peltzman, S., 322 weighting to adjust for unit nonresponse, 301
646   Index

presidential election results, 323–​24 question wording


Proctor, K., 8 agree-​disagree scales,  116–​19
Program on Governance and Local best practices, 115–​16, 116t
Development (GLD), 221, 222t, 224, 225f, characteristics, coding, 122, 125, 125f
241n4, 246 cognitive processes and, 116–​20, 119–​20f
propensity scores, 191, 302, 304–​5 common method variance, 121
Public Opinion Quarterly,  294–​96 described,  5–​6
design choices, 116–​20, 119–​20f
Qatar, 222t, 235, 246 in longitudinal (panel) surveys, 33–​34, 42
qualitative research low-​incidence populations, 194, 200
benefits of, 505 meta-​analysis,  120–​21
cognitive interviewing (see cognitive multi trait-​multi method approach, 118,
interviewing) 121–​23, 126, 127f
concepts, definitions, 506–​7 predictive value, 123–​26, 123f, 126–​27f
concurrent, 509 qualitative research, 514–​16, 521
confidentiality,  529–​30 quality estimation, 121–​22
data management, organization, 525–​26 quasi-​simplex model, 118
ethical issues, 528–​30 reliability, 121
file naming, storage, 526 responses, unreliability in, 113–​14, 118–​19
findings, analysis/​reporting of, 525–​28 satisficing,  116–​17
focus groups, 510–​12, 521–​24, 531n4 scale correspondence, 127
group/​interview management,  523–​24 seam effect reduction via, 40
incentives in, 520–​21 smartphone monitoring, 129
in-​depth individual interviews, 512–​13, SQP project, 124–​30, 126–​27f,  134–​37
521–​24,  531n4 in survey experiments, 483–​84, 486–​87
informed consent, 522, 528–​29 survey mode effects, 22–​23
integration of, 507–​10 Quirk, P. J., 484–​85, 487, 499
limitations of, 507
observers,  524–​25 RAND American Life Panel, 28, 43n1
participants, respect for, 530 random digit dial phone surveys, 90
post-​administration,  509–​10 random sample surveys, 79–​80, 92, 100–​103,
probes, 514, 524 102f, 110n2
professional respondents, 519–​20 Rao, D., 566
project discovery, 507–​8 Rasinski, K., 17
protocol development, 521 Ratkiewicz, J., 568
question asking, 524 Ray, L., 590, 600
question wording, 514–​16, 521 Razo, A., 9
rapport,  522–​23 RealClearPolitics, 611, 614–​15, 615f, 626
reports, formal, 527–​28 referenda results, 324
research plans, 516–​17 regression trees, 123–​24, 123f
respondent identification/​ relational database management systems,
recruitment,  518–​20 541, 541f
screening criteria, 518 representative sampling, 92
standards, guidelines for, 513, 516, 531n1 Révilla, M., 117, 129
survey creation, refinement, 508–​9 Rips, L. J., 17, 40
training, 519, 522, 531n4 RIVA Training Institute, 522, 523, 531n4
usability testing, 514–​15 Rivero, G., 563
Index   647

Rivers, D., 342 Schoen, H., 560


Robinson, J. G., 164 Schuler, M., 302
Rodden, J., 330, 343 sdcMicro, 550n20
Roper, E., 389 self-​administered questionnaires (SAQs),
Rothschild, D., 77, 565 54,  66–​68
Ruths, D., 566 Senate Election Studies, 535
Ryan, C., 251 Shapiro, R. Y., 320
Shone, B., 289–​90
Saiegh, S. M., 351 Si, Y., 8
Sala, E., 39 Silver, N., 609, 612
sampling designs simulations,  324–​25
address-​based, 20, 55 single-​rater design surveys, 585–​87, 586f
ANES, 58, 80, 94n8, 491, 535, 549n4 Sinharay, S., 377
clustering, 21 Sjoberg, L., 595
context surveys, 97–​98, 101–​2, 110n1, 543–​ Skoric, M., 560
45, 549n7, 550nn24–​26 Slobin, D., 255, 259
density sampling, 186–​87 Smit, J. H., 116t
described, 5 Smyth, J. D., 22–​23
expert surveys, 589–​90, 593–​95, 602, 603n11 Snell, S. A., 5
high-​effort cases,  191–​92 Sniderman, P., 484, 499n2
list sampling, 185–​86 Snyder, J. M, 343
longitudinal (panel) surveys, 30–​31, 42–​43 social desirability bias
mixed mode surveys, 58–​59 face-​to-​face surveys,  67
post-​stratification,  190–​91 mail surveys, 22, 68
primary sampling units (PSUs), 327 telephone surveys, 67
qualitative research, 518–​20 Twitter surveys, 556, 561, 569, 575n3, 575n12
simple random sampling, 543–​44, Social & Economic Survey Research
550nn24–​26 Institute, 246
stratified random sampling, 106–​10, 108f, social exchange theory, 163, 174n5
184–​85,  187 social media data. see Twitter
stratifying, 21 social media/​political participation study,
subnational public opinion, 326–​27 309–​12, 310f, 312t, 314n4
in survey experiments, 488–​91, 494–​95, South Bend Study, 106
498–​99, 500nn10–​11 Spahn, B. T., 300–​301
Saris, W. E., 5 Spatial Durbin model, 551n32
satisficing, 17, 68–​69, 116–​17 spatial voting, 350–​51
Saudi Arabia, 222t, 224, 225f, 240n3 Sprague, J., 106
scatterplots SQP2.0 project, 124–​30, 126–​27f,  134–​37
applications of, 464 standards, guidelines. see best practices
aspect ratio, 475–​76, 475–​76f Stanley, J., 15
axis labels in, 466, 467f statistical inference
data presentation in, 450–​52, 453f aggregation, 614–​17, 615–​16f, 618f
jittering, 468, 469f Bayesian vs. non-​Bayesian
point labels in, 468–​7 1 modeling,  618–​19
Schaffner, B. F., 5, 89 binomial outcomes, 275–​78,
Schlozman, K., 198 277–​78t,  286–​87
Schneider, S. K., 8 Brier scores, 623
648   Index

statistical inference (cont.) elite preferences, 320, 328


case studies, 294–​96 geographic sorting, 320, 328
certainty, 286 ideology measures, 326
compositional data, 286 income/​opinion relationships,
context surveys, 545–​47, 547f, 551nn27–​33 353–​54,  356n8
data disaggregation, 325–​28, 345–​46 multilevel regression/​post-​stratification,
errors in, 279–​84, 297nn6–​7 328–​32, 346,  566–​67
forecasting,  617–​23 observations, number of, 318–​19
fundamentals vs. poll-​only, 619–​21,  628–​29 opinion-​policy linkage,  317–​21
hierarchical linear regressions, 545–​47, 547f overview, 7–​8, 316–​17,  331–​32
Internet surveys, 279, 297n5 quality/​effects relationships,  317–​18
item characteristic curves, 370–​7 1 reliability, 326
Kalman filter model, 617, 618f, 626 research designs, 319
LOESS lines, 615–​16, 626 research studies, 321
log-​ratio transformation, 289–​92, 291t, sampling,  326–​27
297nn9–​14 simulations,  324–​25
margin of error treatment, 284–​88, surrogates,  321–​24
285t, 297n8 Sudan, 222t, 225f
Markov chains, 617 Sumaktoyo, N. G., 348
multilevel regression/​post-​stratification, surrogate demographic variables, 322–​23
328–​32, 346,  566–​67 surrogates,  321–​24
multinomial outcomes, 275–​78, survey designs. see designs
277–​78t, 289 survey experiments
null hypothesis significance test, 292–​94 applications of, 484, 495
null variance, in expert surveys, 592, 603n9 background,  483–​84
poll aggregation, forecasting, 614–​24 behavioral vs. treatment outcomes,
pooled measures, 593, 603n10 485, 500n6
proportions,  288–​89 benefits of, 484–​88, 494–​95, 498
random sampling, 279, 297n5 concepts, definitions, 483, 487, 499n1
simulations,  324–​25 embedded,  496–​97
uncertainty,  278–​79 expressed preferences, 496
variation matrix, 290 field experiments, 486–​88, 500nn6–​9
Sterba, S. K., 549n11 incentives in, 488, 491, 497, 500n10
Stipak, B., 545 laboratory experiments, 485–​86, 491,
Stokes, D. E., 317, 326, 390 497, 500n5
Stone, W. J., 598, 600 measurement limitations, 495–​98
stratified random sampling, 106–​10, 108f, MTurk, 79, 90, 91, 490, 492, 500n12, 500n17
184–​85,  187 natural experiments, 485, 500n5
structural equation models (SEM), 121–​22 online panels, 491–​93
Stuart, E. A., 302 participant limitations, 488–​89, 500n10
subclassifications matching, 302, 304 professional subjects, 492–​93, 501n18
subnational public opinion question wording in, 483–​84, 486–​87
bias in, 327 random assignments vs., 485, 499nn4–​5
cross-​sectional measures of, 327 real-​world generalizability, 487,
data disaggregation, 325–​28, 345–​46 489–​90,  500n8
data sets, 318, 320–​21 representative sample recruitment,
dyadic representation model, 317 491–​93,  501n19
Index   649

revealed preferences, 496–​98 item-​level nonresponse,  18–​19


sample diversity, 489–​91, 498–​99, 500n11 measurement,  16–​18
sampling designs in, 488–​91, 494–​95, 498–​ overview, 3–​5,  13–​14
99, 500nn10–​11 post survey error, 23
subject pools, 492 principles of, 14–​15, 16f, 33
time-​in-​sample bias, 492, 501n18 reliability assessment, 32
validity of, 487–​88, 490 respondent error, 16–​17
Survey of Income and Program Participation response modes, 17
(SIPP), 35, 39, 40 response process stages, 17
Survey of LGBT Americans, 364–​66, 365t, 371, sampling error, 20–​22
372t. see also group consciousness standardized interviewing, 18
Swedish National Election Studies, 4t, 409 survey mode effects, 14, 22–​23
Syria, 222t unit-​level nonresponse,  19–​20
validity, internal vs. external, 15
target-​units mapping design, 585–​89, 586f total survey quality, 14, 24
Tausanovitch, C., 330, 348, 349, 350 Tourangeau, R., 17, 158, 161–​63, 174n1
Tavits, M., 260, 262 Transitional Governance Project (TGP), 221,
telephone surveys 222t, 224, 225f, 241n4, 246
coverage issues in, 55–​57, 56–​57f Traugott, M. W., 38
CSES, 401t, 402 true population proportion calculation, 276
in developing countries, 211 TSE. see total survey error
hard to reach populations, 156–​58 TSQ. see total survey quality
history of, 79, 610 Tucker, J., 568
language/​opinion relationships, 259, 262 Tufte, E. R., 439
MENA, 241n14 Tukey, J. W., 448
mixed mode designs, 53 Tumasjan, A., 560
open-​ended responses, 65 Tunisia, 222t, 223, 224, 225f, 229, 230, 232, 237,
presentation effects, 66–​67 241n4,  246–​48
random digit dial phone surveys, 90 Twitter
social desirability bias, 67 benefits of, 555–​57, 575nn4–​5
survey mode transitions, 79 bots, spammers, 562, 567–​68, 575n7
TSE approach to, 13, 79–​81 challenges of, 557–​59, 571
validity of, 90 changes over time, 569
Tessler National Science Foundation, 224, 225f computational focus groups, 565
think-​aloud protocols, 16–​17, 235 contextual data, 570
thinking for speaking, 255, 259–​60, 263 data aggregation, 563–​64, 567–​68
time-​in-​sample bias, 37–​39, 44nn7–​10, data archives, 556, 575n4
492, 501n18 data sets, 559
Time Sharing Experiments for the Social ethical issues, 558–​59
Sciences (TESS), 484, 499n4 fake accounts, 562–​63, 567–​68, 575n7
total survey error ideology estimation, 566–​67
comparability error, 23–​24 keyword selection, 565
conversational/​flexible interviewing, 18 multilevel regression/​post-​stratification,
coverage error, 20 328–​32, 346,  566–​67
data collection, 18–​19 nonresponse bias, 556
Internet surveys, 17, 78, 86–​89, 87t, 94n8 panels, 569
interviewer error, 18 political activist opinions, 570
650   Index

Twitter (cont.) Varieties of Democracy Project (V-​Dem), 4t,


polling, funding/​interest in, 556, 575n3 583, 589, 603n2
public opinion identification, Verba, S., 198
559–​61,  564–​65 verbal probing, 235, 235t
research agenda, 571–​74 video recording, 525
research collaborations, 570, 575n5 visual perception theory, 447, 450, 476
response rates, 556, 567, 574n1 Vivyan, N., 351
selection bias, 559–​60 vote share plotting. see graphs
sentiment analysis, 561, 565, 569, voting behaviors. see also American National
575n12 Election Study (ANES)
social desirability bias, 556, 561, 569, change, measurement of, 31–​32, 44n3
575n3, 575n12 intention stability, 40
subpopulation studies, 569 mixed mode surveys, validation testing,
topics, 560, 575n11 59–​63, 61–​62t, 71nn3–​5
tweet counting methods, 560–​61 panel conditioning effects, 37–​39, 44nn7–​10
user representativeness, 561–​63, spatial voting, 350–​51
565–​67,  575n9 vote share graphing (see graphs)
validation, 568 Vowles, J., 8
Vox, 629
U. S. Census, 4t, 70, 185, 187
UC-​Davis Congressional Election Study, Wang, W., 411, 412, 430
4t, 587 Ward, R., 257
uncertainty measures Warshaw, C., 8, 330, 347, 348, 349
expert surveys, 591–​93, 603nn8–​10 Washington Post, 620, 629
latent constructs, 356n7 weights. see matching algorithms and weights
in poll aggregation, forecasting, 624–​27 Weisberg, H. F., 3, 14, 15
statistical inference, 278–​79 Whorf, B., 254
unconditional positive regard, 523 Williams, K. C., 487, 492
United Arab Emirates, 222t Witt, L., 36
United Kingdom, 630 World Values Survey, 4t, 221, 224, 225f, 245,
univariate graphs 392, 392t
bar charts, 454–​56, 455f, 457f Wright, G. C., 326
best practices, 448–​52, 449f, 451f, 477
dot plots, 458–​59, 458f, 460f, 478n2 Xu, Y., 344
histograms, 442, 444f, 446, 452, 455,
460–​63,  462f Yemen, 222t, 225f, 240n3
information processing, 459 YouGov, 28, 30, 38, 76, 77, 88, 94n6, 492–​93
overview, 452, 478n2 Young, M., 43, 492–​93
pie charts, 452–​54 Youth-​Parent Socialization Panel study, 30
Unwin, A., 410
Zaller, J., 262
Vaccari, C., 567 Zanutto, E. L., 302
Van Bruggen, G. H., 598, 600 Zell, E. R., 38
Vandecasteele, L., 36 Zogby International, 195
Van Ham, C., 587 Zupan, M. A., 322

You might also like