Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli
Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli
Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli
Econometrics
The Right Tools to Answer
Important Questions
Real
Econometrics
The Right Tools to Answer
Important Questions
Second Edition
Michael A. Bailey
c 2020, 2017 by Oxford University Press
Printing number: 9 8 7 6 5 4 3 2 1
APPENDICES
Math and Probability Background 538
A Summation 538
B Expectation 538
C Variance 539
D Covariance 540
E Correlation 541
F Probability Density Functions 541
G Normal Distributions 543
H Other Useful Distributions 549
I Sampling 551
Further Reading 554 . Key Terms 554 . Computing Corner 554
Bibliography 577
Glossary 587
Index 596
LIST OF FIGURES
1.1 Rule #1 2
1.2 Weight and Donuts in Springfield 4
1.3 Regression Line for Weight and Donuts in Springfield 5
1.4 Examples of Lines Generated by Core Statistical Model (for Review
Question) 7
1.5 Correlation 10
1.6 Possible Relationships between X, , and Y (for Discussion
Questions) 12
1.7 Two Scenarios for the Relationship between Flu Shots and Health 14
3.1 Relationship between Income Growth and Vote for the Incumbent
President’s Party, 1948–2016 46
3.2 Elections and Income Growth with Model Parameters Indicated 51
3.3 Fitted Values and Residuals for Observations in Table 3.1 52
3.4 Four Distributions 55
3.5 Distribution of β̂1 58
3.6 Two Distributions with Different Variances of β̂1 62
3.7 Four Scatterplots (for Review Questions) 64
3.8 Distributions of β̂1 for Different Sample Sizes 66
3.9 Plots with Different Goodness of Fit 73
3.10 Height and Wages 75
3.11 Scatterplot of Violent Crime and Percent Urban 77
3.12 Scatterplots of Crime against Percent Urban, Single Parent, and
Poverty with OLS Fitted Lines 79
xii
LIST OF FIGURES xiii
4.1 Distribution of β̂1 under the Null Hypothesis for Presidential Election
Example 96
4.2 Distribution of β̂1 under the Null Hypothesis with Larger Standard
Error for Presidential Election Example 99
4.3 Three t Distributions 100
4.4 Critical Values for Large-Sample t Tests 102
4.5 Two Examples of p Values 107
4.6 Statistical Power for Three Values of β1 Given α = 0.01 and a
One-Sided Alternative Hypothesis 110
4.7 Power Curves for Two Values of se(β̂1 ) 112
4.8 Tradeoff between Type I and Type II Error 114
4.9 Meaning of Confidence Interval for Example of 0.41 ± 0.196 118
5.1 Monthly Retail Sales and Temperature in New Jersey from 1992 to
2013 128
5.2 Monthly Retail Sales and Temperature in New Jersey with December
Indicated 129
5.3 95 Percent Confidence Intervals for Coefficients in Adult Height,
Adolescent Height, and Wage Models 133
5.4 Economic Growth, Years of School, and Test Scores 142
6.1 Goal Differentials for Home and Away Games for Manchester City
and Manchester United 180
6.2 Bivariate OLS with a Dummy Independent Variable 182
6.3 Scatterplot of Trump Feeling Thermometers and Party Identification 185
6.4 Three Difference of Means Tests (for Review Questions) 186
6.5 Scatterplot of Height and Gender 188
6.6 Another Scatterplot of Height and Gender 189
6.7 Fitted Values for Model with Dummy Variable and Control Variable:
Manchester City Example 192
6.8 Relation between Omitted Variable (Year) and Other Variables 199
6.9 95 Percent Confidence Intervals for Universal Male Suffrage Variable
in Table 6.8 202
6.10 Interaction Model of Salaries for Men and Women 204
6.11 Various Fitted Lines from Dummy Interaction Models (for Review
Questions) 206
xiv LIST OF FIGURES
12.1 Scatterplot of Law School Admissions Data and LPM Fitted Line 412
12.2 Misspecification Problem in an LPM 413
12.3 Scatterplot of Law School Admissions Data and LPM- and
Probit-Fitted Lines 415
12.4 Symmetry of Normal Distribution 419
12.5 PDFs and CDFs 420
12.6 Examples of Data and Fitted Lines Estimated by Probit 424
12.7 Varying Effect of X in Probit Model 427
12.8 Fitted Lines from LPM, Probit, and Logit Models 435
12.9 Fitted Lines from LPM and Probit Models for Civil War Data
(Holding Ethnic and Religious Variables at Their Means) 442
12.10 Figure Included for Some Respondents in Global Warming Survey
Experiment 452
5.1 Bivariate and Multivariate Results for Retail Sales Data 130
5.2 Bivariate and Multiple Multivariate Results for Height and Wages
Data 132
xvii
xviii LIST OF TABLES
9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime 297
9.2 Influence of Distance on NICU Utilization (First-Stage Results) 306
9.3 Influence of NICU Utilization on Baby Mortality 307
9.4 Regression Results for Models Relating to Drinking and Grades 308
9.5 Price and Quantity Supplied Equations for U.S. Chicken Market 321
9.6 Price and Quantity Demanded Equations for U.S. Chicken Market 322
9.7 Variables for Rainfall and Economic Growth Data 327
9.8 Variables for News Program Data 328
9.9 Variables for Fish Market Data 329
9.10 Variables for Education and Crime Data 331
9.11 Variables for Income and Democracy Data 332
13.1 Using OLS and Lagged Residual Model to Detect Autocorrelation 466
13.2 Example of ρ-Transformed Data (for ρ̂ = 0.5) 470
13.3 Global Temperature Model Estimated by Using OLS, Newey-West,
and ρ-Transformation Models 473
13.4 Dickey-Fuller Tests for Stationarity 484
13.5 Change in Temperature as a Function of Change in Carbon Dioxide
and Other Factors 485
13.6 Variables for James Bond Movie Data 492
xxii
USEFUL COMMANDS FOR STATA xxiii
xxiv
USEFUL COMMANDS FOR R xxv
Critical value for t distribution, two-sided qt qt(0.975, 120) # For α = 0.05 and 120 degrees of freedom; 4
divide α by 2
Critical value for t distribution, one-sided qt qt(0.95, 120) # For α = 0.05 and 120 degrees of freedom 4
Critical value for normal distribution, two-sided qnorm qnorm(0.975) # For α = 0.05; divide α by 2 4
Critical value for normal distribution, one-sided qnorm qnorm(0.95) # For α = 0.05 4
Two-sided p values [Reported in summary(Results) output]
One-sided p values pt 2*(1-pt(abs(1.69), 120)) # For model with 120 degrees of 4
freedom and a t statistic of 1.69
Confidence intervals confint confint(Results, level = 0.95) # For OLS object "Results" 4
Produce standardized regression coefficients scale Res.std = lm(scale(Y) ˜scale(X1) + scale(X2) ) 5
Display R squared $r.squared summary(Results)$r.squared 5
Critical value for F test qf qf(.95, df1 = 2, df2 = 120) # Degrees of freedom equal 2 and 5
120 and α = 0.05
p value for F statistic pf 1 - pf(7.77, df1=2, df2=1,846) # For F statistic = 7.77, and 5
degrees of freedom equal 2 and 1846
Include dummies for categorical variable factor lm(Y ∼ factor(X1)) # Includes appropriate number of dummy 6
variables for categorical variable X1
Set reference category relevel X1 = relevel(X1, ref = “south”) # Sets 2nd category as 6
reference category; include before OLS model
Difference of means test using OLS lm lm(Y˜Dum) # Where Dum is a dummy variable 6
Create an interaction variable DumX = Dum * X # Or use <- in place of = 6
Create a squared variable X_sq = Xˆ2 7
Create a logged variable X_log =log( X) 7
LSDV model for panel data factor Results = lm(Y ∼ X1 + factor(country)) # Factor adds a 8
dummy variable for every value of variable called country
One-way fixed-effects model (de-meaned) plm library(plm) 8
Results = plm(Y ˜X1+ X2+ X3, data = dta,
index=c("country"), model="within")
Two-way fixed-effects model (de-meaned) plm library(plm) 8
Results = plm(Y ˜X1+ X2+ X3, data = dta,
index=c("country", "year"), model="within",
effect = "twoways")
2SLS model ivreg library(AER) 9
ivreg(Y ˜X1 + X2 + X3 |Z1 + Z2 + X2 + X3)
Probit glm glm(Y ˜X1 + X2, family = binomial(link ="probit")) 12
Normal CDF pnorm pnorm(0) # The normal CDF evaluated at 0 (which is 0.5) 12
Logit glm glm(Y ˜X1 + X2, family = binomial(link ="logit")) 12
Generate draws from standard normal distribution rnorm Noise = rnorm(500) # 500 draws from standard normal 14
distribution
Panel model with autocorrelation [See Computing Corner in Chapter 15] 15
Include lagged dependent variable plm with Results = plm(Y ˜lag(Y) + X1 + X2, data = dta, index = c("ID", 15
lag(Y) "time"), effect = "twoways")
Random effects panel model plm with Results = plm(Y ˜X1 + X2, data = dta, model = "random") 15
"random"
PREFACE FOR STUDENTS:
HOW THIS BOOK CAN HELP YOU
LEARN ECONOMETRICS
“I wish I had had this book when I was first exposed to the material—it would
have saved a lot of time and hair-pulling . . .”—Student J.H.
“Material is easy to understand, hard to forget.”—Student M.H.
This book introduces the econometric tools necessary to answer important ques-
tions. Do antipoverty programs work? Does unemployment affect inflation? Does
campaign spending affect election outcomes? These and many more questions are
not only interesting but also important to answer correctly if we want to support
policies that are good for people, countries, and the world.
When using econometrics to answer such questions, we need always to
remember a single big idea: correlation is not causation. Just because variable
Y rises when variable X rises does not mean that variable X causes variable Y to
rise. The essential goal is to figure out when we can say that changes in variable
X will lead to changes in variable Y.
This book helps us learn how to identify causal relationships with three
features seldom found in other econometrics textbooks. First, it focuses on
the tools that economic researchers use most. These are the real econometric
techniques that help us make reasonable claims about whether X causes Y, and
by using these tools, we can produce analyses that others can respect. We’ll get
the most out of our data while recognizing the limits in what we can say or how
confident we can be.
This emphasis on real econometrics means that we skip obscure econometric
tools that could come up under certain conditions. Econometrics is too often
complicated by books and teachers trying to do too much. This book shows that
we can have a sophisticated understanding of statistical inference without having
to catalog every method that our instructor had to learn as a student.
Second, this book works with a single unifying framework. We don’t start over
with each new concept; instead, we build around a core model. That means there
is a single equation and a unifying set of assumptions that we poke, probe, and
xxvi
PREFACE FOR STUDENTS xxvii
expand throughout the book. This approach reduces the learning costs of moving
through the material and allows us to go back and revisit material. As with any
skill, we probably won’t fully understand any given technique the first time we see
it. We have to work at it; we have to work with it. We’ll get comfortable; we’ll see
connections. Then it will click. Whether the skill is jumping rope, typing, throwing
a baseball, or analyzing data, we have to do things many times to get good at it.
By sticking to a unifying framework, we have more chances to revisit what we
have already learned. You’ll also notice that I’m not afraid to repeat myself on the
important stuff. Really, I’m not afraid to repeat myself.
Third, this book uses many examples from the policy, political, and economic
worlds. So even if you do not care about “two-stage least squares” or “maximum
likelihood” in and of themselves, you will see how understanding these techniques
will affect what you think about education policy, trade policy, election outcomes,
and many other interesting issues. The examples and case studies make it clear
that the tools developed in this book are being used by contemporary applied
economists who are actually making a difference with their empirical work.
Real Econometrics is meant to serve as the primary textbook in an introduc-
tory econometrics course or as a supplemental text providing more intuition and
context in a more advanced econometric methods course. As more and more public
policy and corporate decisions are based on statistical and econometric analysis,
this book can also be used outside of course work. Econometrics has infiltrated
into every area of our lives—from entertainment to sports (I no longer spit out my
coffee when I come across an article on regression analysis of National Hockey
League players)—and a working knowledge of basic econometric techniques can
help anyone make better sense of the world around them.
The five chapters of Part One constitute the heart of the book. They introduce
ordinary least squares (OLS), also known as regression analysis. Chapter 3
introduces the most basic regression model, the bivariate OLS model. Chapter
4 shows how to use OLS to test hypotheses. Chapters 5 through 7 introduce
the multivariate OLS model and applications. By the end of Part One, you will
understand regression and be able to control for anything you can measure. You’ll
also be able to fit curves to data and assess whether the effects of some variables
differ across groups, among other skills that will impress your friends.
Part Two introduces techniques that constitute the contemporary econometric
toolkit. These are the techniques people use when they want to get published—or
paid. These techniques build on multivariate OLS to give us a better chance of
identifying causal relations between two variables. Chapter 8 covers a simple yet
powerful way to control for many factors we can’t measure directly. Chapter 9
covers instrumental variable techniques, which work if we can find a variable
that affects our independent variable but not our dependent variable. Instrumental
variable techniques are a bit funky, but they can be very useful for isolating causal
effects. Chapter 10 covers randomized experiments. Although ideal in theory, in
practice such experiments often raise a number of challenges we need to address.
Chapter 11 covers regression discontinuity tools that can be used when we’re
studying the effect of variables that were allocated based on a fixed rule. For
example, Medicare is available to people in the United States only when they turn
65, and admission to certain private schools depends on a test score exceeding
some threshold. Focusing on policies that depend on such thresholds turns out to
be a great context for conducting credible econometric analysis.
Part Three contains a single chapter (Chapter 12) that covers dichotomous
dependent variable models. These are simply models in which the outcome we
care about takes on two possible values. Examples and case studies include high
school graduation (someone graduates or doesn’t), unemployment (someone has
a job or doesn’t), and alliances (two countries sign an alliance treaty or don’t). We
show how to apply OLS to such models and then provide more elaborate models
that address the deficiencies of OLS in this context.
Part Four supplements the book with additional useful material. Chapter 13
covers time series data. The first part of the chapter is a variation on OLS; the
second part introduces dynamic models that differ from OLS models in important
ways. Chapter 14 derives important OLS results and extends discussion on specific
topics. Chapter 15 goes into greater detail on the vast literature on panel data,
showing how the various strands fit together.
Chapter 16 concludes the book with tips on adopting the mind-set of an
econometric realist. In fact, if you are looking for an overall understanding of the
power and limits of statistics, you might want to read this chapter first—and then
read it again once you’ve learned all the statistical concepts covered in the other
chapters.
PREFACE FOR STUDENTS xxix
We econometrics teachers have high hopes for our students. We want them to
understand how econometrics can shed light on important economic and policy
questions. Sometimes they humor us with incredible insight. The heavens part;
angels sing. We want that to happen daily. Sadly, a more common experience is
seeing a furrowed brow of confusion and frustration. It’s cloudy and rainy in that
place.
It doesn’t have to be this way. If we distill the material to the most critical
concepts, we can inspire more insight and less brow-furrowing. Unfortunately,
conventional statistics and econometrics books all too often manage to be too
simple and too confusing at the same time. Many are too simple in that they
provide a semester’s worth of material that hardly gets past rudimentary ordinary
least squares (OLS). Some are too confusing in that they get to OLS by way of
going deep into the weeds of probability theory without showing students how
econometrics can be useful and interesting.
Real Econometrics is predicated on the belief that we are most effective
when we teach the tools we use. What we use are regression-based tools with an
increasing focus on experiments and causal inference. If students can understand
these fundamental concepts, they can legitimately participate in analytically sound
conversations. They can produce analysis that is interesting—and believable!
They can understand experiments and the sometimes subtle analysis required
when experimental methods meet social scientific reality. They can appreciate that
causal effects are hard to tease out with observational data and that standard errors
estimated on crap coefficients, however complex, do no one any good. They can
sniff out when others are being naive or cynical. It is only when we muck around
too long in the weeds of less useful material that statistics becomes the quagmire
students fear.
Hence this book seeks to be analytically sophisticated in a simple and relevant
way. It focuses on tools actually used by real analysts. Nothing useless. No clutter.
To do so, the book is guided by three principles: relevance, opportunity costs, and
pedagogical efficiency.
Relevance
Relevance is a crucial first principle for successfully teaching econometrics in
the social sciences. Every experienced instructor knows that most students care
xxx
PREFACE FOR INSTRUCTORS xxxi
more about the real world than math. How do we get such students to engage
with econometrics? One option is to cajole them to care more and work harder.
We all know how well that works. A better option is to show them how a
sophisticated understanding of statistical concepts helps them learn more about
the topics that concern them. Think of a mother trying to get a child to commit to
the training necessary to play competitive sports. She could start with a semester
of theory. . . . No, that would be cruel. And counterproductive. Much better to let
the child play and experience the joy of the sport. Then there will be time (and
motivation!) to understand nuances. Thus every chapter is built around examples
and case studies on topics students might actually care about—topics like violent
crime in the United States (Chapter 2), global warming (Chapter 7), and the
relationship between alcohol consumption and grades (Chapter 11).
Learning econometrics is not that different from learning anything else. We
need to care to truly learn. Therefore this book takes advantage of a careful
selection of material to spend more time on the real examples that students care
about.
Opportunity Costs
Opportunity costs are, as we all tell our students, what we have to give up to
do something. So, while some topic might be a perfectly respectable part of an
econometric toolkit, we should include it only if it does not knock out something
more important. The important stuff all too often gets shunted aside as we fill up
the early part of students’ analytical training with statistical knick-knacks, material
“some people still use” or that students “might see.”
Therefore this book goes quickly through descriptive statistics and doesn’t
cover χ 2 tests for two-way tables, weighted least squares, and other denizens of
conventional statistics books. These concepts—and many, many more—are all
perfectly legitimate. Some are covered elsewhere (descriptive statistics are covered
in elementary schools these days). Others are valuable enough to rate inclusion
here in an “advanced material” section for students and instructors who want
to pursue these topics further. And others simply don’t make the cut. Only by
focusing the material can we get to the tools used by researchers today, tools such
as panel data analysis, instrumental variables, and regression discontinuity. The
core ideas behind these tools are not particularly difficult, but we need to make
time to cover them.
Pedagogical Efficiency
Pedagogical efficiency refers to streamlining the learning process by using a single
unified framework. Everything in this book builds from the standard regression
model. Hypothesis testing, difference of means, and experiments can be—and
often are—taught independently of regression. Causal inference is sometimes
taught with potential outcomes notation. There is nothing intellectually wrong
with these approaches. But is using them pedagogically efficient? If we teach
xxxii PREFACE FOR INSTRUCTORS
these as stand-alone concepts we have to take time and, more important, student
brain space to set up each separate approach. For students, this is really hard.
Remember the furrowed brows? Students work incredibly hard to get their heads
around difference of means and where to put degrees of freedom corrections and
how to know if the means come from correlated groups or independent groups and
what the equation is for each of these cases. Then BAM! Suddenly the professor is
talking about residuals and squared deviations. The transition is old hat for us, but
it can overwhelm students first learning the material. It is more efficient to teach the
OLS framework and use that to cover difference of means, experiments, and the
contemporary canon of econometric analysis, including panel data, instrumental
variables, and regression discontinuity. Each tool builds from the same regression
model. Students start from a comfortable place and can see the continuity that
exists.
An important benefit of working with a single framework is that it allows
students to revisit the core model repeatedly throughout the term. Despite the
brilliance of our teaching, students rarely can put it all together with one pass
through the material. I know I didn’t when I was beginning. Students need to see
the material a few times, work with it a bit, and then it will finally click. Imagine
if sports were coached the way we do econometrics. A tennis coach who said
“This week we’ll cover forehands (and only forehands), next week backhands (and
only backhands), and the week after that serves (and only serves)” would not be a
tennis coach for long. Instead, coaches introduce material, practice, and then keep
working on the fundamentals. Working with a common framework throughout
makes it easier to build in mini-drills about fundamentals as new material is
introduced.
Course Adoption
Real Econometrics is organized to work well in three different kinds of courses.
First, it can be used in an introductory econometrics course that follows a semester
of probability and statistics. In such a course, students should probably be able to
move quickly through the early material and then pick up where they left off,
typically with multivariate OLS.
Second, this book can be used with students who have not previously (or
recently) studied statistics, either in a one-semester course covering Part One or
a year-long course covering the whole book. Using this book as a first course
avoids the “warehouse problem,” which occurs when we treat students’ statistical
education as a warehouse, filling it up with tools first and accessing them only
later. One challenge is that things rot in a warehouse. Another challenge is that
instructors tend to hoard a bit, putting things in the warehouse “just in case”
and creating clutter. And students find warehouse work achingly dull. Using this
book in a first-semester course avoids the warehouse problem by going directly
to interesting and useful material, providing students with a more just-in-time
approach. For example, they see statistical distributions, but in the context of trying
to solve a specific problem rather than as an abstract concept that will become
useful later.
PREFACE FOR INSTRUCTORS xxxiii
Overview
The first two chapters of the book serve as introductory material and introduce
the science of statistics. Chapter 1 lays out the theme of how important—and
hard—it is to generate unbiased estimates. This is a good time to let students offer
hypotheses about questions of the day, because these questions can help bring
to life the subsequent material. Chapter 2, which introduces computer programs
and good practices, is a confidence builder that gets students who are not already
acclimated to statistical computing over the hurdle of using statistical software.
Part One covers core OLS material. Chapter 3 introduces bivariate OLS.
Chapter 4 covers hypothesis testing, and Chapter 5 moves to multivariate OLS.
Chapters 6 and 7 proceed to practical tasks such as use of dummy variables, logged
variables, interactions, and F tests.
Part Two covers essential elements of the contemporary econometric toolkit,
including panel data, instrumental variables, analysis of experiments, and regres-
sion discontinuity. Chapter 10, on experiments, uses instrumental variables.
Chapters 8, 9, and 11 can be covered in any order, however, so instructors can
pick and choose among these chapters as needed.
Part Three contains a single chapter (Chapter 12) on dichotomous dependent
variables. It develops the linear probability model in the context of OLS and
uses the probit and logit models to introduce students to maximum likelihood.
Instructors can cover this chapter any time after Part One if dichotomous
dependent variables play a major role in the course.
xxxiv PREFACE FOR INSTRUCTORS
Data
Instructor’s Manual
PowerPoint Presentations
Presentation slides offer bullet-point summaries as well as all the tables and graphs
from the book to help guide and design lectures. A separate set of slides containing
only the text tables and graphs is also available.
The computerized test bank that accompanies this text enables instructors to
easily create quizzes and exams, using any combination of publisher-provided
questions and their own questions. Questions can be edited and easily assembled
into assessments that can then be exported for use in learning management systems
or printed for paper-based assessments.
For instructors using an online learning management system (e.g., Moodle, Sakai,
Blackboard, or others), Oxford University Press can provide all the electronic
components of the package in a format suitable for easy upload. Adopting
instructors should contact their local Oxford University Press sales representative
or OUP’s Customer Service (800-445-9714) for more information.
ACKNOWLEDGMENTS
This book has benefited from close reading and probing questions from a large
number of people, including students at the McCourt School of Public Policy
at Georgetown University and my current and former colleagues and students at
Georgetown, including Shirley Adelstein, Rachel Blum, David Buckley, Ian Gale,
Ariya Hagh, Carolyn Hill, Mark Hines, Dan Hopkins, Jeremy Horowitz, Huade
Huo, Wes Joe, Karin Kitchens, Jon Ladd, Jens Ludwig, Paasha Mahdavi, Jean
Mitchell, Paul Musgrave, Sheeva Nesva, Hans Noel, Irfan Nooruddin, Ji Yeon
Park, Parina Patel, Betsy Pearl, Lindsay Pettingill, Carlo Prato, Barbara Schone,
George Shambaugh, Dennis Quinn, Chris Schorr, Frank Vella, and Erik Voeten.
Credit (and/or blame) for the Simpsons figure goes to Paul Musgrave.
Participants at a seminar on the book at the University of Maryland, especially
Antoine Banks, Brandon Bartels, Kanisha Bond, Ernesto Calvo, Sarah Croco,
Michael Hanmer, Danny Hayes, Eric Lawrence, Irwin Morris, and John Sides,
gave excellent early feedback.
In addition, colleagues across the country have been incredibly helpful,
especially Allison Carnegie, Craig Volden, Sarah Croco, and Wendy Tam-Cho.
Reviewers for Oxford University Press and other commentators have provided
supportive yet probing feedback. These individuals include:
Steve Balla, George Washington University; Yong Bao, Purdue University;
James Bland, The University of Toledo; Kwang Soo Cheong, Johns Hopkins
University; Amanda Cook, Bowling Green State University; Renato Corbetta,
University of Alabama at Birmingham; Sarah Croco, University of Maryland;
David E. Cunningham, University of Maryland; Seyhan Erden, Columbia
University; José M. Fernández, University of Louisville; Luca Flabbi, Georgetown
University; Mark A. Gebert, University of Kentucky; Kaj Gittings, Texas Tech
University; Brad Graham, Grinnell College; Jonathan Hanson, University of
Michigan; David Harris, Benedictine College; Daniel Henderson, University of
Alabama; Matthew J. Holian, San Jose State University; Todd Idson, Boston
University; Changkuk Jung, SUNY Geneseo; Manfred Keil, Claremont McKenna
College; Subal C. Kumbhakar, State University of New York at Binghamton; Latika
Lagalo, Michigan Technological University; Matthew Lang, Xavier University;
Jing Li, Miami University; Quan Li, Texas A&M University; Drew A. Linzer,
Civiqs; Steven Livingston, Middle Tennessee State University; Aprajit Mahajan,
Stanford University; Brian McCall, University of Michigan; Phillip Mixon,
Troy University; David Peterson, Iowa State University; Leanne C. Powner,
Christopher Newport University; Zhongjun Qu, Boston University; Robi Ragan,
Stetson School of Business and Economics; Stephen Schmidt, Union College;
Markus P. A. Schneider, University of Denver; Sam Schulhofer-Wohl, Federal
Reserve Bank of Minneapolis; Christina Suthammanont, Texas A&M University,
San Antonio; Kerry Tan, Loyola University Maryland; Robert Turner, Colgate
University; Martijn van Hasselt, University of North Carolina—Greensboro;
David Vera, California State University; Christopher Way, Cornell University;
Acknowledgments xxxvii
1
2 CHAPTER 1 The Quest for Causality
dependent variable The dependent variable, usually denoted as Y, is called that because its value
The outcome of interest, depends on the independent variable. The independent variable, usually denoted
usually denoted as Y. by X, is called that because it does whatever the hell it wants. It is potentially the
cause of some change in the dependent variable.
independent At root, social scientific theories posit that a change in one thing (the
variable A variable independent variable) will lead to a change in another (the dependent variable).
that possibly influences We’ll formalize this relationship in a bit, but let’s start with an example. Suppose
the value of the
we’re interested in the U.S. obesity epidemic and want to analyze the influence
dependent variable.
of snack food on health. We may wonder, for example, if donuts cause health
problems. Our model is that eating donuts (variable X, our independent variable)
causes some change in weight (variable Y, our dependent variable). If we can find
data on how many donuts people ate and how much they weighed, we might be
on the verge of a scientific breakthrough.
Let’s conjure up a small midwestern town and do a little research. Figure 1.2
plots donuts eaten and weights for 13 individuals from a randomly chosen town:
Springfield, U.S.A. Our raw data is displayed in Table 1.1. Each person has a line in
the table. Homer is observation 1. Since he ate 14 donuts per week, Donuts1 = 14.
We’ll often refer to Xi or Yi , which are the values of X and Y for person i in the
data set. The weight of the seventh person in the data set, Smithers, is 160 pounds,
meaning Weight7 = 160, and so forth.
scatterplot A plot of Figure 1.2 is a scatterplot of data, with each observation located at the
data in which each coordinates defined by the independent and dependent variables. The value of
observation is located at donuts per week is on the X-axis, and weights are on the Y-axis. Just by looking
the coordinates defined
at this plot, we sense there is a positive relationship between donuts and weight
by the independent and
dependent variables.
because the more donuts eaten, the higher the weight tends to be.
1 Homer 14 275
2 Marge 0 141
3 Lisa 0 70
4 Bart 5 75
12 Patty 5 155
13 Selma 4 145
4 CHAPTER 1 The Quest for Causality
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150
Marge Selma
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
)
pe
slo
Principal (the
Skinner β1
200
Rev. Lovejoy
Ned Flanders
Smithers Patty
150
Marge Selma
β0 = 123
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
error term The term • i is the error term that captures anything else that affects weight. ( is the
associated with Greek letter epsilon)
unmeasured factors in a
regression model,
This equation will help us estimate the two parameters necessary to charac-
typically denoted as .
terize a line. Remember Y = mX + b from junior high? This is the equation for
a line where Y is the value of the line on the vertical axis, X is the value on the
horizontal axis, m is the slope, and b is the intercept, or the value of Y when X is
zero. Equation 1.1 is essentially the same, only we refer to the “b” term as β0 and
call the “m” term β1 .
Figure 1.3 shows an example of a possible line from this model for our
Springfield data. The intercept (β0 ) is the value of weight when donut consumption
is zero (X = 0). The slope (β1 ) is the amount that weight increases for each donut
eaten. In this case, the intercept is about 123, which means that the expected weight
for those who eat zero donuts is around 123 pounds. The slope is around 9.1, which
means that for each donut eaten per week, weight is about 9.1 pounds higher.
More generally, our core model can be written as
Yi = β0 + β1 Xi + i (1.2)
6 CHAPTER 1 The Quest for Causality
where β0 is the intercept that indicates the value of Y when X = 0 and β1 is the
slope that indicates how much change in Y is expected if X increases by one unit.
We almost always care a lot about β1 , which characterizes the relationship between
X and Y. We usually don’t care a whole lot about β0 . It plays an important role in
helping us get the line in the right place, but determining the value of Y when X is
zero is seldom our core research interest.
In Figure 1.3, we see that the actual observations do not fall neatly on the
line that we’re using to characterize the relationship between donuts and weight.
The implication is that our model does not perfectly explain the data. Of course
it doesn’t! Springfield residents are much too complicated for donuts to explain
them completely (except, apparently, Comic Book Guy).
The error term, i , comes to the rescue by giving us some wiggle room. The
error term is what is left over after the variables have done their work in explaining
variation in the dependent variable. In doing this service, it plays an incredibly
important role for the entire econometric enterprise. As this book proceeds, we
will keep coming back to the importance of getting to know our error term.
The error term, i , is not simply a Greek letter. It is something real. What it
covers depends on the model. In our simple model—in which weight is a function
only of how many donuts a person eats—oodles of factors are contained in the
error term. Basically, anything else that affects weight will be in the error term:
sex, height, other eating habits, exercise patterns, genetics, and on and on. The
error term includes everything we haven’t measured in our model.
We’ll often see i referred to as random error, but be careful about that one.
Yes, for the purposes of the model we are treating the error term as something
random, but it is not random in the sense of a roll of the dice. It is random more in
the sense that we don’t know what the value of it is for any individual observation.
But as a practical matter every error term reflects, at least in part, some relationship
to real things that we have not measured or included in the model. We will come
back to this point often.
REMEMBER THIS
Our core statistical model is
Yi = β0 + β1 Xi + i
1. β1 , the slope, indicates how much change in Y (the dependent variable) is expected if X (the
independent variable) increases by one unit.
2. β0 , the intercept, indicates where the regression line crosses the Y-axis. It is the value of Y when
X is zero.
3. β1 is usually more interesting than β0 because β1 characterizes a relationship between X and
Y.
1.2 Two Major Challenges: Randomness and Endogeneity 7
1 1
Y Y
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
X X
(a) (b)
1
Y
Y 8
0.8 7
6
5
0.6 4
3
2
0.4
1
0
0.2 −1
−2
−3
0 −4
X X
(c) (d)
FIGURE 1.4: Examples of Lines Generated by Core Statistical Model (for Review Question)
Review Question
For each of the panels in Figure 1.4, determine whether β0 and β1 are greater than, equal to, or less
than zero. [Be careful with β0 in panel (d)!]
But can we be certain that donuts, and not some other factor, cause weight
gain? Two core challenges in econometric analysis should make us cautious. One
is randomness. Any time we observe a relationship in data, we need to keep in mind
that some coincidence could explain it. Perhaps we happened to pick some unusual
people for our data set. Or perhaps we picked perfectly representative people, but
they happened to have had unusual measurements on the day we examined them.
In the donut example, the possibility of such randomness should worry us, at
least a little. Perhaps the people in Figure 1.3 are a bit odd. Perhaps if we had more
people, we might get more heavy folks who don’t eat donuts and skinny people
who scarf them down. Adding those folks to the data set would change the figure
and our conclusions. Or perhaps even with the set of folks we observed, we might
have gotten some of them on a bad (or a good) day, whereas if we had looked at
them another day, we might have observed a different relationship.
Every legitimate econometric analysis therefore will account for randomness
in an effort to distinguish results that could happen by chance from those that
would be unlikely to happen by chance. The bad news is that we will never escape
the possibility that the results we observe are due to randomness rather than a
causal effect. The good news, though, is that we can often do a pretty good job
characterizing our confidence that the results are not simply due to randomness.
Another major challenge arises from the possibility that an observed relation-
ship between X and Y is actually due to another variable, which causes Y and
is associated with X. In the donuts example, worry about scenarios in which we
wrongly attribute to our key independent variable (in this case, donut consumption)
changes in weight that were caused by other factors. What if tall people eat more
donuts? Height is in the error term as a contributing factor to weight, and if tall
people eat more donuts, we may wrongly attribute to donuts the effect of height.
There are loads of other possibilities. What if men eat more donuts? What if
exercise addicts don’t eat donuts? What if people who eat donuts are also more
likely to down a tub of Ben and Jerry’s ice cream every night? What if thin people
can’t get donuts down their throats? Being male, exercising, bingeing on ice cream,
having itty-bitty throats—all these things are probably in the error term (meaning
they affect weight), and all could be correlated with donut eating.
Speaking econometrically, we highlight this major statistical challenge by
endogenous An saying that the donut variable is endogenous. An independent variable is
independent variable is endogenous if changes in it are related to factors in the error term. The prefix
endogenous if changes “endo” refers to something internal, and endogenous independent variables are
in it are related to factors
“in the model” in the sense that they are related to other things that also determine
in the error term.
Y (but are not already accounted for by X).
In the donuts example, donut consumption is likely endogenous because how
many donuts a person eats is not independent of other factors that influence weight
gain. Factors that cause weight gain (e.g., eating Ben and Jerry’s ice cream)
might be associated with donut eating; in other words, factors that influence the
dependent variable Y might also be associated with the independent variable X,
muddying the connection between correlation and causation. If we can’t be sure
that our variation in X is not associated with factors that influence Y, we need to
worry about wrongly attributing to X the causal effect of some other variable.
We might wrongly conclude that donuts cause weight gain when really donut
1.2 Two Major Challenges: Randomness and Endogeneity 9
eaters are more likely to eat tubs of Ben and Jerry’s, with the ice cream being
the real culprit.
In all these examples, something in the error term that really causes weight
gain is related to donut consumption. When this connection exists, we risk
spuriously attributing to donut consumption the causal effect of some other factor.
Remember, anything not measured in the model is in the error term, and here,
at least, we have a wildly simple model in which only donut consumption is
measured. So Ben and Jerry’s, genetics, and everything else are in the error term.
Endogeneity is everywhere; it’s endemic. Suppose we want to know if raising
teacher salaries increases test scores. It’s an important and timely question.
Answering it may seem easy enough: we could simply see if test scores (a
dependent variable) are higher in places where teacher salaries (an independent
variable) are higher. It’s not that easy, though, is it? Endogeneity lurks. Test
scores might be determined by unmeasured factors that also affect teacher salaries.
Maybe school districts with lots of really poor families don’t have very good test
scores and don’t have enough money to pay teachers high salaries. Or perhaps
the relationship is the opposite—poor school districts get extra federal funds to
pay teachers more. Either way, teacher salaries are endogenous because their
levels depend in part on factors in the error term (like family income) that affect
educational outcomes. Simply looking at the relationship of test scores to teacher
salaries risks confusing the effect of family income and teacher salaries.2
The opposite of endogeneity is exogeneity. An independent variable is
exogenous An exogenous if changes in it are not related to factors in the error term. The prefix
independent variable is “exo” refers to something external, and exogenous independent variables are
exogenous if changes in “outside the model” in the sense that their values are unrelated to other things
it are unrelated to
that also determine Y. For example, if we use an experiment to randomly set the
factors in the error term.
value of X, then changes in X are not associated with factors that also determine
Y. This gives us a clean view of the relationship between X and Y, unmuddied by
associations between X and other factors that affect Y.
One of our central challenges is to avoid endogeneity and thereby achieve
exogeneity. If we succeed, we can be more confident that we have moved beyond
correlation and closer to understanding if X causes Y—our fundamental goal. This
process is not automatic or easy. Often we won’t be able to find purely exogenous
variation, so we’ll have to think through how close we can get. Nonetheless, the
bottom line is this: if we can find exogenous variation in X, we will be in a good
position to make reasonable inferences about what will happen to variable Y if we
change variable X.
correlation To formalize these ideas, we’ll use the concept of correlation, which most
Measures the extent to people know, at least informally. Two variables are correlated (“co-related”) if
which two variables are they move together. A positive correlation means that high values of one variable
linearly related to each
are associated with high values of the other; a negative correlation indicates that
other.
high values of one variable are associated with low values of the other.
Figure 1.5 shows examples of variables that have positive correlation
[panel (a)], no correlation [panel (b)], and negative correlation [panel (c)].
2
A good idea is to measure these things and put them in the model so that they are no longer in the
error term. That’s what we do in Chapter 5.
10 CHAPTER 1 The Quest for Causality
Ne
1 1 1
tio
ga
la
rre
tiv
e
co
co
ive
rre
0.8 0.8 0.8
sit
la
Po
tio
n
0.6 0.6 0.6
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Correlations range from 1 to −1. A correlation of 1 means that the variables move
perfectly together.
Correlations close to zero indicate weak relationships between variables.
When the correlation is zero, there is no linear relationship between two variables.3
We use correlation in our definitions of endogeneity and exogeneity. If our
independent variable has a relationship to the error term like the one in panel (a) of
Figure 1.5 (which shows positive correlation) or in panel (c) (which shows negative
correlation), then we have endogeneity. In other words, we have endogeneity when
the unmeasured stuff that constitutes the error term is correlated with our indepen-
dent variable, and endogeneity will make it difficult to tell whether changes in the
dependent variable are caused by our independent variable or the error term.
On the other hand, if our independent variable has no relationship to the error
term as in panel (b), we have exogeneity. In this case, if we observe Y rising with
X, we can feel confident that X is causing Y.
The challenge is that the true error term is not observable. Hence, much of
what we do in econometrics attempts to get around the possibility that something
3
In Appendix E (page 541), we provide an equation for correlation and discuss how it relates to our
ordinary least squares estimates from Chapter 3. Correlation measures linear relationships between
variables; we’ll discuss non-linear relationships in ordinary least squares on page 221.
1.2 Two Major Challenges: Randomness and Endogeneity 11
unobserved in the error term may be correlated with the independent variable. This
quest makes econometrics challenging and interesting.
As a practical matter, we should begin every analysis by assessing endogene-
ity. First, look away from the model for a moment and list all the things that could
determine the dependent variable. Second, ask if anything on the list correlates
with the independent variable in the model and explain why it might. That’s it. Do
that, and we are on our way to identifying endogeneity.
REMEMBER THIS
1. There are two fundamental challenges in econometrics: randomness and endogeneity.
2. Randomness can produce data that suggests X causes Y even when it does not. Randomness
can also produce data that suggests X does not cause Y even when it does.
3. An independent variable is endogenous if it is correlated with the error term in the model.
(a) An independent variable is exogenous if it is not correlated with the error term in the
model.
(b) The error term is not observable, making it a challenge to know whether an independent
variable is endogenous or exogenous.
(c) It is difficult to assess causality for endogenous independent variables.
Discussion Questions
1. Each panel of Figure 1.6 on page 12 shows relationships among three variables: X is an observed
independent variable, is a variable reflecting some unobserved characteristic, and Y is the
dependent variable. (In our donut example, X corresponds to the number of donuts eaten,
corresponds to an unobserved characteristic such as exercise, and Y corresponds to the outcome
of interest, which is weight.) If an arrow connects X and Y, then X has a causal effect on Y.
If an arrow connects and Y, then the unobserved characteristic has a causal effect on Y. If a
double arrow connects X and , then these two variables are correlated (and we won’t worry
about which causes which).
For each panel, explain whether endogeneity will cause problems for an analysis of the
relationship between X and Y. For concreteness, assume X is grades in college, is IQ, and Y
is salary at age 26.
2. Come up with your own independent variable, unmeasured error variable, and dependent
variable. Decide which of the panels in Figure 1.6 best characterizes the relationship of the
variables you chose, and discuss the implications for econometric analysis.
12 CHAPTER 1 The Quest for Causality
X X
(Observed) (Observed)
Y Y
(Dependent variable) (Dependent variable)
(Unobserved) (Unobserved)
(a) (b)
X X
(Observed) (Observed)
Y Y
(Dependent variable) (Dependent variable)
(Unobserved) (Unobserved)
(c) (d)
X X
(Observed) (Observed)
Y Y
(Dependent variable) (Dependent variable)
(Unobserved) (Unobserved)
(e) (f)
X X
(Observed) (Observed)
Y Y
(Dependent variable) (Dependent variable)
(Unobserved) (Unobserved)
(g) (h)
where Deathi is a (creepy) variable that is 1 if person i died in the time frame of the
study and 0 if he or she did not. Flu shoti is 1 if the person i got a flu shot and 0 if
not.4
A number of studies have done essentially this analysis and found that people
who get flu shots are less likely to die. According to some estimates, those who
receive flu shots are as much as 50 percent less likely to die. This effect is enormous.
Going home with a Band-Aid that has a little bloodstain is worth it after all.
But are we convinced? Is there any chance of endogeneity? If there exists some
factor in the error term that affected whether someone died and whether he or she
got a flu shot, we would worry about endogeneity.
What is in the error term? Goodness, lots of things affect the probability
of dying: age, health status, wealth, cautiousness—the list is immense. All these
factors and more are in the error term.
How could these factors cause endogeneity? Let’s focus on overall health.
Clearly, healthier people die at a lower rate than unhealthy people. If healthy people
are also more likely to get flu shots, we might erroneously attribute life-saving
power to flu shots when perhaps all that is going on is that people who are healthy
in the first place tend to get flu shots.
It’s hard, of course, to get measures of health for people, so let’s suppose we
don’t have them. We can, however, speculate on the relationship between health
and flu shots. Figure 1.7 shows two possible states of the world. In each figure we
plot flu-shot status on the X-axis. A person who did not get a flu shot is in the 0
group; someone who got a flu shot is in the 1 group. On the Y-axis we plot health
4
We discuss dependent variables that equal only 0 or 1 in Chapter 12 and independent variables that
equal 0 or 1 in Chapter 6.
14 CHAPTER 1 The Quest for Causality
Health Health
10 10
8 8
6 6
4 4
2 2
0 1 0 1
FIGURE 1.7: Two Scenarios for the Relationship between Flu Shots and Health
related to everything but flu (supposing we could get an index that factors in age,
heart health, absence of disease, etc.). In panel (a) of Figure 1.7, health and flu
shots don’t seem to go together; in other words the correlation is zero. If panel (a)
represents the state of the world, then our results that flu shots are associated with
lower death rates is looking pretty good because flu shots are not reflecting overall
health. In panel (b), health and flu shots do seem to go together, with the flu shot
population being healthier. In this case, we have correlation of our main variable
(flu shots) and something in the error term (health).
Brownlee and Lenzer (2009) discuss some indirect evidence suggesting that flu
shots and health are actually correlated. A clever approach to assessing this matter
is to look at death rates of people in the summer. The flu rarely kills people in the
summer, which means that if people who get flu shots also die at lower rates in the
summer, it is because they are healthier overall. And if people who get flu shots die
at the same rates as others during the summer, it would be reasonable to suggest
that the flu-shot and non-flu-shot populations have similar health. It turns out that
people who get flu shots have an approximately 60 percent lower probability of
dying outside the flu season.
1.2 Two Major Challenges: Randomness and Endogeneity 15
Other evidence backs up the idea that healthier people get flu shots. As it
happened, vaccine production faltered in 2004, and 40 percent fewer people got
vaccinated. What happened? Flu deaths did not increase. And in some years, the flu
vaccine was designed to attack a set of viruses that turned out to be different from
the viruses that actually spread; again, there was no clear change in mortality. This
data suggests that people who get flu shots may live longer because getting flu
shots is associated with other healthy behavior, such as seeking medical care and
eating better.
The point is not to put us off flu shots. We’ve discussed only mortality—whether
people die from the flu—not whether they’re more likely to contract the virus or stay
home from work because they are sick.5 The point is to highlight how hard it is to
really know if something (in this case, a vaccine) works. If something as widespread
and seemingly straightforward as a flu shot is hard to assess definitively, think about
the care we must take when trying to analyze policies that affect fewer people and
have more complicated effects.
where Suicide ratesi is the suicide rate in metropolitan area i and Country musici is
the proportion of radio airtime devoted to country music in metropolitan area i.7
It turns out that suicides are indeed higher in metropolitan areas where radio
stations play more country music. But do we believe this is a causal relationship?
5
Demicheli, Jefferson, Ferroni, Rivetti, and Di Pietrantonj (2018) summarize 52 randomized
controlled trials of flu vaccines and conclude that the vaccines reduce the incidence of flu in healthy
adults from 2.3 to 0.9 percent. The flu vaccine also reduces the incidence of flu-like illness from 21.5
to 18.1 percent. The effect on hospitalization is not large and not statistically significant. There is no
evidence of reducing days off of work. See also DiazGranados, Denis, and Plotkin (2012) as well as
Osterholm, Kelley, Sommer, and Belongia (2012).
6
Really, this is an actual published paper.
7
Their analysis is based on a more complicated model, but this is the general idea.
16 CHAPTER 1 The Quest for Causality
(In other words, is country music exogenous?) If radio stations play more country
music, should we expect more suicides?
Let’s work through this example.
What does β 0 mean? What does β 1 mean? In this model, β0 is the expected level
of suicide in metropolitan areas that play no country music. β1 is the amount by
which suicide rates change for each one-unit increase in the proportion of country
music played in a metropolitan area. We don’t know what β1 is; it could be positive
(suicides increase), zero (no relation to suicides), or negative (suicides decrease). For
the record, we don’t know what β0 is either, but since this variable does not directly
characterize the relationship between music and suicides the way β1 does, we are
less interested in it.
What is in the error term? The error term contains factors that are associated
with higher suicide rates, such as alcohol and drug use, availability of guns, divorce
and poverty rates, lack of sunshine, lack of access to mental health care, and
probably many more.
way, it would be incorrect to conclude that we would save lives by banning country
music.
As it turns out, Snipes and Maguire (1995) account for the amount of guns and
divorce in metropolitan areas and find no relationship between country music and
metropolitan suicide rates. So there’s no reason to turn off the radio and put away
those cowboy boots.
Discussion Questions
1. Labor economists often study the returns on investment in education (see, e.g., Card 1999).
Suppose we have data on salaries of a set of people, some of whom went to college and some
of whom did not. A simple model linking education to salary is
where the value of Salaryi is the salary of person i and the value of College graduatei is 1 if
person i graduated from college and is 0 if person i did not.
(a) What does β0 mean? What does β1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be endogenous?
(d) Is the independent variable likely to be endogenous? Why or why not?
(e) Explain how endogeneity could lead to incorrect inferences.
2. Donuts aren’t the only food that people worry about. Consider the following model based on
Solnick and Hemenway (2011):
where Violencei is the number of physical confrontations student i was in during a school year
and Soft drinksi is the average number of cans of soda student i drinks per week.
(a) What does β0 mean? What does β1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be endogenous?
(d) Is the independent variable likely to be endogenous? Why or why not?
(e) Explain how endogeneity could lead to incorrect inferences.
3. We know U.S. political candidates spend an awful lot of time raising money. And we know
they use the money to inflict mind-numbing ads on us. Do we know if the money and the ads
18 CHAPTER 1 The Quest for Causality
it buys actually work? That is, does campaign spending increase vote share? Jacobson (1978),
Erikson and Palfrey (2000), and others have grappled at length with this issue. Consider the
following model:
where Vote sharei is the vote share of a candidate in state i and Campaign spendingi is the
spending by candidate i.
(a) What does β0 mean? What does β1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be endogenous?
(d) Is the independent variable likely to be endogenous? Why or why not?
(e) Explain how endogeneity could lead to incorrect inferences.
4. Researchers identified every outdoor advertisement in 228 census tracts in Los Angeles and
New Orleans and then interviewed 2,881 residents of the cities about weight. Their results
suggested that a 10 percent increase in outdoor food ads in a neighborhood was associated
with a 5 percent increase in obesity.
(a) Do you think there could be endogeneity?
(b) How would you test for a relationship between food ads and obesity?
(c) Read the article “Does This Ad Make Me Fat?” by Christopher Chabris and Daniel
Simons in the March 10, 2013, issue of the New York Times and see how your answers
compare to theirs.
across groups. Both treated and untreated groups would be virtually identical and
would resemble the composition of the population.
In an experiment like this, the variation in our independent variable X is
exogenous. We have won. If we observe that donut eaters weigh more or have
other health differences from non-eaters of donuts, we can reasonably attribute
these effects to donut consumption.
Simply put, the goal of such a randomized experiment is to make sure
the independent variable, which we also call the treatment, is exogenous. The
randomization The key element of such experiments is randomization, a process whereby the
process of determining value of the independent variable is determined by a random process. The
the experimental value value of the independent variable will depend on nothing but chance, meaning
of the key independent
that the independent variable will be uncorrelated with everything, including
variable based on a
random process.
any factor in the error term affecting the dependent variable. In other words,
a randomized independent variable is exogenous; analyzing the relationship
between an exogenous independent variable and the dependent variable allows
us to make inferences about a causal relationship between the two variables.
This is one of those key moments when a concept that may not be very compli-
cated turns out to have enormous implications. By randomly picking some people
to get a certain treatment, we rule out the possibility that there is some other way
for the independent variable to be associated with the dependent variable. If the
randomization is successful, the treated subjects are not systematically taller, more
athletic, or more food conscious—or more left-handed or stinkier, for that matter.
The basic structure of a randomized experiment, often referred to as a
randomized randomized controlled trial, is simple. Based on our research question, we
controlled trial An identify a relevant population that we randomly split into two groups: a treatment
experiment in which the group, which receives the policy intervention, and a control group, which does
treatment of interest is
not. After the treatment, we compare the behavior of the treatment and control
randomized.
groups on the outcome we care about. If the treatment group differs substantially
from the control group, we believe the treatment had an effect; if not, then we’re
treatment group In inclined to think the treatment had no effect.8
an experiment, the For example, suppose we want to know if an ad campaign increases
group that receives the enrollment in ObamaCare. We would identify a sample of uninsured people and
treatment of interest. split them into a treatment group that is exposed to the campaign and a control
group that is not. After the treatment, we compare the enrollment in ObamaCare
control group In an of the treatment and control groups. If the treated group enrolled at a substantially
experiment, the group higher rate, that outcome would suggest the campaign works.
that does not receive Because they build exogeneity into the research, randomized experiments
the treatment of interest.
are often referred to as the gold standard for causal inference. The phrase “gold
standard” usually means the best of the best. But experiments also merit the gold
standard moniker in another sense. No country in the world is actually on a gold
standard. The gold standard doesn’t work well in practice, and for many research
questions, neither do experiments. Simply put, experiments are great, but they can
be tricky when applied to real people going about their business.
8
We provide standards for making such judgments in Chapter 3 and beyond.
20 CHAPTER 1 The Quest for Causality
The human element of social scientific experiments makes them very different
from experiments in the physical sciences. My third grader’s science fair project
compared cucumber seeds planted in peanut butter and in dirt. She did not have to
worry that the cucumber seeds would get up and say, “There is NO way you are
planting me in that.” In the social sciences, though, people can object, not only to
being planted in peanut butter but also to things like watching TV commercials,
attending a charter school, changing health care plans, or pretty much anything
else we might want to study with an experiment.
Therefore, an appreciation of the virtues of experiments should come with
a recognition of their limits. We devote Chapter 10 to discussing the analytical
challenges that accompany experiments. No experiment should be designed
without thinking through these issues, and every experiment should be judged by
how well it deals with them.
Social scientific experiments can’t answer all social scientific research
questions for other reasons as well. The first is that experiments aren’t always
feasible. The financial costs of many experiments are beyond what most major
research organizations can fund, let alone what a student doing a term paper can
afford. And for many important questions, it’s not a matter of money. Do we
want to know if corruption promotes civil unrest? Good luck with our proposal to
randomly end corruption in some countries and not others. Do we want to know
if birthrates affect crime? Are we really going to randomly assign some regions
to have more babies? While the randomizing process could get interesting, we’re
unlikely to pull it off. Or do we want to know something historical? Forget about
an experiment.9
And even if an experiment is feasible, it might not be ethical. We see this
dilemma most clearly in medicine: If we believe a given treatment is better but
are not sure, how ethical is it to randomly subject some people to a procedure that
might not work? The medical community has developed standards relating to level
of risk and informed consent by patients, but such questions will never be easy to
answer.
Consider (again) flu shots. We may think that assessing the efficacy of this
public health measure is a situation made for a randomized experiment. It would
be expensive but conceptually simple. Get a bunch of people who want a flu shot,
tell them they are participating in a random experiment, and randomly give some
a flu shot and the others a placebo shot. Wait and see how the two groups do.
But would such a randomized trial of flu vaccine be ethical? When we say
“Wait and see how the two groups do,” we actually mean “Wait and see who dies.”
9
The range of randomized controlled trials can be astounding, though, ranging from a study of
layoffs (randomized!) (Heinz, Jeworrek, Mertins, Schumacher, and Sutter 2017) to a study of
epidural pain-relief for women in childbirth (Shen, Li, Xu, Wang, Fan, Qin, Zhou, and Hess 2017).
Here’s how I picture the randomized epidural study went down:
Doctor: About your pain relief during labor. Or should I say [makes air quote gesture] “pain
relief”. . .
Post-delivery mother: [punches doctor in nose]
Doctor: Ok, well yeah, that’s fair . . .
1.3 Randomized Experiments as the Gold Standard 21
That changes the stakes a bit, doesn’t it? The public health community strongly
believes in the efficacy of the flu vaccine and, given that belief, considers it
unethical to deny people the treatment. Brownlee and Lenzer (2009) recount in
The Atlantic how one doctor first told interviewers that a randomized trial might
be acceptable, then got cold feet and called back to say that such an experiment
would be unethical.10
generalizable A Finally, experimental results may not be generalizable. That is, a specific
statistical result is experiment may provide great insight into the effect of a given policy intervention
generalizable if it applies at a given time and place, but how sure can we be that the same policy intervention
to populations beyond
will work somewhere else? Jim Manzi, the author of Uncontrolled (2012), argues
the sample in the
analysis.
that the most honest way to describe experimental results is that treatment X was
effective in a certain time and place in which the subjects had the characteristics
they did and the policy was implemented by people with the characteristics
they had. Perhaps people in different communities respond to treatments dif-
ferently. Or perhaps the scale of an experiment could matter: a treatment that
worked when implemented on a small scale might fail if implemented more
broadly.
Econometricians make this point by distinguishing between internal validity
internal validity A and external validity. Internal validity refers to whether the inference is biased;
research finding is external validity refers to whether an inference applies more generally. A
internally valid when it is well-executed experiment will be internally valid, meaning that the results will
based on a process that
on average lead us to make the correct inferences about the treatment and its
is free from systematic
error.
outcome in the context of the experiment. In other words, with internal validity,
we can say confidently that our research design will not systematically lead
us astray (even as randomness could point to incorrect conclusions for any
external validity A
given analysis). Even with internal validity, however, an experiment may not be
research finding is
externally valid when it externally valid: the causal relationship between the treatment and the outcome
applies beyond the could differ in other contexts. That is, even if we have internally valid evidence
context in which the from an experiment that aardvarks in Alabama procreate more if they listen
analysis was conducted. to Mozart, we can’t really be sure aardvarks in Alaska will respond in the
same way.
Hence, even as experiments offer a conceptually clear approach to defeating
endogeneity, they cannot always offer the final word for economic, policy, and
political research. Therefore, most scholars in most fields need to grapple with
observational non-experimental data. Observational studies use data that has been generated
studies Use data by non-experimental processes. In contrast to randomized experiments in which a
generated in an researcher controls at least one of the variables, in observational studies the data is
environment not
what it is, and we do the best we can to analyze it in a sensible way. Endogeneity
controlled by a
researcher. They are
will be a chronic problem, but we are not totally defenseless in the fight against it.
distinguished from Even if we have only observational data, the techniques explained in this book can
experimental studies help us achieve, or at least approximate, the exogeneity promised by randomized
and are sometimes experiments.
referred to as
non-experimental studies. 10
Another flu researcher cited in the article came to the opposite conclusion, saying, “What do you do
when you have uncertainty? You test . . .We have built huge, population-based policies on the
flimsiest of scientific evidence. The most unethical thing to do is to carry on business as usual.”
22 CHAPTER 1 The Quest for Causality
REMEMBER THIS
1. Experiments create exogeneity via randomization.
2. Social science experiments are complicated by practical challenges associated with the
difficulty of achieving randomization and full participation.
3. Experiments are not always feasible, ethical, or generalizable.
4. Observational studies use non-experimental data. They are necessary to answer many
questions.
Discussion Questions
1. Is it possible to have a non-random exogenous independent variable?
2. Think of a policy question of interest. Discuss how an experiment might work to address the
question.
3. Does foreign aid work? How should we create an experiment to assess whether aid to very poor
countries works? What might some of the challenges be?
4. Do political campaigns matter? How should we create an experiment to assess whether phone
calls, mailings, and visits by campaign workers matter? What might some of the challenges be?
5. How are health and medical spending affected when people have to pay each time they see
a doctor? How should we create an experiment to assess whether the amount of co-payments
(payments tendered at every visit to a doctor) affects health costs and quality? What might some
of the challenges be?
Conclusion
The point of econometric research is almost always to learn if X (the independent
variable) causes Y (the dependent variable). If we see high values of Y when
X is high and low values of Y when X is low, we might be tempted to think
X causes Y. We need always to be aware that the observed relationship could
have arisen by chance. Or, if X is endogenous, we need to remember that
interpreting the relationship between X and Y as causal could be wrong, possibly
completely wrong. When another factor both causes Y and is correlated with X,
any relationship we see between X and Y may be due to the effect of that other
factor.
Key Terms 23
We spend the rest of this book accounting for uncertainty and battling
endogeneity. Some approaches, like randomized experiments, seek to create
exogenous change. Other econometric approaches, like multivariate regression,
winnow down the number of other factors lurking in the background that can
cause endogeneity. These and other approaches have strengths, weaknesses,
tricks, and pitfalls. However, they all are united by a fundamental concern with
counteracting endogeneity. Therefore, if we understand the concepts in this
chapter, we understand the essential challenges of using econometrics to better
understand policy, economics, and politics.
Based on this chapter, we are on the right track if we can do the following:
• Section 1.2: Explain how randomness can make causal inference challeng-
ing, and explain how endogeneity can undermine causal inference.
Key Terms
Constant (4) External validity (21) Randomized controlled trial
Control group (19) Generalizable (21) (19)
Correlation (9) Independent variable (2) Scatterplot (3)
Dependent variable (2) Intercept (4) Slope coefficient (4)
Endogenous (8) Internal validity (21) Treatment group (19)
Error term (5) Observational studies (21)
Exogenous (9) Randomization (19)
2 Stats in the Wild: Good Data Practices
24
Stats in the Wild: Good Data Practices 25
Real
GDP
growth 4 4
(percent)
3 3
2 2
1 1
0 0
0−30% 30−60% 60−90% Above 90% 0−30% 30−60% 60−90% Above 90%
(a) (b)
didn’t plummet once government debt passed 90 percent of GDP. While people
can debate whether the slope in panel (b) is a bunny hill or an intermediate hill, it
clearly is nothing like the cliff in the data originally reported.1
Reinhart and Rogoff’s discomfort can be our gain when we realize that even
top scholars can make data mistakes. Hence, we need to create habits that help
us minimize mistakes and maximize the chance that others can find them if
we do.
This chapter focuses on the crucial first steps for any econometric analysis.
First, we need to understand our data. Section 2.1 introduces tools for describing
data and sniffing out possible errors or anomalies. Second, we need to be
prepared to convince others. If others can’t recreate our results, people shouldn’t
1
A deeper question is whether we should treat this observational data as having any causal force.
Government debt levels are probably related to other factors that affect economic growth, like wars
and the quality of a country’s institutions. In other words, government debt likely is endogenous,
meaning that we probably can’t draw any conclusions about the effects of debt on growth without
implementing techniques we cover later in this book.
26 CHAPTER 2 Stats in the Wild: Good Data Practices
believe them. Therefore, Section 2.2 helps us establish good habits so that our
code is understandable to ourselves and others. Finally, we sure as heck aren’t
going to do all this work by hand. Therefore, Section 2.3 introduces two major
statistical software programs, Stata and R. This chapter is short because we’ll also
be spending time getting used to our software.
2
Chris Achen (1982, 53) memorably notes, “If the information has been coded by nonprofessionals
and not cleaned at all, as often happens in policy analysis projects, it is probably filthy.”
3
Appendix C contains more details (page 539). Here’s a quick refresher. The standard deviation of X
is a measure of the dispersion of X.The larger the standard deviation, the more spread out the values.
Standard deviation is calculated as N1 (Xi − X)2 , where X is the mean of X. We record how far
each observation is from the mean. We then square each value because for the purposes of calculating
dispersion, we don’t distinguish whether a value is below the mean or above it; when squared, all
these values become positive numbers. We record the average of these squared values. Finally, since
they’re squared values, taking the square root of the average brings the final value back to the scale of
the original variable.
2.1 Know Our Data 27
0 4
1 9
0 4
1 8
100 1
REMEMBER THIS
1. A useful first step toward understanding data is to review sample size, mean, standard deviation,
and minimum and maximum for each variable.
2. Plotting data is useful for identifying patterns and anomalies in data.
28 CHAPTER 2 Stats in the Wild: Good Data Practices
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150
Marge Selma
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
2.2 Replication
At the heart of scientific knowledge is replication. Research that meets a
replication Research
replication standard can be duplicated based on the information provided at the
that meets a replication
standard can be time of publication. In other words, an outsider who used that information would
duplicated based on the produce identical results.
information provided at We need replication files to satisfy this standard. Replication files document
the time of publication. exactly how data is gathered and organized. Properly constructed, these files allow
others to check our work by following our steps and seeing if they get identical
replication files Files results.
that document exactly
Replication files also enable others to probe our analysis. Sometimes—often,
how data is gathered
and organized. When in fact—statistical results hinge on seemingly small decisions about what data
properly compiled, to include, how to deal with missing data, and so forth. People who really care
these files allow others about getting the answer right will want to see what we’ve done to our data and,
to reproduce our results realistically, will be wary until they determine for themselves that other reasonable
exactly. ways of doing the analysis produce similar results. If a certain coding or statistical
2.2 Replication 29
4
We analyze this data on page 74.
30 CHAPTER 2 Stats in the Wild: Good Data Practices
REMEMBER THIS
1. Analysis that cannot be replicated cannot be trusted.
2. Replication files document data sources and methods that someone could use to exactly recreate
the analysis in question from scratch.
3. Replication files also allow others to explore the robustness of results by enabling them to assess
alternative approaches to the analysis.
2.2 Replication 31
5
Despite the fact that more people live in Washington, DC, than in Vermont or Wyoming! Or so says
the resident of Washington, DC . . .
32 CHAPTER 2 Stats in the Wild: Good Data Practices
Violent
crime
rate DC DC DC
(per
100,000
1,200
people)
1,000
800
NV NV NV
SC
TN TNSC SC
TN
NMDE FL
AK LA AK DENM
FL
LA AK DE
FL
LA
NM
600 MD MD MD
AR OK MI OK
MI AR MITX AR
IL MO OK
MO TXIL CA IL
MO
TX
AL MA MACA AL MA CA
AL
GA AZ GA
AZ GA
AZ
400 NC KS NY KS NC KS NC
PA PANY PA NY
CO
INOHWA CO
WAINOH CO INOH
WA
WV
MSMT IA NE CT NJ NJ
NEIAWV
CT
MT MS
NJ
CT NE MT WV
IA MS
ND WI
KY ID OR HI ND OR
WI HIRI HI RI
NDWI OR
MN RI MN VAKY
ID MN ID KY
200
SD WY VA UT UT WY SD WYVA UT SD
NH NH NH
VT
ME VTME VTME
FIGURE 2.3: Scatterplots of Violent Crime against Percent Urban, Single Parent, and Poverty
reality of the urbanization variable helps us better appreciate what the data is
telling us.
Being aware of the data can help us detect possible endogeneity. Many of the
states showing high single-parent populations and high poverty are in the South.
If this leads us to suspect that southern states are distinctive in other social and
political characteristics, we should be on high alert for potential endogeneity in any
analysis that uses the poverty or single-parent variable. These variables capture not
only poverty and single parenthood, but also “southernness.”
REMEMBER THIS
1. Stata is a powerful statistical software program. It is relatively user friendly, but it can be
expensive.
2. R is another powerful statistical software program. It is less user friendly, but it is free.
Conclusion
This chapter prepares us for analyzing real data. We begin by understanding our
data. This vital first step makes sure that we know what we’re dealing with. We
should use descriptive statistics to get an initial feel for how much data we have
and the scales of the variables. Then we should graph our data. It’s a great way to
appreciate what we’re dealing with and to spot interesting patterns or anomalies.
The second step of working with data is documenting our data and analysis.
Social science depends crucially on replication. Analyses that cannot be replicated
6
In the Further Reading section at the end of chapter, we indicate some good sources for learning
Stata and R and mention some other statistical packages in use.
34 CHAPTER 2 Stats in the Wild: Good Data Practices
cannot be trusted. Therefore, all statistical projects should document data and
methods, ensuring that anyone (including the author!) can recreate all results.
We are on track with the key concepts in this chapter when we can do the
following:
• Section 2.2: Explain the importance of replication and the two elements of
a replication file.
• Section 2.3 (and Computing Corner that follows): Do basic data description
in Stata and R.
Further Reading
King (1995) provides an excellent discussion of the replication standard.
Data visualization is a growing field, with good reason, as analysts increas-
ingly communicate primarily via figures. Tufte (2001) is a landmark book.
Schwabish (2004) and Yau (2011) are nice guides to graphics.
Chen, Ender, Mitchell, and Wells (2003) is an excellent online resource for
learning Stata. Gaubatz (2015) is an accessible and comprehensive introduction to
R. Other resources include Verzani (2004) and online tutorials.
Other programs are widely used as well. EViews is a powerful program often
chosen by those doing forecasting models (see eviews.com). Some people use
Excel for basic statistical analysis. It’s definitely useful to have good Excel skills,
but to do serious analysis, most people will need a more specialized program.
Key Terms
Codebook (29) Replication files (28) Standard deviation (26)
Replication (28) Robust (30)
Computing Corner
Stata
• The first thing to know is what to do when we get stuck (when, not if ).
In Stata, type help commandname if you have questions about a certain
command. For example, to learn about the summarize command, we can
type help summarize to get a description of that command. Probably the
most useful information comes in the form of the examples at the end of
these files. Often the best approach is to find an example that seems closest
Computing Corner 35
to what we’re trying to do and apply that example to the problem. Googling
usually helps, too.
• A comment line is a line in the code that provides notes for the user. A
comment line does not actually tell Stata to do anything, but it can be
incredibly useful to clarify what is going on in the code. Comment lines
in Stata begin with an asterisk (*). Using ** makes it easier to visually
identify these crucial lines.
• One of the hardest parts of learning new statistical software is loading data
into a program. While some data sets are prepackaged and easy, many
are not, especially those we create ourselves. Be prepared for the process
of loading data to take longer than expected. And because data sets can
sometimes misbehave (columns shifting in odd ways, for example), it is
very important to use the descriptive statistics diagnostics described in this
chapter to make sure the data is exactly what we think it is.
– To load Stata data files (which have .dta at the end of the file name), there
are two options.
1. Use syntax:
use "C:\Users\SallyDoe\Documents\DonutData.dta"
The “path” tells the computer where to find the file. In this exam-
ple, the path is C:\Users\SallyDoe\Documents\. The exact path
depends on a computer’s file structure.
1. Use syntax:
insheet using "C:\Users\SallyDoe\Documents\
DonutsData.raw"
• To see a list of variables loaded into Stata, look at the variable window that
lists all variables. We can also click on Data – Data editor to see variables.
• To make sure the data loaded correctly, display it with the list command.
To display the first 10 observations of all variables, type list in 1/10. To
display the first eight observations of only the weight variable, type list
weight in 1/8. We can also look at the data in Stata’s “Data Browser”
by going to Data/Data editor in the toolbar.
• To see descriptive statistics on the weight and donut data as in Table 2.1,
use summarize weight donuts.
• To produce a frequency table such as Table 2.2, type tabulate male. Use
this command only for variables that take on a limited number of possible
values.
• Use the if subcommand to limit the data used in Stata analyses. The
syntax list name if male == 1 will list the names of individuals who
are male. The syntax list name if male != 1 will list the names of
individuals who are not male. The syntax list name if male == 1 &
age > 18 will list the names of individuals who are male and over 18.
The syntax list name if male == 1 | age > 18 will list the names
of individuals who are male or over 18.
• To plot the weight and donut data as in Figure 2.2, type scatter
weight donuts. There are many options for creating figures. For example,
to plot the weight and donut data for males only with labels from a
variable called “name,” type scatter weight donuts if male = =
1, mlabel(name).
R
• To get help in R, type ?commandname for questions about a certain
command. For questions about the mean command, type ?mean to get
a description of the command, options, and most importantly, examples.
Computing Corner 37
Often the best approach is to find an example that seems closest to what
we’re trying to do and apply that example to the problem. Googling usually
helps, too.
• Comment lines in R begin with a pound sign (#). Using ## makes it easier
to visually identify these crucial lines.
• To open a syntax file where we document our analysis, click on File – New
script. It’s helpful to resize this window to be able to see both the commands
and the results. Save the syntax file as “SomethingSomething.R”; the more
informative the name, the better. Including the date in the file name aids
version control. To run any command in the syntax file, highlight the whole
line and then press ctrl-r. The results of the command will be displayed in
the R console window.
• To load R data files (which have .RData at the end of the file name), the
easiest option is to save the file to your computer and then to use the File –
Load Workspace menu option in the R console (where we see results) and
browse to the file. You will see the R code to load the data in the R console
and can paste that to your syntax file.
• Loading non-R data files (files that are in .txt or other such format) requires
more care. For example, to read in data that has commas between variables
on each line, use read.table:
RawData = read.table("C:\Users\SallyDoe\Documents\
DonutData.raw", header=TRUE)
This command saves variables as Data$VariableName (or, e.g., Raw-
Data$weight, RawData$donuts). It is also possible to install special
commands that load in various types of data. For example, search the Web
for “read.dta” to see more information on how to install a special command
that reads Stata files directly into R.
• To make sure the data loaded correctly, use the following tools to display
the data in R:
1. Use the objects() command to show the variables and objects loaded
into R.
38 CHAPTER 2 Stats in the Wild: Good Data Practices
• There are many useful tools to limit the sample. The syntax donuts[male
== 1] tells R to use only values of donuts for which male equals 1. The
syntax donuts[male != 1] tells R to use only values of donuts for which
male does not equal 1. The syntax donuts[male == 1 & age > 18]
tells R to use only values of donuts for which male equals 1 and age is
7
R can load variables directly such that each variable has its own variable name. Or it can load
variables as part of data frames such that the variables are loaded together. For example, our
commands to load the .RData file loaded each variable separately, while our commands to load data
from a text file created an object called “RawData” that contains all the variables. To display a
variable in the “RawData” object called “donuts,” type RawData$donuts in the .R file, highlight it,
and press ctrl-r. This process may take some getting used to, but if you experiment freely with any
data set you load, it should become second nature.
Exercises 39
greater than 18. The syntax donuts[male == 1 | age > 18] tells R to
use only values of donuts for which male equals 1 or age is greater than 18.
• To plot the weight and donut data as in Figure 2.2, type plot(donuts,
weight). For example, to plot the weight and donut data for males only
with labels from a variable called “name,” type
plot(donuts[male == 1], weight[male == 1])
text(donuts[male == 1], weight[male == 1], name[male == 1]).
There are many options for creating figures.8
Exercises
1. The data set DonutDataX.dta contains data from our donuts example on
page 26. There is one catch: each of the variables has an error. Use the
tools discussed in this chapter to identify the errors.
year Year
medals Total number of combined medals won
athletes Number of athletes in Olympic delegation
GDP Gross domestic product of country (per capita GDP in $10,000 U.S. dollars)
temp Average high temperature (in Fahrenheit) in January if country is in Northern
Hemisphere or July if Southern Hemisphere (for largest city)
population Population of country (in 100,000)
8
To get a flavor of plotting options, use text(donuts[male == 1], weight[male == 1],
name[male == 1], cex=0.6, pos=4) as the second line of the plot sequence of code. The cex
command controls the size of the label, and the pos=4 puts the labels to the right of the plotted point.
Refer to the help menus in R, or Google around for more ideas.
40 CHAPTER 2 Stats in the Wild: Good Data Practices
(b) List the first five observations for the country, year, medals, athletes,
and GDP data.
(e) Explain any suspicion you might have that other factors could
explain the observed relationship between the number of athletes and
medals.
(f) Create a scatterplot of medals and GDP. Briefly describe any clear
patterns.
(a) Summarize the wage, height (both height85 and height81), and
sibling variables. Discuss briefly.
TABLE 2.7 Variables for Height and Wage Data in the United States
Variable name Description
(c) Create a scatterplot of wages and adult height that excludes the
observations with wages above $500 per hour.
4. Anscombe (1973) created four data sets that had interesting properties.
Let’s use tools from this chapter to describe and understand these data
sets. The data is in a Stata data file called AnscombesQuartet.dta. There are
four possible independent variables (X1–X4) and four possible dependent
variables (Y1–Y4). Create a replication file that reads in the data and
implements the analysis necessary to answer the following questions.
Include comment lines that explain the code.
(a) Briefly note the mean and variance for each of the four X variables.
Briefly note the mean and variance for each of the four Y variables.
Based on these, would you characterize the four sets of variables as
similar or different?
(b) Create four scatterplots: one with X1 and Y1, one with X2 and Y2,
one with X3 and Y3, and one with X4 and Y4.
(c) Briefly explain any differences and similarities across the four
scatterplots.
PA R T I
1
The figure is an updated version of a figure in Noel (2010). The figure plots vote share as a percent
of the total votes given to Democrats and Republicans only. We use these data to avoid the
complication that in some years, third-party candidates such as Ross Perot (in 1992 and 1996) or
George Wallace (in 1968) garnered non-trivial vote share.
2
In the late nineteenth century, Francis Galton used the term regression to refer to the phenomenon
that children of very tall parents tended to be less tall than their parents. He called this phenomenon
“regression to the mean” in heights of children because children of tall parents tend to “regress”
(move back) to average heights. Somehow the term regression bled over to cover statistical methods
for analyzing relationships between dependent and independent variables. Go figure.
45
46 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
1972
1964
Incumbent party’s
vote percent
60
1984
1956
1996
55
1988
1948
2012
2016
2004 2000
50 1960
1976
1968
2008 1992
45 1952
1980
–1 0 1 2 3 4 5 6
FIGURE 3.1: Relationship between Income Growth and Vote for the Incumbent President’s Party,
1948–2016
The OLS model allows us to quantify the relationship between two variables
and to assess whether the relationship occurred by chance or resulted from some
real cause. We build on these methods in the rest of the book in ways that help us
differentiate, as best we can, true causes from simple associations.
In this chapter, we learn how to draw a regression line and understand
the statistical properties of the OLS model. Section 3.1 shows how to estimate
coefficients in an OLS model and how those coefficients relate to the regression
line we can draw in scatterplots of our data. Section 3.2 demonstrates that the
OLS coefficient estimates are themselves random variables. Section 3.3 explains
one of the most important concepts in statistics: the OLS estimates of β̂ 1 will be
biased if X is endogenous. That is, the estimates will be systematically higher
or lower than the true values if the independent variable is correlated with the
error term. Section 3.4 shows how to characterize the precision of the OLS
estimates. Section 3.5 shows how the distribution of OLS estimate converges
3.1 Bivariate Regression Model 47
to a point as the sample size gets very, very large. Section 3.6 discusses issues
that complicate the calculation of the precision of our estimates. These issues
have intimidating names like heteroscedasticity and autocorrelation. Their bark
is worse than their bite, however, and statistical software can easily address
them. Finally, Sections 3.7 and 3.8 discuss tools for assessing how well the
model fits the data and whether any unusual observations could distort our
conclusions.
Yi = β0 + β1 Xi + i (3.1)
where Yi is the dependent variable and X is the independent variable. The parame-
ter β0 is the intercept (or constant). It indicates the expected value of Y when Xi is
zero. The parameter β1 is the slope. It indicates how much Y changes as X changes.
The random error term i captures everything else other than X that affects Y.
Adapting the generic bivariate equation to the presidential election example
produces
where Incumbent party vote sharei is the dependent variable and Income changei is
the independent variable. The parameter β0 indicates the expected vote percentage
for the incumbent when income change equals zero. The parameter β1 indicates
how much more we expect vote share to rise as income change increases by one
unit.
This model is an incredibly simplified version of the world. The data will not
fall on a completely straight line because elections are affected by many other
factors, ranging from wars to scandals to social issues and so forth. These factors
comprise our error term, i .
For any given data set, OLS produces estimates of the β parameters that best
explain the data. We indicate estimates as β̂ 0 and β̂ 1 , where the “hats” indicate
48 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
that these are our estimates. Estimates are different from the true values, β0 and
β1 , which don’t get hats in our notation.3
How can these parameters best explain the data? The β̂’s define a line with an
intercept (β̂ 0 ) and a slope (β̂ 1 ). The task boils down to picking a β̂ 0 and β̂ 1 that
define the line that minimizes the aggregate distance of the observations from the
line. To do so, we use two concepts: the fitted value and the residual.
fitted value A fitted The fitted value is the value of Y predicted by our estimated equation. The
value, Ŷi , is the value of Y fitted value Ŷ (which we call “Y hat”) from our bivariate OLS model is
predicted by our
estimated equation. For Ŷi = β̂ 0 + β̂ 1 Xi (3.3)
a bivariate OLS model it
is Ŷi = β̂ 0 + β̂ 1 Xi . Also
called predicted value. Note the differences from Equation 3.1—there are lots of hats and no i . This is
the equation for the regression line defined by the estimated β̂ 0 and β̂ 1 parameters
and Xi .
regression line The
fitted line from a A fitted value tells us what we would expect the value of Y to be given the
regression. value of the X variable for that observation. To calculate a fitted value for any value
of X, use Equation 3.3. Or, if we plot the line, we can simply look for the value of
the regression line at that value of X. All observations with the same value of Xi
will have the same Ŷi , which is the fitted value of Y for observation i. Fitted values
are also called predicted values.
residual The A residual measures the distance between the fitted value and an actual
difference between the observation. In the true model, the error, i , is that part of Yi not explained by
fitted value and the β0 + β1 Xi . The residual is the estimated counterpart to the error. It is the portion of
observed value.
Yi not explained by β̂ 0 + β̂ 1 Xi (notice the hats). If our coefficient estimates exactly
equaled the true values, then the residual would be the error; in reality, of course,
our estimates β̂ 0 and β̂ 1 will not equal the true values β0 and β1 , meaning that our
residuals will differ from the error in the true model.
The residual for observation i is ˆi = Yi − Ŷi . Equivalently, we can say a
residual is ˆi = Yi − β̂ 0 − β̂ 1 Xi . We indicate residuals with ˆ (“epsilon hat”). As
with the β’s, a Greek letter with a hat is an estimate of the true value. The residual
ˆi is distinct from i , which is how we denote the true, but not directly observed,
error.
Estimation
The OLS estimation strategy is to identify values of β̂ 0 and β̂ 1 that define
a line that minimizes the sum of the squared residuals. We square the resid-
uals because we want to treat a residual of +7 (as when an observed Yi is
7 units above the fitted line) as equally undesirable as a residual of −7 (as when
an observed Yi is 7 units below the fitted line). Squaring the residuals converts all
residuals to positive numbers. Our +7 residual and −7 residual observations will
both register as +49 in the sum of squared residuals.
3
Another common notation is to refer to estimates with regular letters rather than Greek letters
(e.g., b0 and b1 ). That’s perfectly fine, too, of course, but we stick with the hat notation for
consistency throughout this book.
3.1 Bivariate Regression Model 49
Specifically, the expression for the sum of squared residuals for any given
estimates of β̂ 0 and β̂ 1 is
N
N
ˆi2 = (Yi − β̂ 0 − β̂ 1 Xi )2
i=1 i=1
The OLS process finds the β̂ 1 and β̂ 0 that minimize the sum of squared
residuals. The “squares” in “ordinary least squares” comes from the fact that
we’re squaring the residuals. The “least” bit is from minimizing the sum of
squares. The word “ordinary” indicates that we haven’t progressed to anything
fancy yet.
As a practical matter, we don’t need to carry out the minimization
ourselves—we can leave that to the software. The steps are not that hard, though,
and we step through a simplified version of the minimization task in Chapter 14
(page 494). This process produces specific equations for the OLS estimates of β̂ 0
and β̂ 1 . These equations provide estimates of the slope (β̂ 1 ) and intercept (β̂ 0 )
combination that characterizes the line that best fits the data.
The OLS estimate of β̂ 1 is
N
i=1 (Xi − X)(Yi − Y)
β̂1 = N (3.4)
i=1 (Xi − X)
2
where X (read as “X bar”) is the average value of X and Y is the average value
of Y.
Equation 3.4 shows that β̂ 1 captures how much X and Y move together. The
N
numerator has i=1 (Xi − X)(Yi − Y). The first bit inside the sum is the difference
of X from its mean for the ith observation; the second bit is the difference of Y
from its mean for the ith observation. The product of these bits is summed over
observations. So, if Y tends to be above its mean [meaning (Yi − Y) is positive]
when X is above its mean [meaning (Xi − X) is positive], there will be a bunch
of positive elements in the sum in the numerator. If Y tends to be below its mean
[meaning (Yi − Y) is negative] when X is below its mean [meaning (Xi − X) is
negative], we’ll also get positive elements in the sum because a negative number
times a negative number is positive. Such observations will also push β̂ 1 to be
positive.
On the other hand, β̂ 1 will be negative when the signs of Xi − X and Yi − Y
are mostly opposite signs. For example, if X is above its mean [meaning (Xi − X)
is positive] when Y is below its mean [meaning (Yi − Y) is negative], we’ll get
negative elements in the sum and β̂ 1 will tend to be negative.4
4
There is a close affinity between the regression coefficient in bivariate OLS and covariance and
correlation. By using the equations for variance and covariance from Appendices C and D (pages 539
and 540), we see that Equation 3.4 can be rewritten as cov(X,Y)
var(X)
. The relationship between covariance
and correlation can be used to show that Equation 3.4 can equivalently be written as corr(X, Y) σσY ,
X
which indicates that the bivariate regression coefficient is simply a rescaled correlation coefficient.
50 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
β̂ 0 = Y − β̂ 1 X (3.5)
We focus on the equation for β̂ 1 because this is the parameter that defines the
relationship between X and Y, which is what we usually care most about.
Incumbent party vote sharei = β̂ 0 + β̂ 1 Income changei
= 46.1 + 2.2 × Income changei
Figure 3.2 shows what these coefficient estimates mean. The β̂ 1 estimate
implies that the incumbent party’s vote percentage went up by 2.2 percentage
points for each one-percent increase in income. The β̂ 0 estimate implies that the
expected election vote share for the incumbent president’s party for a year with
zero income growth was 46.1 percent.
Table 3.1 and Figure 3.3 show predicted values and residuals for specific
presidential elections. In 2016, income growth was low (at 0.69 percent). The
value of the dependent variable for 2016 was the vote share of Hillary Clinton,
who, as a Democrat, was in the same party as the incumbent president, Barack
Obama. Hillary Clinton received 51.1 percent of the vote. The fitted value, denoted
by a triangle in Figure 3.3, is 46.1 + 2.2 × 0.69 = 47.6. The residual, which
is the difference between the actual and fitted, is 51.1 − 47.6 = 3.5 percent.
In other words, Hillary Clinton did 3.5 percentage points better than would
be expected based on the regression line in 2016. Think of that as her
“Trump bump.”
We can go through the same process to understand the fitted values and
residuals displayed in the Figure 3.3 and Table 3.1. In 2000, the fitted value based
on the regression line is 46.1 + 2.2 × 3.87 = 54.6. The residual, which is the
difference between the actual and the fitted, is 50.2 − 54.6 = −4.4 percent. The
negative residual means that Al Gore, who, as a Democrat, was the candidate
of the incumbent president’s party, did 4.4 percentage points worse than would
be expected based on the regression line. In 1964, the Democrats controlled the
presidency at the time of the election, and they received 61.3 percent of the
vote when Democrat Lyndon Johnson trounced Republican Barry Goldwater.
The correlation coefficient indicates the strength of the association, while the bivariate regression
coefficient indicates the effect of a one-unit increase in X on Y. It’s a good lesson to remember. We all
know “correlation does not imply causation”; this little nugget tells us that bivariate regression (also!)
does not imply causation. Appendix E provides additional details (page 541).
3.1 Bivariate Regression Model 51
1972
1964
Incumbent party’s
vote percent
60
1984
1956
1996
55
1988
1948
2012
2016
2004 2000
1960
50
)
pe 1976
slo 1968
e
(th
2
2.
=
β1 2008 1992
β0 = 46.1
45 1952
1980
–1 0 1 2 3 4 5 6
Percent change in income
FIGURE 3.2: Elections and Income Growth with Model Parameters Indicated
The fitted value based on the regression line is 46.1 + 2.2 × 5.63 = 58.5. The
residual, which is the difference between the actual and the fitted, is 61.3 −
5 = 2.8 percent. In other words, in 1964 the incumbent president’s party did
2.8 percentage points better than would be expected based on the regression
line.
52 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
Incumbent party’s
vote percent
60 Residual for 1964
Fitted
value
for 1964
50
Residual for
2016
45
−1 0 1 2 3 4 5 6
Percent change in income
FIGURE 3.3: Fitted Values and Residuals for Observations in Table 3.1
REMEMBER THIS
1. The bivariate regression model is
Yi = β0 + β1 Xi + i
N
N
ˆi2 = (Yi − β̂ 0 − β̂ 1 Xi )2
i=1 i=1
modeled Second, our estimates will have modeled randomness. Think again of the
randomness Variation population of ferrets. Even if we were to get data on every last one of them, our
attributable to inherent model has random elements. The ferret sleep patterns (the dependent variable) are
variation in the data-
subject to randomness that goes into the error term. Maybe one ferret had a little
generation process. This
source of randomness
too much celery, another got stuck in a drawer, and yet another broke up with his
exists even when we girlferret. Unmeasured factors denoted by affect ferret sleep, and having data on
observe data for an every single ferret would not change that fact.
entire population. In other words, there is inherent randomness in the data-generation process
even when data is measured for an entire population. So, even if we observe a
complete population at any given time, thus eliminating any sampling variation,
we will have randomness due to the data-generation process. Put another way,
virtually every model has some unmeasured component that explains some of
the variation in our dependent variable, and the modeled-randomness perspective
highlights this.
An OLS estimate of β̂ 1 inherits randomness whether from sampling or
random variable A modeled randomness. The estimate β̂ 1 is therefore a random variable—that
variable that takes on is, a variable that takes on a set of possible different values, each with some
values in a range and probability. An easy way to see why β̂ 1 is random is to note that the equation
with the probabilities
for β̂ 1 (Equation 3.4) depends on the values of the Yi ’s, which in turn depend on
defined by a
distribution.
the i values, which themselves are random.
β estimates
Distributions of β̂
distribution The To understand these random β̂ 1 ’s, it is best to think of the distribution of β̂ 1 . That
range of possible values is, we want to think about the various values we expect β̂ 1 to take and the relative
for a random variable likelihood of these values.
and the associated
Let’s start with random variables more generally. A random variable with
relative probabilities for
each value.
discrete outcomes can take on one of a finite set of specific outcomes. The flip
of a coin or roll of a die yields a random variable with discrete outcomes. These
probability random variables have probability distributions. A probability distribution is a
distribution A graph graph or formula that identifies the probability for each possible value of a random
or formula that gives the variable.
probability for each
Many probability distributions of random variables are intuitive. We all know
possible value of a
random variable.
the distribution of a coin toss: heads with 50 percent probability and tails with
50 percent probability. Panel (a) of Figure 3.4 plots this data, with the outcome
on the horizontal axis and the probability on the vertical axis. We also know the
distribution of the roll of a six-sided die. There is a 16 probability of seeing each
of the six numbers on it, as panel (b) of Figure 3.4 shows. These are examples of
random variables with a specific number of possible outcomes: two (as with a coin
toss) or six (as with a roll of a die).
continuous variable This logic of distributions extends to continuous variables. A continuous
A variable that takes on variable is a variable that can take on any value in some range. Weight in our
any possible value over donut example from Chapter 1 is essentially a continuous variable. Because weight
some range.
can be measured to a very fine degree of precision, we can’t simply say there
is some specific number of possible outcomes. We don’t identify a probability
3.2 Random Variation in Coefficient Estimates 55
Probability Probability
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Heads Tails 1 2 3 4 5 6
(a) (b)
Probability Probability
density density
−4 −2 0 2 4 60 65 70 75
(c) (d)
for each possible outcome for continuous variables because there is an unlimited
probability density number of possible outcomes. Instead we identify a probability density, which
A graph or formula that is a graph or formula that describes the relative probability that a random variable
describes the relative is near a specified value for the range of possible outcomes for the random
probability that a
variable.
random variable is near
a specified value.
Probability densities run the gamut from familiar to weird. On the familiar
end of things is a normal distribution, which is the classic bell curve in panel
normal distribution (c) of Figure 3.4. This plot indicates the probability of observing realizations of
A bell-shaped the random variable in any given range. For example, since half of the area of the
probability density that density shown in panel (c) is less than zero, we know that there is a 50 percent
characterizes the
chance that this particular normally distributed random variable will be less than
probability of observing
outcomes for normally
zero. Because the probability density is high in the middle and low on the ends, we
distributed random can say, for example, that the normal random variable plotted in panel (c) is more
variables. likely to take on values around zero than values around −4. The odds of observing
values around +1 or −1 are still reasonably high, but the odds of observing values
near +3 or −3 are small.
Probability densities for random variables can have odd shapes, as in panel
(d) of Figure 3.4, which shows a probability density for a random variable that has
56 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
its most likely outcomes near 64 and 69.5 The point of panel (d) is to make it clear
that not all continuous random variables follow the bell-shaped distribution. We
could draw a squiggly line, and if it satisfied a few conditions, it, too, would be a
valid probability distribution.
If the concept of probability densities is new to you (or you are rusty on the
idea), read more on probability densities in Appendix F starting on page 541. The
normal density in particular will be important for us. Appendix G explains how
to work with the normal distribution, something that we will see again in the next
chapter.
5
The distribution of adult heights measured in inches looks something like this. What explains the
two bumps in the distribution?
6
If the errors in the model (the ’s) are normally distributed, then the β̂ 1 values will be normally
distributed no matter what the sample size is. Therefore, in small samples, if we could make
ourselves believe the errors are normally distributed, that belief would be a basis for treating the β̂ 1
values as coming from a normal distribution. Unfortunately, many people doubt that errors are
normally distributed in most empirical models. Some statisticians therefore pour a great deal of
energy into assessing whether errors are normally distributed (just Google “normality of errors”). But
we don’t need to worry about this debate as long as we have a large sample.
7
Some technical assumptions are necessary. For example, the “distribution” of the values of the error
term cannot consist solely of a single number.
3.3 Endogeneity and Bias 57
more 1s than usual, and those averages will tend to be closer to 3. Crucially, the
shape of the distribution will look more and more like a normal distribution the
larger our sample of averages gets.
Even though the central limit theorem is about averages, it is relevant for OLS.
Econometricians deriving the distribution of β̂ 1 invoke the central limit theorem
to prove that β̂ 1 will be normally distributed for a sufficiently large sample size.8
What sample size is big enough for the central limit theorem and, therefore,
normality to kick in? There is no hard-and-fast rule, but the general expectation is
that around 100 observations is enough. If we have data with some really extreme
outliers or other pathological cases, we may need a larger sample size. Happily,
though, the normality of the β̂ 1 distribution is a reasonable approximation even
for data sets with as few as 100 observations. Exercise 2 at the end of this chapter
provides a chance to see distributions of coefficients for ourselves.
REMEMBER THIS
1. Randomness in coefficient estimates can be the result of
• Sampling variation, which arises due to variation in the observations selected into the
sample. Each time a different random sample is analyzed, a different estimate of β̂ 1 will be
produced even though the population (or “true”) relationship is fixed.
• Modeled variation, which arises because of inherent uncertainty in outcomes. Virtually
any data set has unmeasured randomness, whether the data set covers all observations in a
population or some subsample (random or not).
2. The central limit theorem implies the β̂ 0 and β̂ 1 coefficients will be normally distributed random
variables if the sample size is sufficiently large.
8
One way to see why is to think of the OLS equation for β̂ 1 as a weighted average of the dependent
variable. That’s not super obvious, but if we squint our eyes and look at Equation 3.4, we see that we
(Xi −X)
could rewrite it as β̂ 1 = N
i=1 wi (Yi − Y), where wi = N 2 . (We have to squint really hard!) In
i=1 (Xi −X)
other words, we can think of the β̂ 1 ’s as a weighted sum of the Yi ’s, where wi is the weight (and we
happen to subtract the mean of Y from each Yi ). It’s not to hard to get from a weighted sum to an
average. Doing so opens the door for the central limit theorem (which is, after all, about averages) to
work its magic and establish that β̂ 1 will be normally distributed for large samples.
58 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
Probability
density
Distribution of β1
β1
happens to be quite wide, even though the average is the true value, we might still
observe values of β̂ 1 that are far from the true value, β1 .
Think of the figure skating judges at the Olympics. Some are biased—perhaps
blinded by nationalism or wads of cash—and they systematically give certain
skaters higher or lower scores than the skaters deserve. Other judges (most?) are
not biased. Still, these judges do not get the right answer every time.9 Sometimes
an unbiased judge will give a score that is higher than it should be, and sometimes
a score that is lower. Similarly, an OLS regression coefficient β̂ 1 that qualifies as
an unbiased estimate of β1 can be too high or too low in a given application.
Here are two thought experiments that shed light on unbiasedness. First, let’s
approach the issue from the sampling-randomness framework from Section 3.2.
Suppose we select a sample of people, measure some dependent variable Yi and
independent variable Xi for each, and use those to estimate the OLS β̂ 1 . We write
that down and then select another sample of people, get the data, estimate the
OLS model again, and write down the new estimate of β̂ 1 . The new estimate will
be different because we’ll have different people in our data set. Repeat the process
again and again, write down all the different β̂ 1 ’s, and then calculate the average
of the estimated β̂ 1 ’s. While any given realization of β̂ 1 could be far from the true
value, we will call the estimates unbiased if the average of the β̂ 1 ’s is the true
value, β1 .
We can also approach the issue from the modeled-randomness framework
from Section 3.2. Suppose we generate our data. We set the true β1 and β0 values
as some specific values. We also fix the value of Xi for each observation. Then we
draw the i for each observation from some random distribution. These values will
come together in our standard equation to produce values of Y that we then use in
the OLS equation for β̂ 1 . Then we repeat the process of generating random error
terms (while keeping the true β and X values the same). Doing so produces another
set of Yi values and a different OLS estimate for β̂ 1 . We keep running this process
a bunch of times, writing down the β̂ 1 estimates from each run. If the average of
the β̂ 1 ’s we have recorded is equal to the true value, β1 , then we say that β̂ 1 is an
unbiased estimator of β1 .
OLS does not automatically produce unbiased coefficient estimates. A crucial
condition must be satisfied for OLS estimates to be unbiased: the error term cannot
be correlated with the independent variable. The exogeneity condition, which we
discussed in Chapter 1, is at the heart of everything. If this condition is violated,
then something in the error term is correlated with our independent variable and
will contaminate the observed relationship between X and Y. In other words, while
observing large values of Y associated with large values of X naturally inclines us
to think X pushes Y higher, we worry that something in the error term that is big
when X is big is actually what is causing Y to be high. In that case, the relationship
between X and Y is spurious, and the real causal influence is that unidentified factor
in the error term.
9
We’ll set aside for now the debate about whether a right answer even exists. Let’s imagine there is a
score that judges would on average give to a performance if the skater’s identity were unknown.
60 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
where violent crime in period t is the dependent variable and ice cream sales
in period t is the independent variable. We’d find that β̂ 1 is greater than zero,
suggesting crime is indeed higher when ice cream sales go up.
Does this relationship mean that ice cream is causing crime? Maybe. But
probably not. OK, no, it doesn’t. So what’s going on? There are a lot of factors
in the error term, and one of them is probably truly associated with crime and
correlated with ice cream sales. Any guesses?
Heat. Heat makes people want ice cream and, it turns out, makes them cranky
(or gets them out of doors) such that crime goes up. Hence, a bivariate OLS model
with just ice cream sales will show a relationship, but because of endogeneity, this
relationship is really just correlation, not causation.
Characterizing bias
As a general matter, we can say that as the sample size gets large, the estimated
coefficient will on average be off by some function of the correlation between the
included variable and the error term. We show in Chapter 14 (page 495) that the
expected value of our bivariate OLS estimate is
σ
E[β̂ 1 ] = β1 + corr(X, ) (3.8)
σX
where E[β̂ 1 ] is short for the expectation of β̂ 1 ,11 corr(X, ) is the correlation of X
and , σ (the lowercase Greek letter sigma) is the standard deviation of , and σX
is the standard deviation of X. The fraction at the end of the equation is more a
normalizing factor, so we don’t need to worry too much about it.12
The key thing is the correlation of X and . The bigger this correlation, the
further the expected value of β̂ 1 will be from the true value. Or, in other words, the
more the independent variable and the error are correlated, the more biased OLS
will be.
Much of the rest of this book mostly centers around what to do if the corre-
lation of X and is not zero. The ideal solution is to use randomized experiments
10
Why would we ever wonder that? Work with me here . . .
11
Expectation is a statistical term that essentially refers the the average value over many realizations
of a random value. We discuss the concept in Appendix C on page 539.
12
If we use corr(X, ) = covariance(X,)
σ σ
, we can write Equation 3.8 as E[β̂ 1 ] = β1 + cov(X,)
σ2
, where cov is
X X
short for covariance.
3.4 Precision of Estimates 61
for which corr(X1 , ) is zero by design. But in the real world, experiments often
fall prey to challenges discussed in Chapter 10. For observational studies, which
are more common than experiments, we’ll discuss lots of tricks in the rest of this
book that help us generate unbiased estimates even when corr(X1 , ) is non-zero.
REMEMBER THIS
1. The distribution of an unbiased estimator is centered at the true value, β1 .
2. The OLS estimator β̂ 1 is a biased estimator of β1 if X and are correlated.
σ
3. If X and are correlated, the expected value of β̂ 1 is β1 + corr(X, ) .
σX
Probability
Smaller variance
density
Larger variance
−6 −4 −2 0 2 4 6 8 10
β1
The variance and standard error of an estimate contain the same information,
just in different forms as the variance is simply the standard deviation squared.
We’ll see later that it is often more convenient to use standard errors to characterize
the precision of estimates because they are on the same scale as the independent
variable (meaning, for example, that if X is measured in feet, we can interpret the
standard error in terms of feet as well).13
We prefer β̂ 1 to have a smaller variance. With a smaller variance, values close
to the true value are more likely, meaning we’re less likely to be far off when we
generate the β̂ 1 . In other words, our bowl of estimates will be less likely to have
wacky stuff in it.
Under the right conditions, we can characterize the variance (and, by
extension, the standard error) of β̂ 1 with a simple equation. We discuss the
conditions on page 67. If they are satisfied, the estimated variance of β̂ 1 for a
bivariate regression is
σ̂ 2
var(β̂ 1 ) = (3.9)
N × var(X)
This equation tells us how wide our distribution of β̂ 1 is.14 We don’t need to
calculate the variance of β̂ 1 by hand. That is, after all, why we have computers.
13
The difference between standard errors and standard deviations can sometimes be confusing. The
standard error of a parameter estimate is the standard deviation of the sampling distribution of the
parameter estimate.
14
We derive a simplified version of the equation on page 499 in Chapter 14.
3.4 Precision of Estimates 63
N
i=1 (Yi − Ŷi )
2
σ̂ 2 = .
N −k
N 2
ˆ
= i=1 i (3.10)
N −k
which is (essentially) the average squared deviation of fitted values of Y from the
actual values. It’s not quite an average because the denominator is N − k rather
degrees of freedom than N. The N − k in the denominator is the degrees of freedom, where k is the
The sample size minus number of variables (including the constant) in the model.15
the number of The numerator of Equation 3.10 indicates that the more each individual
parameters. It refers to
observation deviates from its fitted value the higher σ̂ 2 will be. The estimated
the amount of
information we have
σ̂ 2 is also an estimate of the variance of in our core model, Equation 3.1.16
available to use in the Next, look at the denominator of the variance of β̂ 1 (Equation 3.9). It is N ×
estimation process. var(X). Yawn. There are, however, two important substantive facts in there. First,
the bigger the sample size (all else equal), the smaller the variance of β̂ 1 . In other
words, more data means lower variance. More data is a good thing.
Second, we see that variance of X reduces the variance of β̂ 1 . The variance
N
(Xi −X)2
of X is calculated as i=1 N . This puts the variance of β̂ 1 on the same scale
as the variance of the X variable. It is also the case that the more our X variable
varies, the more precisely we will be able to learn about β1 .17
15
For bivariate regression, k = 2 because we estimate two parameters (β̂ 0 and β̂ 1 ). We can think of
the degrees of freedom correction as a penalty for each parameter we estimate; it’s as if we use up
some information in the data with each parameter we estimate and cannot, for example, estimate
more parameters than the number of observations we have. If N is large enough, the k in the
denominator will have only a small effect on the estimate of σ̂ 2 . For small samples, the degrees of
freedom issue can matter more. Every statistical package will get this right, and the core intuition is
that σ̂ 2 measures the average squared distance between actual and fitted values.
(ˆ −ˆ )2
16
Recall that the variance of ˆ will be i
N
. The OLS minimization process automatically
creates residuals with a average of zero (meaning ˆ = 0). Hence, the variance of the residuals reduces
to Equation 3.10.
17
Here we’re assuming a large sample. If we had a small sample, we would calculate the variance of
N 2
i=1 (Xi −X)
X with a degrees of freedom correction such that it would be N−1
.
64 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
Dependent Dependent
variable variable
45 45
40 40
35 35
30 30
25 25
20 20
15 15
4 6 8 10 12 14 4 6 8 10 12 14
Dependent Dependent
variable variable
50 50
45 45
40 40
35 35
30 30
25 25
20 20
15 15
10 10
4 6 8 10 12 14 4 6 8 10 12 14
Review Questions
1. Will the variance of β̂ 1 be smaller in panel (a) or panel (b) of Figure 3.7? Why?
2. Will the variance of β̂ 1 be smaller in panel (c) or panel (d) of Figure 3.7? Why?
3.5 Probability Limits and Consistency 65
REMEMBER THIS
1. The variance of β̂ 1 measures the width of the β̂ 1 distribution. If the conditions discussed later
in Section 3.6 are satisfied, then the estimated variance of β̂ 1 is
σ̂ 2
var(β̂ 1 ) =
N × var(X)
Probability
density
N = 1,000
N = 100
N = 10
−4 −2 0 2 4
β1
consistency A to a vertical line at the true value. If we had an infinite number of observations,
consistent estimator is we would get the right answer every time. That may be cold comfort if we’re
one for which the stuck with a sad little data set of 37 observations, but it’s awesome when we have
distribution of the
100, 000 observations.
estimate gets closer and
closer to the true value
Consistency is an important property of OLS estimates. An estimator, such
as the sample size as OLS, is a consistent estimator if the distribution of β̂ 1 estimates shrinks to be
increases. The OLS closer and closer to the true value, β1 , as we get more data. If the exogeneity
estimate β̂ 1 consistently condition is true, then β̂ 1 is a consistent estimator of β1 .18 Formally, we say
estimates β1 if X is
uncorrelated with .
plim β̂ 1 = β1 (3.11)
REMEMBER THIS
1. The probability limit of an estimator is the value to which the estimator converges as the sample
size gets very, very large.
2. When the error term and X are uncorrelated, OLS estimates of β are consistent, meaning that
plim β̂ = β.
19
The two best things you can say about an estimator are that it is unbiased and that it is consistent.
OLS estimators are both unbiased and consistent when the error is uncorrelated with the independent
variable and there are no post-treatment variables in the model (something we discuss in Chapter 7).
These properties seem pretty similar, but they can be rather different. These differences are typically
only relevant in advanced statistical work. For reference, we discuss in the citations and notes section
on page 556 examples of estimators that are unbiased but not consistent, and vice versa.
68 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
Homoscedasticity
The first condition for Equation 3.9 to be appropriate is that the variance of i
must be the same for every observation. That is, once we have taken into account
the effect of our measured variable (X), the expected degree of uncertainty in the
model must be the same for all observations. If this condition holds, the variance
of the error term is the same for low values of X as for high values of X. This
condition gets a fancy name, homoscedasticity. “Homo” means same. “Scedastic”
homoscedastic (yes, that’s a word) means variance. Hence, errors are homoscedastic when they
Describing a random all have the same variance.
variable having the When errors violate this condition, they are heteroscedastic, meaning that the
same variance for all
variance of i is different for at least some observations. That is, some observations
observations.
are on average closer to the predicted value than others. Imagine, for example, that
heteroscedastic A we have data on how much people weigh from two sources: some people weighed
random variable is themselves with a state-of-the-art scale, and others had a guy at a state fair guess
heteroscedastic if the their weight. Definite heteroscedasticity there, as the weight estimates on the scale
variance differs for some
would be very close to the truth (small errors), and the weight estimates from the
observations.
fair dude will be further from the truth (large errors).
Violating the homoscedasticity condition doesn’t cause OLS β̂ 1 estimates
heteroscedasticity- to be biased. It simply means we shouldn’t use Equation 3.9 to calculate
consistent standard the variance of β̂ 1 . Happily for us, the intuitions we have discussed so far
errors Standard errors about what causes var(β̂ 1 ) to be big or small are not nullified, and there
for the coefficients in are relatively simple ways to implement procedures for this case. We show
OLS that are appropriate how to generate these heteroscedasticity-consistent standard errors in Stata
even when errors are
and R in the Computing Corner of this chapter (pages 83 and 86). This
heteroscedastic.
approach to accounting for heteroscedasticity does not affect the values of the β̂
estimates.20
20
The equation for heteroscedasticity-consistent standard errors is ugly. If you must know, it is
2
1
var(β̂ 1 ) = (Xi − X)2 ˆi2 (3.12)
(Xi − X) 2
This is less intuitive than in Equation 3.9, so we do not emphasize it. As it turns out, we derive
heteroscedasticity-consistent standard errors in the course of deriving the standard errors that assume
homoscedasticity (see Chapter 14 Page 499). Heteroscedasticity-consistent standard errors are also
referred to as robust standard errors (because they are robust to heteroscedasticity) or as
Huber-White standard errors. Another approach to dealing with heteroscedasticity is to use
“weighted least squares.” This approach is more statistically efficient, meaning that the variance of
the estimate will theoretically be lower. The technique produces β̂ 1 estimates that differ from the
OLS β̂ 1 estimates. We point out references with more details on weighted least squares in the Further
Reading section at the end of this chapter.
3.6 Solvable Problems: Heteroscedasticity and Correlated Errors 69
There are two fairly common situations in which errors are correlated. The
first involves clustered errors. Suppose, for example, we’re looking at test scores
of all eighth graders in California. It is possible that the unmeasured factors in the
error term cluster by school. Maybe one school attracts science nerds and another
attracts jocks. If such patterns exist, then knowing the error term for a kid in a
school gives some information about the error terms of other kids in the same
school, which means errors are correlated. In this case, the school is the “cluster,”
and errors are correlated within the cluster. It’s inappropriate to use Equation 3.9
when errors are correlated.
This sounds worrisome. And it is, but not terribly so. As with heteroscedas-
ticity, violating the condition that errors must not be correlated doesn’t cause an
OLS β̂ 1 estimate to be biased. Autocorrelated errors only render Equation 3.9
inappropriate.
So what should we do if errors are correlated? Get a better equation for the
variance of β̂ 1 ! It’s actually a bit more complicated than that, but it is possible to
derive the variance of β̂ 1 when errors are correlated within cluster. We simply note
the issue here and use the computational procedures discussed in the Computing
time series data Corner to deal with clustered standard errors.
Consists of observations Correlated errors are also common in time series data—that is, data on
for a single unit over a specific unit over time. Examples include U.S. growth rates since 1945 or
time. Time series data is data on annual attendance at New York Yankees games since 1913. Errors in
typically contrasted to time series data are frequently correlated in a pattern we call autocorrelation.
cross-sectional and
Autocorrelation occurs when the error in one time period is correlated with the
panel data.
error in the previous time period.
Correlated errors can occur in time series when an unmeasured variable
autocorrelation in the error term is sticky, such that a high value in one year implies a high
Errors are autocorrelated value in the next year. Suppose, for example, we are modeling annual U.S.
if the error in one time economic growth since 1945 and we lack a variable for technological innovation
period is correlated with
(which is very hard to measure). If technological innovation was in the error term
the error in the previous
time period. boosting the economy in one year, it probably did some boosting to the error
Autocorrelation is term the next year. Similar autocorrelation is likely in many time series data sets,
common in time series ranging from average temperature in Tampa over time to monthly Frisbee sales in
data. Frankfurt.
As with the other issues raised in this section, autocorrelation does not
cause bias. Autocorrelation only renders Equation 3.9 inappropriate. Chapter 13
discusses how to generate appropriate estimates of the variance of β̂ 1 when errors
are autocorrelated.
It is important to keep these conditions in perspective. Unlike the exogeneity
condition (that X and the errors are uncorrelated), we do not need the homoscedas-
ticity and uncorrelated-errors conditions for unbiased estimates. When these
conditions fail, we simply do some additional steps to get back to a correct
equation for the variance of β̂ 1 . Violations of these conditions may seem to
be especially important because they have fancy labels like “heteroscedasticity”
70 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
and “autocorrelation.” They are not. The exogeneity condition matters much
more.
REMEMBER THIS
1. The standard equation for the variance of β̂ 1 (Equation 3.9) requires errors to be homoscedastic
and uncorrelated with each other.
• Errors are homoscedastic if their variance is constant. When errors are heteroscedastic, the
variance of errors is different across observations.
• Correlated errors commonly occur in clustered data in which the error for one observation
is correlated with the error for another observation from the same cluster (e.g., a
school).
• Correlated errors are also common in time series data where errors are autocorrelated,
meaning the error in one period is correlated with the error in the previous period.
2. Violating the homoscedasticity or uncorrelated-error conditions does not bias OLS coefficients.
Discussion Questions
Come up with an example of an interesting relationship you would like to test.
worry too much about goodness of fit, however, as we can have useful, interesting
results from models with poor fit and biased, useless results from models with
great fit.
σ)
Standard error of the regression (σ̂
We’ve already seen one goodness of fit measure, the variance of the regression
(denoted as σ̂ 2 ). One limitation with this measure is that the scale is not intuitive.
For example, if our dependent variable is salary, the variance of the regression will
be measured in dollars squared (which is odd).
standard error of Therefore, the standard error of the regression is commonly used as a
the regression A measure of goodness of fit. It is simply the square root of the variance of the
measure of how well the regression and is denoted as σ̂ . It corresponds, roughly, to the average distance of
model fits the data. It is
observations from fitted values. The scale of this measure will be the same units
the square root of the
variance of the
as the dependent variable, making it much easier to relate to.
regression. The trickiest thing about the standard error of the regression may be that it
goes by so many different names. Stata refers to σ̂ as the root mean squared error
(or root MSE for short); root refers to the square root and MSE to mean squared
error, which is how we calculate σ̂ 2 , or the mean of the squared residuals. R refers
to σ̂ 2 as the residual standard error because it is the estimated standard error for
the errors in the model based on the residuals.
R2
Finally, a very common measure of goodness of fit is R2 , so named because it
is a measure of the squared correlation of the fitted values and actual values.21
Correlation is often indicated with an “r,” so R2 is simply the square of this
value. (Why one is lowercase and the other is uppercase is one of life’s little
mysteries.) The value of R2 also represents the percent of the variation in the
dependent variable explained by the included independent variables in the linear
model.
If the model explains the data well, the fitted values will be highly correlated
with the actual values and R2 will be high. If the model does not explain the data
well, the fitted values will not correlate very highly with the actual values and R2
will be near zero. Possible values of R2 range from 0 to 1.
21
This interpretation works only if an intercept is included in the model, which it usually is.
72 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
R2 values often help us understand how well our model predicts the dependent
variable, but the measure may be less useful than it seems. A high R2 is neither
necessary nor sufficient for an analysis to be useful. A high R2 means the predicted
values are close to the actual values. It says nothing more. We can have a
model loaded with endogeneity that generates a high R2 . The high R2 in this
case means nothing; the model is junk, the high R2 notwithstanding. And to
make matters worse, some people have the intuition that a good fit is necessary
for believing regression results. This intuition isn’t correct, either. There is no
minimum value we need for a good regression. In fact, it is very common for
experiments (the gold standard of statistical analyses) to have low R2 values.
There can be all kinds of reasons for low R2 —the world could be messy, such
that σ 2 is high, for example—but the model could nonetheless yield valuable
insight.
Figure 3.9 shows various goodness of fit measures for OLS estimates of
two different hypothetical data sets of salary at age 30 (measured in thousands
of dollars) and years of education. In panel (a), the observations are pretty
closely clustered around the regression line. That’s a good fit. The variance of
the regression is 91.62; it’s not really clear what to make of that, however, until
we look at its square root, σ̂ (also known as the standard error of the regression,
among other terms), which is 9.57. Roughly speaking, this value of the standard
error of the regression means that the observations are on average within 9.57 units
of their fitted values.22 From this definition, therefore, on average the fitted values
are within $9,570 of actual salary. The R2 is 0.89. That’s pretty high. Is that value
high enough? We can’t answer that question because it is not a sensible question
for R2 values.
In panel (b) of Figure 3.9, the observations are more widely dispersed and so
not as good a fit. The variance of the regression is 444.2. As with panel (a), it’s
not really clear what to make of the variance of the regression until we look at
its square root, σ̂ , which is 21.1. This value means that the observations are on
average within $21,100 of actual salary. The R2 is 0.6. Is that good enough? Silly
question.
REMEMBER THIS
There are four ways to assess goodness of fit.
1. The variance of the regression (σ̂ 2 ) is used in the equation for var(β̂ 1 ). It is hard to interpret
directly.
22
We say “roughly speaking” because this value is actually the square root of the average of the
squared residuals. The intuition for that value is the same, but it’s quite a mouthful.
3.7 Goodness of Fit 73
Salary Salary
(in $1,000s) (in $1,000s)
120 120
100 100
80 80
60 60
40 40
2 2
= 91.62 = 444.2
= 9.57 = 21.1
20 20
R 2 = 0.89 R 2 = 0.6
0 4 8 12 16 0 4 8 12 16
2. The standard error of the regression (σ̂ ) is measured on the same scale as the dependent
variable and roughly corresponds to the average distance between fitted values and actual
values.
3. Scatterplots can be quite informative, not only about goodness of fit but also about possible
anomalies and outliers.
4. R2 is a widely used measure of goodness of fit.
• It is the square of the correlation between the fitted and observed values of the dependent
variable.
• R2 ranges from 0 to 1.
• A high R2 is neither necessary nor sufficient for an analysis to be useful.
74 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
The results reported in Table 3.2 look pretty much like the results any
statistical software will burp out. The estimated coefficient on adult height
(β̂ 1 ) is 0.412. The standard error estimate will vary depending on whether
we assume errors are or are not homoscedastic. The column on the left
shows that if we assume homoscedasticity (and therefore use Equation 3.9), the
estimated standard error of β̂ 1 is 0.0975. The column on the right shows that if we
allow for heteroscedasticity, the estimated standard error for β̂ 1 is 0.0953. This isn’t
much of a difference, but the two approaches to estimating standard errors can
differ more substantially for other examples. The estimated constant (β̂ 0 ) is −13.093
with estimated standard error estimates of 6.897 and 6.691, depending on whether
or not we use heteroscedasticity-consistent standard errors.
Notice that the β̂ 0 and β̂ 1 coefficients are identical across the columns, as
the heteroscedasticity-consistent standard error estimate has no effect on the
coefficient.
23
The data is adjusted in two ways for the figure. First, we jitter the data to deal with the problem that
many observations overlap perfectly because they have the same values of X and Y. Jittering adds a
small random number to the height, causing each observation to be at a slightly different point. If
there are only two observations with the same specific combination of X and Y values, the jittered
data will show two circles, probably overlapping a bit. If there are many observations with some
specific combination of X and Y values, the jittered data will show many circles, overlapping a bit,
but creating a cloud of data that indicates lots of data near that point. We don’t use jittered data in the
statistical analysis; we use jittered data only for plotting data. Second, six outliers who made a ton of
money ($750 per hour for one of them!) are excluded. If they were included, the scatterplot would be
so tall that most observations would get scrunched up at the bottom.
3.7 Goodness of Fit 75
Hourly
wages
(in $)
80
60
40
20
60 65 70 75 80
Height in inches
R2 0.009 0.009
What, exactly, do these numbers mean? First, let’s interpret the slope coef-
ficient, β̂ 1 . A coefficient of 0.412 on height implies that a one-inch increase in
height is associated with an increase in wages of 41.2 cents per hour. That’s
a lot!24
The interpretation of the constant, β̂ 0 , is that someone who is zero inches tall
would get negative $13.09 dollars an hour. Hmmm. Not the most helpful piece of
information. What’s going on is that most observations of height (the X variable)
are far from zero (they are mostly between 60 and 75 inches). For the regression
line to go through this data, it must cross the Y-axis at −13.09 for people who are
zero inches tall. This example explains why we don’t spend a lot of time on β̂ 0 . It’s
kind of weird to want to know—or believe—the extrapolation of our results to such
people who are zero inches tall.
If we don’t care about β̂ 0 why do we have it in the model? Because it still plays
a very important role. Remember that we’re fitting a line, and the value of β̂ 0 pins
down where the line starts when X is zero. Failing to estimate the parameter is
the same as setting β̂ 0 to zero (because the fitted value would be Ŷi = β̂ 1 Xi , which
is zero when Xi = 0). Forcing β̂ 0 to be zero will typically lead to a much worse
model fit than letting the data tell us where the line should cross the Y-axis when X
is zero.
The results are not only about the estimated coefficients. They also include
standard errors, which are quite important as they give us a sense of how accurate
our estimates are. The standard error estimates come from the data and tell us how
wide the distribution of β̂ 1 is. If the standard error of β̂ 1 is huge, then we should
not have much confidence that our β̂ 1 is necessarily close to the true value. If the
standard error of β̂ 1 is small, then we should have more confidence that our β̂ 1 is
close to the true value.
Are these results the final word on the relationship between height and wages?
(Hint: NO!) As for most observational data, a bivariate analysis may not be sufficient.
We should worry about endogeneity. In other words, there could be elements in the
error term (factors that influence wages but have not been included in the model)
that could be correlated with adult height, and if so, then the result that height
causes wages to go up may be incorrect. Can you think of anything in the error
term that is correlated with height? We come back to this question in Chapter 5
(page 131), where we revisit this data set.
Table 3.2 also shows several goodness of fit measures. The σ̂ 2 is 142.4; this
number is pretty hard to get our heads around. Much more useful is the standard
error of the regression, σ̂ , which is 11.93, meaning roughly that the average distance
between fitted and actual heights is almost $12 per hour. In other words, the fitted
values really aren’t particularly accurate. The R2 is close to 0.01. This value is low, but
as we said earlier, there is no set standard for R2 .
24
To put that estimate in perspective, we can calculate how much being an inch taller is worth per
year for someone who works 40 hours a week for 50 weeks per year: 0.412 × 1 × 40 × 50 = $820 per
year. Being three inches taller is associated with earning 0.41 × 3 × 40 × 50 = $2, 460 more per year.
Being tall has its costs, though: tall people live shorter lives (Palmer 2013).
3.8 Outliers 77
One reasonable concern might be that we should be wary of the OLS results
because the model fit seems pretty poor. That’s not how it works, though. The
coefficients provide the best estimates, given the data. The standard errors of the
coefficients incorporate the poor fit (via the σ̂ 2 ). So, yes, the poor fit matters, but it’s
incorporated into the OLS estimation process.
3.8 Outliers
One practical concern we have in statistics is dealing with outliers, or observations
outliers Observation that are extremely different from the rest of sample. The concern is that a single
that are extremely goofy observation can skew the analysis.
different from those in We saw on page 32 that Washington, DC, is quite an outlier in a plot of crime
the rest of sample.
data for the United States. Figure 3.11 shows a scatterplot of violent crime and
Violent
crime
rate DC
(per 100,000
people)
1,200
1,000
800
NV
SC
TN
AK LA NM DE
600 FL
MD
AR OK MO MI TX IL
AL MA CA
GA AZ
400 NC KS
PA NY
IN OH WACO
WV CT NJ
MS MT IA NE HI
ND
KY WI OR RI
ID MNVA
200 SD WY UT
NH
VT
ME
40 50 60 70 80 90 100
Percent urban
percent urban. Imagine drawing an OLS line by hand when the nation’s capital
is included. Then imagine drawing an OLS line by hand when it’s excluded.
The line with Washington, DC, will be steeper in order to get close to the
observation for Washington, DC; the other line will be flatter because it can
stay in the mass of the data without worrying about Washington, DC. Hence, a
reasonable person may worry that the DC data point could substantially influence
the estimate. On the other hand, if we were to remove an observation in the
middle of the mass of the data, such as Oklahoma, the estimated line would move
little.
We can see the effect of including and excluding DC in Table 3.3, which
shows bivariate OLS results in which violent crime rate is the dependent variable.
In the first column, percent urban is the independent variable and all states
plus DC are included (therefore the N is 51). The coefficient is 5.61 with a
standard error of 1.80. The results in the second column are based on data without
Washington, DC (dropping the N to 50). The coefficient is quite a bit smaller,
coming in at 3.58, which is consistent with our intuition from our imaginary line
drawing.
The table also shows bivariate OLS coefficients for a model with single-parent
percent as the independent variable. The coefficient when we include DC is 23.17.
When we exclude DC, the estimated relationship weakens to 16.91. We see a
similar pattern with crime and poverty percent in the last two columns.
Figure 3.12 shows scatterplots of the data with the fitted lines included. The
fitted lines based on all data are the solid lines, and the fitted lines when DC is
excluded are the dashed lines. In every case, the fitted lines including DC are
steeper than the fitted lines when DC is excluded.
So what are we to conclude here? Which results are correct? There may be
no clear answer. The important thing is to appreciate that the results in these
cases depend on a single observation. In such cases, we need to let the world
know. We should show results with and without the excluded observation and
3.8 Outliers 79
Violent
crime
rate DC DC DC
(per
Fitted line with DC Fitted line with DC Fitted line with DC
100,000 1,200 Fitted line without DC Fitted line without DC Fitted line without DC
people)
1,000
800
NV NV NV
SC
TN TNSC SC
TN
NMDE FL
AK LA AK DENMLA AK DE LA
NM
600 MD MDFL MD FL
AR OK MI OK
MI AR MITX AR
IL MO OK
MO TXIL CA IL
CAMO
TX CA
AL MA MA AL MA AL
GA AZ GA
AZ GA
AZ
400 NC KSPA NY KS NY
PANC
KS
PA NY NC
CO
INOHWA CO
WA
INOH CO INOH
WA
WV
MSMT NE CTNJ NJ
NEIAWV
CT
MT MS
NJ
CT NE MT WV MS
NDIA WI
KY ID
MN OR HI
RI ND
ID OR
WIHIRI
MN VAKY MN
IA
HI RI
NDWI OR
ID KY
SD WY VA UT UT WY SD VA
WY UT SD
200
NH NH NH
VT
ME VTME VTME
40 50 60 70 80 90 100 20 30 40 50 60 8 10 12 14 16 18 20 22
Percent urban Percent single parent Poverty percent
FIGURE 3.12: Scatterplots of Crime against Percent Urban, Single Parent, and Poverty with OLS Fitted Lines
justify substantively why an observation might merit exclusion. In the case of the
crime data, for example, we could exclude DC on the grounds that it is not (yet!)
a state.
Outlier observations are more likely to influence OLS results when the
number of observations is small. Given that OLS will minimize the sum of squared
residuals from the fitted line, a single observation is more likely to play a big role
when only a few residuals must be summed. When data sets are very large, a single
observation is less likely to move the fitted line substantially.
An excellent way to identify potentially influential observations is to plot the
data and look for unusual observations. If an observation looks out of whack,
it’s a good idea to run the analysis without it to see if the results change. If
they do, explain the situation to readers and justify including or excluding the
outlier.25
25
Most statistical packages provide tools to assess the influence of each observation. For a sample
size N, these commands essentially run N separate OLS models, each one excluding a different
observation. For each of these N regressions, the command stores a value indicating how much the
coefficient changes when that particular observation is excluded. The resulting output reflects how
much the coefficients change with the deletion of each observation. In Stata, the command is
dfbeta, where df refers to difference and beta refers to β̂. In other words, the command will tell us
for each observation the difference in estimated β̂’s when that observation is deleted. In R, the
command is also called dfbeta. Google these command names to find more information on how to
use them.
80 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
REMEMBER THIS
Outliers are observations that are very different from other observations.
1. When sample sizes are small, a single outlier can exert considerable influence on OLS
coefficient estimates.
2. Scatterplots are useful in identifying outliers.
3. When a single observation substantially influences coefficient estimates, we should
• Inform readers of the issue.
• Report results with and without the influential observation.
• Justify including or excluding that observation.
Conclusion
Ordinary least squares is an odd name that refers to the way in which the β̂
estimates are produced. That’s fine to know, but the real key to understanding OLS
is appreciating the properties of the estimates produced.
The most important property of OLS estimates is that they are biased if
X is uncorrelated with the error. We’ve all heard “correlation does not imply
causation,” but “regression does not imply causation” is every bit as true. If there
is endogeneity, we may observe a big regression coefficient even in the absence of
causation or a tiny regression coefficient even when there is causation.
OLS estimates have many other useful properties. With a large sample size,
β̂ 1 is a normally distributed random variable. The variance of β̂ 1 reflects the width
of the β̂ 1 distribution and is determined by the fit of the model (the better the
fit, the thinner the distribution), the sample size (the more data, the thinner the
distribution), and the variance of X (the more variance, the thinner the distribution).
If the errors satisfy the homoscedasticity and no-correlation conditions, the
variance of β̂ 1 is defined by Equation 3.9. If the errors are heteroscedastic or
correlated with each other, OLS still produces unbiased coefficients, but we will
need other tools, covered here and in Chapter 13, to get appropriate standard errors
for our β̂ 1 estimates.
We’ll have mastered bivariate OLS when we can accomplish the
following:
• Section 3.1: Write out the bivariate regression equation, and explain all its
elements (dependent variable, independent variable, slope, intercept, error
term). Draw a hypothetical scatterplot with a small number of observations,
and show how bivariate OLS is estimated, identifying residuals, fitted
Further Reading 81
• Section 3.2: Explain why β̂ 1 is a random variable, and sketch its dis-
tribution. Explain two ways to think about randomness in coefficient
estimates.
• Section 3.4: Write out the standard equation for the variance of β̂ 1 in
bivariate OLS, and explain three factors that affect this variance.
• Section 3.6: Identify the conditions required for the standard variance
equation of β̂ 1 to be accurate. Explain why these two conditions are less
important than the exogeneity condition.
• Section 3.7: Explain four ways to assess goodness of fit. Explain why R2
alone does not measure whether or not a regression was successful.
• Section 3.8: Explain what outliers are, how they can affect results, and what
to do about them.
Further Reading
Beck (2010) provides an excellent discussion of what to report from a regression
analysis.
Weighted least squares is a type of generalized least squares that can be
used when dealing with heteroscedastic data. Chapter 8 of Kennedy (2008)
discusses weighted least squares and other issues associated with errors that are
heteroscedastic or correlated with each other. These issues are often referred to
as violations of a “spherical errors” condition. Spherical errors is fancy statistical
jargon meaning that errors are both homoscedastic and not correlated with each
other.
Murray (2006b, 500) provides a good discussion of probability limits and
consistency for OLS estimates.
We discuss what to do with autocorrelated errors in Chapter 13. The Further
Reading section at the end of that chapter provides links to the very large literature
on time series data analysis.
82 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
Key Terms
Autocorrelation (69) Heteroscedasticity-consistent Regression line (48)
Bias (58) standard errors (68) Residual (48)
Central limit theorem (56) Homoscedastic (68) Sampling randomness (53)
Consistency (66) Modeled randomness (54) Standard error (61)
Continuous variable (54) Normal distribution (55) Standard error of the
Degrees of freedom (63) Outliers (77) regression (71)
Distribution (54) plim (66) Time series data (69)
Fitted value (48) Probability density (55) Unbiased estimator (58)
Goodness of fit (70) Probability distribution (54) Variance (61)
Heteroscedastic (68) Probability limit (65) Variance of the regression
Random variable (54) (63)
Computing Corner
Stata
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
donuts | 9.103799 1.919976 4.74 0.001 4.877961 13.32964
_cons | 122.6156 16.36114 7.49 0.000 86.60499 158.6262
------------------------------------------------------------------------------
There is a lot of information here, not all of which is useful. The vital
information is in the bottom table that shows β̂ 1 is 9.10 with a standard
error of 1.92 and β̂ 0 is 122.62 with a standard error of 16.36. We cover t,
P>|t|, and 95% confidence intervals in Chapter 4.
The column on the upper right has some useful information, too, indicating
the number of observations, R2 , and Root MSE. (As we noted in the
Computing Corner 83
chapter, Stata refers to the standard error of the regression, σ̂ , as root MSE,
which is Stata’s shorthand for the square root of the mean squared error.)
We discuss the adjusted R2 later (page 150). The F and Prob > F to the
right of the output relate information that we also cover later (page 159);
it’s generally not particularly useful.
The table in the upper left is pretty useless. Contemporary researchers sel-
dom use the information in the Source, SS, df, and MS
columns.
We can display the actual values, fitted values, and residuals with the list
command: list weight Fitted Residuals.
26
We jittered the data in Figure 3.10 to make it a bit easier to see more data points. Stata’s jitter
subcommand jitters data [e.g., scatter weight donuts, jitter(3)]. The bigger the number in
parentheses, the more the data will be jittered.
84 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
1. The following commands use the donut data from Chapter 1 (page 3).
Since R is an object-oriented language, our regression commands create
objects containing information, which we ask R to display as needed. To
estimate an OLS regression, we create an object called “OLSResults” (we
could choose a different name) by typing OLSResults = lm(weight
~ donuts). This command stores information about the regression
results in the object called OLSResults. The lm command stands for
“linear model” and is the R command for OLS. The general format
is lm(Y ~ X) for a dependent variable Y and independent variable X.
To display these regression results, type summary(OLSResults), which
produces
lm(formula = weight ~ donuts)
Residuals:
Min 1Q Median 3Q Max
-93.135 -9.479 0.757 35.108 55.073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.616 16.361 7.494 0.0000121
donuts 9.104 1.920 4.742 0.000608
Residual standard error: 45.59 on 11 degrees of freedom
Multiple R-squared: 0.6715, Adjusted R-squared: 0.6416
F-statistic: 22.48 on 1 and 11 DF, p-value: 0.0006078
The vital information is in the bottom table that shows that β̂ 1 is 9.104
with a standard error of 1.920 and β̂ 0 is 122.616 with a standard error of
16.361. We cover t value and Pr(>|t|) in Chapter 4.
R refers to the standard error of the regression (σ̂ ) as the residual standard
error and lists it below the regression results. Next to that is the degrees of
freedom. To calculate the number of observations in the data set analyzed,
recall that degrees of freedom equals N − k. Since we know k (the number
of estimated coefficients) is 2 for this model, we can infer the sample
size is 13. (Yes, this is probably more work than it should be to display
sample size.)
The multiple R2 (which is just the R2 ) is below the residual standard error.
We discuss the adjusted R2 later (page 150). The F statistic at the bottom
refers to a test we cover on page 159. It’s usually not a center of attention.
Computing Corner 85
27
Figure 3.10 jittered the data to make it a bit easier to see more data points. To jitter data in an R
plot, type plot(jitter(donuts), jitter(weight)).
28
There are more efficient ways to exclude data when we are using data frames. For example, if the
variables are all included in a data frame called dta, we could type OLSResultsNoHomer =
lm(weight ~ donuts, data = dta[name != "Homer", ]).
86 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
useful AER package must be installed once and loaded at each use, as
follows:
• Tell R to load the package every time we open R and want to use the
commands in the AER (or other) package. We do this with the library
command. We have to use the library command in every session we use
a package.
Assuming the AER package has been installed, we can run OLS
with heteroscedasticity-consistent standard errors via the following
code:
library(AER)
OLSResults = lm(weight ~ donuts)
coeftest(OLSResults, vcov = vcovHC(OLSResults,
type = "HC1"))
The last line is elaborate. The command coeftest is asking for informa-
tion on the variance of the estimates (among other things) and the vcov =
vcovHC part of the command is asking for heteroscedasticity-consistent
standard errors. There are multiple ways to estimate such standard errors,
and the HC1 asks for the most commonly used form of these standard
errors.29
Exercises
1. Use the data in PresVote.dta to answer the following questions about the
relationship between changes in real disposable income and presidential
election results. Table 3.4 describes the variables.
(b) Estimate an OLS regression in which the vote share of the incumbent
party is regressed on change in real disposable income. Report the
estimated regression equation, and interpret the coefficients.
29
The “vcov” terminology is short for variance-covariance, and “vcovHC” is short for
heteroscedasticity-consistent standard errors.
Exercises 87
TABLE 3.4 Variables for Questions on Presidential Elections and the Economy
Variable name Description
Salaryi = β0 + β1 Educationi + i
For this problem, we are going to assume that the true model is
The model indicates that the salary for each person is $10,000 plus
$1,000 times the number of years of education plus the error term for the
individual. Our goal is to explore how much our estimate of β̂ 1 varies.
The book’s website provides code that will simulate a data set with
100 observations. (Stata code is in Ch3_SimulateBeta_StataCode.do; R
code is in Ch3_SimulateBeta_StataCode.R.) Values of education for each
observation are between 0 and 16 years. The error term will be a normally
distributed error term with a standard deviation of 10,000.
(a) Explain why the means of the estimated coefficients across the
multiple simulations are what they are.
88 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
(b) What are the minimum and maximum values of the estimated
coefficients on education? Explain whether these values are
inconsistent with our statement in the chapter that OLS estimates are
unbiased.
(c) Rerun the simulation with a larger sample size in each simulation.
Specifically, set the sample size to 1,000 in each simulation. Com-
pare the mean, minimum, and maximum of the estimated coefficients
on education to the original results above.
(d) Rerun the simulation with a smaller sample size in each simulation.
Specifically, set the sample size to 20 in each simulation. Compare
the mean, minimum, and maximum of the estimated coefficients on
education to the original results above.
(e) Reset the sample size to 100 for each simulation, and rerun the
simulation with a smaller standard deviation (equal to 500) for each
simulation. Compare the mean, minimum, and maximum of the
estimated coefficients on education to the original results above.
(f) Keeping the sample size at 100 for each simulation, rerun the
simulation with a larger standard deviation for each simulation.
Specifically, set the standard deviation to 50,000 for each simulation.
Compare the mean, minimum, and maximum of the estimated
coefficients on education to the original results above.
(g) Revert to original model (sample size at 100 and standard deviation
at 10,000). Now run 500 simulations. Summarize the distribution of
the β̂ Education estimates as you’ve done so far, but now also plot the
distribution of these coefficients using code provided. Describe the
density plot in your own words.
(a) Estimate a model where height at age 33 explains income at age 33.
Explain β̂ 1 and β̂ 0 .
(b) Create a scatterplot of height and income at age 33. Identify outliers.
Exercises 89
(c) Create a scatterplot of height and income at age 33, but exclude
observations with wages per hour more than 400 British pounds and
height less than 40 inches. Describe the difference from the earlier
plot. Which plot seems the more reasonable basis for statistical
analysis? Why?
(d) Reestimate the bivariate OLS model from part (a), but exclude four
outliers with very high wages and outliers with height below 40
inches. Briefly compare results to earlier results.
(e) What happens when the sample size is smaller? To answer this ques-
tion, reestimate the bivariate OLS model from above (that excludes
outliers), but limit the analysis to the first 800 observations.30 Which
changes more from the results with the full sample: the estimated
coefficient on height or the estimated standard error of the coefficient
on height? Explain.
4. Table 3.6 lists the variables in the WorkWomen.dta and WorkMen.dta data
sets, which are based on Chakraborty, Holter, and Stepanchuk (2012).
Answer the following questions about the relationship between hours
worked and divorce rates:
(a) For each data set (for women and for men), create a scatterplot of
hours worked on the Y-axis and divorce rates on the X-axis.
hours Average yearly labor (in hours) for gender specified in data set
divorcerate Divorce rate per thousand
taxrate Average effective tax rate
30
To do this in Stata, include if _n < 800 at the end of the Stata regress command. Because some
observations have missing data and others are omitted as outliers, the actual sample size in the
regression will fall a bit lower than 800. The _n notation is Stata’s way of indicating the observation
number, which is the row number of the observation in the data set. In R, create and use a new data
set with the first 800 observations (e.g., dataSmall = data[1:800,]).
90 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
(b) For each data set, estimate an OLS regression in which hours
worked is regressed on divorce rates. Report the estimated regression
equation, and interpret the coefficients. Explain any differences in
coefficients.
(c) What are the fitted value and residual for men in Germany?
(d) What are the fitted value and residual for women in Spain?
5. Use the data described in Table 3.6 to answer the following questions about
the relationship between hours worked and tax rates:
(a) For each data set (for women and for men), create a scatterplot of
hours worked on the Y-axis and tax rates on the X-axis.
(b) For each data, set estimate an OLS regression in which hours worked
is regressed on tax rates. Report the estimated regression equation,
and interpret the coefficients. Explain any differences in coefficients.
(c) What are the fitted value and residual for men in the United States?
(d) What are the fitted value and residual for women in Italy?
Hypothesis Testing and Interval 4
Estimation: Answering Research
Questions
91
92 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
The standard null hypothesis is that height has no effect on wages. Or, more
formally,
H0 : β1 = 0
where the subscript zero after the H indicates that this is the null hypothesis.
4.1 Hypothesis Testing 93
1
That’s why there is a t-shirt that says “Being a statistician means never having to say you are
certain.”
94 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
situations, we must take the threat of Type II error seriously; we consider some
when we discuss statistical power in Section 4.4.
alternative If we reject the null hypothesis, we accept the alternative hypothesis. We do
hypothesis An not prove the alternative hypothesis is true. Rather, the alternative hypothesis is
alternative hypothesis is the idea we hang onto when we have evidence that is inconsistent with the null
what we accept if we
hypothesis.
reject the null
hypothesis.
An alternative hypothesis is either one sided or two sided. A one-sided
alternative hypothesis has a direction. For example, if we have theoretical reasons
one-sided to believe that being taller increases wages, then the alternative hypothesis for the
alternative model
hypothesis An
alternative to the
Wagei = β0 + β1 Adult heighti + i (4.2)
null hypothesis that has
a direction—for
example, HA : β1 > 0 or would be written as HA : β1 > 0.
HA : β1 < 0. A two-sided alternative hypothesis has no direction. For example, if we
think height affects wages but we’re not sure whether tall people get paid
more or less, the alternative hypothesis would be HA : β1 = 0. If we’ve done
enough thinking to run a statistical model, it seems reasonable to believe that we
two-sided
alternative
should have at least an idea of the direction of the coefficient on our variable
hypothesis An of interest, implying that two-sided alternatives might be rare. They are not,
alternative to the however, in part because they are more statistically cautious, as we will discuss
null hypothesis that shortly.
indicates the coefficient Formulating appropriate null and alternative hypotheses allows us to translate
is not equal to 0 (or substantive ideas into statistical tests. For published work, it is generally a breeze
some other specified
to identify null hypotheses: just find the β̂ that the authors jabber on about most.
value)—for example,
HA : β1 = 0.
The main null hypothesis is almost certainly that that coefficient is zero.
where Vote sharet is percent of the vote received by the incumbent president’s party
in year t and the independent variable, Change in incomet , is the percent change
in real disposable income in the United States in the year before the presidential
election. The null hypothesis is that there is no effect, or H0 : β1 = 0.
What is the distribution of β̂1 under the null hypothesis? Pretty simple: if the
correlation of change in income and is zero (which we assume for this example),
then β̂1 is a normally distributed random variable centered on zero. This is because
OLS produces unbiased estimates, and if the true value of β1 is zero, then an
unbiased distribution of β̂1 will be centered on zero.
4.1 Hypothesis Testing 95
How wide is the distribution of β̂1 under the null hypothesis? Unlike the mean
of the distribution, which we know under the null, the width of the β̂1 distribution
depends on the data. In other words, we allow the data to tell us the variance and
standard error of the β̂1 estimate under the null hypothesis.
Table 4.2 shows the results for the presidential election model. Of particular
interest for us at this point is that the standard error of the β̂1 estimate is 0.55.
This number tells us how wide the distribution of the β̂1 will be under the
null.
With this information, we can depict the distribution of β̂1 under the null.
Specifically, Figure 4.1 shows the probability density function of β̂1 under the null
hypothesis, which is a normal probability density centered at zero with a standard
deviation of 0.55. We also refer to this as the distribution of β̂1 under the null
hypothesis. We introduced probability density functions in Section 3.2 and discuss
them in further detail in Appendix F starting on page 541.
Figure 4.1 illustrates the key idea of hypothesis testing. The actual value of β̂1
that we estimated is 2.2. That number seems pretty unlikely, doesn’t it? Under the
null hypothesis, most of the distribution of β̂1 is to the left of the β̂1 observed. We
formalize things in the next section, but intuitively, it’s reasonable to think that the
observed value β̂1 is so unlikely if the null is true that, well, the null hypothesis is
probably not true.
Now name a value of β̂1 that would lead us not to reject the null hypothesis. In
other words, name a value of β̂1 that is perfectly likely under the null hypothesis.
We show one such example in Figure 4.1: the line at β̂1 = −0.3. A value like this
would be completely unsurprising if the null hypothesis were true. Hence, if we
observed such a value for β̂1 , we would deem it to be consistent with the null
hypothesis, and we would not reject the null hypothesis.
Significance level
Given that our strategy is to reject the null hypothesis when we observe a β̂1 that
significance level is quite unlikely under the null, the natural question is: Just how unlikely does
The probability of β̂1 have to be? We get to choose the answer to this question. In other words, we
committing a Type I get to decide our standard for what we deem to be sufficiently unlikely to reject
error for a hypothesis
the null hypothesis. We’ll call this probability the significance level and denote
test (i.e., how unlikely a
result has to be under
it with α (the Greek letter alpha). A significance level determines how unlikely a
the null hypothesis for result has to be under the null hypothesis for us to reject the null. A very common
us to reject the null). significance level is 5 percent (meaning α = 0.05).
96 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
Example of β1
Actual
for which we
value
would fail to
of β1
reject the null
−0.3 2.2
−2 −1 0 1 2
β1
FIGURE 4.1: Distribution of β̂1 under the Null Hypothesis for Presidential Election Example
If we set α = 0.05, then we reject the null when we observe a β̂1 so large that
we would expect a 5 percent chance of seeing the observed value or higher under
only the null hypothesis. Setting α = 0.05 means that there is a 5 percent chance
that we would see a value high enough to reject the null hypothesis even when the
null hypothesis is true, meaning that α is the probability of making a Type I error.
If we want to be more cautious (in the sense of requiring a more extreme result
to reject the null hypothesis), we can choose α = 0.01, in which case we will reject
the null if we have a one percent or lower chance of observing a β̂1 as large as we
actually did if the null hypothesis were true.
Reducing α is not completely costless, however. As the probability of making
a Type I error decreases, the probability of making a Type II error increases. In
other words, the more we say we’re going to need really strong evidence to reject
the null hypothesis (which is what we say when we make α small), the more likely
it is that we’ll fail to reject the null hypothesis when the null hypothesis is wrong
(which is the Type II error).
4.1 Hypothesis Testing 97
REMEMBER THIS
1. A null hypothesis is typically a hypothesis of no effect, written as H0 : β1 = 0.
• We reject a null hypothesis when the statistical evidence is inconsistent with the null
hypothesis. A coefficient estimate is statistically significant if we reject the null hypothesis
that the coefficient is zero.
• We fail to reject a null hypothesis when the statistical evidence is consistent with the null
hypothesis.
• Type I error occurs when we wrongly reject a null hypothesis.
• Type II error occurs when we wrongly fail to reject a null hypothesis.
2. An alternative hypothesis is the hypothesis we accept if we reject the null hypothesis.
• We choose a one-sided alternative hypothesis if theory suggests either β1 > 0 or β1 < 0.
• We choose a two-sided alternative hypothesis if theory does not provide guidance as to
whether β1 is greater than or less than zero.
3. The significance level (α) refers to the probability of a Type 1 error for our hypothesis test. We
choose the value of the significance level, typically 0.01 or 0.05.
4. There is a trade-off between Type I and Type II errors. If we lower α, we decrease the probability
of making a Type I error but increase the probability of making a Type II error.
Discussion Questions
1. Translate each of the following questions into a bivariate model with a null hypothesis that
could be tested. There is no single answer for each.
(a) “What causes test scores to rise?”
(b) “How can Republicans increase support among young voters?”
(c) “Why did unemployment spike in 2008?”
2. For each of the following, identify the null hypothesis, draw a picture of the distribution of β̂1
under the null, identify values of β̂1 that would lead you to reject or fail to reject the null, and
explain what it would mean to commit Type I and Type II errors in each case.
(a) We want to know if height increases wages.
(b) We want to know if gasoline prices affect the sales of SUVs.
(c) We want to know if handgun sales affect murder rates.
98 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
4.2 t Tests
The most common tool we use for hypothesis testing in OLS is the t test. There’s
ˆ
t test A hypothesis a quick rule of thumb for t tests: if the absolute value of se(ββ1ˆ ) is bigger than 2,
test for hypotheses 1
about a normal random reject the null hypothesis. (Recall that se( β̂1 ) is the standard error of our coefficient
variable with an estimate.) If not, don’t. This section provide the logic and tools of t testing, which
estimated standard will enable us to be more precise, but this rule of thumb is pretty much all there is
error. to it.
The t distribution
Dividing β̂1 by its standard error solves the scale problem but introduces
another challenge. We know β̂1 is normally distributed, but what is the distribution
ˆ
of se(ββ1ˆ ) ? The se( β̂1 ) term is also a random variable because it depends on the
1
estimated β̂1 . It’s a tricky question, and now is a good time to turn to our friends
at Guinness Brewery for help. Really. Not for what you might think, but for work
ˆ
they did in the early twentieth century demonstrating that the distribution of se(ββ1ˆ )
1
4.2 t Tests 99
Probability
density
Actual
value
of β1
2.2
−5 −4 −3 −2 −1 0 1 2 3 4 5
β1
FIGURE 4.2: Distribution of β̂1 under the Null Hypothesis with Larger Standard Error for Presiden-
tial Election Example
t distribution A follows a distribution we call the t distribution.2 The t distribution is bell shaped
distribution that looks like a normal distribution but has “fatter tails.”3 We say it has fat tails because the
like a normal values on the far left and far right have higher probabilities than what we find for
distribution, but with the normal distribution. The extent of these chubby tails depends on the sample
fatter tails. The exact size: as the sample size gets bigger, the tails melt down to become the same as the
shape of the distribution normal distribution. What’s going on is that we need to be more cautious about
depends on the degrees
of freedom. This
rejecting the null because it is possible that by chance our estimate of se( β̂1 ) will
distribution converges
to a normal distribution 2
Like many statistical terms, t distribution and t test have quirky origins. William Sealy Gosset
for large sample sizes. devised the test in 1908 when he was working for Guinness Brewery in Dublin. His pen name was
“Student.” There already was an s test (now long forgotten), so Gosset named his test and distribution
after the second letter of his pen name. Technically, the standard error of β̂1 follows a statistical
distribution called a χ 2 distribution, and the ratio of a normally distributed random variable and a χ 2
random variable follows a t distribution. More details are in Appendix H on page 549. For now, just
note that the Greek letter χ (chi) is pronounced like “ky,” as in Kyle.
3
That’s a statistical term. Seriously.
100 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
ˆ
be too small, which will make se(ββ1ˆ ) appear to be really big. When we have small
1
amounts of data, the issue is serious because we will be quite uncertain about
se( β̂1 ); when we have lots of data, we’ll be more confident about our estimate
of se( β̂1 ) and, as we’ll see, the fat tails of the t distribution fade away and the t
distribution and normal distribution become virtually indistinguishable.
The specific shape of a t distribution depends on the degrees of freedom,
which is sample size minus the number of parameters. A bivariate OLS model
estimates two parameters (β̂ 0 and β̂1 ), which means, for example, that the degrees
of freedom for a bivariate OLS model with a sample size of 50 is 50 − 2 = 48.
Figure 4.3 displays three different t distributions; a normal distribution is
plotted in the background of each panel as a dotted line. Panel (a) shows a t
distribution with degrees of freedom (d.f.) equal to 2. The probability of observing
Probability
density t distribution d.f. = 2
normal distribution
(a)
−3 −2 −1 0 1 2 3
β1/se(β1)
Probability
density t distribution d.f. = 5
normal distribution
(b)
−3 −2 −1 0 1 2 3
β1/se(β1)
Probability
density t distribution d.f. = 50
normal distribution
(c)
Note: normal distribution is
covered by t distribution
−3 −2 −1 0 1 2 3
β1/se(β1)
a value as high as 3 is higher for the t distribution than for the normal distribution.
The same thing goes for the probability of observing a value as low as –3. Panel
(b) shows a t distribution with degrees of freedom equal to 5. If we look closely,
we can see some chubbiness in the tails because the t distribution has higher
probabilities at, for example, values greater than 2. We have to look pretty closely
to see that, though. Panel (c) shows a t distribution with degrees of freedom equal
to 50. It is visually indistinguishable from a normal distribution and, in fact, covers
up the normal distribution so we cannot see it.
Critical values
ˆ
critical value In Once we know the distribution of se(ββ1ˆ ) , we can come up with a critical value.
1
hypothesis testing, a A critical value is the threshold for our test statistic. Loosely speaking, we reject
value above which a β̂1 ˆ
would be so unlikely the null hypothesis if se(ββ1ˆ ) (the test statistic) is greater than the critical value; if
1
that we reject the null. βˆ1
is below the critical value, we fail to reject the null hypothesis.
se( βˆ1 )
More precisely, our specific decision rule depends on the nature of the
alternative hypothesis. Table 4.3 displays the specific rules. Rather than trying to
memorize these rules, it is better to concentrate on the logic behind them. If the
alternative hypothesis is two sided, then big values of β̂1 relative to the standard
error incline us to reject the null. We don’t particularly care if they are very positive
or very negative. If the alternative hypothesis is that β > 0, then only large, positive
values of β̂1 will incline us to reject the null hypothesis in favor of the alternative
hypothesis. Observing a very negative β̂1 would be odd, but certainly it would
not incline us to believe the alternative hypothesis that the true value of β is
greater than zero. Similarly, if the alternative hypothesis is that β < 0, then only
very negative values of β̂1 will incline us to reject the null hypothesis in favor of
the alternative hypothesis. We refer to the appropriate critical value in the table
because the actual value of the critical value will depend on whether the test is one
sided or two sided, as we discuss shortly.
The critical value for t tests depends on the t distribution and identifies the
ˆ
point at which we decide the observed se(ββ1ˆ ) is unlikely enough under the null
1
hypothesis to justify rejecting the null hypothesis.
Critical values depend on the significance level (α) we choose, our degrees of
freedom, and whether the alternative is one sided or two sided. Figure 4.4 depicts
102 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
critical values for various scenarios. We assume the sample size is large in each,
allowing us to use the normal approximation to the t distribution. Appendix G
explains the normal distribution in more detail. If you have not seen or do not
remember how to work with the normal distribution, it is important to review this
material.
Panel (a) of Figure 4.4 shows critical values for α = 0.05 and a two-sided
alternative hypothesis. The distribution of the t statistic is centered at zero under
the null hypothesis that β1 = 0. For a two-sided alternative hypothesis, we want to
2.5% of normal
2.5% of normal distribution is to
(a) distribution is to right of 1.96
left of −1.96
−1.96 1.96
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
0.5% of normal
0.5% of normal distribution is to
(b) distribution is to right of 2.58
left of −2.58
−2.58 2.58
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
5% of normal distribution
(c) is to right of 1.64
1.64
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
FIGURE 4.4: Critical Values for Large-Sample t Tests. Using Normal Approximation to t Distribution
4.2 t Tests 103
identify ranges that are far from zero and unlikely under the null hypothesis. For
α = 0.05, we want to find the range that constitutes the least-likely 5 percent of the
distribution under the null. This 5 percent is the sum of the 2.5 percent on the far
left and the 2.5 percent on the far right. Values in these ranges are not impossible,
but they are unlikely. For a large sample size, the critical values that mark off the
least-likely 2.5 percentage regions of the distribution are –1.96 and 1.96.
Panel (b) of Figure 4.4 depicts another two-sided alternative hypothesis, this
time α = 0.01. Now we’re saying that to reject the null hypothesis, we’re going to
need to observe an even more unlikely β̂1 under the null hypothesis. The critical
value for a large sample size is 2.58. This number defines the point at which there
is a 0.005 probability (which is half of α) of being higher than than the critical
value and at which there is a 0.005 probability of being less than the negative
of it.
The picture and critical values differ a bit for a one-tailed test in which we look
only at one side of the distribution. In panel (c) of Figure 4.4, α = 0.05 and HA :
β1 > 0. Here 5 percent of the distribution is to the right of 1.64, meaning that we
ˆ
will reject the null hypothesis in favor of the alternative that β1 > 0 if se(β1β̂) > 1.64.
Note that the one-sided critical value for α = 0.05 is lower than the two-sided
critical value. One-sided critical values will always be lower for any given value
of α, meaning that it is easier to reject the null hypothesis for a one-sided
alternative hypothesis than for a two-sided alternative hypothesis. Hence, using
critical values based on a two-sided alternative is statistically cautious insofar as
we are less likely to appear overeager to reject the null if we use a two-sided
alternative.
Table 4.4 displays critical values of the t distribution for one-sided and
two-sided alternative hypotheses for common values of α. When the degrees of
freedom are very small (typically owing to a small sample size), the critical values
are relatively large. For example, with 2 degrees of freedom and α = 0.05, we
need to see a t stat above 2.92 to reject the null.4 With 10 degrees of freedom
α = 0.05, we need to see a t stat above 1.81 to reject the null. With 100 degrees
of freedom and α = 0.05, we need a t stat above 1.66 to reject the null. As the
degrees of freedom get higher, the t distribution looks more and more like a
normal distribution; for infinite degrees of freedom, it is exactly like a normal
distribution, producing identical critical values. For degrees of freedom above
100, it is reasonable to use critical values from the normal distribution as a good
approximation.
ˆ
We compare se(ββ1ˆ ) to our critical value and reject if the magnitude is larger
1
ˆ
t statistic The test than the critical value. We refer to the ratio of se(ββ1ˆ ) as the t statistic (or “t stat,”
1
statistic used in a t test. It as the kids say). The t statistic is so named because that ratio will be compared
ˆ −β Null
is equal to β1se( βˆ )
. to a critical value that depends on the t distribution in the manner just outlined.
1
Tests based on two-sided alternative tests with α = 0.05 are very common. When
the sample size is large, the critical value for such a test is 1.96, hence the rule of
thumb is that a t statistic bigger than 2 is statistically significant at conventional
levels.
4
It’s unlikely that we would seriously estimate a model with 2 degrees of freedom. For a bivariate
OLS model, that would mean estimating a model with just four observations, which is a silly idea.
4.2 t Tests 105
The column on the left shows that the t statistic from the homoscedastic
model for the coefficient on adult height is 4.23, meaning that β̂1 is 4.23 standard
deviations away from zero. The t statistic from the heteroscedastic model for
the coefficient on adult height is 4.33, which is essentially the same as in the
homoscedastic model. For simplicity, we’ll focus on the homoscedastic model
results.
Is this coefficient on adult height statistically significant? To answer that
question, we’ll need a critical value. To pick a critical value, we need to choose a
one-sided or two-sided alternative hypothesis and a significance level. Let’s start
with a two-sided test and α = 0.05.
For a t distribution, we also need to know the degrees of freedom. Recall that
to find the degrees of freedom, we take the sample size and subtract the number
of parameters estimated. The smaller the sample size, the more uncertainty we
have about our standard error estimate, hence the larger we make our critical
value. Here the sample size is 1, 910 and we estimate two parameters, so the
degrees of freedom are 1, 908. For a sample this large, we can reasonably use
the critical values from the last row of Table 4.4. The critical value for a two-sided
test with α = 0.05 and a high number for degrees of freedom is 1.96. Because
our t statistic of 4.22 is higher than 1.96, we reject the null hypothesis. It’s
that easy.
REMEMBER THIS
1. We use a t test to test a null hypotheses such as H0 : β1 = 0. The steps are as follows:
(a) Choose a one-sided or two-sided alternative hypothesis.
(b) Set a significance level, α, usually equal to 0.01 or 0.05.
(c) Find a critical value based on the t distribution. This value depends on α, whether the
alternative hypothesis is one sided or two sided, and the degrees of freedom (equal to
sample size minus number of parameters estimated).
106 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
• For a one-sided alternative hypothesis that β1 < 0, we reject the null hypothesis if
βˆ1
se( βˆ )
< −1 times the critical value.
1
βˆ1 −β Null
2. We can test any hypothesis of the form H0 : β1 = β Null by using se( βˆ1 )
as the test statistic for
a t test.
Review Questions
1. Refer to the results in Table 4.2 on page 95.
(a) What is the t statistic for the coefficient on change in income?
(b) What are the degrees of freedom?
(c) What is the critical value for a two-sided alternative hypothesis and α = 0.01? Do we
accept or reject the null?
(d) What is the critical value for a one-sided alternative hypothesis and α = 0.05? Do we
accept or reject the null?
2. Which is bigger: the critical value from one-sided tests or two-sided tests? Why?
3. Which is bigger: the critical value from a large sample or a small sample? Why?
4.3 p Values
The p value is a useful by-product of the hypothesis testing framework. It indicates
p value The the probability of observing a coefficient as extreme as we actually did if the null
probability of observing hypothesis were true. In this section, we explain how to calculate p values and why
a coefficient as extreme they’re useful.
as we actually observed
As a practical matter, the thing to remember is that we reject the null if the
if the null hypothesis
were true.
p value is less than α. Our rule of thumb here is “small p value means reject”: low
p values are associated with rejecting the null, and high p values are associated
with failing to reject the null hypothesis.
4.3 p Values 107
Although p values can be calculated for any null hypothesis, we focus on the
most common null hypotheses in which β1 = 0. Most statistical software reports
a two-sided p value, which indicates the probability that a coefficient is larger in
magnitude (either positively or negatively) than the coefficient we observe.
Panel (a) of Figure 4.5 shows the p value calculation for the β̂1 estimate
in the wage and height example we discussed on page 104. The t statistic
Probability
density
Case 1: t statistic is 4.23
p value is 0.0000244
−4.23 4.23
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
(a)
Probability
density
Case 2: t statistic is 1.73
p value is 0.084
−1.73 1.73
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
(b)
5
Here we are calculating two-sided p values, which are the output most commonly reported by
statistical software. If se(ββ1ˆ ) is greater than zero, then the two-sided p value is twice the probability of
ˆ
1
βˆ1
being greater than that value. If se( βˆ1 )
is less than zero, the two-sided p value is twice the probability
of being less than that value. A one-sided p value is simply half the two-sided p value.
6
For a two-sided p value, we want to know the probability of observing a t statistic higher than the
absolute value
of ˆthetstatistic we actually observe under the null hypothesis. This is
2 × 1 − Φ se(ββ1ˆ ) , where Φ is the capital Greek letter phi (pronounced like the “fi” in
1
“Wi-Fi”) and Φ() indicates the normal cumulative density function (CDF). (We see the normal CDF
in our discussion of statistical power in Section 4.4; Appendix G on page 543 supplies more details).
If the alternative hypothesis is HA : β1 > 0, the p value is the probability of observing a t statistic
ˆ
higher than the observed t statistic under the null hypothesis: 1 − Φ se(ββ1ˆ ) . If the alternative
1
hypothesis is HA : β1 < 0, the p value is the probability of observing a t statistic less than the observed
βˆ1
t statistic under the null hypothesis: Φ se( βˆ ) .
1
4.4 Power 109
REMEMBER THIS
The p value is the probability of observing a coefficient as large in magnitude as actually observed if
the null hypothesis is true.
1. The lower the p value, the less consistent the estimated β̂1 is with the null hypothesis.
2. We reject the null hypothesis if the p value is less than α.
3. A p value can be useful to indicate the weight of evidence against a null hypothesis.
4.4 Power
The hypothesis testing infrastructure we’ve discussed so far is designed to deal
with the possibility of Type I error, which occurs when we reject a null hypothesis
that is actually true. When we set the significance level, we are setting the
probability of making a Type I error. Obviously, we’d really rather not believe
the null is false when it is true.
Type II errors aren’t so hot either, though. We make a Type II error when
β is really something other than zero and we fail to reject the null hypothesis
that β is zero. In this section, we explain statistical power, the statistical concept
associated with Type II errors. We discuss the importance and meaning of Type II
error and how power and power curves help us understand our ability to avoid such
error.
Probability
density
β1 distribution centered on 1 when β1 = 1
(a)
Probability of Type II
error is 0.91
Probability of rejecting the null is 0.09
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
Probability
density
β1 distribution centered on 2 when β1 = 2
(b)
Probability
Probability of rejecting
of Type II the null is 0.37
error is 0.63
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
Probability
density
β1 distribution centered
on 3 when β1 = 3
(c)
Probability of rejecting
the null is 0.75
Probability
of Type II
error is 0.25
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
FIGURE 4.6: Statistical Power for Three Values of β1 Given α = 0.01 and a One-Sided Alternative Hypothesis
hypothesis HA : β1 > 0, with α = 0.01. In this case, the critical value is 2.32, which
ˆ
means that we reject the null hypothesis if we observe se(ββ1ˆ ) greater than 2.32. For
1
simplicity, we’ll suppose se( β̂1 ) is 1.
Panel (a) of Figure 4.6 displays the probability of Type II error if the true
value of β equals 1. In this case, the distribution of β̂1 will be centered at 1. Only
4.4 Power 111
9 percent of this distribution is to the right of 2.32, meaning that we have only a
9 percent chance of rejecting the null hypothesis and a 91 percent chance of failing
to reject the null hypothesis. In other words, the probability of Type II error is
91.7 This means that even though the null hypothesis actually is false—remember,
β1 = 1 in this example, not 0—we have a roughly 9 in 10 chance of committing a
Type II error. In this example, our hypothesis test is not particularly able to provide
statistically significant results when the true value of β is 1.
Panel (b) of Figure 4.6 displays the probability of a Type II error if the true
value of β equals 2. In this case, the distribution of β̂1 will be centered at 2.
Here 37 percent of the distribution is to the right of 2.32, and therefore, we have
a 63 percent chance of of committing a Type II error. Better, but not by much:
even though β1 > 0, we have a roughly 2 in 3 chance of committing a Type
II error.
Panel (c) of Figure 4.6 displays the probability of a Type II error if the true
value of β equals 3. In this case, the distribution of β̂1 will be centered at 3.
Here 75 percent of the distribution is to the right of 2.32, meaning there is a
25 percent probability of committing a Type II error. We’re making progress,
but still far from perfection. In other words, the true value of β must be near or
above 3 before we have a 75 percent chance of rejecting the null hypothesis when
we should.
These examples illustrate why we use the somewhat convoluted “fail to reject
the null” terminology. That is, when we observe a β̂1 less than the critical value,
it is still quite possible that the true value is not zero. Failure to find an effect is
not the same as finding no effect.
power The ability of An important statistical concept related to Type II error is power. The
our data to reject the statistical definition of power differs from how we use the the word in ordinary
null. A high-powered conversation. Power in the statistical sense refers to the ability of our data to reject
statistical test will reject
the null hypothesis. A high-powered statistical test will reject the null with a very
the null with a very high
probability when the
high probability when the null is false; a low-powered statistical test will reject
null is false; a low- the null with a low probability when the null is false. Think of statistical power
powered statistical test like the power of a microscope. Using a high-powered microscope allows us to
will reject the null with a distinguish small differences in an object, differences that are there but invisible
low probability when to us when we look through a low-powered microscope.
the null is false. The logic of (statistical) power is pretty simple: power is 1-Pr(Type II error)
for a given true value of β. A key characteristic of power is that it varies with the
true value of β. In the example in Figure 4.6, panel (a) shows that the power of the
test is 0.09 when β = 1. Panel (b) shows that the power rises to 0.37 when β = 2,
and panel (c) shows that the power is 0.75 when β = 3. Calculating power can be
a bit clunky; we leave the details to Section 14.3.
Since we don’t know the true value of β (if we did, we would not need
hypothesis testing!), it is common to think about power for a range of possible true
power curve values. We can do this with a power curve, which characterizes the probability
Characterizes the
probability of rejecting 7
the null for each possible Calculating the probability of a Type II error follows naturally from the properties of the normal
distribution described in Appendix G. Using the notation from that appendix, Pr(Type II error) =
value of the parameter.
Φ(2.32 − 1) = 0.09. See also Section 14.3 for more detail on these kinds of calculations.
112 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
0.75
0.5
0.37
0.25
0.09
0 1 2 3 4 5 6 7 8 9 10
β1
of rejecting the null for a range of possible values of the parameter of interest
(which is, in our case, β1 ). Figure 4.7 displays two power curves. The solid line on
top is the power curve for when se( β̂1 ) = 1.0 and α = 0.01. On the horizontal
axis are hypothetical values of β1 . The line shows the probability of rejecting
the null for a one-tailed test of H0 : β1 = 0 versus HA : β1 > 0 for α = 0.01
and a sample large enough to permit us to use the normal approximation to the
t distribution. To reject the null under these conditions requires a t stat greater
than 2.32 (see Table 4.4). This power curve plots for each possible value of β1
the probability that se(β̂β̂) (which in this case is 1.0
β̂
) is greater than 2.32. This
curve includes the values we calculated in Figure 4.6 but now also covers all
values of β1 between 0 and 10. We can see, for example, that the probability
of rejecting the null when β = 2 is 0.37, which is what we saw in panel (b) of
Figure 4.6.
4.4 Power 113
Look first at the values of β1 that are above zero, but still small. For these
values, the probability of rejecting the null is quite small. In other words, even
though the null hypothesis is false for these values (since β1 > 0), we’re unlikely
to reject the null hypothesis that β1 = 0. As β1 increases, this probability increases,
and by around β1 = 4, the probability of rejecting the null approaches 1.0. That
is, if the true value of β1 is 4 or bigger, we will reject the null with almost
certainty.
The dashed line in Figure 4.7 displays a second power curve for which the
standard error is bigger, here equal to 2.0. The significance level is the same
as for the first power curve, α = 0.01. We immediately see that the statistical
power is lower. For every possible value of β1 , the probability of rejecting the null
hypothesis is lower than when se( β̂1 ) = 1.0 because there is more uncertainty with
the higher standard error for the estimate. For this standard error, the probability
of rejecting the null when β1 equals 2 is 0.09. So even though the null is false, we
will have a very low probability of rejecting it.8
Figure 4.7 illustrates an important feature of statistical power: the higher the
standard error of β̂1 , the lower the power. This implies that anything that increases
se( β̂1 ) (see page 65) will lower power. Since a major determinant of standard
errors is sample size, a useful rule of thumb is that hypothesis tests based on large
samples are usually high in power and hypothesis tests based on small samples
are usually low in power. In Figure 4.7, we can think of the solid line as the power
curve for a large sample and the dashed line as the power curve for a smaller
sample. More generally, though, statistical power is a function of the variance of
β̂1 and all the factors that affect it.
null result A finding Power is particularly relevant when someone presents a null result, or a
in which the null finding in which the null hypothesis is not rejected. For example, someone may
hypothesis is not say class size is not related to test scores or that an experimental treatment does not
rejected.
work. In this case, we need to ask what the power of the test was. It could be, for
example, that the sample size is very small, such that the probability of rejecting
the null is small even for substantively large values of β1 .
What can we do to increase power? If we can lower the standard errors of
our coefficients, we should do that, of course, but usually that’s not an option.
We could also choose a higher value of α, which determines our statistical
significance level. Doing so would make it easier to reject a null hypothesis. The
catch, though, is that doing so would also increase the probability of a Type I
error. In other words, there is an inherent trade-off between Type I and Type
II error.
Figure 4.8 illustrates this tradeoff. Panel (a) shows the distribution of β̂1
when the null hypothesis that β1 = 0 is true. The distribution is centered at 0 (for
8
What happens when β1 actually is zero? In this case, the null hypothesis is true and power isn’t the
right concept. Instead, the probability of rejecting the null here is the probability of rejecting the null
when it is true. In other words, the probability of rejecting the null when β1 = 0 is the probability of
committing a Type I error, which is the α level we set.
114 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
Distribution of β1 Critical value for
when null hypothesis α = 0.01
is true
(a)
Type I error
2.32
−3 −2 −1 0 1 2 3 4
β1
Probability
density
−3 −2 −1 0 1 2 2.32 3 4
β1
simplicity, we use an example with the standard error of β̂1 = 1). We use α = 0.01
and a one-sided test, so the critical value is 2.32. The probability of a Type I error is
one percent, as highlighted in the figure. Panel (b) shows an example when the null
hypothesis is false—in this case, β1 = 1. We still use α = 0.01 and a one-sided test,
so the critical value remains 2.32. Here every realization of β̂1 to the left of 2.32
will produce a Type II error because for those realizations we will fail to reject the
null even though the null hypothesis is false. If we wanted to lower the probability
of a Type II error in panel (b), we could chose a higher value of α, which would
shift the critical value to the left (see Table 4.4 on page 103). A higher α would
also move the critical value in panel (a) to the left, increasing the probability of
a Type I error. If we wanted to lower the probability of a Type I error, we could
chose lower value of α, which would shift the critical value to the right in both
panels, lowering the probability of a Type I error in panel (a) but increasing the
probability of a Type II error in panel (b).
4.5 Straight Talk about Hypothesis Testing 115
REMEMBER THIS
Statistical power refers to the probability of rejecting a null hypothesis for a given value of β1 .
1. A power curve shows the probability of rejecting the null for a range of possible values
of β1 .
2. Large samples typically produce high-power statistical tests. Small samples typically produce
low-power statistical tests.
3. It is particularly important to discuss power in the presentation of null results that fail to reject
the null hypothesis.
4. There is an inherent trade-off between Type I and Type II errors.
The p values we discussed previously are helpful in this, as are the confidence
intervals we’ll discuss shortly.
Fourth, hypothesis tests and their focus on statistical significance can distract
substantive us from substantive significance. A substantively significant coefficient is one
significance If a that, well, matters; it indicates that the independent variable has a meaningful
reasonable change in effect on the dependent variable. Deciding how big a coefficient must be for us to
the independent
believe it matters can be a bit subjective. However, this is a conversation we need to
variable is associated
with a meaningful
have. And statistical significance is not always a good guide. Remember that t stats
change in the depend a lot on the se( β̂1 ), and the se( β̂1 ) in turn depends on sample size and other
dependent variable, the factors (see page 65). If we have a really big sample, and these days it is increas-
effect is substantively ingly common to have sample sizes in the millions, the standard error will be tiny
significant. Some and our t stat might be huge even for a substantively trivial β̂1 estimate. In these
statistically significant cases, we may reject the null even when the β̂1 coefficient suggests a minor effect.
effects are not
For example, suppose that in our height and wages example we last
substantively significant,
especially for large data
discussed on page 104 we had 20 million observations (instead of roughly 2,000
sets. observations). The standard error on β̂1 would be one one-hundredth as big.
So while a coefficient of 0.41 was statistically significant in the data we had, a
coefficient of 0.004 would be statistically significant in the larger data set. Our
results would suggest that a inch in height is associated with 0.4 cent per hour
which, while statistically significant does not really matter that much. In other
words, we could describe such an effect as statistically, but not substantively,
significant. This is more likely to happen when we have large data sets, something
that has become increasingly likely in an era of big data.
Or, conversely, we could have a small sample size that would lead to a large
standard error on β̂1 and, say, to a failure to reject the null. But the coefficient could
be quite big, suggesting a perhaps meaningful relationship. Of course we wouldn’t
want to rush to conclude that the effect is really big, but it’s worth appreciating
that the data in such a case is indicating the possibility of a substantively
significant relationship. In this instance, getting more data would be particularly
valuable.
REMEMBER THIS
Statistical significance is not the same as substantive significance.
Probability
density
The value of the upper bound of a 95% confidence
interval is the value of β1 such that we would see
the observed β1 or lower 2.5% of the time.
(a)
β1
Probability
density
The value of the lower bound of a 95% confidence
interval is the value of β1 such that we would see
the observed β1 or higher 2.5 percent of the time.
(b)
If true value of β1 is 0.214 we would see a β1
equal to or greater than 0.41 only 2.5% of the
time.
β1
ranges from 0.214 to 0.606 and includes the values of β1 that plausibly generate
the β̂1 we actually observed.9
Figure 4.9 does not tell us how to calculate the upper and lower bounds of
a confidence interval. A confidence interval is β̂1 − critical value × se( β̂1 ) to
β̂1 + critical value × se( β̂1 ). For large samples and α = 0.05, the critical value
is 1.96, giving rise to the rule of thumb that a 95 percent confidence interval is
approximately β̂1 ± 2 × the standard error of β̂1 . In our example, where β̂1 = 0.41
and se( β̂1 ) = 0.1, we can be 95 percent confident that the true value is between
0.214 and 0.606.
Table 4.6 shows some commonly used confidence intervals for large sample
sizes. The large sample size allows us to use the normal distribution to calculate
9
Confidence intervals can also be defined with reference to random sampling. Just as an OLS
coefficient estimate is random, so is a confidence interval. And just as a coefficient may randomly be
far from true value, so may a confidence interval fail to cover the true value. The point of confidence
intervals is that it is unlikely that a confidence interval will fail to include the true value. For example,
if we draw many samples from some population, 95 percent of the confidence intervals generated
from these samples will include the true coefficient.
4.6 Confidence Intervals 119
90% 1.64 βˆ1 ± 1.64 × se( βˆ1 ) 0.41 ± 1.64 × 0.1 = 0.246 to 0.574
95% 1.96 βˆ1 ± 1.96 × se( βˆ1 ) 0.41 ± 1.96 × 0.1 = 0.214 to 0.606
99% 2.58 βˆ1 ± 2.58 × se( βˆ1 ) 0.41 ± 2.58 × 0.1 = 0.152 to 0.668
critical values. A 90 percent confidence interval for our example is 0.246 to 0.574.
The 99 percent confidence interval for a β̂1 = 0.41 and se( β̂1 ) = 0.1 is 0.152
to 0.668. Notice that the higher the confidence level, the wider the confidence
interval.
Confidence intervals are closely related to hypothesis tests. Because confi-
dence intervals tell us the range of possible true values that are consistent with
what we’ve seen, we simply need to note whether the confidence interval on our
estimate includes zero. If it does not, zero was not a value that would be likely to
produce the data and estimates we observe; we can therefore reject H0 : β1 = 0.
Confidence intervals do more than hypothesis tests, though, because they
provide information on the likely location of the true value. If the confidence
interval is mostly positive but just barely covers zero, we would fail to reject the
null hypothesis; we would also recognize that the evidence suggests the true value
is likely positive. If the confidence interval does not cover zero but is restricted to a
region of substantively unimpressive values of β1 , we can conclude that while the
coefficient is statistically different from zero, it seems unlikely that the true value
is substantively important. Baicker and Chandra (2017) provide a useful summary:
“There is also a key difference between ‘no evidence of effect’ and ‘evidence of no
effect.’ The first is consistent with wide confidence intervals that include zero as
well as some meaningful effects, whereas the latter refers to a precisely estimated
zero that can rule out effects of meaningful magnitude.”
REMEMBER THIS
1. A confidence interval indicates a range of values in which the true value is likely to be, given
the data.
• The lower bound of a 95 percent confidence interval will be a value of β1 such that there is
less than a 2.5 percent probability of observing a β̂1 as high as the β̂1 actually observed.
• The upper bound of a 95 percent confidence interval will be a value of β1 such that there is
less than a 2.5 percent probability of observing a β̂1 as low as the β̂1 actually observed.
2. A confidence interval is calculated as β̂1 ± t critical value × se( β̂1 ), where the t critical value
is the critical value from the t table. It depends on the sample size and α, the significance level.
For large samples and α = 0.05, the t critical value is 1.96.
120 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
Conclusion
“Statistical inference” refers to the process of reaching conclusions based on the
data. Hypothesis tests, particularly t tests, are central to inference. They’re pretty
easy. Honestly, a well-trained parrot could probably do simple t tests. Look at the
damn t statistic! Is it bigger than 2? Then squawk “reject”; if not, squawk “fail to
reject.”
We can do much more, though. With p values and confidence intervals, we
can characterize our findings with some nuance. With power calculations, we
can recognize the likelihood of failing to see effects that are there. Taken as
a whole, then, these tools help us make inferences from our data in a sensible
way.
After reading and discussing this chapter, we should be able to do the
following:
• Section 4.6: Explain confidence intervals and the rule of thumb for
approximating a 95 percent confidence interval.
Further Reading
Ziliak and McCloskey (2008) provide a book-length attack on the hypothesis
testing framework. Theirs is hardly the first such critique, but it may be the
most fun.
An important, and growing, school of thought in statistics called Bayesian
statistics produces estimates of the following form: “There is an 8.2 percent
probability that β is less than zero.” Happily, there are huge commonalities across
Bayesian statistics and the approach used in this (and most other) introductory
books. Simon Jackman’s Bayesian Analysis for the Social Sciences (2009) is an
excellent guide to Bayesian statistics.
Computing Corner 121
Key Terms
Alternative hypothesis (94) p Value (106) t Statistic (104)
Confidence interval (117) Point estimate (117) t Test (98)
Confidence levels (117) Power (111) Two-sided alternative
Critical value (101) Power curve (111) hypothesis (94)
Hypothesis testing (91) Significance level (95) Type I error (93)
Null hypothesis (92) Statistically significant (93) Type II error (93)
Null result (113) Substantive significance
One-sided alternative (116)
hypothesis (94) t Distribution (99)
Computing Corner
Stata
2. To find the critical value from a normal distribution for a given α, use the
inverse normal function in Stata. For a two-sided test with α = 0.05, type
display invnormal(1-0.05/2). For a one-sided test with α = 0.01,
type display invnormal(1-0.01).
10
This is referred to as an inverse t function because we provide a percent (the α) and it returns a
value of the t distribution for which α percent of the distribution is larger in magnitude. For a
non-inverse t function, we typically provide some value for t and the function tells us how much of
the distribution is larger in magnitude. The tail part of the function command indicates that we’re
dealing with the far ends of the distribution.
11
The ttail function in Stata reports the probability of a t distributed random variable being higher
than a t statistic we provide (which we denote here as TSTAT). This syntax contrasts to the convention
122 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
4. Use the following code to create a power curve for α = 0.01 and a
one-sided alternative hypothesis covering 71 possible values of the true
β1 from 0 to 7. We discuss calculation of power in Section 14.3.
set obs 71
gen BetaRange = (_n-1)/10 /* Sequence of possible betas from 0 to 7 */
scalar stderrorBeta = 1.0 /* Standard error of beta-hat */
gen PowerCurve = normal(BetaRange/stderrorBeta - 2.32)
/* Probability t statistic is greater than critical value */
/* for each value in BetaRange/stderrorBeta */
graph twoway (line PowerCurve BetaRange)
2. To find the critical value from a normal distribution for a given a, use the
inverse normal function in R. For a two-sided test, type qnorm(1-a/2).
For a one-sided test, type display qnorm(1-a).
for normal distribution functions, which typically report the probability of being less than the t
statistic we provide.
Exercises 123
2.5% 97.5%
(Intercept) 86.605 158.626
donuts 4.878 13.329
5. Use the following code to create a power curve for α = 0.01 and a
one-sided alternative hypothesis covering 71 possible values of the true
β1 from 0 to 7. We discuss calculation of power in Section 14.3.
Exercises
1. Persico, Postlewaite, and Silverman (2004) analyzed data from the
National Longitudinal Survey of Youth (NLSY) 1979 cohort to assess the
relationship between height and wages for white men. Here we explore the
relationship between height and wages for the full sample, which includes
men and women and all races. The NLSY is a nationally representative
sample of 12,686 young men and women who were 14 to 22 years old
when first surveyed in 1979. These individuals were interviewed annually
through 1994 and biannually after that. Table 4.7 describes the variables
from heightwage.dta we’ll use for this question.
(a) Create a scatterplot of adult wages against adult height. What does
this plot suggest about the relationship between height and wages?
TABLE 4.7 Variables for Height and Wage Data in the United States
Variable name Description
(c) Assess whether the null hypothesis that the coefficient on height81
equals 0 is rejected at the 0.05 significance level for one-sided and
for two-sided hypothesis tests.
(a) Suppose, as in the example, that only one sheep in the treatment
group died and all sheep in the control group died. Is the treatment
coefficient statistically significant? What is the (two-sided) p value?
What is the confidence interval?
(b) Suppose now that only one sheep in the treatment group died and
only 10 sheep in the control group died. Is the treatment coefficient
statistically significant? What is the (two-sided) p value? What is the
confidence interval?
(c) Continue supposing that only one sheep in the treatment group died.
What is the minimal number of sheep in the control group that need
to die for the treatment effect to be statistically significant? (Solve
by trial and error.)
3. Voters care about the economy, often more than any other issue. It is
not surprising, then, that politicians invariably argue that their party is
best for the economy. Who is right? In this exercise, we’ll look at the
U.S. economic and presidential party data in PresPartyEconGrowth.dta to
test if there is any difference in economic performance between Repub-
lican and Democratic presidents. We will use two different dependent
variables:
• ChangeGDPpc is the change in real per capita GDP in each year from
1962 to 2013 (in inflation-adjusted U.S. dollars, available from the
World Bank).
previous year was a Republican. The idea is that the president’s policies do
not take effect immediately, so the economic growth in a given year may
be influenced by who was president the year before.12
(c) Choose an α level and alternative hypothesis, and indicate for each
model above whether you accept or reject the null hypothesis.
(d) Explain in your own words what the p value means for the
LagDemPres variable in each model.
(e) Create a power curve for the model with ChangeGDPpc as the
dependent variable for α = 0.01 and a one-sided alternative hypoth-
esis. Explain what the power curve means by indicating what the
curve means for true β1 = 200, 400, and 800. Use the code in the
Computing Corner, but with the actual standard error of β̂1 from the
regression output.13
(f) Discuss the implications of the power curve for the interpretation of
the results for the model in which ChangeGDPpc was the dependent
variable.
4. Run the simulation code in the initial part of the education and salary
question from the Exercises in Chapter 3 (page 87).
12
Other ways of considering the question are addressed in the large academic literature on presidents
and the economy. See, among others, Bartels (2008), Campbell (2011), Comiskey and Marsh (2012),
and Blinder and Watson (2013).
13
In Stata, start with the following lines to create a list of possible true values of β1 and then set the
“stderrorBeta” variable to be equal to the actual standard error of β̂1 :
clear
set obs 201
gen BetaRange = 4*(_n-1) /* Sequence of true beta values from 0 to 800 */
Note: The first line clears all data; you will need to reload the data set if you wish to run additional
analyses. If you have created a syntax file, it will be easy to reload and re-run what you have done
so far.
In R, start with the code in the Computing Corner and set BetaRange = seq(0, 800, 4)
126 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
(b) Generate two-sided p values for the coefficient on education for each
simulation. What are the minimal and maximal values of these p
values?
(d) Re-run the simulations, but set the true value of βEducation to zero. Do
this for 500 simulations, and report what percent of time we reject
the null at the α = 0.05 level with a two-sided alternative hypothesis.
The code provided in Chapter 3 provides tips on how to do this.
5. We will continue the analysis of height and wages in Britain from the
Exercises in Chapter 3 (page 88).
(a) Estimate the model with income at age 33 as the dependent variable
and height at age 33 as the independent variable. (Exclude observa-
tions with wages above 400 British pounds per hour and height less
than 40 inches.) Interpret the t statistics on the coefficients.
(c) Show how to calculate the 95 percent confidence interval for the
coefficient on height.
(f) Limit the sample size to the first 800 observations. Do we accept or
reject the null hypothesis that β1 = 0 for α = 0.01 and a two-sided
alternative? Explain if/how/why this answer differs from the earlier
hypothesis test about β1 .14
14
In Stata, do this by adding & _n < 800 to the end of the “if” statement at the end of the “regress”
command. In R, create and use a new data set with the first 800 observations (e.g., dataSmall =
data[1:800,]).
Multivariate OLS: Where the 5
Action Is
Salest = β0 + β1 Temperaturet + t
where Salest is sales in billions of dollars during month t and Temperaturet is the
average temperature in month t. Figure 5.1 shows monthly data for New Jersey for
about 20 years. We’ve also added the fitted line from a bivariate regression. It’s
negative, implying that people shop less as temperatures rise.
Is that the full story? Could there be endogeneity? Is something correlated
with temperature and associated with more shopping? Think about shopping in the
United States. When is it at its most frenzied? Right before Christmas. Something
that happens in December . . . when it’s cold. In other words, we think there is
something in the error term (Christmas shopping season) that is correlated with
temperature. That’s a recipe for endogeneity.
In this chapter, we learn how to control for other variables so that we can avoid
(or at least reduce) endogeneity and thereby see causal associations more clearly.
Multivariate OLS is the tool that makes this possible. In our shopping example,
multivariate OLS helps us see that once we account for the December effect, higher
multivariate OLS temperatures are associated with higher sales.
OLS with multiple Multivariate OLS refers to OLS with multiple independent variables. We’re
independent variables. simply going to add variables to the OLS model developed in the previous
127
128 CHAPTER 5 Multivariate OLS: Where the Action Is
Monthly
retail sales
(billions of $)
12
10
30 40 50 60 70 80
FIGURE 5.1: Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013
chapters. What do we gain? Two things: bias reduction and precision. When we
reduce bias, we get more accurate parameter estimates because the coefficient
estimates are on average closer to the true value. When we increase precision, we
reduce uncertainty because the distribution of coefficient estimates is more closely
clustered toward the true value.
In this chapter, we explain how to use multivariate OLS to fight endogeneity.
Section 5.1 introduces the model and shows how controlling for multiple variables
can lead to better estimates. Section 5.2 discusses omitted variable bias, which
occurs when we fail to control for variables that affect Y and are correlated with
included variables. Section 5.3 shows how the omitted variable bias framework
can be used to understand what happens when we use poorly measured variables.
Section 5.4 explains the precision of our estimates in multivariate OLS. Section 5.5
demonstrates how standardizing variables can make OLS coefficients more
comparable. Section 5.6 shows formal tests of whether coefficients differ from
each other. The technique illustrated can be used for any hypothesis involving
multiple coefficients.
5.1 Using Multivariate OLS to Fight Endogeneity 129
Monthly Monthly
retail sales retail sales
(billions of $) (billions of $)
12 12
December sales December sales
Other months minus $5 billion
Other months
10 10
8 8
6 6
4 4
30 40 50 60 70 80 30 40 50 60 70 80
FIGURE 5.2: Monthly Retail Sales and Temperature in New Jersey with December Indicated
130 CHAPTER 5 Multivariate OLS: Where the Action Is
considered the relationship between temperature and sales. That is what we’ve
done in panel (b) of Figure 5.2, where each December observation is now $5
billion lower than before. When we look at the data this way, the negative
relationship between temperature and sales seems to go away; it may even be that
the relationship is now positive.
In essence, multivariate OLS nets out the effects of other variables when it
controls for additional variables. When we actually implement multivariate OLS,
we (or, really, computers) do everything at once, controlling for the December
effect while estimating the effect of temperature even as we are simultaneously
controlling for temperature while estimating the December effect.
Table 5.1 shows the results for both a bivariate and a multivariate model for
our sales data. In the bivariate model, the coefficient on temperature is –0.019. The
estimate is statistically significant because the t statistic is above 2. The implication
is that people shop less as it gets warmer, or in other words, folks like to shop in
the cold. When we use multivariate OLS to control for December (by including
the December variable that equals 1 for observations from the month of December
and 0 for all other observations), the coefficient on temperature becomes positive
and statistically significant. Our conclusion has flipped! Heat brings out the cash.
Whether this relationship exists because people like shopping when it’s warm or
are going out to buy swimsuits and sunscreen, we can’t say. We can say, though,
that our initial bivariate finding that people shop less as the temperature rises is
not robust to controlling for holiday shopping in December.
The way we interpret multivariate OLS regression coefficients is slightly
different from how we interpret bivariate OLS regression coefficients. We still
say that a one-unit increase in X is associated with a β̂1 increase in Y, but now
we need to add the phrase “holding constant the other factors in the model.”
TABLE 5.1 Bivariate and Multivariate Results for Retail Sales Data
Bivariate Multivariate
σ̂ 1.82 1.07
2
R 0.026 0.661
where Wagesi was the wages of men in the sample in 1996 and the adult height
was measured in 1985.
This is observational data, and the reality is that with such data, the bivariate
model is suspect. There are many ways something in the error term could be
correlated with the independent variable.
The authors of the height and wages study identified several additional
variables to include in the model, focusing in particular on one: adolescent height.
They reasoned that people who were tall as teenagers might have developed more
confidence and participated in more high school activities, and that this experience
may have laid the groundwork for higher wages later.
If teen height is actually boosting adult wages in the way the researchers sus-
pected, it is possible that the bivariate model with only adult height (Equation 5.1)
132 CHAPTER 5 Multivariate OLS: Where the Action Is
will suggest a relationship even though the real action was to be found between
adolescent height and wages. How can we tell what the real story is?
Multivariate OLS comes to the rescue. It allows us to simply “pull” adolescent
height out of the error term and into the model by including it as an additional
variable in the model. The model then becomes
where β1 reflects the effect on wages of being one inch taller as an adult when
including adolescent height in the model and β2 reflects the effect on wages of
being one inch taller as an adolescent when adult height is included in the model.
The coefficients are estimated by using logic similar to that for bivariate
OLS. We’ll discuss estimation momentarily. For now, though, let’s concentrate
on the differences between bivariate and multivariate results. Both are presented
in Table 5.2. The first column shows the coefficient and standard error on β̂1
for the bivariate model with only adult height in the model; these are identical
to the results presented in Chapter 3 (page 75). The coefficient of 0.412 implies
that each inch of height is associated with an additional 41.2 cents per hour
in wages.
Adult
Height
Adolescent
Height
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
FIGURE 5.3: 95 Percent Confidence Intervals for Coefficients in Adult Height, Adolescent Height,
and Wage Models
The second column shows results from the multivariate analysis; they tell
quite a different story. The coefficient on adult height is, at 0.003, essentially zero.
In contrast, the coefficient on adolescent height is 0.48, implying that controlling
for adult height, adult wages were 48 cents higher per hour for each inch taller
someone was when younger. The standard error on this coefficient is 0.19 with a
t statistic that is higher than 2, implying a statistically significant effect.
Figure 5.3 displays the confidence intervals implied by the coefficients and
their standard errors. The dots are placed at the coefficient estimate (e.g., 0.41 for
the coefficient on adult height in the bivariate model and 0.003 for the coefficient
on adult height in the multivariate model). The solid lines indicate the range of the
95 percent confidence interval. As discussed in Chapter 4 (page 119), confidence
intervals indicate the range of true values of β most consistent with the observed
estimate; they are calculated as β̂ ± 1.96 × se(β̂).
The confidence interval for the coefficient on adult height in the bivariate
model is clearly positive and relatively narrow, and it does not include zero.
However, the confidence interval for the coefficient on adult height includes zero
in the multivariate model. In other words, the multivariate model suggests that
the effect of adult height on wages is small or even zero when we control for
adolescent height. In contrast, the confidence interval for adolescent height is
positive, reasonably wide, and far from zero when we control for adult height.
These results suggest that the effect of adolescent height on wages is large and the
relationship we see is unlikely to have arisen simply by chance.
In this head-to-head battle of the two height variables, adolescent height wins:
the coefficient on it is large, and its confidence interval is far from zero. The
coefficient on adult height, however, is puny and has a confidence interval that
clearly covers zero. In other words, the multivariate model we have estimated is
telling us that being tall as a kid matters more than being tall as a grown-up. This
conclusion is quite thought provoking. It appears that the height premium in wages
does not reflect a height fetish by bosses. We’ll explore what’s going on in a bit
more detail shortly.
134 CHAPTER 5 Multivariate OLS: Where the Action Is
where each X is another variable and k is the total number of independent variables.
Often a single variable, or perhaps a subset of variables, is of primary interest. We
control variable An refer to the other independent variables as control variable, as these are included
independent variable to control for factors that could affect the dependent variable and also could be
included in a statistical correlated with the independent variables of primary interest. We should note here
model to control for
that control variables and control groups are different: a control variable is an
some factor that is not
the primary factor of
additional variable we include in a model, while a control group is the group to
interest. which we compare the treatment group in an experiment.1
The authors of the height and wage study argue that adolescent height in and
of itself was not causing increased wages. Their view is that adolescent height
translated into opportunities that provided skills and experience that increased
ability to get high wages later. They view increased participation in clubs and
sports activities as a channel for adolescent height to improve wage-increasing
human capital. In statistical terms, the claim is that participation in clubs and
athletics was a factor in the error term of a model with only adult height and
adolescent height. If either height variable turns out to be correlated with any of
the factors in the error term, we could have endogeneity.
With the right data, we can check the claim that the effect of adolescent
height on adult wages is due, at least in part, to the effect of adolescent height on
participation in developmentally helpful activities. In this case, the researchers had
measures of the number of clubs each person participated in (excluding athletics
and academic/honor society clubs), as well as a dummy variable that indicated
whether each person participated in high school athletics.
1
The control variable and control group concepts are related. In an experiment, a control variable is
set to be the same for all subjects of the experiment to ensure that the only difference between treated
and untreated groups is the experimental treatment. If we were experimenting on samples in petri
dishes, for example, we could treat temperature as a control variable. We would make sure that the
temperature is the same for all petri dishes used in the experiment. Hence, the control group has
everything similar to the treatment group except the treatment. In observational studies, we cannot
determine the values of other factors; we can, however, try to net out these other factors, such that
once we have taken them into account, the treated and untreated groups should be the same. In the
Christmas shopping example, the dummy variable for December is our control variable. The idea is
that once we net out the effect of Christmas on shopping patterns in the United States, retail sales
should differ based only on differences in the temperature. If we worry (as we should) that factors in
addition to temperature still matter, we should include other control variables until we feel confident
that the only remaining difference is due to the variable of interest.
5.1 Using Multivariate OLS to Fight Endogeneity 135
ˆi = Yi − Ŷi
= Yi − (β̂ 0 + β̂1 X1i + β̂ 2 X2i + · · · + β̂ k Xki )
Second, square the residuals (for the same reasons given on page 48):
Multivariate OLS then finds the β̂’s that minimize the sum of the squared residuals
over all observations. We let computers do that work for us.
136 CHAPTER 5 Multivariate OLS: Where the Action Is
The name “ordinary least squares” (OLS) describes the process: ordinary
because we haven’t gotten to the fancy stuff yet, least because we’re minimizing
the deviations between fitted and actual values, and squares because there was
a squared thing going on in there. Again, it’s an absurd name. It’s like calling
a hamburger a “kill-with-stun-gun-then-grill-and-put-on-a-bun.” OLS is what
people call it, though, so we have to get used to it.
REMEMBER THIS
1. Multivariate OLS is used to estimate a model with multiple independent variables.
2. Multivariate OLS fights endogeneity by pulling variables from the error term into the estimated
equation.
3. As with bivariate OLS, the multivariate OLS estimation process selects β̂’s in a way that
minimizes the sum of squared residuals.
Discussion Questions
1. Mother Jones magazine blogger Kevin Drum (2013a, b, c) offers the following scenario.
Suppose we gathered records of a thousand school children aged 7 to 12, used a bivariate
model, and found that heavier kids scored better on standardized math tests.
(a) Based on these results, should we recommend that kids eat lots of potato chips and french
fries if they want to grow up to be scientists?
(b) Write down a model that embodies Drum’s scenario.
(c) Propose additional variables for this model.
(d) Would inclusion of additional controls bolster the evidence? Would doing so provide
definitive proof?
2. Researchers from the National Center for Addiction and Substance Abuse at Columbia
University (2011) suggest that time spent on Facebook and Twitter increases risks of smoking,
drinking, and drug use. They found that compared with kids who spent no time on social
networking sites, kids who visited the sites each day were five times likelier to smoke cigarettes,
three times more likely to drink alcohol, and twice as likely to smoke pot. The researchers
argue that kids who use social media regularly see others engaged in such behaviors and then
emulate them.
(a) Write down the model implied by the description of the Columbia study, and discuss the
factors in error term.
5.2 Omitted Variable Bias 137
(b) What specifically has to be true about these factors for their omission to cause bias?
Discuss whether these conditions will be true for the factors you identify.
(c) Discuss which factors could be measured and controlled for and which would be difficult
to measure and control for.
3. Suppose we are interested in knowing the relationship between hours studied and scores on a
Spanish exam.
(a) Suppose some kids don’t study at all but ace the exam, leading to a bivariate OLS result
that studying has little or no effect on the score. Would you be convinced by these results?
(b) Write down a model, and discuss your answer to (a) in terms of the error term.
(c) What if some kids speak Spanish at home? Discuss implications for a bivariate model
that does not include this factor and a multivariate model that controls for this factor.
We assume (for now) that the error in this true model, νi , is uncorrelated with X1i
and X2i . (The Greek letter ν, or nu, is pronounced “new”—even though it looks
like a v.) As usual with multivariate OLS, the β1 parameter reflects how much
higher Yi would be if we increased X1i by one; β2 reflects how much higher Yi
would be if we increased X2i by one.
What happens if we omit X2 and estimate the following model?
OmitX2 OmitX2
Yi = β0 + β1 X1i + i (5.5)
where β OmitX2 indicates the coefficient on X1i we get when we omit variable X2
from the model. While we used νi to refer to the error term in Equation 5.4, we
use a different letter (which happens to be i ) in Equation 5.5 because the error
now includes νi and β2 X2i .
OmitX2 OmitX2
How close will β̂ 1 be to β1 in Equation 5.4? In other words, will β̂ 1
be an unbiased estimator of β1 ? Or, in English, will our estimate of the effect of X1
138 CHAPTER 5 Multivariate OLS: Where the Action Is
suck if we omit X2 ? We ask questions like this every time we analyze observational
data.
It’s useful to first characterize the relationship between the two independent
auxiliary regression variables, X1 and X2 . To do this, we use an auxiliary regression equation. An
A regression that is not auxiliary regression is a regression that is not directly the one of interest but yields
directly the one of information helpful in analyzing the equation we really care about. In this case,
interest but yields
we can assess how strongly X1 and X2 are related by means of the equation
information helpful in
analyzing the equation
we really care about. X2i = δ0 + δ1 X1i + τi (5.6)
where δ0 (“delta”) and δ1 are coefficients for this auxiliary regression and τi (“tau,”
rhymes with what you say when you stub your toe) is how we denote the error term
(which acts just like the error term in our other equations, but we want to make it
clear that we’re dealing with a different equation). We assume τi is uncorrelated
with νi and X1 .
This equation for X2i is not based on a causal model. Instead, we are using a
regression model to indicate the relationship between the included variable (X1 )
and the excluded variable (X2 ). If δ1 = 0, then X1 and X2 are not related. If δ1 = 0,
then X1 and X2 are related.
If we substitute the equation for X2i (Equation 5.6) into the main equation
(Equation 5.4), then do some rearranging and a bit of relabeling, we get
where β1 and β2 come from the main equation (Equation 5.4) and δ1 comes from
the equation for X2i (Equation 5.6).2
Given our assumption that τ and ν are not correlated with any independent
OmitX2
variable, we can use our bivariate OLS results to know that β̂1 will be
distributed normally with a mean of β1 + β2 δ1 . In other words, when we omit
X2 , the distribution of the estimated coefficient on X1 will be skewed away from
omitted variable β1 by a factor of β2 δ1 . This is omitted variable bias.
OmitX2
bias Bias that results In other words, when we omit X2 , the coefficient on X1 , which is β1 , will
from leaving out a pick up not only β1 , which is the effect of X1 on Y, but also β2 , which is the effect
variable that affects the OmitX2
dependent variable and of the omitted variable X2 on Y. The extent to which β1 picks up the effect of
is correlated with the X2 depends on δ1 , which characterizes how strongly X2 and X1 are related.
independent variable.
2
Note that in the derivation, we replace β2 τi + νi with i . If, as we’re assuming here, τi and νi are
uncorrelated with each other and uncorrelated with X1 , then errors of the form β2 τi + νi will also be
uncorrelated with each other and uncorrelated with X1 .
5.2 Omitted Variable Bias 139
3
We derive this result more formally on page 502.
140 CHAPTER 5 Multivariate OLS: Where the Action Is
always) goes down when we add variables that explain the dependent variable.
We’ll discuss a major exception in Chapter 7: bias can increase when we add a
so-called post-treatment variable.
REMEMBER THIS
1. Two conditions must both be true for omitted variable bias to occur:
(a) The omitted variable affects the dependent variable.
• Mathematically: β2 = 0 in Equation 5.4.
• An equivalent way to state this condition is that X2i really should have been in
Equation 5.4 in the first place.
(b) The omitted variable is correlated with the included independent variable.
• Mathematically: δ1 = 0 in Equation 5.6.
• An equivalent way to state this condition is that X2i needs to be correlated with X1i
2. Omitted variable bias is more complicated in models with more independent variables, but the
main intuition applies.
The data is structured such that even though information on the economic
growth in these countries for each year is available, we are looking only at the
average growth rate across the 40 years from 1960 to 2000. Thus, each country
gets only a single observation. We control for GDP per capita in 1960 because of
a well-established phenomenon in which countries that were wealthier in 1960
have a slower growth rate. The poor countries simply have more capacity to grow
5.2 Omitted Variable Bias 141
N 50 50
σ̂ 1.13 0.72
R2 0.36 0.74
Average Average
economic economic
growth growth
(in %) 7 (in %) 7
6 6
5 5
4 4
3 3
2 2
1 1
Average
5.5
test
scores
5
4.5
3.5
2 4 6 8 10 12
Could the real story be that test scores, not years in school, explain growth? If
so, why is there a significant coefficient on average of schooling in the first column
of Table 5.3? We know the answer: omitted variable bias. As discussed on page 137,
if we omit a variable that matters (and we suspect that test scores matter), the
estimate of the effect of the variable that is included will be biased if the omitted
variable is correlated with the included variable. To address this issue, look at panel
(c) of Figure 5.4, a scatterplot of average test scores and average years of schooling.
5.3 Measurement Error 143
Yes, indeed, these variables look quite correlated, as observations with high years
of schooling also tend to be accompanied by high test scores. Hence, the omission
of test scores could be problematic.
It therefore makes sense to add test scores to the model, as in the right-hand
column of Table 5.3. The coefficient on average years of schooling here differs
markedly from before. It is now very close to zero. The coefficient on average
test scores, on the other hand, is 1.97 and statistically significant, with a t statistic
of 8.28.
Because the scale of the test score variable is not immediately obvious, we
need to do a bit of work to interpret the substantive significance of the coefficient
estimate. Based on descriptive statistics (not reported), the standard deviation of
the test score variable is 0.61. The results therefore imply that increasing average
test scores by a standard deviation is associated with an increase of 0.61 × 1.97 =
1.20 percentage points in the average annual growth rate per year over these
40 years. This increase is large when we are talking about growth compounding
over 40 years.4
Notice the very different story we have across the two columns. In the first
one, years of schooling is enough for economic growth. In the second specification,
quality of education, as measured with math and science test scores, matters
more. The second specification is better because it shows that a theoretically
sensible variable matters a lot. Excluding this variable, as the first specification does,
risks omitted variable bias. In short, these results suggest that education is about
quality, not quantity. High test scores explain economic growth better than years
in school. Crappy schools do little; good ones do a lot. These results don’t end the
conversation about education and economic growth, but they do move it ahead a
few more steps.
4
Since the scale of the test score variable is different from the years in school variable, we cannot
directly compare the two coefficients. Sections 5.5 and 5.6 show how to make such comparisons.
144 CHAPTER 5 Multivariate OLS: Where the Action Is
some overreport and some underreport). And many, perhaps even most, variables
could have error. Just think how hard it would be to accurately measure spending
on education or life expectancy or attitudes toward Justin Bieber in an entire
country.
We keep things simple here by assuming that the measurement error (νi ) has a
mean of zero and is uncorrelated with the true value.
∗
Notice that we can rewrite X1i as the observed value (X1i ) minus the
measurement error:
∗
X1i = X1i − νi
∗
Substitute for X1i in the true model, do a bit of rearranging, and we get
Yi = β0 + β1 (X1i − νi ) + i
= β0 + β1 X1i − β1 νi + i (5.9)
right? If we could observe it, we would fix our darn measure of X1 . So what we
do is treat the measurement error as an unobserved variable that by definition
we must omit; then we can see how this particular form of omitted variable bias
plays out. Unlike the case of a generic omitted variable bias problem, we know
two things that allow us to be more specific than in the general omitted variable
case: the coefficient on the omitted term (νi ) is β1 , and νi relates to X1 as in
Equation 5.8.
We go step by step through the logic and math in Chapter 14 (page 508). The
upshot is that as the sample size gets very large, the estimated coefficient when
the independent variable is measured with error is
σX2∗
plim β̂1 = β1 1
σν2 + σX2∗
1
REMEMBER THIS
1. Measurement error in the dependent variable does not bias β̂ coefficients but does increase the
variance of the estimates.
2. Measurement error in an independent variable causes attenuation bias. That is, when X1 is
measured with error, β̂1 will generally be closer to zero than it should be.
• The attenuation bias is a consequence of the omission of the measurement error from the
estimated model.
• The larger the measurement error, the larger the attenuation bias.
σ̂ 2
var(β̂ j ) = (5.10)
N var(Xj )(1 − R2j )
This equation is similar to the equation for variance of β̂1 in bivariate OLS
(Equation 3.9, page 62). The new bit relates to the (1 − R2j ) in the denominator.
Before elaborating on R2j , let’s note the parts from the bivariate variance equation
that carry through to the multivariate context.
• In the numerator, we see σ̂ 2 , which means that the higher the variance of the
regression, the higher the variance of the coefficient estimate. Because σ̂ 2
5.4 Precision and Goodness of Fit 147
• In the denominator, we see the sample size, N. As for the bivariate model,
more data leads the value of the denominator to get bigger making the
var(β̂ j ) smaller. In other words, more data means more precise estimates.
N
(X −X j )2
i=1 ij
• The greater
the variation of Xj as measured by N for large
samples , the bigger the denominator will be. The bigger the denominator,
the smaller var(β̂ j ) will be.
Multicollinearity
The new element in Equation 5.10 compared to the earlier variance equation is the
(1 − R2j ). Notice the j subscript. We use the subscript to indicate that R2j is the
R2 from an auxiliary regression in which Xj is the dependent variable and all
the other independent variables in the full model are the independent variables
in the auxiliary model. The R2 without the j is still the R2 for the main equation,
as discussed on page 72.
There is a different R2j for each independent variable. For example, if our
model is
5
We discuss experiments in their real-world form in Chapter 10.
148 CHAPTER 5 Multivariate OLS: Where the Action Is
These R2j tell us how much the other variables explain Xj . If the other
variables explain Xj very well, the R2j will be high and—here’s the key insight—the
denominator will be smaller. Notice that the denominator of the equation for
var(β̂ j ) has (1 − R2j ). Remember that R2 is always between 0 and 1, so as R2j gets
bigger, 1−R2j gets smaller, which in turn makes var(β̂ j ) bigger. The intuition is that
if variable Xj is virtually indistinguishable from the other independent variables,
it should in fact be hard to tell how much that variable affects Y, and we will
therefore have a larger var(β̂ j ).
In other words, when an independent variable is highly related to other
independent variables, the variance of the coefficient we estimate for that variable
multicollinearity will be high. We use a fancy term, multicollinearity, to refer to situations in
Variables are which independent variables have strong linear relationships. The term comes
multicollinear if they are from “multi” for multiple variables and “co-linear” because they vary together in a
correlated. The
linear fashion. The polysyllabic jargon should not hide a simple fact: The variance
consequence of
multicollinearity is that
of our estimates increases when an independent variable is closely related to other
the variance of β̂1 will independent variables.
1
be higher than it would The term 1−R 2 is referred to as the variance inflation factor (VIF). It
j
have been in the
measures how much variance is inflated owing to multicollinearity relative to a
absence of
multicollinearity. case in which there is no multicollinearity.
Multicollinearity does It’s really important to understand what multicollinearity does and does not
not cause bias. do. It does not cause bias. It doesn’t even cause the standard errors of β̂1 to be
incorrect. It simply causes the standard errors to be bigger than they would be
if there were no multicollinearity. In other words, OLS is on top of the whole
variance inflation multicollinearity thing, producing estimates that are unbiased with appropriately
factor A measure of calculated uncertainty. It’s just that when variables are strongly related to each
how much variance is other, we’re going to have more uncertainty—that is, the distributions of β̂1 will
inflated owing to be wider, meaning that it will be harder to learn from the data.
multicollinearity. What, then, should we do about multicollinearity? If we have a lot of data,
our standard errors may be small enough to allow reasonable inferences about
the coefficients on the collinear variables. In that case, we do not have to do
anything. OLS is fine, and we’re perfectly happy. Both our empirical examples in
this chapter are consistent with this scenario. In the height and wages analysis in
Table 5.2, adult height and adolescent height are highly correlated (we don’t report
it in the table, but the two variables are correlated at 0.86, which is a very strong
correlation). And yet, the actual effects of these two variables are so different that
we can parse out their differential effects with the amount of data we have. In
the education and economic growth analysis in Table 5.3, the years of school and
test score variables are correlated at 0.81 (not reported in the table). And yet, the
effects are different enough to let us parse out the differential effects of these two
variables with the data we have.
If we have substantial multicollinearity, however, we may get very large
standard errors on the collinear variables, preventing us from saying much about
any one variable. Some are tempted in such cases to drop one or more of the highly
multicollinear variables and focus only on the results for the remaining variables.
This isn’t quite fair, however, since we may not have solid evidence to indicate
which variables we should drop and which we should keep. A better approach is
5.4 Precision and Goodness of Fit 149
to be honest: we should just say that the collinear variables taken as a group seem
to matter or not and that we can’t parse out the individual effects of these variables.
For example, suppose we are interested in predicting undergraduate grades as
a function of two variables: scores from a standardized math test and scores from a
standardized verbal reasoning test. Suppose also that these test score variables are
highly correlated and that when we run a model with both variables as independent
variables, both are statistically insignificant in part because the standard errors
will be very high owing to the high R2j values. If we drop one of the test scores,
the remaining test score variable may be statistically significant, but it would be
poor form to believe, then, that only that test score affected undergraduate grades.
Instead, we should use the tools we present later (Section 5.6, page 158), which
allow us to assess whether both variables taken together explain grades. At that
point, we may be able to say that we know standardized test scores matter, but
we cannot say much about the relative effect of math versus verbal test scores. So
even though it would be more fun to say which test score matters, the statistical
evidence to justify the statement may simply not be there.
perfect A lethal dose of multicollinearity, called perfect multicollinearity, occurs
multicollinearity when an independent variable is completely explained by other independent
Occurs when an variables. If this happens, R2j = 1, and the var( β̂1 ) blows up because (1 − R2j )
independent variable is
is in the denominator (in the sense that the denominator becomes zero, which is
completely explained by
other independent
a big no-no). In this case, statistical software either will refuse to estimate the
variables. model or will automatically delete enough independent variables to extinguish
perfect multicollinearity. A silly example of perfect multicollinearity is including
the same variable twice in a model.
Goodness of fit
Let’s talk about the regular old R2 , the one without a j subscript. As with the R2 for
a bivariate OLS model, the R2 for a multivariate OLS model measures goodness
of fit and is the square of the correlation of the fitted values and actual values (see
Section 3.7).6 As before, it can be interesting to know how well the model explains
the dependent variable, but this information is often not particularly useful. A good
model can have a low R2 , and a biased model can have a high R2 .
There is one additional wrinkle for R2 in the multivariate context. Adding
a variable to a model necessarily makes the R2 go up, at least by a tiny bit.
To see why, notice that OLS minimizes the sum of squared errors. If we add a
new variable, the fit cannot be worse than before because we can simply set the
coefficient on this new variable to be zero, which is equivalent to not having the
variable in the model in the first place. In other words, every time we add a variable
to a model, we do no worse and, as a practical matter, do at least a little better
even if the variable doesn’t truly affect the dependent variable. Just by chance,
estimating a non-zero coefficient on this variable will typically improve the fit
for a couple of observations. Hence, R2 always is the same or larger as we add
variables.
6
The model needs to have a constant term for this interpretation to work—and for R2 to be sensible.
150 CHAPTER 5 Multivariate OLS: Where the Action Is
Review Questions
1. How much will other variables explain Xj when Xj is a randomly assigned treatment?
Approximately what will R2j be?
2. Suppose we are designing an experiment in which we can determine the value of all independent
variables for all observations. Do we want the independent variables to be highly correlated or
not? Why or why not?
7
Our earlier discussion was about the regular R2 , but it also applies to any R2 (from the main
equation or an auxiliary equation). R2 goes up as the number of variables increases.
5.4 Precision and Goodness of Fit 151
REMEMBER THIS
1. If errors are not correlated with each other and are homoscedastic, the variance of the β̂ j
estimate is
σ̂ 2
var(β̂ j ) =
N × var(Xj )(1 − R2j )
(a) Model fit: The better the model fits, the lower σ̂ 2 and var(β̂ j ) will be.
(b) Sample size: The more observations, the lower var(β̂ j ) will be.
(c) Variation in X: The more the Xj variable varies, the lower var(β̂ j ) will be.
(d) Multicollinearity: The less the other independent variables explain Xj , the lower R2j and
var(β̂ j ) will be.
3. Independent variables are multicollinear if they are correlated.
(a) The variance of β̂1 is higher when there is multicollinearity than when there is no
multicollinearity.
(b) Multicollinearity does not bias β̂1 estimates.
(c) The se( β̂1 ) produced by OLS accounts for multicollinearity.
(d) An OLS model cannot be estimated when there is perfect multicollinearity—that is,
when an independent variable is perfectly explained by one or more of the other
independent variables.
4. Inclusion of irrelevant variables occurs when variables that do not affect Y are included in a
model.
(a) Inclusion of irrelevant variables causes the variance of β̂1 to be higher than if the
variables were not included.
(b) Inclusion of irrelevant variables does not cause bias.
5. The variance of β̂ j is more complicated when errors are correlated or heteroscedastic,
but the intuitions about model fit, sample size, variance of X, and multicollinearity still
apply.
152 CHAPTER 5 Multivariate OLS: Where the Action Is
8
This example is based on La Porta, Lopez-de-Silanes, Pop-Eleches, and Schliefer (2004).
Measurement of abstract concepts like human rights and judicial independence is not simple. See
Harvey (2011) for more details.
5.4 Precision and Goodness of Fit 153
Democracy 24.93∗
(2.77)
[t = 9.01]
N 63 63
σ̂ 17.6 11.5
R2 0.47 0.78
R2Judicialind. 0.153
R2LogGDP 0.553
R2Democracy 0.552
Before we discuss what Harvey found, let’s think about what would have to be
true if omitting a measure of democracy is indeed causing bias under our conditions
given on page 140. First, the level of democracy in a country actually needs to affect
the dependent variable, human rights (this is the β2 = 0 condition). Is that true here?
Very plausibly. We don’t know beforehand, of course, but it certainly seems possible
that torture tends not to be a great vote-getter. Second, democracy needs to be
correlated with the independent variable of interest, which in this case is judicial
independence. This we know is almost certainly true: democracy and judicial
independence definitely seem to go together in the modern world. In Harvey’s data,
democracy and judicial independence correlate at 0.26: not huge, but not nuthin’.
Therefore will be we have a legitimate candidate for omitted variable bias.
The right-hand column of Table 5.4 shows that Harvey’s intuition was right.
When the democracy measure is added, the coefficients on both judicial indepen-
dence and GDP per capita fall precipitously. The coefficient on democracy, however,
is 24.93, with a t statistic of 9.01, a highly statistically significant estimate.
Statistical significance is not the same as substantive significance, though. So
let’s try to interpret our results in a more meaningful way. If we generate descriptive
154 CHAPTER 5 Multivariate OLS: Where the Action Is
statistics for our variable that depends on human rights, we see that it ranges from
17 to 99, with a mean of 67 and a standard deviation of 24. Doing the same for the
democracy variable indicates a range of 0 to 2 with a mean of 1.07 and a standard
deviation of 0.79. A coefficient of 24.93 implies that a change in the democracy
measures of one standard deviation is associated with a 24.93 × 0.79 = 19.7 unit
increase on the human rights scale. Given that the standard deviation change in
the dependent variable is 24, this is a pretty sizable association between democracy
and human rights.9
This is a textbook example of omitted variable bias.10 When democracy is not
accounted for, judicial independence is strongly associated with human rights.
When democracy is accounted for, however, the effect of judicial independence
fades to virtually nothing. And this is not just about statistics. How we view the
world is at stake, too. The conclusion from the initial model was that courts protect
human rights. The additional analysis suggests that democracy protects human
rights.
The example also highlights the somewhat provisional nature of social scien-
tific conclusions. Someone may come along with a variable to add or another way
to analyze the same data that will change our conclusions. That is the nature of the
social scientific process. We do the best we can, but we leave room (sometimes a
little, sometimes a lot) for a better way to understand what is going on.
Table 5.4 also includes some diagnostics to help us think about multicollinear-
ity, for surely such factors as judicial independence, democracy, and wealth are
correlated. Before looking at specific diagnostics, though, we should note that
collinearity of independent variables does not cause bias. It doesn’t even cause
the variance equation to be wrong. Instead, multicollinearity simply causes the
variance to be higher than it would be without collinearity among the independent
variables.
Toward the bottom of the table will be we see that R2Judicialind. is 0.153. This
value is the R2 from an auxiliary regression in which judicial independence is the
dependent variable and the GDP and democracy variables are the independent
variables. This value isn’t particularly high, and if we plug it into the equation for
the VIF, which is just the part of the variance of β̂ j associated with multicollinearity,
we see that the VIF for the judicial independence variable is 1−1R2 = 1−0.153
1
= 1.18. In
j
other words, the variance of the coefficient on the judicial independence variable
is 1.18 times larger than it would have been if the judicial independence variable
were completely uncorrelated with the other independent variables in the model.
9
Determining exactly what is a substantively large effect can be subjective. There’s no rule book on
what is “large.” Those who have worked in a substantive area for a long time often get a good sense
of what effects qualify as “large.” An effect might be considered large if it is larger than the effect of
other variables that people think are important. Or an effect might be considered large if we know
that the benefit is estimated to be much higher than the cost. In the human rights case, we can get a
sense of what a 19.7 unit change in the human rights scale means by looking at pairs of countries that
differed by around 20 points on that scale. For example, Pakistan was 22 points higher than North
Korea. Decide if it would make a difference to vacation in North Korea or Pakistan. If it would make
a difference, then 19.7 is a large difference; if not, then it’s not.
10
Or, it is now . . .
5.5 Standardized Coefficients 155
That’s pretty small. The R2LogGDP is 0.553. This value corresponds to a VIF of 2.24,
which is higher but still not in a range people get too worried about. And just to
reiterate, this is not a problem to be corrected. Rather, we are simply noting that one
source of variance of the coefficient estimate on GDP is multicollinearity. Another
source is the sample size and another is the fit of the model (indicated by σ̂ , which
indicates that the fitted values are on average, roughly 11.5 units away from the
actual values).
Standardizing coefficients
standardize A convenient trick is to standardize the variables. To do so, we convert variables
Standardizing a variable to standard deviations from their means. That is, instead of having a variable that
converts it to a measure indicates a baseball player’s batting average, we have a variable that indicates how
of standard deviations
many standard deviations above or below the average batting average a player
from its mean.
was. Instead of having a variable that indicates home runs, we have a variable that
indicates how many standard deviations above or below the average number of
home runs a player hit. The attraction of standardizing variables is that a one-unit
increase for both standardized independent variables will be a standard deviation.
We often (but not always) standardize the dependent variable as well. If we
do so, the coefficient on a standardized independent variable can be interpreted as
“Controlling for the other variables in the model, a one standard deviation increase
in X is associated with a β̂1 standard deviation increase in the dependent variable.”
We standardize variables using the following equation:
Variable − Variable
VariableStandardized = (5.12)
sd(Variable)
Constant −2,869,439.40∗
(244,241.12)
[t = 11.75]
N 6,762
2
R 0.30
TABLE 5.7 Means and Standard Deviations of Baseball Variables for Three Players
Unstandardized Standardized
Player ID Salary Batting average Home runs Salary Batting average Home runs
where Variable is the mean of the variable for all units in the sample and
sd(Variable) is the standard deviation of the variable.
Table 5.6 reports the means and standard deviations of the variables for our
baseball salary example. Table 5.7 then uses these means and standard deviations
to report the unstandardized and standardized values of salary, batting average,
and home runs for three selected players. Player 1 earned $5.85 million. Given that
the standard deviation of salaries in the data set was $2,764,512, the standardized
− 2,024,616
value of this player’s salary is 5,850,000
2,764,512 = 1.38. In other words, player 1
earned 1.38 standard deviations more than the average salary. This player’s batting
average was 0.267, which is exactly the average. Hence, his standardized batting
average is zero. He hit 43 home runs, which is 2.99 standard deviations above the
mean number of home runs.
Table 5.8 displays standardized OLS results along with the unstandardized
results from Table 5.5. The dependent variable is standardized is the result on
the right. The standardized results allow us to reasonably compare the effects
on salary of batting average and home runs. We see in Table 5.6 that a standard
standardized deviation of batting average is 0.031. The standardized coefficients tell us that
coefficient The an increase of one standard deviation of batting average is associated with an
coefficient on an increase in salary of 0.14 standard deviations. So, for example, a player raising
independent variable
his batting average by 0.031, from 0.267 to 0.298, can expect an increase in salary
that has been
standardized.
of 0.14 × $2, 764, 512 = $387, 032. A player who increases his home runs by one
standard deviation (which Table 5.6 tells us is 10.31 home runs), can expect a 0.48
standard deviation increase in salary (which is 0.48 × $2, 764, 512 = $1, 326, 966).
In other words, home runs have a bigger bang for the buck. Eat your steroid-laced
Wheaties, kids.11
While results from OLS models with standardized variables seem quite
different, all they really do is rescale the original results. The model fit is the
same whether standardized or unstandardized variables are used. Notice that
the R2 is identical. Also, the conclusions about statistical significance are the
same in the unstandardized and standardized regressions; we can see that by
comparing the t statistics in the unstandardized and standardized results. Think
of the standardization as something like international currency conversion. In
unstandardized form, the coefficients are reported in different currencies, but
in standardized form, the coefficients are reported in a common currency. The
11
That’s a joke! Wheaties are gross.
158 CHAPTER 5 Multivariate OLS: Where the Action Is
underlying real prices, however, are the same whether they are reported in dollars,
euros, or baht.
REMEMBER THIS
Standardized coefficients allow the effects of two independent variables to be compared.
1. When the independent variable, Xk , and dependent variable are standardized, an increase of one
standard deviation in Xk is associated with a β̂ k standard deviation increase in the dependent
variable.
2. Statistical significance and model fit are the same for unstandardized and standardized results.
that we need to take into account. In this section, we discuss F tests as a solution
to this challenge, explain two different types of commonly used hypotheses about
multiple coefficients, and then show how to use R2 results to implement these tests,
including an example for our baseball data.12
F tests
There are several ways to test hypotheses involving multiple coefficients. We focus
F test A type of on an F test. This test shares features with hypothesis tests discussed earlier
hypothesis test (page 97). When using a F test, we define null and alternative hypotheses, set
commonly used to test a significance level, and compare a test statistic to a critical value. The F test
hypotheses involving
is different in that we use a new test statistic and compare it to a critical value
multiple coefficients.
derived from an F distribution rather than a t distribution or a normal distribution.
We provide more information on the F distribution in Appendix H (page 549).
F statistic The test The new test statistic is an F statistic. It is based on R2 values from two
statistic used in separate OLS specifications. We’ll first discuss these OLS models and then
conducting an F test. describe the F statistic in more detail.
The first specification is the unrestricted model, which is simply the full
unrestricted model model. For example, if we have three independent variables, our full model
The model in an F test might be
that imposes no Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i (5.13)
restrictions on the
coefficients. The model is called unrestricted because we are imposing no restrictions on what
the values of β̂1 , β̂ 2 , and β̂ 3 will be.
restricted model The second specification is the so-called restricted model in which we force
The model in an F test the computer to give us results that comport with the null hypothesis. It’s called
that imposes the restricted because we are restricting the estimated values of β̂1 , β̂ 2 , and β̂ 3 to be
restriction that the null
consistent with the null hypothesis.
hypothesis is true.
How do we do that? Sounds hard, but actually, it isn’t. We simply take the
relationship implied by the null hypothesis and impose it on the unrestricted
model. We can divide hypotheses involving multiple coefficients into two general
cases.
12
It is also possible to use t tests to compare multiple coefficients, but F tests are more widely used
for this purpose.
160 CHAPTER 5 Multivariate OLS: Where the Action Is
both the multicollinear variables equal zero, we can at least learn if one (or both)
of them is non-zero, even as we can’t say which one it is because the two are so
closely related.
In this case, imposing the null hypothesis means making sure that our
estimates of β1 and β2 are both zero. The process is actually easy-schmeasy: just
set the coefficients to zero and see that the resulting model is simply a model
without variables X1 and X2 . Specifically, the restricted model for H0 : β1 = β2 =
0 is
The R2Unrestricted will always be higher because the model without restrictions can
generate a better model fit than the same model subject to some restrictions. This
conclusion is a little counterintuitive at first, but note that R2Unrestricted will be higher
than R2Restricted even when the null hypothesis is true. This is because in estimating
the unrestricted equation, the software not only has the option of estimating both
coefficients to be whatever the value is under the null (hence assuring the same
fit as in the restricted model), but also any other deviation, large or small, that
improves the fit.
The extent of difference between R2Unrestricted and R2Restricted depends on whether
the null hypothesis is or is not true. If we are testing H0 : β1 = β2 = 0 and β1 and
β2 really are zero, then restricting them to be zero won’t cause the R2Restricted to be
too far from R2Unrestricted because the optimal values of β̂1 and β̂ 2 really are around
zero. If the null is false and β1 and β2 are much different from zero, there will be a
huge difference between R2Unrestricted and R2Restricted because setting them to non-zero
values, as happens only in the unrestricted model, improves fit substantially.
Hence, the heart of an F test is the difference between R2Unrestricted and
2
RRestricted . When the difference is small, imposing the null doesn’t do too much
damage to the model fit. When the difference is large, imposing the null damages
the model fit a lot.
An F test is based on the F statistic:
The q term refers to how many constraints are in the null hypothesis. That’s just
a fancy way of saying how many equal signs are in the null hypothesis. So for
H0 : β1 = β2 , the value of q is 1. For H0 : β1 = β2 = 0, the value of q is 2. The N − k
term is a degrees of freedom term, like what we saw with the t distribution. This
is the sample size minus the number of parameters estimated in the unrestricted
model. (For example, k for Equation 5.14 will be 4 because we estimate β̂ 0 , β̂1 ,
β̂2 and β̂3 .) We need to know these terms because the shape of the F distribution
depends on the sample size and the number of constraints in the null, just as the t
distribution shifted based on the number of observations.
The F statistic has the difference of R2Unrestricted and R2Restricted in it and also
includes some other bits to ensure that the F statistic is distributed according to an
F distribution. The F distribution describes the relative probability of observing
different values of the F statistic under the null hypothesis. It allows us to know the
probability that the F statistic will be bigger than any given number when the null
is true. We can use this knowledge to identify critical values for our hypothesis
tests; we’ll describe how shortly.
How we approach the alternative hypotheses depends on the type of null
hypothesis. For case 1 null hypotheses (in which multiple coefficients are zero),
the alternative hypothesis is that at least one coefficient is not zero. In other words,
the null hypothesis is that they all are zero, and the alternative is the negation of
that, which is that one or more of the coefficients is not zero.
162 CHAPTER 5 Multivariate OLS: Where the Action Is
For case 2 null hypotheses (in which two or more coefficients are equal), it is
possible to have a directional alternative hypothesis that one coefficient is larger
than the other. The critical value remains the same, but we add a requirement that
the coefficients actually go in the direction of the specified alternative hypothesis.
For example, if we are testing H0 : β1 = β2 versus HA : β1 > β2 , we reject the null in
favor of the alternative hypothesis if the F statistic is bigger than the critical value
and β̂1 is actually bigger than β̂ 2 .
This all may sound complicated, but the process isn’t that hard, really. (And,
as we show in the Computing Corner at the end of the chapter, statistical software
makes it easy.) The crucial step is formulating a null hypothesis and using it to
create a restricted equation. This process is not very hard. If we’re dealing with
a case 1 null hypothesis (that multiple coefficients are zero), we simply drop
the variables listed in the null in the restricted equation. If we’re dealing with a
case 2 null hypothesis (that two or more coefficients are equal to each other), we
simply create a new variable that is the sum of the variables referenced in the
null hypothesis and use that new variable in the restricted equation instead of the
individual variables.
The R2Unrestricted is 0.2992. (It’s usually necessary to be more precise than the
0.30 reported in Table 5.8.)
For the restricted model, we simply drop the variables listed in the null
hypothesis, yielding
Salaryi = β0 + i
The critical value (which we show how to identify in the Computing Corner,
pages 170 and 172) is 3.00. Since the F statistic is (way!) higher than the critical
value, we reject the null handily.
We can also easily test whether the standardized effect of home runs is bigger
than the standardized effect of batting average. The unrestricted equation is, as
before,
The critical value (which we show how to identify in the Computing Corner,
pages 170 and 172) is 3.84. Here, too, the F statistic is vastly higher than the
critical value, and we also reject the null hypothesis that β1 = β2 .
REMEMBER THIS
1. F tests are useful to test hypotheses involving multiple coefficients. To implement an F test for
the following model
(c) Estimate a restricted model by using the conditions in the null hypothesis to restrict the
full model.
• Case 1: When the null hypothesis is that multiple coefficients equal zero, we create a
restricted model by simply dropping the variables listed in the null hypothesis.
• Case 2: When the null hypothesis is that two or more coefficients are equal, we create
a restricted model by replacing the variables listed in the null hypothesis with a single
variable that is the sum of the listed variables.
(d) Use the R2 values from the unrestricted and restricted models to generate an F statistic
using Equation 5.14, and compare the F statistic to the critical value from the F
distribution.
2. The bigger the difference between R2Unrestricted and R2Restricted , the more the null hypothesis is
reducing fit and, therefore, the more likely we are to reject the null.
Table 5.9 presents results necessary to test this null. We use an F test that
requires R2 values from two specifications. The first column presents the unre-
stricted model; at the bottom is the R2Unrestricted , which is 0.06086. The second
column presents the restricted model; at the bottom is the R2Restricted , which is
0.05295. There are two restrictions in this null, meaning q = 2. The sample size
is 1,851, and the number of parameters in the unrestricted model is 5, meaning
N − k = 1,846.
5.6 Hypothesis Testing about Multiple Coefficients 165
Hence, for H0 : β1 = β2 = 0,
We have to use software (or tables) to find the critical value. We’ll discuss that
process in the Computing Corner (pages 170 and 171). For q = 2 and N − k = 1,846,
the critical value for α = 0.05 is 3.00. Because our F statistic as just calculated is bigger
than that, we can reject the null. In other words, the data is telling us that if the null
were true, we would be very unlikely to see such a big difference in fit between the
unrestricted and restricted models.13
13
The specific value of the F statistic provided by automated software F tests will differ from our
presentation because the automated software tests do not round to three digits, as we have done.
166 CHAPTER 5 Multivariate OLS: Where the Action Is
Second, let’s test the following case 2 null, H0 : β1 = β2 . Again, the first column in
Table 5.9 presents the unrestricted model; at the bottom is the R2Unrestricted , which is
0.06086. However, the restricted model is different for this null. Following the logic
discussed on page 160, it is
The third column in Table 5.9 presents the results for this restricted model; at the
bottom is the R2Restricted , which is 0.0605. There is one restriction in this null, meaning
q = 1. The sample size is still 1, 851, and the number of parameters in the unrestricted
model is still 5, meaning N − k = 1, 846.
Hence, for H0 : β1 = β2 ,
We again have to use software (or tables) to find the critical value. For q = 1
and N − k = 1, 846, the critical value for α = 0.05 is 3.85. Because our F statistic as
calculated here is less than the critical value, we fail to reject the null that the two
coefficients are equal. The coefficients are quite different in the unrestricted model
(0.03 and 0.35), but notice that the standard errors are large enough to prevent us
from rejecting the null that either coefficient is zero. In other words, we have a lot of
uncertainty in our estimates. The F test formalizes this uncertainty by forcing OLS
to give us the same coefficient on both height variables, and when we do this, the
overall model fit is pretty close to the model fit achieved when the coefficients are
allowed to vary across the two variables. If the null is true, this result is what we
would expect because imposing the null would not lower R2 by very much. If the
null is false, then imposing the null probably would have caused a more substantial
reduction in R2Restricted .
Conclusion
Multivariate OLS is a huge help in our fight against endogeneity because it allows
us to add variables to our models. Doing so cuts off at least part of the correlation
between an independent variable and the error term because the included variables
are no longer in the error term. For observational data, multivariate OLS is
very necessary, although we seldom can wholly defeat endogeneity simply by
including variables. For experimental data not suffering from attrition, balance,
or compliance problems, we can beat endogeneity without multivariate OLS.
However, multivariate OLS makes our estimates more precise.
Conclusion 167
• Section 5.1: Write down the multivariate regression equation and explain
all its elements (dependent variable, independent variables, coefficients,
intercept, and error term). Explain how adding a variable to a multivariate
OLS model can help fight endogeneity.
• Section 5.2: Explain omitted variable bias, including the two conditions
necessary for omitted variable bias to exist.
Yi = β0 + β1 X1i + β2 X2i + i
• H0 : β1 = β2 = 0
168 CHAPTER 5 Multivariate OLS: Where the Action Is
• H0 : β1 = β2
Further Reading
King, Keohane, and Verba (1994) provide an intuitive and useful discussion of
omitted variable bias.
Goldberger (1991) has a terrific discussion of multicollinearity. His point
is that the real problem with multicollinear data is that the estimates will
be imprecise. We defeat imprecise data with more data; hence, the problem
of multicollinearity is not having enough data, a state of affairs Goldberger
tongue-in-cheekily calls “micronumerosity.”
Morgan and Winship (2014) provide an excellent framework for thinking
about various approaches to controlling for multiple variables. They spend a fair
bit of time discussing the strengths and weaknesses of multivariate OLS and
alternatives.
Statistical results can often be more effectively presented as figures instead of
tables. Kastellec and Leoni (2007) provide a nice overview of the advantages and
options for such an approach.
Achen (1982, 77) critiques standardized variables, in part because they
depend on the standard errors of independent variables in the sample.
Key Terms
Adjusted R2 (150) Irrelevant variable (150) Restricted model (159)
Attenuation bias (145) Measurement error (143) Standardize (156)
Auxiliary regression (138) Multicollinearity (148) Standardized coefficient
Ceteris paribus (131) Multivariate OLS (127) (157)
Control variable (134) Omitted variable bias (138) Unrestricted model (159)
F test (159) Perfect multicollinearity Variance inflation factor
F statistic (159) (149) (148)
Computing Corner
Stata
• Calculating the R2j for each variable. For example, calculate the R21 via
reg X1 X2 X3
and calculate the R22 via
reg X2 X1 X3
1
• Stata also provides a VIF command that estimates 1−R2j
for each
variable. This command needs to be run immediately after the main
model of interest. For example,
reg Y X1 X2 X3
vif
would provide the VIF for all variables from the main model. A VIF
of 5, for example, indicates that the variance is five times higher than
it would be if there were no multicollinearity.
The hard way isn’t very hard. Use Stata’s egen comment to create
standardized versions of every variable in the model:
egen BattingAverage_std = std(BattingAverage)
egen Homeruns_std = std(Homeruns)
egen Salary_std = std(Salary)
Then run a regression with these standardized variables:
reg Salary_std BattingAverage_std Homeruns_std
The standardized coefficients are listed, as usual, under “Coef.” Notice that
they are identical to the results from using the , beta command.
4. Stata has a very convenient way to conduct F tests for hypotheses involving
multiple coefficients. Simply estimate the unrestricted model, then type
test, and then key in the coefficients involved and restriction implied by
the null. For example, to test the null hypothesis that the coefficients on
Height81 and Height85 are both equal to zero, type the following:
reg Wage Height81 Height85 Clubs Athletics
test Height81 = Height85 = 0
To test the null hypothesis that the coefficients on Height81 and Height85
are equal to each other, type the following:
reg Wage Height81 Height85 Clubs Athletics
test Height81 = Height85
Rounding will cause this code to produce F statistics slightly different
from those on page 165.
7. To find the p value from an F distribution for a given F statistic, use disp
Ftail(df1, df2, F), where df1 and df2 are the degrees of freedom
and F is the F statistic. For example, to calculate the p value for the F
statistic on page 165 for H0 : β1 = β2 = 0, type disp Ftail(2, 1846,
7.77).
2. To assess multicollinearity, calculate the R2j for each variable. For example,
calculate the R21 via
AuxReg1 = lm(X1 ~ X2 + X3)
and calculate the R22 via
AuxReg2 = lm(X2 ~ X1 + X3)
summary(Restricted2)$r.squared
To see the degrees of freedom for the unrestricted model, type
summary(Unrestricted1)$df[2]
We’ll have to keep track of q on our own.
Exercises
1. Table 5.10 describes variables from heightwage.dta we will use in this
problem. We have seen this data in Chapter 3 (page 74) and in Chapter 4
(page 123).
(a) Estimate two OLS regression models: one in which adult wages is
regressed on adult height for all respondents, and another in which
adult wages is regressed on adult height and adolescent height for
all respondents. Discuss differences across the two models. Explain
why the coefficient on adult height changed.
TABLE 5.10 Variables for Height and Wages Data in the United States
Variable name Description
Run the plot once without a jitter subcommand and once with it, and
choose the more informative of the two plots.14
(c) Notice that IQ is omitted from the model. Is this a problem? Why or
why not?
(d) Notice that eye color is omitted from the model. Is this a problem?
Why or why not?
(e) You’re the boss! Use the data in this file to estimate a model that
you think sheds light on an interesting relationship. The specification
decisions include whether to limit the sample and what variables to
include. Report only a single additional specification. Describe in no
more than two paragraphs why this is an interesting way to assess the
data.
14
In Stata, add jittering to a scatter plot via scatter X1 X2, jitter(3). In R, add jittering to a
plot via plot(jitter(X1), jitter(X2)). Note that in the auxiliary regression, it’s useful to limit
the sample to observations for which wage96 is not missing to ensure that the R2 from the auxiliary
regression will be based on the same number of observations as the regression originally. In Stata,
add if wage96 !=. to the end of a regression statement, where the exclamation means “not” and
the period is how Stata marks missing values. In R, we could limit the sample via data =
data[is.na(data$wage96) == 0, ] in the regression command, where the is.na function
returns a 1 for missing observations and a 0 for non-missing observations.
174 CHAPTER 5 Multivariate OLS: Where the Action Is
(b) Suppose someone argues that we need to take into account the
growth of the U.S. population between 1970 and 2000. This
particular data set does not have a population variable, but it does
have a variable called Season, which indicates what season the data
is from (e.g., Season equals 1969 for observations from 1969 and
Season equals 1981 for observations from 1981, etc.). What are the
conditions that need to be true for omission of the season variable to
bias other coefficients? Do you think they hold in this case?
(e) Which matters more for attendance: winning or runs scored? [To
keep us on the same page, use home_attend as the dependent variable
and control for wins, runs_scored, runs_allowed, and season.]
3. Do cell phones distract drivers and cause accidents? Worried that this
is happening, many states recently have passed legislation to reduce
distracted driving. Fourteen states now have laws making handheld cell
phone use while driving illegal, and 44 states have banned texting while
driving. This problem looks more closely at the relationship between cell
phones and traffic fatalities. Table 5.11 describes the variables in the data
set Cellphone_2012_homework.dta.
(a) While we don’t know how many people are using their phones while
driving, we can find the number of cell phone subscriptions in a
state (in thousands). Estimate a bivariate model with traffic deaths
as the dependent variable and number of cell phone subscriptions as
the independent variable. Briefly discuss the results. Do you suspect
endogeneity? If so, why?
(b) Add population to the model. What happens to the coefficient on cell
phone subscriptions? Why?
Exercises 175
TABLE 5.11 Variables for Cell Phones and Traffic Deaths Data
Variable name Description
year Year
(c) Add total miles driven to the model. What happens to the coefficient
on cell phone subscriptions? Why?
(d) Based on the model in part (c), calculate the variance inflation factor
for population and total miles driven. Why are they different? Dis-
cuss implications of this level of multicollinearity for the coefficient
estimates and the precision of the coefficient estimates.
4. What determines how much drivers are fined if they are stopped for
speeding? Do demographics like age, gender, and race matter? To answer
this question, we’ll investigate traffic stops and citations in Massachusetts
using data from Makowsky and Stratmann (2009). Even though state law
sets a formula for tickets based on how fast a person was driving, police
officers in practice often deviate from the formula. Table 5.12 describes
data in speeding_tickets_text.dta that includes information on all traffic
stops. An amount for the fine is given only for observations in which the
police officer decided to assess a fine.
(b) Estimate the model from part (a), also controlling for miles per hour
over the speed limit. Explain what happens to the coefficient on age
and why.
(c) Suppose we had only the first thousand observations in the data set.
Estimate the model from part (b), and report on what happens to the
standard errors and t statistics when we have fewer observations.15
(b) Let’s keep going. Add height at age 7 to the above model, and discuss
the results. Be sure to note changes in sample size (and its possible
15
In Stata, use if _n < 1001 at the end of the regression command to limit the sample to the first
thousand observations. In R, create and use a new data set with the first 1,000 observations (e.g.,
dataSmall = data[1:1000,]). Because the ticket amount is missing for drivers who were not
fined, the sample size of the regression model will be smaller than 1, 000.
16
For the reasons discussed in the exercise in Chapter 3 on page 89, we limit the data set to
observations with height greater than 40 inches and self-reported income less than 400 British pounds
per hour. We also exclude observations of individuals who grew shorter from age 16 to age 33.
Excluding these observations doesn’t substantially affect the results we see here, but since it’s
reasonable to believe there is some kind of non-trivial measurement error for these cases, we exclude
them for the analysis for this question.
Exercises 177
(c) Is there multicollinearity in the model from part (b)? If so, qualify the
degree of multicollinearity, and indicate its consequences. Specify
whether the multicollinearity will bias coefficients or have some
other effect.
(d) Perhaps characteristics of parents affect height (some force kids to eat
veggies, while others give them only french fries and Fanta). Add the
two parental education variables to the model, and discuss the results.
Include only height at age 16 (meaning we do not include the height
at ages 33 and 7 for this question—although feel free to include them
on your own; the results are interesting).
(e) Perhaps kids had their food stolen by greedy siblings. Add the
number of siblings to the model, and discuss the results.
6. Use globaled.dta, the data set on education and growth from Hanushek and
Woessmann (2012) for this question. The variables are given in Table 5.14.
region Region
open Openness of the economy scale
proprts Security of property rights scale
178 CHAPTER 5 Multivariate OLS: Where the Action Is
(a) Use standardized variables to assess whether the effect of test scores
on economic growth is larger than the effect of years in school. At
this point, simply compare the different effects in a meaningful way.
We’ll do statistical tests next. The dependent variable is average
annual GDP growth per year. For all parts of this exercise, control
for average test scores, average years of schooling between 1960 and
2000, and GDP per capita in 1960.
(c) Now add controls for openness of economy and security of property
rights. Which matters more: test scores or property rights? Use
appropriate statistical evidence in your answer.
Dummy Variables: Smarter than 6
You Think
179
180 CHAPTER 6 Dummy Variables: Smarter than You Think
Goal Goal
differential differential
5 5
4 4
3 3
2 2
−1 −1
−2 −2
0 1 0 1
Away Home Away Home
Manchester City Manchester United
(a) (b)
FIGURE 6.1: Goal Differentials for Home and Away Games for Manchester City and Manchester
United
In other words, β̂0 is the predicted value of Y for individuals in the control group.
It is not surprising that the value of β̂0 that best fits the data is simply the average
of Yi for individuals in the control group.1
The fitted value for the treatment group (for whom Treatmenti = 1) is
In other words, β̂0 + β̂1 is the predicted value of Y for individuals in the treatment
group. The best predictor of this value is simply the average of Y for individuals
1
The proof is a bit laborious. We show it in the Citations and Additional Notes section on page 557.
182 CHAPTER 6 Dummy Variables: Smarter than You Think
in the treatment group. Because β̂0 is the average of individuals in the control
group, β̂1 is the difference in averages between the treatment and control groups.
If β̂1 > 0, then the average Y for those in the treatment group is higher than for
those in the control group. If β̂1 < 0, then the average Y for those in the treatment
group is lower than for those in the control group. If β̂1 = 0, then the average Y
for those in the treatment group is no different from the average Y for those in the
control group.
In other words, our slope coefficient ( β̂1 ) is, in the case of a bivariate OLS
model with a dummy independent variable, a measure of the difference in means
across the two groups. The standard error on this coefficient tells us how much
uncertainty we have and determines the confidence interval for our estimate of β̂1 .
Figure 6.2 graphically displays the difference of means test in bivariate OLS
with a scatterplot of data. It looks a bit different from our previous scatterplots
(e.g., Figure 3.1 on page 46) because here the independent variable takes on only
two values: 0 or 1. Hence, the observations are stacked at 0 and 1. In our example,
Dependent
variable 20
15
Average for
β0 + β1
treatment group
10
e)
slop
(the
β1
5
Average for
β0
control group
0 1
Control Treatment
group group
Treatment variable
the values of Y when X = 0 are generally lower than the values of Y when X = 1.
The parameter β̂0 corresponds to the average of Y for all observations for which
X = 0. The average for the treatment group (for whom X = 1) is β̂0 + β̂1 . The
difference in averages across the groups is β̂1 . A key point is that the standard
interpretation of coefficients in bivariate OLS still applies: a one-unit change in X
(e.g., going from X = 0 to X = 1) is associated with a β̂1 change in Y.
This is excellent news. Whenever our independent variable is a dummy
variable—as it typically is for experiments and often is for observational data—we
can simply run bivariate OLS and the β̂1 coefficient tells us the difference of
means. The standard error on this coefficient tells us how precisely we have
measured this difference and allows us to conduct a hypothesis test and determine
a confidence interval.
OLS produces difference of means tests for observational data as well. The
model and interpretation are the same; the difference is how much we worry
about whether the exogeneity assumption is satisfied. Typically, exogeneity will
be seriously in doubt for observational data. And sometimes OLS can be useful
in estimating the difference of means as a descriptive statistic without a causal
interpretation.
Difference of means tests can be conducted without using OLS. Doing so
is totally fine, of course; in fact, OLS and non-OLS difference of means tests
assuming the same variances across groups produce identical estimates and
standard errors. The advantage of the OLS approach is that we can use it within a
framework that also does all the other things OLS does, such as adding multiple
variables to the model.
2
A standard OLS regression model produces a standard error and a t statistic that are equivalent to
the standard error and t statistic produced by a difference of means test in which variance is assumed
to be the same across both groups. An OLS model with heteroscedasticity-consistent standard errors
(as discussed in Section 3.6) produces a standard error and t statistic that are equivalent to a
difference of means test in which variance differs across groups. The Computing Corner at the end of
the chapter shows how to estimate these models.
184 CHAPTER 6 Dummy Variables: Smarter than You Think
Difference of means tests convey the same essential information when the
coding of the dummy variable is flipped. The column on the right in Table 6.1
shows results from a model in which NotRepublican was the independent
variable. This variable is the opposite of the Republican variable, equaling 1
for non-Republicans and 0 for Republicans. The numerical results are different,
but they nonetheless contain the same information. The constant is the mean
evaluation of Trump by Republicans. In the first specification, this mean is
β̂0 + β̂1 = 14.95 + 36.06 = 51.01. In the second specification it is simply β̂0
because this is the mean value for the reference category. In the first specification,
the coefficient on Republican is 36.06, indicating that Republicans evaluated
Trump 36.06 points higher than non-Republicans. In the second specification the
coefficient on NotRepublican is negative, −36.06, indicating that non-Republicans
evaluated Trump 36.06 points lower than Republicans.
Figure 6.3 scatterplots the data and highlights the estimated differences in
means between non-Republicans and Republicans. Dummy variables can be a
bit tricky to plot because the values of the independent variable are only zero
jitter A process used or one, causing the data to overlap such that we can’t tell whether a given dot
in scatterplotting data. A in the scatterplot indicates 2 or 200 observations. A trick of the trade is to jitter
small, random number is each observation by adding a small, random number to each observation for the
added to each independent and dependent variables. The jittered data gives the cloudlike images
observation for in the figure that help us get a decent sense of the data. We jitter only the data
purposes of plotting that is plotted; we do not jitter the data when running the statistical analysis.
only. This procedure
The Computing Corner at the end of this chapter shows how to jitter data for
produces cloudlike
images, which overlap plots.3
less than the unjittered
data, hence providing a
3
better sense of the data. We discussed jittering data earlier, on page 74.
6.1 Using Bivariate OLS to Assess Difference of Means 185
Feeling
thermometer
toward 100
Trump
80
60
β0 + β1 Average for
Republicans
40
e)
lop
es
(th
β1
20
β0 Average for
Non−Republicans
0 1
Non−Republicans Republicans
Partisan identification
Non-Republicans’ feelings toward Trump clearly run lower: that group shows
many more observations at the low end of the feeling thermometer scale. The
non-Republicans’ average feeling thermometer rating is 14.95. Feelings toward
Trump among Republicans are higher, with an average of 51.01. When interpreted
correctly, both the specifications in Table 6.1 tell this same story.
REMEMBER THIS
A difference of means test assesses whether the average value of the dependent variable differs
between two groups.
1. We often are interested in the difference of means between treatment and control groups,
between women and men, or between other groupings.
186 CHAPTER 6 Dummy Variables: Smarter than You Think
2. Difference of means tests can be implemented in bivariate OLS by using a dummy independent
variable:
Yi = β0 + β1 Treatmenti + i
(a) The estimate of the mean for the control group is β̂0 .
(b) The estimate of the mean for the treatment group is β̂0 + β̂1 .
(c) The estimate for differences in means between groups is β̂1 .
Review Questions
1. Approximately what are the averages of Y for the treatment and control groups in each panel
of Figure 6.4? Approximately what is the estimated difference of means in each panel?
2. Approximately what are the values of β̂0 and β̂1 in each panel of Figure 6.4?
Dependent
variable
4
−2
0 1
0 1
Heighti = β0 + β1 Malei + i
4
Sometimes people will name a variable like this “gender.” That’s annoying! Readers will then have
to dig through the paper to figure out whether 1 indicates males or females.
188 CHAPTER 6 Dummy Variables: Smarter than You Think
Height
(in inches)
80
75
70
65
60
55
50
0 1
Women Men
Gender
Constant 64.23∗
(0.04)
[t = 1, 633.6]
Male 5.79∗
(0.06)
[t = 103.4]
N 10, 863
Height
(in inches)
80
75
Average height
β0 + β170 for men
65 Average height
β0 for women
60
55
50
0 1
Women Men
Gender
Now the estimated coefficient β̂0 will tell us the average height for men
(the group for which Female = 0). The estimated coefficients β̂0 + β̂1 will tell us
the average height for women, and the difference between the two groups is
estimated as β̂1 .
The results with the female dummy variable are in the right-hand column of
Table 6.3. The numbers should look familiar because we are learning the same
information from the data. It is just that the accounting is a bit different. What is the
estimate of the average height for men? It is β̂0 in the right-hand column, which is
70.02. Sound familiar? That was the number we got from our initial results (reported
again in the left-hand column of Table 6.3); in that case, we had to add β̂0 + β̂1
because when the dummy variable indicated men, we needed both coefficients to
get the average height for men. What is the difference between males and females
estimated in the right-hand column? It is –5.79, which is the same as before, only
negative. The underlying fact is that women are estimated to be 5.79 inches shorter
on average. If we have coded our dummy variable as Female = 1, then going from
190 CHAPTER 6 Dummy Variables: Smarter than You Think
Male 5.79∗
(0.06)
[t = 103.4]
Female −5.79∗
(0.06)
[t = 103.4]
Constant 64.23∗ 70.02∗
(0.04) (0.04)
[t = 1, 633.6] [t = 1, 755.9]
N 10,863 10,863
where Opponent qualityi measures the opponent’s overall goal differential in all
other games. The β̂1 estimate will tell us, controlling for opponent quality, whether
the goal differential was higher for Manchester City for home games. The results
are in Table 6.4.
The generic for such a model is
Yi = β0 + β1 Dummyi + β2 Xi + i (6.3)
It is useful to think graphically about the fitted lines from this kind of
model. Figure 6.7 shows the data for Manchester City’s results in 2012–2013.
6.2 Dummy Independent Variables in Multivariate OLS 191
The observations for home games (for which the Home dummy variable is 1) are
dots; the observations for away games (for which the Home dummy variable is 0)
are squares.
As discussed on page 181, the intercept for the Homei = 0 observations (the
away games) will be β̂0 , and the intercept for the Homei = 1 observations (the
home games) will be β̂0 + β̂1 , which equals the intercept for away games plus
the bump (up or down) for home games. Note that the coefficient indicating
the difference of means is the coefficient on the dummy variable. (Note also
that the β we should look at depends on how we write the model. For this
model, β1 indicates the difference of means controlling for the other variable,
but it would be β2 if we wrote the model to have β2 multiplied by the dummy
variable.)
The innovation is that our difference of means test here also controls for
another variable—in this case, opponent quality. Here the effect of a one-unit
increase in opponent quality is β̂ 2 ; this effect is the same for the Homei = 1
and Homei = 0 groups. Hence, the fitted lines are parallel, one for each group
separated by β̂1 , the differential bump associated with being in the Homei = 1
group. In Figure 6.7, β̂1 is greater than zero, but it could be less than zero (in
which case the dashed line for β̂0 + β̂1 for the Homei = 1 group would be below
the β̂0 line) or equal to zero (in which case the two dashed lines would overlap
exactly).
We can add independent variables to our heart’s content, allowing us to
assess the difference of means between the Homei = 1 and Homei = 0 groups
in a manner that controls for the additional variables. Such models are incredibly
common.
192 CHAPTER 6 Dummy Variables: Smarter than You Think
Goal
differential
5
Home = 1
Fitted line for home games
4 Home = 0
Fitted line for away games
β0 + β1
1
β2 (th
es lope)
β0
0
β2 (th
e slope
)
−1
−2
Opponent quality
FIGURE 6.7: Fitted Values for Model with Dummy Variable and Control Variable: Manchester City
Example
REMEMBER THIS
1. Including a dummy variable in a multivariate regression allows us to conduct a difference of
means test while controlling for other factors with a model such as
Yi = β0 + β1 Dummyi + β2 Xi + i
2. The fitted values from this model will be two parallel lines, each with a slope of β̂ 2 and separated
by β̂1 for all values of X.
6.3 Transforming Categorical Variables to Multiple Dummy Variables 193
Discussion Questions
Come up with an example of an interesting relationship involving a dummy independent variable and
one other independent variable variable that you would like to test.
5
It is possible to treat ordinal independent variables in the same way as categorical variables in the
manner we describe here. Or, it is common to simply include ordinal independent variables directly
in a regression model and interpret a one-unit increase as movement from one category to another.
194 CHAPTER 6 Dummy Variables: Smarter than You Think
(Here Wagei is the wages of person i and Regioni is the region person i lives in, as
just defined.)
No, no, and no. Though the categorical variable may be coded numerically, it
has no inherent order, which means the units are not meaningful. The Midwest
is not “1” more than the Northeast; the South is not “1” more than the
Midwest.
So what do we do with categorical variables? Dummy variables save the
day. We simply convert categorical variables into a series of dummy variables,
a different one for each category. If region is the categorical variable, we simply
create a Northeast dummy variable (1 for people from the Northeast, 0 otherwise),
a Midwest dummy variable (1 for people from the Midwest, 0 otherwise),
and so on.
The catch is that we cannot include dummy variables for every category—if
we did, we would have perfect multicollinearity (as we discussed on page 149).
Hence, we exclude one of the dummy variables and treat that category as
reference category the reference category (also called the excluded category), which means that
When a model includes coefficients on the included dummy variables indicate the difference between the
dummy variables category designated by the dummy variable and the reference category.
indicating the multiple
We’ve already been doing something like this with dichotomous dummy
categories of a
categorical variable, we
variables. When we used the male dummy variable in our height and wages
need to exclude a example on page 187, we did not include a female dummy variable, meaning
dummy variable for one that females were the reference category and the coefficient on the male dummy
of the groups, which we variable indicated how much taller men were. When we used the female
refer to as the reference dummy variable, men were the reference category and the coefficient on the female
category. Also referred dummy variable indicated how much shorter females were on average.
to as the excluded
category.
TABLE 6.5 Using Different Reference Categories for Women’s Wages and Region
(a) (b) (c) (d)
West as South as Midwest as Northeast as
reference reference reference reference
The results for this regression are in column (a) of Table 6.5. The β̂0 result
(indicated in the “Constant” line in the table) tells us that the average wage per
hour for women in the West (the reference category) was $12.50. Women in the
Northeast are estimated to receive $2.02 more per hour than those in the West, or
$14.52 per hour. Women in the Midwest earn $1.59 less than women in the West,
which works out to $10.91 per hour. And women in the South receive $2.13 less
than women in the West, or $10.37 per hour.
Column (b) of Table 6.5 shows the results from the same data, but with South
as the reference category instead of West. The β̂0 result tells us that the average
wage per hour for women in the South (the reference category) was $10.37.
Women in the Northeast get $14.52 per hour, which is $4.15 per hour more than
women in the South. Women in the Midwest receive $0.54 per hour more than
women in the South (which works out to $10.91 per hour), and women in the West
get $2.13 per hour more than women in the South (which works out to $12.50 per
hour). The key pattern is that the estimated amount that women in each region get is
the same in columns (a) and (b). Columns (c) and (d) have Midwest and Northeast,
respectively, as the reference categories, and with calculations like those we just
did, we can see that the estimated average wages for each region are the same in
all specifications.
196 CHAPTER 6 Dummy Variables: Smarter than You Think
REMEMBER THIS
To use dummy variables to control for categorical variables, we include dummy variables for every
category except one.
1. Coefficients on the included dummy variables indicate how much higher or lower each group
is than the reference category.
2. Coefficients differ depending on which reference category is used, but when interpreted
appropriately, the fitted values for each category do not change across specifications.
Review Questions
1. Suppose we wanted to conduct a cross-national study of opinion in North America and have
a variable named “Country” that is coded 1 for respondents from the United States, 2 for
respondents from Mexico, and 3 for respondents from Canada. Write a model, and explain
how to interpret the coefficients.
2. For the results in Table 6.6 on page 197, indicate what the coefficients are in boxes (a)
through (j).
6.3 Transforming Categorical Variables to Multiple Dummy Variables 197
Hence, at least for earlier times, universal male suffrage was a policy that broadened
the electorate from a narrow slice of property holders to a larger group of
non-property holders—and, thus, less wealthy citizens.
To assess if universal male suffrage led to increases in inheritance taxes, we can
begin with the following model:
The data is measured every five years. The dependent variable is the top
inheritance tax rate, and the independent variable is a dummy variable for whether
all men were eligible to vote in at least half of the previous five years.6
Table 6.7 shows initial results that corroborate our suspicion. The coefficient on
our universal male suffrage dummy variable β̂1 is 19.33, with a t statistic of 10.66,
indicating strong statistical significance. The results mean that countries without
universal male suffrage had an average inheritance tax of 4.75 (β̂0 ) percent and
that countries with universal male suffrage had an average inheritance tax of 24.08
(β̂0 + β̂1 ) percent.
These results are from a bivariate OLS analysis of observational data. It is likely
that unmeasured factors lurking in the error term are correlated with the universal
suffrage dummy variable, which would induce endogeneity.
One possible source of endogeneity could be that major advances in universal
male suffrage happened at the same time inheritance taxes were rising throughout
the world, whatever the state of voting was. Universal male suffrage wasn’t really
a thing until around 1900 but then took off quickly, and by 1921, a majority of the
6
Measuring these things can get tricky; see the original paper for details. Most countries had an
ignominious history of denying women the right to vote until the late nineteenth or early twentieth
century (New Zealand was one of the first to extend the right to vote to women, in 1893) and of
denying or restricting voting by minorities until even later. Scheve and Stasavage used additional
statistical tools we will cover later, including fixed effects (introduced in Chapter 8) and lagged
dependent variables (explained in Chapter 13).
6.3 Transforming Categorical Variables to Multiple Dummy Variables 199
Inheritance Year
tax 2000
(%)
80
1950
60
1900
40
20
1850
FIGURE 6.8: Relation between Omitted Variable (Year) and Other Variables
countries had universal male suffrage (at least in theory). In other words, it seems
quite possible that something in the error term (a time trend) is correlated both with
inheritance taxes and with universal suffrage. So what appears to be a relationship
between suffrage and taxes may be due to the fact that suffrage increased at
a time when inheritance taxes were going up rather than to a causal effect of
suffrage.
Figure 6.8 presents evidence consistent with these suspicions. Panel (a) shows
the relationship between year and the inheritance tax. The line is the fitted line
from a bivariate OLS regression model in which inheritance tax was the dependent
variable and year was the independent variable. Clearly, the inheritance tax was
higher as time went on.
Panel (b) of Figure 6.8 shows the relationship between year and universal male
suffrage. The data is jittered for ease of viewing, and the line is from a bivariate
model. Obviously, this is not a causal model; it instead shows that the mean value
200 CHAPTER 6 Dummy Variables: Smarter than You Think
for the year variable was much higher when universal male suffrage equaled 1
than when universal male suffrage equaled 0. Taken together with panel (a), we
have evidence that the two conditions for omitted variable bias are satisfied: the
year variable is associated with the dependent variable and with the independent
variable.
What to do next is simple enough—include a year variable with the following
model:
where Year equals the value of the year of the observation. This model allows us to
assess whether a difference exists between countries with universal male suffrage
and countries without universal male suffrage even after we control for a year trend
that may have affected all countries.
Table 6.8 shows the results. The bivariate column is the same as in Table 6.7.
The multivariate (a) column adds the year variable. Whoa! Huge difference. Now
the coefficient on universal male suffrage is –0.38, with a tiny t statistic. In terms
of difference of means testing, we can now say that controlling for a year trend,
the average inheritance tax in countries with universal male suffrage was not
statistically different from that in countries without universal male suffrage.
Scheve and Stasavage argue that war was a more important factor behind
increased inheritance taxes. When a country mobilizes to fight, leaders not only
need money to fund the war, they also need a societal consensus in favor of it.
Ordinary people may feel stretched thin, with their sons conscripted and their
taxes increased. An inheritance tax could be a natural outlet that provides the
government with more money while creating a sense of fairness within society.
Column (b) in the multivariate results includes a dummy variable indicating
that the country was mobilized for war for more than half of the preceding
five years. The coefficient on the war variable is 14.05, with a t statistic of 4.68,
meaning that there is a strong connection between war and inheritance taxes. The
coefficient on universal suffrage is negative but not quite statistically significant
(with a t statistic of 1.51). The coefficient on year continues to be highly statistically
significant, indicating that the year trend persists even when we control for war.
Many other factors could affect the dependent variable and be correlated with
one or more of the independent variables. There could, for example, be regional
variation, as perhaps Europe tended to have more universal male suffrage and
higher inheritance taxes. Therefore, we include dummy variables for Europe, Asia,
and Australia/New Zealand in column (c). North America is the reference category,
which means, for example, that European inheritance taxes were 5.65 percentage
points lower than in North America once we control for the other variables.
The coefficient on the war variable in column (c) is a bit lower than in column
(b) but still very significant. The universal male suffrage variable is close to zero and
statistically insignificant. These results therefore suggest that the results in column
(b) are robust to controlling for continent.
Column (d) shows what happens when we use Australia/New Zealand as our
reference category instead of North America. The coefficients on the war and
6.3 Transforming Categorical Variables to Multiple Dummy Variables 201
suffrage variables are identical to those in column (c). Remember that changing
the reference category affects only how we interpret the coefficients on the dummy
variables associated with the categorical variable in question.
The coefficients on the region variables, however, do change with the new
reference category. The coefficient on Europe in column (d) is 2.19 and statistically
insignificant. Wait a minute! Wasn’t the coefficient on Europe –5.65 and statistically
significant in column (c)? Yes, but in column (c), Europe was being compared to
North America, and Europe’s average inheritance taxes were (controlling for the
other variables) 5.65 percentage points lower than North American inheritance
taxes. In column (d), Europe is being compared to Australia/New Zealand, and the
coefficient indicates that European inheritance taxes were 2.19 percentage points
higher than in Australia/New Zealand.
The relative relationship between Europe and North America is the same in
both specifications as the coefficient on the North America dummy variable is 7.84
in column (d), which is 5.65 higher than the coefficient on Europe in column (d).
202 CHAPTER 6 Dummy Variables: Smarter than You Think
Bivariate
model
Multivariate
model (a)
Multivariate
model (b)
Multivariate
model (c)
−5 0 5 10 15 20
Estimated coefficient
FIGURE 6.9: 95 Percent Confidence Intervals for Universal Male Suffrage Variable in Table 6.8
We can go through such a thought process for each of the coefficients and see
the bottom line: as long as we know how to use dummy variables for categorical
variables, the substantive results are exactly the same in multivariate columns (c)
and (d).
Figure 6.9 shows the 95 percent confidence intervals for the coefficient on the
universal suffrage variable for the bivariate and multivariate models. As discussed
in Section 4.6, confidence intervals indicate the range of possible true values
most consistent with the data. In the bivariate model, the confidence interval
ranges from 15.8 to 22.9. This confidence interval does not cover zero, which is
another way of saying that the coefficient is statistically significant. When we move
to the multivariate models, however, the 95 percent confidence intervals shift
dramatically downward and cover zero, indicating that the estimated effect is no
longer statistically significant. We don’t need to plot the results from column (d)
because the coefficient on the suffrage variable is identical to that in column (c).
that all men get paid more by the same amount. It could be that work experience for
men is more highly rewarded than work experience for women. We address this
possibility with models in which a dummy independent variable interacts with
(meaning “is multiplied by”) a continuous independent variable.7
The following OLS model allows the effect of X to differ across groups:
The third variable is produced by multiplying the Dummyi variable times the Xi
variable. In a spreadsheet, we would simply create a new column that is the product
of the Dummy and X columns. In statistical software, we generate a new variable,
as described in the Computing Corner of this chapter.
For the Dummyi = 0 group, the fitted value equation simplifies to
In other words, the estimated intercept for the Dummyi = 0 group is β̂0 and the
estimated slope is β̂1 .
For the Dummyi = 1 group, the fitted value equation simplifies to
In other words, the estimated intercept for the Dummyi = 1 group is β̂0 + β̂ 2 and
the estimated slope is β̂1 + β̂ 3 .
Figure 6.10 shows a hypothetical example for the following model of salary
as a function of experience for men and women:
The dummy variable here is an indicator for men, and the continuous variable
is a measure of years of experience. The intercept for women (the Dummyi = 0
group) is β̂0 , and the intercept for men (the Dummyi = 1 group) is β̂0 + β̂ 2 . The β̂2
coefficient indicates the salary bump that men get even at 0 years of experience.
The slope for women is β̂1 , and the slope for men is β̂1 + β̂ 3 . The β̂ 3 coefficient
indicates the extra salary men get for each year of experience over and above the
salary increase women get for another year of experience. In this figure, the initial
gap between the salaries of men and women is modest (equal to β̂ 2 ), but due to
a positive β̂ 3 , the salary gap becomes quite large for people with many years of
experience.
7
Interactions between continuous variables are created by multiplying two continuous variables
together. The general logic is the same. Kam and Franceze (2007) provide an in-depth discussion of
all kinds of interactions.
204 CHAPTER 6 Dummy Variables: Smarter than You Think
Salary
(in $1,000s) Men (Dummyi = 1 group)
Fitted line for men (Dummyi = 1 group)
Women (Dummyi = 0 group)
70
Fitted line for women (Dummyi = 0 group)
60
50
n)
me
or
ef
op
(sl
+β
3
40 β1
en)
β0 + β2 for wom
lope
β 1 (s
β0
30
0 1 2 3 4 5 6 7 8 9 10
Years of experience
βˆ1 < 0 Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is
negative. Slope for Di = 1 negative. Slope for Di = 1 negative. Slope for Di = 1
group is more negative. group is same. group is less negative and will
be positive if βˆ1 + β̂ 3 > 0.
βˆ1 = 0 Slope for Di = 0 group is Slope for both groups is Slope for Di = 0 group is
zero. Slope for Di = 1 group zero. zero. Slope for Di = 1 group
is negative. is positive.
βˆ1 > 0 Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is pos-
positive. Slope for Di = 1 positive. Slope for Di = 1 itive. Slope for Di = 1 group
group is less positive and will group is same. is more positive.
be negative if βˆ1 + β̂ 3 < 0.
The standard error of β̂ 3 is useful for calculating confidence intervals for the
difference in slope coefficients across the two groups. Standard errors for some
quantities of interest are tricky, though. To generate confidence intervals for the
effect of X on Y, we need to be alert. For the Dummyi = 0 group, the effect is
simply β̂1 , and we can simply use the standard error of β̂1 . For the Dummyi = 1
group, the effect is β̂1 + β̂ 3 ; the standard error of the effect is more complicated
because we must account for the standard error of both β̂1 and β̂ 3 in addition to
any correlation between β̂1 and β̂ 3 (which is associated with the correlation of X1
and X3 ). The Citations and Additional Notes section provides more details on how
to do this on page 559.
REMEMBER THIS
Interaction variables allow us to estimate effects that depend on more than one variable.
Yi = β0 + β1 Xi + β2 Dummyi + β3 Dummyi × Xi + i
3. The fitted values from this model will be two lines. For the model as written, the slope for the
group for which Dummyi = 0 will be β̂1 . The slope for the group for which Dummyi = 1 will
be β̂1 + β̂ 3 .
4. The coefficient on a dummy interaction variable indicates the estimated difference in slopes
between two groups.
206 CHAPTER 6 Dummy Variables: Smarter than You Think
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(a) (b)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(c) (d)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(e) (f)
FIGURE 6.11: Various Fitted Lines from Dummy Interaction Models (for Review Questions)
Review Questions
1. For each panel in Figure 6.11, indicate whether each of β0 , β1 , β2 , and β3 is less than, equal to,
or greater than zero for the following model:
Yi = β0 + β1 Xi + β2 Dummyi + β3 Dummyi × Xi + i
6.4 Interaction Variables 207
The results for this model, in column (a) of Table 6.10, indicate that the home-
owner used 13.02 fewer therms of energy in months of using the programmable
thermostat than in months before he acquired it. Therms cost about $1.59 at this
8
For each day, the HDD is measured as the number of degrees that a day’s average temperature is
below 65 degrees Fahrenheit, the temperature below which buildings may need to be heated. The
monthly measure adds up the daily measures and provides a rough measure of the amount of heating
needed in the month. If the temperature is above 65 degrees, the HDD measure will be zero.
208 CHAPTER 6 Dummy Variables: Smarter than You Think
200
100
FIGURE 6.12: Heating Used and Heating Degree-Days for Homeowner who Installed a Program-
mable Thermostat
time, so the homeowner saved roughly $20.70 per month on average. That’s not
bad. However, the effect is not statistically significant (not even close, really, as
the t statistic is only 0.54), so based on this result, we should be skeptical that the
thermostat saved money.
The difference of means model does not control for anything else, and we know
that the coefficient on the programmable thermostat variable will be biased if some
other variable matters and is correlated with the programmable thermostat vari-
able. In this case, we know unambiguously that HDD matters, and it is plausible that
the HDD differed in the months with and without the programmable thermostat.
Hence, a better model is clearly
The results for this model are in column (b) of Table 6.10. The HDD variable is
hugely (massively, superlatively) statistically significant. Including it also leads to a
6.4 Interaction Variables 209
TABLE 6.10 Data from Programmable Thermostat and Home Heating Bills
(a) (b) (c)
∗
Programmable thermostat −13.02 −20.05 −0.48
(23.94) (4.49) (4.15)
[t = 0.54] [t = 4.46] [t = 0.11]
N 45 45 45
σ̂ 80.12 15.00 10.25
The results for this model are in column (c) of Table 6.10, where the coef-
ficient on Programmable thermostat indicates the difference in therms when the
other variables are zero. Because both variables include HDD, the coefficient on
Programmable thermostat indicates the effect of the thermostat when HDD is zero
(meaning the weather is warm for the whole month). The coefficient of −0.48 with a
t statistic of 0.11 indicates there is no significant bump down in energy usage across
all months. This might seem to be bad news, but is it good news for us, given that
we have figured out that the programmable thermostat shouldn’t reduce heating
costs when the furnace isn’t running?
Not quite. The overall effect of the thermostat is β̂1 + β̂ 3 × HDD. Although
we have already seen that β̂1 is insignificant, the coefficient on Programmable
thermostat × HDD, −0.062, is highly statistically significant, with a t statistic of
7.00. For every one-unit increase in HDD, the programmable thermostat lowered
the therms used by 0.062. In a month with the HDD variable equal to 500, we
estimate that the homeowner changed energy used by β̂1 + β̂ 2 500 = −.048 +
(−0.062 × 500) = −31.48 therms after the programmable thermostat was installed
(lowering the bill by $50.05, at $1.59 per therm). In a month with the HDD
variable equal to 1,000, we estimate that the homeowner changed energy use by
−.048 + (−0.062 × 1000) = −62.48 therms, lowering the bill by $99.34 at $1.59 per
therm. Suddenly we’re talking real money. And we’re doing so from a model that
makes intuitive sense because the savings should indeed differ depending on how
cold it is.9
This case provides an excellent example of how useful—and distinctive—the
dummy variable models we’ve presented in this chapter can be. In panel (a) of
Figure 6.13, we show the fitted values based on model (b) in Table 6.10, which
controls for HDD but models the effect of the thermostat as a constant difference
across all values of HDD. The effect of the programmable thermostat is statistically
significant and rather substantial, but it doesn’t ring true because it suggests that
savings from reduced use of gas for the furnace are the same in a sweltering
summer month and in a frigid winter month. Panel (b) of Figure 6.13 shows
the fitted values based on model (c) in Table 6.10, which allows the effect of
the thermostat to vary depending on the HDD. This is an interactive model that
yields fitted lines with different slopes. Just by inspection, we can see the fitted
lines for model (c) fit the data better. The effects are statistically significant and
substantial and, perhaps most important, make more sense because the effect of
the programmable thermostat on heating gas used increases as the month gets
colder.
9
We might be worried about correlated errors given that this is time series data. As discussed on page
68, the coefficient estimates are not biased if the errors are correlated, but standard OLS standard
errors might not be appropriate. In Chapter 13, we show how to estimate models with correlated
errors. For this data set, the results get a bit stronger.
Conclusion 211
Therms Therms
300 Months without 300 Months without
programmable thermostat programmable thermostat
Months with Months with
programmable thermostat programmable thermostat
200 200
100 100
0 0
FIGURE 6.13: Heating Used and Heating Degree-Days with Fitted Values for Different Models
Conclusion
Dummy variables are incredibly useful. Despite a less-than-flattering name, they
do some of the most important work in all of statistics. Experiments almost
always are analyzed with treatment group dummy variables. A huge proportion
of observational studies care about or control for dummy variables such as gender
or race. And when we interact dummy variables with continuous variables, we can
investigate whether the effects of certain variables differ by group.
We have mastered the core points of this chapter when we can do the
following:
• Section 6.1: Write down a model for a difference of means test using
bivariate OLS. Which parameter measures the estimated difference? Sketch
a diagram that illustrates the meaning of this parameter.
212 CHAPTER 6 Dummy Variables: Smarter than You Think
• Section 6.2: Write down a model for a difference of means test using
multivariate OLS. Which parameter measures the estimated difference?
Sketch a diagram that illustrates the meaning of this parameter.
• Section 6.4: Write down a model that has a dummy variable (D) interaction
with a continuous variable (X). How do we explain the effect of X on Y?
Sketch the relationship for Di = 0 observations and Di = 1 observations.
Further Reading
Brambor, Clark, and Golder (2006) as well as Kam and Franceze (2007) provide
excellent discussions of interactions, including the appropriate interpretation of
models with two continuous variables interacted. Braumoeller (2004) does a good
job of injecting caution into the interpretation of coefficients on lower-order terms
in models that include interaction variables.
Key Terms
Categorical variables (193) Difference of means Jitter (184)
Dichotomous variable (181) test (180) Ordinal variables (193)
Dummy variable (181) Reference category (194)
Computing Corner
Stata
3. Page 559 in the citations and additional notes section discusses how to
generate a standard error in Stata for the effect of X on Y for the Dummyi =
1 group.
3. Page 559 in the citations and additional notes section discusses how to
generate a standard error in R for the effect of X on Y for the Dummyi = 1
group.
category and three dummy variables will be included. If the data type
of our categorical variable is factor, running lm(Y ~ X1) (notice we
do not need the factor command) will produce an OLS model with the
appropriate dummy variables included. To change the reference value for
a factor variable, use the relevel() command. For example, if we include
X1 = relevel(X1, ref = “south“) before our regression model, the
reference category will be south.
Exercises
1. Use data from heightwage.dta that we used in Exercise 1 in Chapter 5
(page 172).
(a) Estimate an OLS regression model with adult wages as the dependent
variable and adult height, adolescent height, and a dummy variable
for males as the independent variables. Does controlling for gender
affect the results?
(c) Reestimate the model from part (a) separately for males and females.
Do these results differ from the model in which male was included
as a dummy variable? Why or why not?
(d) Estimate a model in which adult wages is the dependent variable and
there are controls for adult height and adolescent height in addition
to dummy variable interactions of male times each of the two height
variables. Compare the results to the results from part (c).
DATE Date
(a) Create two scatterplots, one for years in which a Democrat was
president and one for years in which a Republican was president,
showing the relationship between the FFR and the quarters since the
previous election. Comment on the differences in the relationships.
The variable Quarters is coded 0 to 15, representing each quarter
from one election to the next. For each presidential term, the value
of Quarters is 0 in the first quarter containing the election and 15 in
the quarter before the next election.
(d) Graph two fitted lines for the relationship between Quarters and
interest rates, one for Republicans and one for Democrats. (In Stata,
use the twoway and lfit commands with appropriate if statements;
label by hand. In R, use the abline command.) Briefly describe the
relationship.
(e) Rerun the model from part (b) controlling for both the interest rate
in the previous quarter (lag_FEDFUND) and inflation b and discuss
the results, focusing on (i) effect of Quarters for Republicans, (ii) the
differential effect of Quarters for Democrats, (iii) impact of lagged
FFR, and (iv) inflation. Simply report the statistical significance of
the coefficient estimates; don’t go through the entire analysis from
part (c).
3. This problem uses the cell phone and traffic data set described in Chapter 5
(page 174) to analyze the relationship between cell phone and texting bans
and traffic fatalities. We add two variables: cell_ban is coded 1 if it is illegal
to operate a handheld cell phone while driving and 0 otherwise; text_ban
is coded 1 if it is illegal to text while driving and 0 otherwise.
(a) Add the dummy variables for cell phone bans and texting bans to the
model from Question 3, part (c) in Chapter 5 (page 175). Interpret
the coefficients on these dummy variables.
(b) Explain whether the results from part (a) allow the possibility that
a cell phone ban saves more lives in a state with a large population
compared to a state with a small population. Discuss the implications
for the proper specification of the model.
(c) Estimate a model in which total miles is interacted with both the
cell phone ban and the prohibition of texting variables. What is the
estimated effect of a cell phone ban for California? For Wyoming?
What is the effect of a texting ban for California? For Wyoming?
What is the effect of total miles?
(d) This question uses material from page 559 in the citations and
additional notes section. Figure 6.14 displays the effect of the cell
phone ban as a function of total miles. The dashed lines depict
confidence intervals. Identify the points on the fitted lines for the
estimated effects for California and Wyoming from the results in
part (c). Explain the conditions under which the cell phone ban has
a statistically significant effect.10
10
Brambor, Clark, and Golder (2006) provide Stata code to create a plot like this for models with
interaction variables.
Exercises 217
–500
–1,000
(a) Implement a simple difference of means test that uses OLS to assess
whether the fines for men and women are different. Do we have any
reason to expect endogeneity? Explain.
(b) Implement a difference of means test for men and women that
controls for age and miles per hour. Do we have any reason to expect
endogeneity? Explain.
218 CHAPTER 6 Dummy Variables: Smarter than You Think
(c) Building from the model just described, also assess whether fines are
higher for African-Americans and Hispanics compared to everyone
else (non-Hispanic whites, Asians and others). Explain what the
coefficients on these variables mean.
(d) Look at standard errors on coefficients for the Female, Black, and
Hispanic variables. Why they are different?
(e) Within a single OLS model, assess whether miles over the speed limit
has a differential effect on the fines for women, African-Americans,
and Hispanics.
(c) The effect of party may go beyond simply giving all Republicans
a bump up or down in their answers. It could be that political
knowledge interacts with being Republican such that knowledge has
different effects on Republicans and non-Republicans. To test this,
estimate a model that includes a dummy interaction term:
11
We could use tools for categorical variables discussed in Section 6.3 to separate non-Republicans
into Democrats and Independents. Our conclusions would be generally similar in this particular
example.
7 Specifying Models
1
We have used multivariate OLS to net out the effect of income, religiosity, and children from the
life satisfaction scores.
2
Or smile shaped, if you will. To my knowledge, there is no study of chocolate and happiness, but
I’m pretty sure it would be an upside down U: people might get happier the more they eat for a while,
but at some point, more chocolate has to lead to unhappiness, as it did for the kid in Willy Wonka.
220
7.1 Quadratic and Polynomial Models 221
Life
satisfaction
8.0
7.5
7.0
6.5
6.0
5.5
5.0
4.5
20 30 40 50 60 70
Age
Yi = β0 + β12 X1i + i
Yi = β0 + β1 X1i + β2 X1i + i
The X’s, though, are fair game: we can square, cube, log, or otherwise
transform X’s to produce fitted curves instead of fitted lines. Therefore, both of
the following models are OK in OLS because each β simply multiplies itself times
some independent variable that may or not be non-linear:
Yi = β0 + β1 X1i + β2 X1i
2
+ i
Yi = β0 + β1 X1i + β2 X1i 7
+ i
Non-linear relationships are common in the real world. Figure 7.2 shows
data on life expectancy and GDP per capita for all countries in the world. We
immediately sense that there is a positive relationship: the wealthier countries
definitely have higher life expectancy. But we also see that the relationship is a
curve rather than a line because life expectancy rises rapidly at the lower levels of
GDP per capita but then flattens out. Based on this data, it’s pretty reasonable to
expect an annual increase of $1,000 in per capita GDP to have a fairly substantial
effect on life expectancy in a country with low GDP per capita, while an increase of
$1,000 in per capita GDP for a very wealthy country would have only a negligible
effect on life expectancy. Therefore, we want to get beyond estimating straight
lines alone.
Figure 7.3 shows the life expectancy data with two different kinds of fitted
lines. Panel (a) shows a fitted line from a standard OLS model:
As we can see, the fit isn’t great. The fitted line is lower than the data for many
of the observations with low GDP values. For observations with high GDP levels,
the fitted line dramatically overestimates life expectancy. As bad as it is, though,
this is the best possible straight line in terms of minimizing squared error.
3
The world doesn’t end if we really want to estimate a model that is non-linear in the β’s. We just
need something other than OLS to estimate the model. In Chapter 12, we discuss probit and logit
models, which are non-linear in the β’s.
7.1 Quadratic and Polynomial Models 223
Life
expectancy
(in years)
80
75
70
65
60
55
50
FIGURE 7.2: Life Expectancy and Per Capita GDP in 2011 for All Countries in the World
Polynomial models
polynomial model We can generate a better fit by using a polynomial model. Polynomial models
A model that includes include not only an independent variable but also the independent variable raised
values of X raised to
to some power. By using a polynomial model, we can produce fitted value lines
powers greater than
one. that curve.
The simplest example of a polynomial model is a quadratic model that
quadratic model A includes X and X 2 . The model looks like this:
model that includes X
and X 2 as independent Yi = β0 + β1 X1i + β2 X1i
2
+ i (7.2)
variables.
For our life expectancy example, a quadratic model is
Panel (b) of Figure 7.3 plots this fitted curve, which better captures the
non-linearity in the data as life expectancy rises rapidly at low levels of GDP and
then levels off. The fitted curve is not perfect. The predicted life expectancy is
224 CHAPTER 7 Specifying Models
Life
expectancy
in years
90 90
80 80
70 70
60 60
50 50
0 20 40 60 80 100 0 20 40 60 80 100
GDP per capita GDP per capita
(in $1,000s) (in $1,000s)
FIGURE 7.3: Linear and Quadratic Fitted Lines for Life Expectancy Data
still a bit low for low values of GDP, and the turn to negative effects seems more
dramatic than the data warrant. We’ll see how to generate fitted lines that flatten
out without turning down when we cover logged models later in this chapter.
Interpreting coefficients in a polynomial model is different from this proce-
dure in a standard OLS model. Note that the effect of X changes depending on
the value of X. In panel (b) of Figure 7.3, the effect of GDP on life expectancy
is large for low values of GDP. That is, when GDP goes from $0 to $20,000, the
fitted value for life expectancy increases relatively rapidly. The effect of GDP on
life expectancy is smaller as GDP gets higher: the change in fitted life expectancy
when GDP goes from $40,000 to $60,000 is much smaller than the change in fitted
life expectancy when GDP goes from $0 to $20,000. The predicted effect of GDP
even turns negative when GDP goes above $60,000.
We need some calculus to get the specific equation for the effect of X on Y.
∂Y
We refer to the effect of X on Y as ∂X :
1
∂Y
= β1 + 2β2 X1 (7.4)
∂X1
7.1 Quadratic and Polynomial Models 225
This equation means that when we interpret results from a polynomial regression,
we can’t look at individual coefficients in isolation; instead, we need to know how
the coefficients on X1 and X12 come together to produce the estimated curve.4
Y Y
1,000
100 600
400
50
200 Y = 20X − 0.1X 2
0
0
0 20 40 60 80 100 0 20 40 60 80 100
X X
(a) (b)
Y Y
0
1,000
−800 400
0 20 40 60 80 100 0 20 40 60 80 100
X X
(c) (d)
Y Y
250 0
200 −50
50 −200
0 −250
0 20 40 60 80 100 0 20 40 60 80 100
X X
(e) (f)
4
Equation 7.4 is the result of using standard calculus tools to take the derivative of Y in Equation 7.2
with respect to X1 . The derivative is the slope evaluated at a given value of X1 . For a linear model, the
slope is always the same and is β̂1 . The ∂Y in the numerator refers to the change in Y; the ∂X1 in the
226 CHAPTER 7 Specifying Models
Figure 7.4 illustrates more generally the kinds of relationships that a quadratic
model can account for. Each panel illustrates a different quadratic function. In
panel (a), the effect of X is getting bigger as X gets bigger. In panel (b), the effect
of X on Y is getting smaller. In both panels, Y gets bigger as X gets bigger, but the
relationships have a quite different feel.
In panels (c) and (d) of Figure 7.4, there are negative relationships between
X and Y: the more X, the less Y. Again, though, we see very different types of
relationships. In panel (c), there is a leveling out, while in panel (d), the negative
effect of X on Y accelerates as X gets bigger.
A quadratic OLS model can even estimate relationships that change direc-
tions. In panel (e) of Figure 7.4, Y initially gets bigger as X increases, but then it
levels out. Eventually, increases in X decrease Y. In panel (f), we see the opposite
pattern, with Y getting smaller as X rises for small values of X and, eventually, Y
rising with X.
One of the nice things about using a quadratic specification in OLS is that
we don’t have to know ahead of time whether the relationship is curving down or
up, flattening out, or getting steeper. The data will tell us. We can simply estimate
a quadratic model and, if the relationship is like that in panel (a) of Figure 7.4,
the estimated OLS coefficients will yield a curve like the one in the panel; if the
relationship is like that in panel (f), OLS will produce coefficients that best fit the
data. So if we have data that looks like any of the patterns in Figure 7.4, we can
get fitted lines that reflect the data simply by estimating a quadratic OLS model.
Polynomial models with cubed or higher-order terms can account for patterns
that wiggle and bounce even more than those in the quadratic model. It’s relatively
rare, however, to use higher-order polynomial models, which often simply aren’t
supported by the data. In addition, using higher-order terms without strong
theoretical reasons can be a bit fishy—as in raising the specter of the model fishing
we warn about in Section 7.4. A control variable with a high order can be more
defensible, but ideally, our main results do not depend on untheorized high-order
polynomial control variables.
REMEMBER THIS
OLS can estimate non-linear effects via polynomial models.
1. A polynomial model includes X raised to powers greater than one. The general form is
∂Y
denominator refers to the change in X1 . The fraction ∂X1
therefore refers to the change in Y divided
by the change in X1 , which is the slope.
7.1 Quadratic and Polynomial Models 227
Yi = β0 + β1 Xi + β2 Xi2 + i
Discussion Questions
For each of the following, discuss whether you expect the relationship to be linear or non-linear.
Sketch the relationship you expect with a couple of points on the X-axis, labeled to identify the
nature of any non-linearity you anticipate.
(a) Age and income in France
(b) Height and speed in the Boston Marathon
(c) Height and rebounds in the National Basketball Association
(d) IQ and score on a college admissions test in Japan
(e) IQ and salary in Japan
(f) Gas prices and oil company profits
(g) Sleep and your score on your econometrics final exam
Temperature
(deviation from
average
0.9
pre-industrial
temperature,
in Fahrenheit) 0.7
0.5
0.3
0.1
−0.1
Year
(a)
Temperature
(deviation from
average 0.9 0.9
pre-industrial
temperature,
in Fahrenheit) 0.7 0.7
0.5 0.5
0.3 0.3
0.1 0.1
−0.1 −0.1
Year Year
(b) (c)
Panel (b) of Figure 7.5 includes the fitted line from a bivariate OLS model with
Year as the independent variable:
Temperaturei = β0 + β1 Yeari + i
σ̂ 0.12 0.11
R2 0.73 0.78
Column (a) of Table 7.1 shows the coefficient estimates for the linear model.
The estimated β̂1 is 0.006, with a standard error of 0.0003. The t statistic of 18.74
indicates a highly statistically significant coefficient. The result suggests that the
earth has been getting 0.006 degree warmer each year since 1879 (when the data
series begins).
The data looks pretty non-linear, so we also estimate the following quadratic
OLS model:
Temperaturei = β0 + β1 Yeari + β2 Yeari2 + i
in which Year and Year2 are independent variables. This model allows us to assess
whether the temperature change has been speeding up or slowing down by
enabling us to estimate a curve in which the change per year in recent years is,
depending on the data, larger or smaller than the change per year in earlier years.
We have plotted the fitted line in panel (c) of Figure 7.5; notice it is a curve that gets
steeper over time. It fits the data even better, with less underestimation in recent
years and less overestimation in the 1970s.
Column (b) of Table 7.1 reports results from the quadratic model. The coef-
ficients on Year and Year2 have t stats greater than 5, indicating clear statistical
significance. The coefficient on Year is −0.166, and the coefficient on Year2 is
0.000044. What the heck do those numbers mean? At a glance, not much. Recall,
however, that in a quadratic model, an increase in Year by one unit will be associated
with a β̂1 + 2 β̂2 Yeari increase in estimated average global temperature. This means
the predicted change from an increase in Year by one unit in 1900 is
The predicted change in temperature from an increase in Year by one unit in 2000
is
−0.166 + 2 × 0.000044 × 2000 = 0.01 degree
230 CHAPTER 7 Specifying Models
In the quadratic model, in other words, the predicted effect of Year changes
over time. In particular, the estimated rate of warming in 2000 (0.01 degree per year)
is around eight times the estimated rate of warming in 1900 (0.0012 degree per
year).
We won’t pay much attention at this point to the standard errors because errors
are almost surely autocorrelated (as discussed in Section 3.6), which would make
the standard errors reported by OLS incorrect (probably too small). We address
autocorrelation and other time series aspects of this data in Chapter 13.
Review Questions
Figure 7.6 contains hypothetical data on investment by consumer electronics companies as a function
of their profit margins.
1. For each panel, describe the model you think best explains the data.
2. Sketch a fitted line for each panel.
3. For each panel, approximate the predicted effect on R & D investment of changing profits from
0 to 1 percent and from changing profits from 3 to 4 percent.
For our purposes, we won’t be using the mathematical properties of logs too
much.5 We instead note that using logged variables in OLS equations can allow
us to characterize non-linear relationships that are broadly similar to panels (b)
and (c) of Figure 7.4. In that sense, these models don’t differ dramatically from
polynomial models.
One difference from the quadratic models is that models with logged variables
have an additional attractive feature. The estimated coefficients can be interpreted
R&D R&D
investment investment
8
12
7
10
6
8
5
4 6
3 4
2
2
0 1 2 3 4 0 1 2 3 4
15
20
10
15
5
0 10
0 1 2 3 4 0 1 2 3 4
5
We derive the marginal effects in log models in the Citations and Additional Notes section on
(page 560).
232 CHAPTER 7 Specifying Models
directly in percentage terms. That is, with the correct logged model, we can
produce results that tell us how much a one percent increase in X affects Y. Often
this is a good way to think about empirical questions.
Consider the model of GDP and life expectancy we looked at on page 222. If
we estimate a basic OLS model such as
the estimated β̂1 in this model would tell us the increase in life expectancy that
would be associated with a one-unit increase in GDP per capita (measured in
thousands of dollars in this example). At first glance, this might seem like an OK
model. On second glance, we might get nervous. Suppose the model produces
β̂1 = 0.25; that result would say that every country—whatever their GDP— would
get another 0.25 years of life expectancy for every thousand dollar increase in GDP
per capita. That means that the effect of a dollar (or, given that we’re measuring
GDP in thousands of dollars, a thousand dollars) is the same in rich country like
the United States and a poor country like Cambodia. One could easily imagine
that the money in poor country could go to life-extending medicine and nutrition;
in the United States, it seems likely the money would go to iPhone apps and maybe
triple bacon cheeseburgers, neither of which are particularly likely to increase
life-expectancy.
It may be is better to think of GDP changes in percentage terms rather than
in absolute values. A $1, 000 increase in GDP per capita in Cambodia is a large
percentage increase, while in the United States, a $1, 000 increase in GDP per
capita is not very large in percentage terms.
Logged models are extremely useful when we want to model relationships in
linear-log model A percentage terms. For example, we could estimate a linear-log model in which the
model in which the independent variable is logged (and the dependent variable is not logged). Such a
dependent variable is model would look like
not logged, but the
Yi = β0 + β1 ln Xi + i (7.6)
independent variable is.
where β1 indicates the effect of a one percent increase in X on Y.
We need to divide the estimated coefficient by 100 to convert it to units of Y.
This is one of the odd hiccups in models with logged variables: the units can be
a bit tricky. While we can memorize the way units work in these various models,
the safe course of action here is to simply accept that each time we use logged
models, we’ll probably have to look up how units in logged models work in the
summary on page 236.
Figure 7.7 shows a fitted line from a linear-log model using the GDP and
life expectancy data we saw earlier in Figure 7.3. One nice feature of the fitted
line from this model is that the fitted values keep rising by smaller and smaller
amounts as GDP per capita increases. This pattern contrasts to the fitted values
in the quadratic model, which declined for high values of GDP per capita. The
estimated coefficient in the linear-log model on GDP per capita (measured in
thousands of dollars) is 5.0. This implies that a one percent increase in GDP per
capita is associated with an increase in life expectancy of 0.05 years. For a country
with a GDP per capita of $100,000, then, an increase of GDP per capita of $1,000
7.2 Logged Variables 233
Life
expectancy
in years
80
70
60
50
0 20 40 60 80 100
is an increase of one percent and will increase life expectancy by 0.05 of a year.
For a country with a GDP per capita of $10,000, however, an increase of GDP
per capita of $1,000 is a 10 percent increase, implying that the estimated effect
is to increase life expectancy by about 0.5 of a year. A $1,000 increase in GDP
per capita for a country with GDP per capita of $1,000 would be a 100 percent
increase, implying that the fitted value of life expectancy would rise by about
5 years.
log-linear model A Logged models come in several flavors. We can also estimate a log-linear
model in which the model in which the dependent variable is transformed by taking the natural log
dependent variable is of it and the independent variable is not logged. For example, suppose we are
transformed by taking
interested in testing if women get paid less than men. We could run a simple linear
its natural log.
model with wages as the dependent variable and a dummy variable for women.
That’s odd, though, because it would say that all women get β̂1 dollars less. It
might be more reasonable to think that discrimination works in percentage terms
as women may get some percent less than men. The following log-linear model
234 CHAPTER 7 Specifying Models
Because of the magic of calculus (shown on page 560), the β̂1 in this model
can be interpreted as the percentage change in Y associated with a one-unit
increase in X. In other words, the model would provide us with an estimate that
the difference in wages women get is β̂1 percent.
log-log model A At the pinnacle of loggy-ness is the so-called log-log model. Log-log models
model in which the do a lot of work in economic models. Among other uses, they also allow us to
dependent variable and estimate elasticity, which is the percent change in Y associated with a percent
the independent change in X. For example, if we want to know the elasticity of demand for airline
variable are logged. tickets, we can get data on sales and prices and estimate the following model:
6
A complete analysis would account for the fact that prices are also a function of the quantity of
tickets sold. We address these types of models in Section 9.6.
7
Recall that (natural) log of k is the exponent to which we have to raise e to obtain k. There is no
number that we can raise e to and get zero. We can get close by raising e to minus a huge number; for
example, e−100 = e1001
, which is very close to zero, but not quite zero.
8
Some people recode these numbers as something very close to zero (e.g., 0.0000001) on the
reasoning that the log function is defined for low positive values and the essential information (that
the variable is near zero) in such observations is not lost. However, it’s always a bit sketchy to be
changing values (even from zero to a small number), so tread carefully.
7.2 Logged Variables 235
TABLE 7.2 Different Logged Models of Relationship between Height and Wages
No log Linear-log Log-linear Log-log
∗ ∗
Adolescent height 0.412 0.033
(0.098) (0.015)
[t = 4.23] [t = 2.23]
Log adolescent height 29.316∗ 2.362∗
(6.834) (1.021)
[t = 4.29] [t = 2.31]
Constant −13.093 −108.778∗ 0.001 −7.754
(6.897) (29.092) (1.031) (4.348)
[t = 1.90] [t = 3.74] [t = 0.01] [t = 1.78]
N 1,910 1,910 1,910 1,910
2
R 0.009 0.010 0.003 0.003
The second column reports results from a linear-log model in which the
dependent variable is not logged and the independent variable is logged. The
interpretation of β̂1 is that a one percent increase in X (which is adolescent height
100 = $0.293 increase in hourly wages. The
in this case) is associated with a 29.316
dividing by 100 is a bit unusual, but no big deal once we get used to it.
The third column reports results from a model in which the dependent variable
has been logged but the independent variable has not been logged. In such a
log-linear model, the coefficient indicates the percent change in the dependent
variable associated with a one-unit change in the independent variable. The
interpretation of β̂1 here is that a one-inch increase in height is associated with
a 3.3 percent increase in wages.
The fourth column reports a log-log model in which both the dependent
variable and the independent variable have been logged. The interpretation of β̂1
here is that a one percent increase in height is associated with a 2.362 percent
increase in wages. Note that in the log-linear column, the probability is on a scale
of 0 to 1, and in the log-log column, the probability is on a 0 to 100 scale. Yeah,
that’s a pain; it’s just how the math works out.
So which model is best? Sadly, there is no magic bullet that will always hit the
perfect model here, another hiccup when we work with logged models. We can’t
simply look at the R2 because those values are not comparable: in the first two
models the dependent variable is Y, and in the last two, the dependent variable is
ln(Y). As is often the case, some judgment will be necessary. If we’re dealing with
an economic problem of estimating price elasticity, a log-log model is natural. In
other contexts, we have to decide whether the causal mechanism makes more sense
in percentage terms and whether it applies to the dependent and/or independent
variables.
236 CHAPTER 7 Specifying Models
REMEMBER THIS
1. How to interpret logged models:
2. Logged models have some challenges not found in other models (the Three Hiccups):
(a) The scale of the β̂ coefficients varies depending on whether the model is log-linear,
linear-log, or log-log.
(b) We cannot log variables that have values less than or equal to zero.
(c) There is no simple test for choosing among log-linear, linear-log, and log-log
models.
X1
Independent variable
Example: 9th grade
tutoring treatment
γ1
X2
Y
γ2
Post-treatment variable
Dependent variable
Example: 12th grade
Example: age 26 earnings
reading scores
after an independent variable of interest and could be caused by it. Our concern
with post-treatment variables is definitely not limited to experiments, though, as
post-treatment variables can screw up observational studies as well.
Two problems can arise when we include post-treatment variables. The first
mediator bias Bias is called mediator bias which is a type of bias that occurs when a post-treatment
that occurs when a variable is added and absorbs some of the causal effect of the treatment variable.
post-treatment variable For example, suppose we provided extra tutoring for a randomly selected group
is added and absorbs of ninth graders and then assessed their earnings at age 26. The mechanism for
some of the causal
the tutoring to work had two parts, as shown in Figure 7.8. The arrows indicate a
effect of the treatment
variable. causal effect, and the Greek letters next to the arrows indicate the magnitude of
the variable’s effect.
We see that the tutoring had a direct effect on earnings of γ1 . Tutoring also
increased test scores by α, and reading scores increased earnings by γ2 . In other
words, Figure 7.8 shows that if we plunk a kid in this tutoring program, he or she
will make γ1 + αγ2 more at age 26.
238 CHAPTER 7 Specifying Models
While this model doesn’t capture the complexity of the process by which tutoring
increased earnings, it does capture the overall effect of being in the tutoring
program. Simply put, β̂1 will provide an unbiased estimate of the effect of the
tutoring program because the tutoring treatment was randomly assigned and is
therefore not correlated with anything, including . In terms of Figure 7.8, a kid in
the tutoring program will earn γ1 +αγ2 more at age 26, and E[ β̂1 ] will be γ1 +αγ2 .
It might seem that adding reading scores to the model might be useful. Maybe.
But we need to be careful. If we estimate
the estimated coefficient on the tutoring treatment will only capture the direct
effect of tutoring and will not capture the indirect effect of the tutoring via
improving reading scores; that is, E[ β̂1 ] = γ1 . That means that if we naively focus
on β̂1 as the effect of the tutoring treatment, we’ll miss that portion of the effect
associated with the tutoring increasing reading scores.9
In a case like this, two steps are most appropriate. First, we should estimate the
simpler model without the post-treatment variable in order to estimate the overall
effect of the treatment. Second, if we want to understand the process by which
the treatment variable affects the outcome we can estimate two equations: one
that looks like Equation 7.10 in order to estimate the direct effect of 12th grade
reading on earnings, and another equation to understand the effect of the tutoring
treatment on 12th grade reading.
collider bias Bias The second problem that post-treatment variables can cause is collider bias,
that occurs when a a type of bias that occurs when a post-treatment variable creates a pathway for
post-treatment variable spurious effects to appear in our estimation. This bias is more subtle and therefore
creates a pathway for more insidious than mediator bias. In particular, if we include a post-treatment
spurious effects to
variable that is affected by an unobserved confounder that also affects the
appear in our
estimation. dependent variable, the estimated effect of a variable of interest may look large
when it is zero, look small when it is large, look positive when it is negative, and
so on.10
Here we’ll focus on a case in which including a post-treatment variable can
lead to an appearance of a causal relationship when there is in fact no relationship;
we’re building on an example from Acharya, Blackwell, and Sen (2016). Suppose
we want to know if car accidents cause the flu. It’s a silly question: we don’t really
think that car accidents cause the flu, but let’s see if a post-treatment variable could
lead us to think car accidents do cause (or prevent) the flu. Suppose we have data
9
We provide references to the recent statistical literature on this issue in the Further Reading section
at the end of this chapter.
10
The name “collider bias” is not particularly intuitive. It comes from a literature that uses diagrams
(like Figure 7.9) to assess causal relations. The two arrows from X1 and U “collide” at X2 , hence the
name.
7.3 Post-Treatment Variables 239
X1 U
Unobserved confounder
Independent variable variable
Example: car accident Example: high fever
ρ1
α ρ2
X2 Y
FIGURE 7.9: Example in which a Post-Treatment Variable Creates a Spurious Relationship between X1 and Y
on 100, 000 people and whether they were in a car accident (our independent
variable of interest, which we label X1 ), whether they were hospitalized (our
post-treatment variable, which we label X2 ), and whether they had the flu (our
dependent variable, Y). Compared to our discussion of mediator bias, we’ll add
a confounder variable, which is something that is unmeasured but affects both
the post-treatment variable (X2 ) and the dependent variable (Y). We’ll label the
confounder as U to emphasize that it is unobserved.
Figure 7.9 depicts the true state of relationships among variables in our exam-
ple. Car accidents increase hospitalization by α, fever increases hospitalization by
ρ1 , and fever increases the probability of having the flu by ρ2 . In our example, car
accidents have no direct effect on having the flu, and being hospitalized itself does
not increase the probability of having the flu. (We allow for these direct effects in
our more general discussion of collider bias in Section 14.8.)
If we estimate a simple model
we will be fine because the car accident variable is uncorrelated with the
unobserved factor, fever (which we can see by noting there is no direct connection
between the car accidents and fever in Figure 7.9). The expected value of β̂1 for
such a model will be the true effect, which is zero in our example depicted in the
figure.
It might seem pretty harmless to also add a variable for hospitalization to the
model, so that our model now looks like
U
X1
Unobserved
Independent variable confounder
variable
ρ1 ρ2
α
γ1
X2
γ2 Y
Post-treatment
Dependent variable
variable
with. Here we’ll examine how collider bias distorts our estimate of the direct effect
of X1 on Y. The true direct effect of X1 on Y is γ1 (see Figure 7.10); we’ll consider
bias to be any deviation of the expected value of the estimated coefficient from
γ1 . This factor is α ρρ2 , meaning that three conditions therefore are necessary for a
1
post-treatment variable to create bias: α = 0, ρ1 = 0, and ρ2 = 0. The condition that
α = 0 is simply the condition that X2 is in fact a post-treatment variable affected
by X1 . If α = 0, then X1 has no effect on X2 . The conditions that ρ1 = 0 and ρ2 = 0
are the conditions that make the unobserved variable a confounder: it affects both
the post-treatment variable X2 and Y. If U does not affect both X2 and Y, then there
is no hidden relationship that is picked up the estimation.
What should we do if we suspect collider bias? One option has a very
multivariate OLS feel: simply add the confounder. If we do this, the bias goes
away. But the thing about confounders is that the reason we’re thinking about them
as confounders in the first place is that they are something we probably haven’t
measured, so this approach is often infeasible.
242 CHAPTER 7 Specifying Models
Discussion Questions
1. Suppose we are interested in assessing whether there is gender bias in wages. Our main variable
of interest is X1 , which is a dummy variable for women. Our dependent variable, Y, is wages.
We also know the occupation for each person in our sample. For simplicity, assume that our
occupation variable is simply a dummy variable X2 indicating whether someone is an engineer
or not. Do not introduce other variables into your discussion (at least until you are done with
the following questions!).
(a) Create a figure like Figure 7.8 that indicates potential causal relations.
(b) What is E[ β̂1 ] for Yi = β0 + β1 X1 ?
(c) What are signs of E[ β̂1 ] and E[ β̂2 ] for Yi = β0 + β1 X1 + β2 X2 ?
(d) What model specification do you recommend?
2. Suppose we are interested in assessing whether having a parent who was in jail is more likely
to increase the probability that a person will be arrested as an adult. Our main variable of
interest is X1 , which is a dummy variable indicating a person’s parent served time in jail. Our
dependent variable, Y, is an indicator for whether that person was arrested as an adult. We also
have a variable X2 that indicates whether the person was suspended in high school. We do not
observe childhood lead exposure, which we label as U. Do not introduce other factors into your
discussion (at least until you are done with the following questions!).
(a) Create a figure that indicates potential causal relations.
(b) What is E[ β̂1 ] for Yi = β0 + β1 X1 ?
(c) What are signs of E[ β̂1 ] and E[ β̂2 ] for Yi = β0 + β1 X1 + β2 X2 ?
(d) What model specification do you recommend?
7.4 Model Specification 243
REMEMBER THIS
1. Post-treatment variables are variables that are affected by the independent variable of interest.
2. Including post-treatment variables in a model can create two types of bias.
(a) Mediator bias: Including post-treatment variables in a model can cause the
post-treatment variable to soak up some of the causal effect of our variable of interest.
(b) Collider bias: Including post-treatment variables in a model can bias the coefficient on
our variable of interest if there is an unmeasured confounder variable that affects both
the post-treatment variable and the dependent variable.
3. It is best to avoid including post-treatment variables in models.
And sometimes the changes in results can be subtle. Sometimes we’re missing
observations for some variables. For example, in survey data it is quite common
for a pretty good chunk of people to decline to answer questions about their
annual income. If we include an income variable in a model, OLS will include
only observations for people who fessed up about how much money they make.
If only half of the survey respondents answered, including income as a control
variable will cut our sample size in half. This change in the sample can cause
coefficient estimates to jump around because, as we talked about with regard to
sampling distributions (on page 53), coefficients will differ for each sample. In
some instances, the effects on a coefficient estimate can be large.11
Two good practices mitigate the dangers inherent in model specification. The
first is to adhere to the replication standard. Some people see how coefficient
estimates can change dramatically depending on specification and become
statistical cynics. They believe that statistics can be manipulated to give any
answer. Such thinking lies behind the aphorism “There are three kinds of lies:
lies, damned lies, and statistics.” A better response is skepticism, a belief that
statistical analysis should be transparent to be believed. In this view, the saying
should be “There are three kinds of lies: lies, damned lies, and statistics that can’t
be replicated.”
A second good practice is to present results from multiple specifications in
a way that allows readers to understand which steps of the specification are the
crucial ones for the conclusion being offered. Begin by presenting a minimal
specification, which is a specification with only the variable of interest and perhaps
some small number of can’t-exclude variables as well (see Lenz and Sahn 2017).
Then explain the addition of additional variables (or other specification changes
such as including non-linearities or limiting the sample). Coefficients may change
when variables are added or excluded—that is, after all, the point of multivariate
analysis. When a specification choice makes a big difference, the researcher owes
the reader a big explanation for why this is a sensible modeling choice. And
because it often happens that two different specifications are reasonable, the reader
should see (or have access to in an appendix) both specifications. This will inform
readers that the results either are robust across reasonable specification choices
or depend narrowly on particular specification choices. The results on height and
wages reported in Table 5.2 offer one example, and we’ll see more throughout the
book.
11
And it is possible that the effects of a variable differ throughout the population. If we limit the
sample to only those who report income (people who tend to make less money, as it happens), we
may be estimating a different effect (the effect of X1 in a lower-income subset) than when we
estimate the model with all the data (the effect of X1 for the full population). Aronow and Samii
(2016) provide an excellent discussion of these and other nuances in OLS estimation.
Conclusion 245
REMEMBER THIS
1. An important part of model specification is choosing what variables to include in the model.
2. Researchers should provide convincing evidence that they are not model fishing by including
replication materials and by reporting results from multiple specifications, beginning with a
minimal specification.
Conclusion
This chapter has focused on the opportunities and challenges inherent in
model specification. First, the world is not necessarily linear, and the multivariate
model can accommodate a vast array of non-linear relationships. Polynomial mod-
els, of which quadratic models are the most common, can produce fitted lines with
increasing returns, diminishing returns, and U-shaped and upside-down U-shaped
relationships. Logged models allow effects to be interpreted in percentage terms.
Post-treatment variables provide an example in which we can have too many
variables in a model, as post-treatment variables can soak up causal effects or,
more subtly, create pathways for spurious causal effects to appear.
We have mastered the core points of this chapter when we can do the
following:
• Section 7.1: Explain polynomial models and quadratic models. Sketch the
various kinds of relationships that a quadratic model can estimate. Show
how to interpret coefficients from a quadratic model.
• Section 7.2: Explain three different kinds of logged models. Show how to
interpret coefficients in each.
Further Reading
Empirical papers using logged variables are very common; see, for example, Card
(1990). Zakir Hossain (2011) discusses the use of Box-Cox tests to help decide
which functional form (linear, log-linear, linear-log, or log-log) is best.
246 CHAPTER 7 Specifying Models
Key Terms
Collider bias (238) Log-log model (234) p-hacking (243)
Elasticity (234) Mediator bias (237) Post-treatment variable (236)
Linear-log model (232) Model fishing (243) Polynomial model (223)
Log-linear model (233) Model specification (220) Quadratic model (223)
Computing Corner
Stata
Exercises
1. The relationship between political instability and democracy is important
and likely to be quite complicated. Do democracies manage conflict in
a way that reduces instability, or do they stir up conflict? Use the data
set called Instability_PS data.dta from Zaryab Iqbal and Christopher Zorn
(2008) to answer the following questions. The data set covers 157 countries
between 1946 and 1997. The unit of observation is the country-year. The
variables are listed in Table 7.3.
Instab Index of instability (revolutions, crises, coups, etc.); ranges from −4.65 to +10.07
Coldwar Cold War year (1 = yes, 0 = no)
12
For the reasons discussed in the homework exercise in Chapter 3 on page 89, we limit the data set
to observations with height greater than 40 inches and self-reported income less than 400 British
pounds per hour. We also exclude observations of individuals who grew shorter from age 16 to age
33. Excluding these observations doesn’t really affect the results, but the observations themselves are
just odd enough to make us think that these cases may suffer from non-trivial measurement error.
Exercises 249
(c) Now do the same test, but with log of wages at age 33 as the
dependent variable. Use female as the dummy variable. Interpret the
coefficient on the female dummy variable.
(d) How much does height explain salary differences across genders?
Estimate a difference of means test across genders, using logged
wages as the dependent variable and controlling for height at age
33 and at age 16. Explain the results.
(e) Does the effect of height vary across genders? Use logged wages at
age 33 as the dependent variable, and control for height at age 16
and the number of siblings. Explain the estimated effect of height at
age 16 for men and for women using an interaction with the female
variable. Use an F test to assess whether height affects wages for
women.
(b) Sketch the relationship between age and ticket amount from the
foregoing quadratic model: calculate the fitted value for a white
male with MPHover equals 0 (probably not many people going
zero miles over the speed limit got a ticket, but this simplifies
calculations a lot) for ages equal to 20, 25, 30, 35, 40, and 70.
(In Stata, the following displays the fitted value for a 20-year-old,
(c) Use Equation 7.4 to calculate the marginal effect of age at ages 20,
35, and 70. Describe how these marginal effects relate to your sketch.
(d) Calculate the age that is associated with the lowest predicted fine
based on the quadratic OLS model results given earlier.
(e) Do drivers from out of town and out of state get treated differ-
ently? Do state police and local police treat non-locals differently?
Estimate a model that allows us to assess whether out-of-towners
and out-of-staters are treated differently and whether state police
respond differently to out-of-towners and out-of-staters. Interpret the
coefficients on the relevant variables.
(f) Test whether the two state police interaction terms are jointly
significant. Briefly explain the results.
4. The book’s website provides code that will simulate a data set we can use to
explore the effects of including post-treatment variables. (Stata code is in
Ch7_PostTreatmentSimulation.do; R code is in Ch7_PostTreatmentSim-
ulation.R).
The first section of code simulates what happens when X1 (the
independent variable of interest) affects X2 , a post-treatment variable as
in Figure 7.8 on page 237. Initially, we set γ1 (the direct effect of X1 on Y),
α (the effect of X1 on X2 ), and γ2 (the effect of X2 on Y) all equal to 1.
255
256 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
The logic behind the fixed effect approach also is important when we conduct
difference-in-difference analysis, which is particularly helpful in the evaluation
of policy changes. We use this model to compare changes in units affected by
some policy change to changes in units not affected by the policy. We show how
difference-in-difference methods rely on the logic of fixed models and, in some
cases, use the same tools as panel data analysis.
In this chapter, we show the power and ease of implementing fixed effects
models. Section 8.1 uses a panel data example to illustrate how basic OLS can fail
when the error term is correlated with the independent variable. Section 8.2 shows
how fixed effects can come to the rescue in this case (and others). It describes how
to estimate fixed effects models by using dummy variables or so-called de-meaned
data. Section 8.3 explains the mildly miraculous ability of fixed effects models
to control for variables even as the models are unable to estimate coefficients
associated with these variables. This ability is a blessing in that we control for these
variables; it is a curse in that we sometimes are curious about such coefficients.
Section 8.4 extends fixed effect logic to so-called two-way fixed effects models
that control for both unit- and time-related fixed effects. Section 8.5 discusses
difference-in-difference methods that rely on the fixed effect logic and are widely
used in policy analysis.
city and others from another city is ignored. For all the computer knew when
running that model, there were N separate cities producing the data.
Table 8.1 shows the results. The coefficient on the police variable is positive
and very statistically significant. Yikes. More cops, more crime. Weird. In fact,
for every additional police officer per capita, there were 2.37 more robberies per
capita. Were we to take these results at face value, we would believe that cities
could eliminate more than two robberies per capita for every police officer per
capita they fired.
Of course we don’t believe the pooled results. We worry that there are
unmeasured factors lurking in the error term that could be correlated with the
number of police, thereby causing bias. The error term in Equation 8.1 contains
gangs, drugs, economic hopelessness, broken families, and many more conditions.
If any of those factors is correlated with the number of police in a given city,
we have endogeneity. Given that police are more likely to be deployed when and
where there are gangs, drugs, and economic desolation, endogeneity in our model
seems inevitable.
In this chapter, we try to eliminate some of this endogeneity by focusing on
aspects of the error associated with each city. To keep our discussion relatively
simple, we’ll turn our attention to five California cities: Los Angeles, San
Francisco, Oakland, Fresno, and Sacramento. Figure 8.1 plots their per capita
robbery and police data from 1971 to 1992.
Consistent with the OLS results on all cities, the message seems clear that
robberies are more common when there are more police. However, we actually
have more information than Figure 8.1 displays. We know which city each
observation comes from. Figure 8.2 replots the data from Table 8.1, but in a
way that differentiates by city. The underlying data is exactly the same, but the
observations for each city have different shapes. The observations for Fresno are
the circles in the lower left, the observations for Oakland are the triangles in the
top middle, and so forth. What does the relationship between police and crime
look like now?
It’s still a bit hard to see, so Figure 8.3 adds a fitted line for each city. These
are OLS regression lines estimated on a city-by-city basis. All are negative, some
dramatically so (Los Angeles and San Francisco). The claim that police reduce
258 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
Robberies
per 1,000 people
12
10
Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
2
2.0 2.5 3.0 3.5 Police
per 1,000 people
FIGURE 8.3: Robberies and Police for Specified Cities in California with City-Specific Regression
Lines
crime is looking much better. Within each individual city, robberies tend to decline
as police increase.
The difference between the pooled OLS results and these city-specific
regression lines presents a puzzle. How can the pooled OLS estimates suggest
a conclusion so radically different from Figure 8.3? The reason is the villain of
this book—endogeneity.
Here’s how it happens. Think about what’s in the error term it in Equation 8.1:
gangs, drugs, and all that. These factors almost certainly affect the crime across
cities and are plausibly correlated with the number of police because cities with
bigger gang or drug problems hire more police officers. Many of these elements in
the error term are also stable within each city, at least in our 20-year time frame. A
city that has a culture or history of crime in year 1 probably has a culture or history
of crime in year 20 as well. This is the case in our selected cities: San Francisco
has lots of police and many robberies, while Fresno has not so many police and
not so many robberies.
And here’s what creates endogeneity: these city-specific baseline levels of
crime are correlated with the independent variable. The cities with the most
robberies (Oakland, Los Angeles, and San Francisco) have the most police. The
cities with fewest robberies (Fresno and Sacramento) have the fewest police. If
260 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
we are not able to find another variable to control for whatever is causing these
differential levels of baselines—and if it is something hard to measure like history
or culture or gangs or drugs, we may not be able to—then standard OLS will have
endogeneity-induced bias and lead us to the spurious inference we highlighted at
the start of the chapter.
where Test scoresit is test scores of student i at time t and Private schoolit is a
dummy variable that is 1 if student i is in a private school at time t and 0 if not.
This model is for a (hypothetical) data set in which we observe test scores for
specific children over a number of years.
The following three simple questions help us identify possibly troublesome
endogeneity.
What is in the error term? Test performance potentially depends not only on
whether a child went to a private school (a variable in the model) but also on his or
her intelligence and diligence, the teacher’s ability, family support, and many other
factors in the error term. While we can hope to measure some of these factors, it
is a virtual certainty that we will not be able to measure them all.
Are there any stable unit-specific elements in the error term? Intelligence,
diligence, and family support are likely to be quite stable for individual students
across time.
Are the stable unit-specific elements in the error term likely to be correlated
with the independent variable? It is quite likely that family support, at least, is
correlated with attendance at private schools, since families with the wealth and/or
interest in private schools are likely to provide other kinds of educational support
to their children. This tendency is by no means set in stone, however: countless
kids with good family support go to public schools, and there are certainly kids
with no family support who end up in private schools. On average, though, it is
reasonable to suspect that kids in private schools have more family support. If
this is the case, then what may seem to be a causal effect of private schools on
test scores may be little more than an indirect effect of family support on test
scores.
8.2 Fixed Effects Models 261
REMEMBER THIS
1. A pooled model with panel data ignores the panel nature of the data. The equation is
2. A common source of endogeneity in the use of a pooled model to analyze panel data is that
the specific units have different baseline levels of Y, and these levels are correlated with X. For
example, cities with higher crime (meaning high unit-specific error terms) also tend to have
more police, creating a correlation in a pooled model between the error term and the police
independent variable.
= β0 + β1 Policei,t−1 + αi + νit
fixed effects model More generally, fixed effects models look like
A model that controls
for unit-specific effects. Yit = β0 + β1 X1it + αi + νit (8.3)
These fixed effects
capture differences in A fixed effects model is simply a model that contains a parameter like αi that
the dependent variable captures differences in the dependent variable associated with each unit and/or
associated with each
period.
unit.
The fixed effect αi is the part of the unobserved error that has the same
value for every observation for unit i. It basically reflects the average value of
the dependent variable for unit i, after we have controlled for the independent
variables. The unit is the unit of observation. In our city crime example, the unit
of observation is the city.
Even though we write down only a single parameter (αi ), we’re actually
representing a different value for each unit. That is, this parameter takes on a
potentially different value for each unit. In the city crime model, therefore, the
262 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
value of αi will be different for each city. If Pittsburgh has a higher average number
of robberies than Portland, the αi for Pittsburgh will be higher than the αi for
Portland.
The amazing thing about the fixed effects parameter is that it allows us to
control for a vast array of unmeasured attributes of units in the data set. These
could correspond to historical, geographical, or institutional factors. Or these
attributes could relate to things we haven’t even thought of. The key is that the fixed
effect term allows different units to have different baseline levels of the dependent
variable.
Why is it useful to model fixed effects in this way? When fixed effects are in
the error term, as in the pooled OLS model, they can cause endogeneity and bias.
But if we can pull them out of the error term, we will have overcome this source of
endogeneity. We do this by controlling for the fixed effects, which will take them
out of the error term so that they no longer can be a source for the correlation of
the error term and an independent variable. This strategy is similar to the one we
pursued with multivariate OLS: we identified a factor in the error term that could
cause endogeneity and pulled it out of the error term by controlling for the variable
in the regression.
How do we pull the fixed effects out of the error term? Easy! We simply
estimate a different intercept for each unit. This will work as long as we have
multiple observations for each unit. In other words, we can pull fixed effects out
of the error term when we have panel data.
2
It doesn’t really matter which unit we exclude. We exclude the Pth unit for convenience; plus, it is
fun to try to pronounce (P − 1)th.
8.2 Fixed Effects Models 263
TABLE 8.2 Example of Robbery and Police Data for Cities in California
City Year Robberies Police per D1 D2 D3
per 1,000 (Fresno (Oakland (San Francisco
1,000 (lagged) dummy) dummy) dummy)
We are really just running OLS with loads of dummy variables. In other
words, we’ve seen this before. Specifically, on page 193, we showed how to
use multiple dummy variables to account for categorical variables. Here the
categorical variable is whatever the unit of observation denotes (in our city crime
data, it’s city).
De-meaned approach
We shouldn’t let the old-news feel of the LSDV approach lead us to underestimate
fixed effects models. They’re actually doing a lot of work, and work that we can
better appreciate when we consider a second way to estimate fixed models, the
de-meaned de-meaned approach. It’s an odd term—it sounds like we’re trying to humiliate
approach An data—but it describes well what we’re doing. (Data is pretty shameless anyway.)
approach to estimating When using the de-meaned approach, we subtract the unit-specific averages from
fixed effects models for
both independent and dependent variables. This approach allows us to control
panel data involving
subtracting average
for the fixed effects (the αi terms) without estimating coefficients associated with
values within units from dummy variables for each unit.
all variables. Why might we want to do this? Two reasons. First, it can be a bit of a
hassle creating dummy variables for every unit and then wading through results
with so many variables. For example, using the LSDV approach to estimate a
country-specific fixed effects model describing voting in the United Nations, we
might need roughly 200 dummy variables.
Second, the inner workings of the de-meaned estimator reveal the intuition
behind fixed effects models. This reason is more important. The de-meaned model
looks like
where Y i· is the average of Y for unit i over all time periods in the data set and
X i· is the average of X for unit i over all time periods in the data set. The dot
notation indicates when an average is calculated. So Y i· is the average for unit i
averaged over all time periods (values of). In our crime data, Y Fresno· is the average
264 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
crime in Fresno over the time frame of our data, and X Fresno· is the average police
per capita in Fresno over the time frame of our data.3 Estimating a model using
this transformed data will produce exactly the same coefficient and standard error
estimates for β̂1 as produced by the LSDV approach.
The de-meaned approach allows us to see that fixed effects models convert
data to deviations from mean levels for each unit and variable. In other words, fixed
effects models are about differences within units, not differences across units. In
the pooled model for our city crime data, the variables reflect differences in police
and robberies in Los Angeles relative to police and robberies in Fresno. In the
fixed effects model, the variables are transformed to reflect how much robberies
in Los Angeles at a specific time differ from average levels in Los Angeles as a
function of how much police in Los Angeles at a specific time differ from average
levels of police in Los Angeles.
An example shows how this works. Recall the data on crime earlier, where we
saw that estimating the model with a pooled model led to very different coefficients
than with the fixed effects model. The reason for the difference was, of course, that
the pooled model was plagued by endogeneity and the fixed effects model was
not. How does the fixed effects model fix things? Figure 8.4 presents illustrative
data for two made-up cities, Fresnomento and Los Frangelese. In panel (a), the
pooled data is plotted as in Figure 8.1, with each observation number indicated.
The relationship between police and robberies looks positive, and indeed, the OLS
β̂1 is positive.
In panel (b) of Figure 8.4, we plot the same data after it has been de-meaned.
Table 8.3 shows how we generated the de-meaned data. Notice, for example, that
observation 1 is from Los Frangelese in 2010. The number of police (the value
of Xit ) was 4, which is one of the bigger numbers in the Xit column. When we
compare this number to the average number of police per thousand people in Los
Frangelese (which was 5.33), though, it is low. In fact, the de-meaned value of the
police variable for Los Frangelese in 2010 is −1.33, indicating that the police per
thousand people was actually 1.33 lower than the average for Los Frangelese in
the time period of the data.
Although the raw values of Y get bigger as the raw values of X get bigger,
the relationship between Yit − Y i· and Xit − X i· is quite different. Panel (b) of
Figure 8.4 shows a clear negative relationship between the de-meaned X and the
de-meaned Y.4
3
The de-meaned equation is derived by subtracting the same thing from both sides of Equation 8.3.
Specifically, note that the average dependent variable for unit i over time is Y i· = β0 + β1 X i· + α i + ν i· .
If we subtract the left-hand side of this equation from the left-hand side of Equation 8.3 and the
right-hand side of this equation from the right-hand side of Equation 8.3, we get
Yit − Y i· = β0 + β1 Xit + αi + νit − β0 − β1 X i· − α i· − ν i· . The α terms cancel because α i equals αi (the
average of fixed effects for each unit are by definition the same for all observations of a given unit in
all time periods). Rearranging terms yields something that is almost Equation 8.5. For simplicity, we
let ν̃it = νit − ν i· ; this new error term will inherit the properties of νit (e.g., being uncorrelated with the
independent variable and having a mean of zero).
4
One issue that can seem confusing at first—but really isn’t—is how to interpret the coefficients.
Because the LSDV and de-meaned approaches produce identical estimates, we can stick with our
8.2 Fixed Effects Models 265
Robberies
per 1,000
people
12 1
2
9
e
n lin 3
(a) gr essio
6 le d re
Poo
4
3 5 Los Frangeles
6 Fresnomento
1 2 3 4 5 6 7
Robberies
per 1,000
people, Fresnomento, de-meaned
2 1 Re
de-meaned gre Los Frangeles, de-meaned
ssi
by city on
1
line
4 for
de-
me
ane
(b) 0 2 d (f
ixe
5 de
ffec
ts)
−1 mo
6 del
−2 3
−2 −1 0 1 2
TABLE 8.3 Robberies and Police Data for Hypothetical Cities in California
Observation
number City Year Xit X i· Xit − X i· Yit Y i· Yit − Y i·
4 Fresnomento 2010 1 2 −1 4 3 1
5 Fresnomento 2011 2 2 0 3 3 0
6 Fresnomento 2012 3 2 1 2 3 −1
relatively straightforward way of explaining LSDV results even when we’re describing results from a
de-meaned model. Specifically, we can simply say that a one-unit change in X1 is associated with a
β̂1 increase in Y when we control for unit fixed effects. This interpretation is similar to how we
interpret multivariate OLS coefficients, which makes sense because the fixed effects model is really
just an OLS model with lots of dummy variables.
266 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
N 1,232 1,232
Number of cities 59 59
REMEMBER THIS
1. A fixed effects model includes an αi term for every unit:
2. The fixed effects approach allows us to control for any factor that is fixed within unit for the
entire panel, regardless of whether we observe this factor.
8.3 Working with Fixed Effects Models 267
3. There are two ways to produce identical fixed effects coefficient estimates for the model.
(a) In the LSDV approach, we simply include dummy variables for each unit except an
excluded reference category.
(b) In the de-meaned approach, we transform the data such that the dependent and
independent variables indicate deviations from the unit mean.
Discussion Question
What factors influence student evaluations of professors in college courses? Are instructors who teach
large classes evaluated less favorably? Consider using the following model to assess the question
based on a data set of evaluations of instructors across multiple classes and multiple years:
general matter, however, including extra variables does not cause errors to be
correlated with independent variables.5
If the fixed effects are non-zero, we want to control for them. We should
note, however, that just because some (or many!) αi are non-zero, our fixed
effects model and our pooled model will not necessarily produce different results.
Recall that bias occurs when errors are correlated with an independent variable.
The fixed effects could exist, but they are not necessarily correlated with the
independent variables. To cause bias, in other words, fixed effects must not only
exist, they must be correlated with the independent variables. It’s not unusual
to observe instances in real data where fixed effects exist but don’t cause bias.
In such cases, the coefficients from the pooled and fixed effects models are
similar.6
The prudent approach to analyzing panel data is therefore to control for
fixed effects. If the fixed effects are zero, we’ll get unbiased results even with
the controls for fixed effects. If the fixed effects are non-zero, we’ll get unbiased
results that will differ or not from pooled results depending on whether the fixed
effects are correlated with the independent variable.
A downside to fixed models is that they make it impossible to estimate effects
for certain variables that might be of interest. As is often the case, there is no free
lunch (although it’s a pretty cheap lunch).
Specifically, fixed effects models cannot estimate coefficients on any variables
that are fixed for all individuals over the entire time frame. Suppose, for example,
that in the process of analyzing our city crime data we wonder if northern cities are
more crime prone. We studiously create a dummy variable Northi that equals 1 if
a city is in a northern state and 0 otherwise and set about estimating the following
model:
Sadly, this approach won’t work. The reason is easiest to see by considering
the fixed effects model in de-meaned terms. The North variable will be converted
to Northit − Northi· . What is the value of this de-meaned variable for a city in the
North? The Northit part will equal 1 for all time periods for such a city. But wait,
this means that Northi· will also be 1 because that is the average of this variable
for this northern city. And that means the value of the de-meaned North variable
will be 0 for any city in the North. What is the value for the de-meaned North
5
Controlling for fixed effects when all αi = 0 will lead to larger standard errors, though. So if we can
establish that there is no sign of a non-zero αi for any unit, we may wish to also estimate a model
without fixed effects. To test for unit-specific fixed effects, we can implement an F test following the
process discussed in Chapter 5 (page 158). The null hypothesis is H0 : α1 = α2 = α3 = · · · = 0. The
alternative hypothesis is that at least one of the fixed effects is non-zero. The unrestricted model is a
model with fixed effects (most easily thought of as the LSDV model that has dummy variables for
each specific unit). The restricted model is a model without any fixed effects, which is simply the
pooled OLS model. We provide computer code on pages 285 and 286.
6
A so-called Hausman test can be used to test whether fixed effects are causing bias. If the results
indicate no sign of bias when fixed effects are not controlled for, we can use a random effects model
as discussed in Chapter 15 on page 524.
8.3 Working with Fixed Effects Models 269
variable for a non-northern city? Similar logic applies: the Northit part will equal
0 for all time periods, and so will Northi· for a non-nothern city. The de-meaned
North variable will therefore also be 0 for non-northern cities. In other words, the
de-meaned North variable will be 0 for all cities in all years. The first job of a
variable is to vary. If it doesn’t, well, that ain’t no variable! Hence, it will not be
possible to estimate a coefficient on this variable.7
More generally, a fixed effects model (estimated with either LSDV or the
de-meaned approach) cannot estimate a coefficient on a variable if the variable
does not change within units for all units. So even though the variable varies across
cities (e.g., the Northi variable is 1 for some cities and 0 for other cities), we can’t
estimate a coefficient on it because it does not vary within cities. This issue arises in
many other contexts. In panel data where individuals are the unit of observation,
fixed effects models cannot estimate coefficients on variables such as gender or
race that do not vary within individuals. In panel data on countries, the effect of
variables such as area or being landlocked cannot be estimated when there is no
variation within country for any country in the data set.
Not being able to include such a variable does not mean fixed effects models
do not control for it. The unit-specific fixed effect is controlling for all factors that
are fixed within a unit for the span of the data set. The model cannot parse out
which of these unchanging factors have which effect, but it does control for them
via the fixed effects parameters.
Some variables might be fixed within some units but variable within other
units. Those we can estimate. For example, a dummy variable that indicates
whether a city has more than a million people will not vary for many cities that
have been above or below one million in population for the entire span of the
panel data. However, if at least some cities have risen above or declined below
one million during the period covered in the panel data, then the variable can be
used in a fixed effects model.
Panel data models need not be completely silent with regard to variables that
do not vary. We can investigate how unchanging variables interact with variables
that do change. For example, we can estimate β2 in the following model:
The β̂2 will tell us how different the coefficient on the police variable is for
northern cities.
Sometimes people are tempted to abandon fixed effects because they care
about variables that do not vary within unit. That’s cheating. The point of choosing
a fixed effects model is to avoid the risk of bias, which could creep in if something
fixed within individuals across the panel happened to be correlated with an
independent variable. Bias is bad, and we can’t just close our eyes to it to get
7
Because we know that LSDV and de-meaned approaches produce identical results, we know that we
will not be able to estimate a coefficient on the North variable in an LSDV model as well. This is the
result of perfect multicollinearity: the North variable is perfectly explained as the sum of the dummy
variables for the northern cities.
270 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
REMEMBER THIS
1. Fixed effects models do not cause bias when implemented in situations in which αi = 0 for all
units.
2. Pooled OLS models are biased only when fixed effects are correlated with the independent
variable.
3. Fixed effects models cannot estimate coefficients on variables that do not vary within at least
some units. Fixed effects models do control for these factors, though, as they are subsumed
within the unit-specific fixed effect.
Discussion Questions
1. Suppose we have panel data on voter opinions toward government spending in 2010, 2012,
and 2014. Explain why we can or cannot estimate the effect of each of the following in a fixed
effects model.
(a) Gender
(b) Income
(c) Race
(d) Party identification
2. Suppose we have panel data on the annual economic performance of 100 countries from 1960
to 2015. Explain why we can or cannot estimate the effect of each of the following in a fixed
effects model.
(a) Average years of education
(b) Democracy, which is coded 1 if political control is determined by competitive elections
and 0 otherwise
(c) Country size
(d) Proximity to the equator
8.4 Two-Way Fixed Effects Model 271
3. Suppose we have panel data on the annual economic performance of the 50 U.S. states from
1960 to 2015. Explain why we can or cannot estimate the effect of each of the following in a
fixed effects model.
(a) Average years of education
(b) Democracy, which is coded 1 if political control is determined by competitive elections
and 0 otherwise
(c) State size
(d) Proximity to Canada
where we’ve taken Equation 8.3 from page 261 and added τt (the Greek letter
tau—rhymes with “wow”), which accounts for differences in crime for all units
in year t. This notation provides a shorthand way to indicate that each separate
time period gets its own τt effect on the dependent variable (in addition to the
αi effect on the dependent variable for each individual unit of observation in the
data set).
Similar to our one-way fixed effects model, the single parameter for a time
fixed effect indicates the average difference for all observations in a given year,
after we have controlled for the other variables in the model. A positive fixed
effect for the year 2008 (α2008 ) would indicate that controlling for all other
factors, the dependent variable was higher for all units in the data set in 2008.
A negative fixed effect for the year 2014 (α2014 ) would indicate that controlling
for all other factors, the dependent variable was lower for all units in the data set
in 2014.
272 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
There are lots of situations in which we suspect that a time fixed effect might
be appropriate:
8
The algebra is a bit more involved than for a one-way model, but the result has a similar feel:
where the dot notation indicates what is averaged over. Thus, Y i· is the average value of Y for unit i
over time, Y ·t is the average value of Y for all units at time t, and Y ·· is the average over all units and
all time periods. Don’t worry, we almost certainly won’t have to create these variables ourselves;
we’re including the dot convention just to provide a sense of how a one-way fixed effects model
extends to a two-way fixed effects model.
9
The additional control variable is called a lagged dependent variable. Inclusion of such a variable is
common in analysis of panel data. These variables often are highly statistically significant, as is the
case here. Control variables of these types raise some complications, which we address in Chapter 15
on advanced panel data models.
8.4 Two-Way Fixed Effects Model 273
It is useful to take a moment to appreciate that not all models are created
equal. A cynic might look at the results in Table 8.5 and conclude that statistics
can be made to say anything. But this is not the right way to think about the
results. The models do indeed produce different results, but there are reasons for
the differences. One of the models is better. A good statistical analyst will know
this. We can use statistical logic to explain why the pooled results are suspect. We
know pretty much what is going on: certain fixed effects in the error term of the
pooled model are correlated with the police variable, thereby biasing the pooled
OLS coefficients. So although there is indeed output from statistical software that
could be taken to imply that police cause crime, we know better. Treating all results
as equivalent is not serious statistics; it’s just pressing buttons on a computer.
Instead of supporting statistical cynicism, this example testifies to the benefits of
appropriate analysis.
REMEMBER THIS
1. A two-way fixed effects model accounts for both unit- and time-specific errors.
2. A two-way fixed effects model is written as
3. A two-way fixed effects model can be estimated with an LSDV approach (which has dummy
variables for each unit and each period in the data set), with a de-meaned approach, or with a
combination of the two.
274 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
dyad An entity that where Bilateral tradeit is total trade volume between countries in dyad i at time t. A
consists of two dyad is a unit that consists of two elements. Here, a dyad indicates a pair of countries,
elements. and the data indicates how much trade flows between them. For example, the
United States and Canada form one dyad, the United States and Japan form another
dyad, and so on. Allianceit is a dummy variable that is 1 if countries in the dyad are
entered into a security alliance at time t and 0 otherwise. The αi term captures the
amount by which trade in dyad i is higher or lower over the entire course of the
panel.
Because the unit of observation is a country-pair dyad, fixed effects here entail
factors related to a pair of countries. For example, the fixed effect for the United
States–New Zealand dyad in the trade model may be higher because of the shared
language. The fixed effect for the China-India dyad might be negative because the
countries are separated by mountains (which they happen to fight over, too).
As we consider whether a fixed effects model is necessary, we need to
think about whether the dyad-specific fixed effects could be correlated with the
independent variables. Dyad-specific fixed effects could exist because of a history
of commerce between two countries, a favorable trading geography (not divided
by mountains, for example), economic complementarities of some sort, and so on.
These factors could also make it easier or harder to form alliances.
Table 8.6 reports results from Green, Kim, and Yoon (2001) based on data
covering trade and alliances from 1951 to 1992. The dependent variable is the
amount of trade between the two countries in a given dyad in a given year. In
addition to the alliance measure, the independent variables are GDP (total gross
domestic product of the two countries in the dyad), Population (total population of
the two countries in the dyad), Distance (distance between the capitals of the two
countries), and Democracy (the minimum value of a democracy ranking for the two
countries in the dyad: the higher the value, the more democracy).
The dependent and continuous independent variables are logged. Logging
variables is a common practice in this literature; the interpretation is that a one
8.4 Two-Way Fixed Effects Model 275
for via the fixed effect. And even better, not only is the effect of distance controlled
for, so are hard-to-measure factors such as being on a trade route or having cultural
affinities. That’s what the fixed effect is—a big ball of all the effects that are the same
within units for the period of the panel.
Not all coefficients flip. The coefficient on GDP is relatively stable, indicating
that unlike the variables that do flip signs from the pooled to fixed effects specifica-
tions, GDP does not seem to be correlated with the unmeasured fixed effects that
influence trade between countries.
8.5 Difference-in-Difference
difference-in- The logic of fixed effects plays a major role in difference-in-difference models,
difference model A which look at differences in changes in treated units compared to untreated units
model that looks at and are particularly useful in policy evaluation. In this section, we explain the
differences in changes in logic of this approach, show how to use OLS to estimate these models, and then
treated units compared link the approach to the two-way fixed effects models we developed for panel
to untreated units.
data.
Difference-in-difference logic
To understand difference-in-difference logic, let’s consider a policy evaluation
of “stand your ground” laws, which have the effect of allowing individuals to
use lethal force when they reasonably believe they are threatened.10 Does a law
that removes the duty to retreat when life or property is being threatened prevent
homicides by making would-be aggressors reconsider? Or do such laws increase
homicides by escalating violence?
Naturally, we would start by looking at the change in homicides in a state
that passed a stand your ground law. This approach is what every policy maker in
the history of time uses to assess the impact of a policy change. Suppose we find
homicides rising in the states that passed the law. Is that fact enough to lead us to
conclude that the law increases crime?
It doesn’t take a ton of thinking to realize that such evidence is pretty weak.
Homicides could rise or fall for a lot of reasons, many of them completely
unrelated to stand your ground laws. If homicides went up not only in the state
that passed the law but in all states—even states that made no policy change—we
can’t seriously blame the law for the rise in homicides. Or, if homicides
declined everywhere, we shouldn’t attribute the decline in a particular state to
the law.
What we really want to do is to look at differences in the state that passed
the policy in comparison to differences in similar states that did not pass such a
10
See McClellan and Tekin (2012) as well as Cheng and Hoekstra (2013).
8.5 Difference-in-Difference 277
where ΔYT is the change in the dependent variable in treated states (those that
passed a stand your ground law) and ΔYC is the change in the dependent variable
in the untreated states (those that did not pass such a law). We call this approach
the difference-in-difference approach because we look at the difference between
differences in treated and control states.
where Treatedi equals 1 for a treated state and 0 for a control state, Aftert equals 1
for all after observations (from both control and treated units) and 0 otherwise, and
Treatedi × Aftert is an interaction of Treatedi and Aftert . This interaction variable
will equal 1 for treated states in the post-treatment period and 0 for all other
observations.
The control states have some mean level of homicides, which we denote with
β0 ; the treated states also have some mean level of homicides, and we denote with
β0 + β1 Treatedi . If β1 is positive, the mean level for the treated states is higher
than in control states. If β1 is negative, the mean level for the treated states is
lower. If β1 is zero, the mean level for the treated states is the same as in control
states. Since this preexisting difference of mean levels was by definition there
before the law was passed, the law can’t be the cause of differences. Instead, these
differences represented by β1 are simply the preexisting differences in the treated
and untreated states. This parameter is analogous to a unit fixed effect, although
here it is for the entire group of treated states rather than individual units.
The model captures national trends with the β2 Aftert term. The dependent
variable for all states, treated and not, changes by β2 in the after period. This
parameter is analogous to a time fixed effect, but it’s for the entire post-treatment
period rather than individual time periods.
The key coefficient is β3 . This is the coefficient on the interaction between
Treatedi and Aftert . This variable equals 1 only for treated units in the after period
and 0 otherwise. The coefficient tells us there is an additional change in the treated
278 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
states after the policy went into effect, once we have controlled for preexisting
differences between the treated and control states (β1 ) and differences in the before
and after periods for all states (β2 ).
If we work out the fitted values for changes in treated and control states, we
can see how this regression model produces a difference-in-difference estimate.
First, note that the fitted value for treated states in the after period is β0 + β1 + β2 +
β3 (because Treatedi , Aftert , and Treatedi × Aftert all equal 1 for treated states in
the after period). Second, note that the fitted value for treated states in the before
period is β0 + β1 , so the change for fitted states is β2 + β3 . The fitted value for
control states in the after period is β0 + β2 (because Treatedi and Treatedi × Aftert
equal 0 for control states). The fitted value for control states in the before period is
β0 , so the change for control states is β2 . The difference in differences of treated
and control states will therefore be β3 . Presto!
Figure 8.5 displays two examples that illustrate the logic of difference-in-
difference models. In panel (a), there is no treatment effect. The dependent
Y Y
4 4
Treated Treated β3
Control Control
3 β2 3 β2
2 2
β1 β1
β2 β2
1 1
β0 β0
0 0
Time Time
Before After Before After
variables for the treated and control states differ in the before period by β1 . Then
the dependent variable for both the treated and control units rose by β2 in the
after period. In other words, Y was bigger for the treated unit than for the control
by the same amount before and after the treatment. The implication is that the
treatment had no effect, even though Y went up in treatment states after they passed
the law.
Panel (b) in Figure 8.5 shows an example with a treatment effect. The
dependent variables for the treated and control states differ in the before period
by β1 . The dependent variable for both the treated and control units rose by β2
in the after period, but the value of Y for the treated unit rose yet another β3 . In
other words, the treated group was β1 bigger than the control before the treatment
and β1 + β3 bigger than the control after the treatment. The implication is that the
treatment caused a β3 bump over and above the differences across unit and time
that are accounted for in the model.
Consider how the difference-in-difference approach would assess outcomes
in our stand your ground law example. If homicides declined in states with such
laws more than in states without them, the evidence supports the claim that the
law prevented homicides. Such an outcome could happen if homicides went down
by 10 in states with the law but decreased by only 2 in other states. Such an
outcome could also happen if homicides actually went up by 2 in states with
stand your ground laws but went up by 10 in other states. In both instances, the
difference-in-difference estimate is −8.
One great thing about using OLS to estimate difference-in-difference models
is that it is easy to control for other variables with this method. Simply include
them as covariates, and do what we’ve been doing. In other words, simply add
a β4 Xit term (and additional variables, if appropriate), yielding the following
difference-in-difference model:
where
• The αi terms (the unit-specific fixed effects) capture differences that exist
across units both before and after the treatment.
• The τt terms (the time-specific fixed effects) capture differences that exist
across all units in every period. If homicide rates are higher in 2007 than in
2003, then the τt for 2007 will be higher than the τt for 2003.
homicide rates (via time fixed effects) and additional controls related to race, age,
and percent of residents living in urban areas, they found that the homicide rates
went up by 0.033 after states implemented these laws.11
REMEMBER THIS
A difference-in-difference model estimates the effect of a change in policy by comparing changes in
treated units to changes in control units.
1. A basic difference-in-difference estimator is ΔYT − ΔYC , where ΔYT is the change in the
dependent variable for the treated unit and ΔYC is the change in the dependent variable for
a control unit.
2. Difference-in-difference estimates can be generated from the following OLS model:
3. For panel data, we can use a two-way fixed effects model to estimate difference-in-difference
effects:
where the αi fixed effects capture differences in units that existed both before and after treatment
and τt captures differences common to all units in each time period.
Discussion Question
For each of the following examples, explain how to create (i) a simple difference-in-difference
estimate of policy effects and (ii) a fixed effects difference-in-difference model.
(a) California implemented a first-in-the-nation program of paid family leave in 2004. Did this
policy increase use of maternity leave?a
(b) Fourteen countries engaged in “expansionary austerity” policies in response to the 2008
financial crisis. Did these austerity policies work? (For simplicity, treat austerity as a dummy
variable equal to 1 for countries that engaged in it and 0 for others.)
(c) Some neighborhoods in Los Angeles changed zoning laws to make it easier to mix commercial
and residential buildings. Did these changes reduce crime?b
a
See Rossin-Slater, Ruhm, and Waldfogel (2013).
b
See Anderson, Macdonald, Bluthenthal, and Ashwood (2013).
11
Cheng and Hoekstra (2013) found similar results.
282 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
Y Y
4 4
Treated Treated
Control Control
3 3
2 2
1 1
Y Y
4 4
Treated
Treated
Control
Control
3 3
2 2
1 1
Review Question
For each of the four panels in Figure 8.6, indicate the values of β0 , β1 , β2 , and β3 for the basic
difference-in-difference OLS model:
Conclusion
Again and again, we’ve emphasized the importance of exogeneity. If X is uncorre-
lated with , we get unbiased estimates and are happy. Experiments are sought after
because the randomization in them ensures—or at least aids—exogeneity. With
OLS we can sometimes, maybe, almost, sort of, kind of approximate endogeneity
Further Reading 283
by soaking up so much of the error term with measured variables that what remains
correlates little or not at all with X.
Realistically, though, we know that we will not be able to measure everything.
Real variables with real causal force will almost certainly lurk in the error term.
Are we stuck? Turns out, no (or at least not yet). We’ve got a few more tricks up our
sleeve. One of the best tricks is to use fixed effects tools. Although uncomplicated,
the fixed effects approach can knock out a whole class of unmeasured (and even
unknown) variables that lurk in the error term. Simply put, any factor that is fixed
across time periods for each unit or fixed across units for each time period can be
knocked out of the error term. Fixed effects tools are powerful, and as we have
seen in real examples, they can produce results that differ dramatically from those
produced by basic OLS models.
We will have mastered the material in this chapter when we can do the
following:
• Section 8.1: Explain how a pooled model can be problematic in the analysis
of panel data.
• Section 8.2: Write down a fixed effects model, and explain the fixed
effect. Give examples of the kinds of factors subsumed in a fixed effect.
Explain how to estimate a fixed effects model with LSDV and de-meaned
approaches.
• Section 8.3: Explain why coefficients on variables that do not vary within
a unit cannot be estimated in fixed effects models. Explain how these
variables are nonetheless controlled for in fixed effects models.
Further Reading
Chapter 15 discusses advanced panel data models. Baltagi (2005) is a more
technical survey of panel data methods.
Green, Kim, and Yoon (2001) provide a nice discussion of panel data methods
in international relations. Wilson and Butler (2007) reanalyze articles that did not
use fixed effects and find results changed, sometimes dramatically.
If we use pooled OLS to analyze panel data sets, we are quite likely to
have errors that are correlated within unit in the manner discussed on page 69.
This correlation of errors will not cause OLS β̂1 estimates to be biased, but
it will make the standard OLS equation for the variance of β̂1 inappropriate.
While fixed effects models typically account for a substantial portion of the
284 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
Key Terms
De-meaned approach (263) Least squares dummy Pooled model (256)
Difference-in-difference variable (LSDV) approach Rolling cross-sectional data
model (276) (262) (279)
Dyad (274) One-way fixed effects model Two-way fixed effects model
Fixed effect (261) (271) (271)
Fixed effects model (261) Panel data (255)
Computing Corner
Stata
1. To use the LSDV approach to estimate a panel data model, we run an OLS
model with dummy variables for each unit.
(c) To use an F test to examine whether fixed effects are all zero, the
unrestricted model is the model with the dummy variables we just
estimated. The restricted model is a regression model without the
dummy variables (also known as the pooled model):
regress Y X1 X2 X3.
(b) Run Stata’s built-in, one-way fixed effects model and include the
dummies for the years:
xtreg Y X1 X2 X3 Yr2-Yr10, fe i(City)
where Yr2-Yr10 is a shortcut way of including every Yr variable
from Yr2 to Yr10.
1. To use the LSDV approach to estimate a panel data model, we run an OLS
model with dummy variables for each unit.
(a) It’s possible to name and include dummy variables for every unit,
but doing this can be a colossal pain when we have lots of units.
It is usually easiest to use the factor command, which will
automatically include dummy variables for each unit. The code
is lm(Y ~ X1 + factor(unit)). This command will estimate a
model in which there is a dummy variable for every unique value
unit indicated in the unit variable. For example, if our data looked
like Table 8.2, including a factor(city) term in the regression
code would lead to the inclusion of dummy variables for each
city.
(b) To implement an F test on the hypothesis that all fixed effects (both
unit and time) are zero, the unrestricted equation is the full model
and the restricted equation is the model with no fixed effects.
Unrestricted = lm(Y ~ X1 + factor(unit)+ factor
(time))
Restricted = lm(Y ~ X1)
Refer to page 171 for more details on how to implement an F test
in R.
and then refer to that data frame in the plm command.12 For a
one-way fixed effects model, include model="within".
library(plm)
All.data = data.frame(Y, X1, X2, city, time)
plm(Y ~ X1 + X2, data=All.data, index=c("city"),
model="within")
(b) We can use the plm command and indicate the unit and time variables
with the index=c("city", "year") command. These are the
variable names that indicate your units and time variables, which
will vary depending on your data set. We also need to include the
subcommand effect="twoways".
plm(Y ~ X1 + X2, data=All.data, index=c("city",
"year"), model="within", effect="twoways")
12
A data frame is a convenient way to package data in R. Not only can you put variables together in
one named object, but you can include text variables like names of countries.
288 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
Exercises
1. Researchers have long been interested in the relationship between eco-
nomic factors and presidential elections. The PresApproval.dta data set
includes data on presidential approval polls and unemployment rates by
state over a number of years. Table 8.8 lists the variables.
(a) Use pooled data for all years to estimate a pooled OLS regression
explaining presidential approval as a function of state unemployment
rate. Report the estimated regression equation, and interpret the
results.
(b) Many political observers believe politics in the South are different.
Add South as an additional independent variable, and reestimate the
model from part (a). Report the estimated regression equation. Do
the results change?
(c) Reestimate the model from part (b), controlling for state fixed effects
by using the de-meaned approach. How does this approach affect the
results? What happens to the South variable in this model? Why?
Does this model control for differences between southern and other
states?
(d) Reestimate the model from part (c) controlling for state fixed effects
using the LSDV approach. (Do not include a South dummy variable).
Compare the coefficients and standard errors for the unemployment
variable.
(e) Estimate a two-way fixed effects model. How does this model affect
the results?
Year Year
PresApprov Percent positive presidential approval
UnemPct State unemployment rate
year Year
stateshort First three letters of state name (for labeling scatterplot)
appspc Applications to the Peace Corps from each state per capita
unemployrate State unemployment rate
(b) Run a pooled regression of Peace Corps applicants per capita on the
state unemployment rate and year dummies. Describe and critique
the results.
(c) Plot the relationship between the state economy and Peace Corps
applications. Does any single state stick out? How may this outlier
affect the estimate on unemployment rate in the pooled regression in
part (b)? Create a scatterplot without the unusual state, and comment
briefly on the difference from the scatterplot with all observations.
(d) Run the pooled model from part (b) without the outlier. Comment
briefly on the results.
(e) Use the LSDV approach to run a two-way fixed effects model
without the outlier. Do your results change from the pooled analysis?
Which results are preferable?
(f) Run a two-way fixed effects model without the outlier; use the fixed
effects command in Stata or R. Compare to the LSDV results.
(a) Estimate a model ignoring the panel structure of the data. Use overall
evaluation of the instructor as the dependent variable and the class
290 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
(b) Explain what a fixed effect for each of the following would control
for: instructor, course, and year.
(c) Use the equation from part (a) to estimate a model that includes
a fixed effect for instructor. Report your results, and explain any
differences from part (a).
(b) Calculate the percent of people in the sample in college from the fol-
lowing four groups: (i) Before 1993/non-Georgia, (ii) Before 1993/
Georgia, (iii) After 1992/non-Georgia, and (iv) After 1992/Georgia.
First, use the mean function (e.g., in Stata use mean Y if X1 == 0
& X2 == 0 and in R use mean Y[X1 == 0 & X2 == 0]). Second,
use the coefficients from the OLS output in part (a).
13
For simplicity, we will not use the sample weights used by Dynarski. The results are stronger,
however, when these sample weights are used.
Exercises 291
(c) Graph the fitted lines for the Georgia group and non-Georgia
samples.
(f) The way the program was designed, Georgia high school graduates
with a B or higher average and annual family income over $50,000
could qualify for HOPE by filling out a simple one-page form. Those
with lower income were required to apply for federal aid with a
complex four-page form and had any federal aid deducted from
their HOPE scholarship. Run separate basic difference-in-difference
models for these two groups, and comment on the substantive
implication of the results.
LnAvgSalary The average salary of teachers in the district, adjusted for inflation and logged
OnCycle A dummy variable which equals 1 for districts where school boards were elected
“on-cycle” (i.e., they were elected at same time people were voting on other office)
and 0 if the school board was elected “off-cycle” (i.e., school board members were
elected in a separate election).
CycleSwitch A dummy variable indicating that the district switched from off-cycle to on-cycle
elections starting in 2007
on-cycle elections, and teachers and teachers unions will have relatively
less influence.
From 2003 to 2006, all districts in the sample elected their school board
members off-cycle. A change in state policies in 2006 led some, but not all,
districts to elect their school board members on-cycle from 2007 onward.
The districts that switched then stayed switched for the period 2007–2009,
and no other district switched.
(c) Run a one-way fixed effects model in which the fixed effect relates to
individual school districts. Interpret the results, and explain whether
this model accounts for time trends that could affect all districts.
Exercises 293
(e) Suppose that we tried to estimate the two-way fixed effects model on
only the last three years of the data (2007, 2008, and 2009). Would
we be able to estimate the effect of OnCycle for this subset of the
data? Why or why not?
6. This problem uses a panel version of the data set described in Chapter 5
(page 174) to analyze the effect of cell phone and texting bans on traffic
fatalities. Use deaths per mile as the dependent variable because this
variable accounts for the pattern we saw earlier that miles driven is a strong
predictor of the number of fatalities. Table 8.13 describes the variables
in the data set Cellphone_panel_homework.dta; it covers all states plus
Washington, DC, from 2006 to 2012.
(a) Estimate a pooled OLS model with deaths per mile as the dependent
variable and cell phone ban and text ban as the two independent
variables. Briefly interpret the results.
(b) Describe a possible state-level fixed effect that could cause endo-
geneity and bias in the model from part (a).
(c) Estimate a one-way fixed effects model that controls for state-level
fixed effects. Include deaths per mile as the dependent variable and
cell phone ban and text ban as the two independent variables. Does
TABLE 8.13 Variables for the Cell Phones and Traffic Deaths Data
Variable name Description
year Year
cell_ban Coded 1 if handheld cell phone while driving ban is in effect; 0 otherwise
text_ban Coded 1 if texting while driving ban is in effect; 0 otherwise
the coefficient on cell phone ban change in the manner you would
expect based on your answer from part (a)?
(d) Describe a possible year fixed effect that could cause endogeneity
and bias in the fixed effects model in part (c).
(f) The model in part (e) is somewhat sparse with regard to control
variables. Estimate a two-way fixed effects model that includes
control variables for cell phones per 10,000 people and percent
urban. Briefly describe changes in inference about the effect of cell
phone and text bans.
(g) Estimate the same two-way fixed effects model by using the least
LSDV approach. Compare the coefficient and t statistic on the cell
phone variable to the results from part (f).
(h) Based on the LSDV results, identify states with large positive and
negative fixed effects. Explain what these mean (being sure to note
the reference category), and speculate about how the positive and
negative fixed effect states differ. (It is helpful to connect the state
number to state name; in Stata, do this with the command list
state state_numeric if year ==2012.)
Instrumental Variables: Using 9
Exogenous Variation to Fight
Endogeneity
295
296 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Like many powerful tools, 2SLS can be a bit dangerous. We won’t cut off a
finger using it, but if we aren’t careful, we could end up with worse estimates than
we would have produced with OLS. And like many powerful tools, the approach is
not cheap. In this case, the cost is that the estimates produced by 2SLS are typically
quite a bit less precise than OLS estimates.
In this chapter, we provide the instruction manual for this tool. Section 9.1
presents an example in which an instrumental variables approach proves useful.
Section 9.2 gives the basics for the 2SLS model. Section 9.3 discusses what
to do when we have multiple instruments. Section 9.4 reveals what happens to
2SLS estimates when the instruments are flawed. Section 9.5 explains why 2SLS
estimates tend to be less precise than OLS estimates. And Section 9.6 applies 2SLS
tools to so-called simultaneous equation models in which X causes Y but Y also
causes X.
Levitt’s (2002) idea is that while some police are hired for endogenous reasons
(city leaders expect more crime and so hire more police), other police are hired
for exogenous reasons (the city simply has more money to spend). In particular,
Levitt argues that the number of firefighters in a city reflects voters’ tastes for
public services, union power, and perhaps political patronage. These factors also
partially predict the size of the police force and are not directly related to crime.
In other words, to the extent that changes in the number of firefighters predict
changes in police numbers, those changes in the numerical strength of a police
force are exogenous because they have nothing to do with crime. The idea, then,
is to isolate the portion of changes in the police force associated with changes in
the number of firefighters and see if crime went down (or up) in relation to those
changes.
We’ll work through the exact steps of the process soon. For now, we can get
a sense of how instrumental variables can matter by looking at Levitt’s results.
The left column of results in Table 9.1 shows the coefficient on police estimated
9.1 2SLS Example 297
TABLE 9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime
OLS with year OLS with year 2SLS
dummies only and city dummies
All models include controls for prison population, per capita income, abortion, city size, and racial demographics.
via a standard OLS estimation of Equation 9.1 based on an OLS analysis with
covariates and year dummy variables but no city fixed effects. The coefficient is
positive and significant, implying that police cause crime. Yikes!
We’re pretty sure, however, that endogeneity distorts simple OLS results
in this context. The second column in Table 9.1 shows that the results change
dramatically when city fixed effects are included. As discussed in Chapter 8,
fixed effects account for the tendency of cities with chronically high crime to also
have larger police forces. The estimated effect of police is negative, but small and
statistically insignificant at usual levels.
The third column in Table 9.1 shows the results obtained when the instru-
mental variables technique is used. The coefficient on police is negative and
almost statistically significant. This result differs dramatically from the OLS result
without city fixed effects and non-trivially from the fixed effects results.
Levitt’s analysis essentially treats changes in firefighters as a kind of experi-
ment. He estimates the number of police that cities add when they add firefighters
and assesses whether crime changed in conjunction with these particular changes
instrumental in police. Levitt is using the firefighter variable as an instrumental variable, a
variable Explains variable that explains the endogenous independent variable of interest (which in
the endogenous this case is the log of the number of police per capita) but does not directly explain
independent variable of
the dependent variable (which in this case is violent crimes per capita).
interest but does not
directly explain the
The example also highlights some limits to instrumental variables methods.
dependent variable. First, the increase in police associated with changes in firefighters may not
really be exogenous. That is, can we be sure that the firefighter variable is
truly independent of the error term in Equation 9.1? It is possible, for example,
that reelection-minded political leaders provide other public services when they
boost the number of firefighters—goodies such as tax cuts, roads, and new
stadiums—and that these policy choices may affect crime (perhaps by improving
economic growth). In that case, we worry that our exogenous bump in police
is actually associated with factors that also affect crime, and that those factors
may be in the error term. Therefore, as we develop the logic of instrumental
variables, we also spend a lot of time worrying about the exogeneity of our
instruments.
298 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
REMEMBER THIS
1. An instrumental variable is a variable that explains the endogenous independent variable of
interest but does not directly explain the dependent variable.
2. When we use the instrumental variables approach, we focus on changes in Y due to the changes
in X that are attributable to changes in the instrumental variable.
3. Major challenges associated with using instrumental variables include the following:
(a) It is often hard to find an appropriate instrumental variable that is exogenous.
(b) Estimates based on instrumental variables are often imprecise.
where Yi is our dependent variable, X1i is our main variable of interest, and X2i is
a control variable (and we could easily add additional control variables).
The difference is that X1i is an endogenous variable, which means that it is
correlated with the error term. Our goal with 2SLS is to replace the endogenous
9.2 Two-Stage Least Squares (2SLS) 299
X1i with a different variable that measures only the portion of X1i that is not related
to the error term in the main equation.
We model X1i as
where Zi is a new variable we are adding to the analysis, X2i is the control variable
in Equation 9.2, the γ’s are coefficients that determine how well Zi and X2i explain
X1i , and νi is an error term. (Recall that γ is the Greek letter gamma and ν is the
Greek letter nu.) We call Z our instrumental variable; this variable is the star of
this chapter, hands down. The variable Z is the source of our exogenous variation
in X1i .
In Levitt’s police and crime example, “police officers per capita” is the
endogenous variable (X1 in our notation) and “firefighters” is the instrumental
variable (Z in our notation). The instrumental variable is the variable that causes
the endogenous variable to change for reasons unrelated to the error time. In other
words, in Levitt’s model, Z (firefighters) explains X1i (police per capita) but is not
correlated with the error term in the equation explaining Y (crime).
Notice that X̂1i is a function only of Z, X2 , and the γ’s. That fact has important
implications for what we are trying to do. The error term when X1i is the dependent
variable is νi ; it is almost certainly correlated with i , the error term in the Yi
equation. That is, drug use and criminal history are likely to affect both the number
of police (X1 ) and crime (Y). This means the actual value of X1 is correlated with
; the fitted value X̂1i , on the other hand, is only a function of Z, X2 , and the γ’s. So
even though police forces in reality may be ebbing and flowing as related to drug
use and other factors in the error term of Equation 9.2, the fitted value X̂1i will not
change. Our X̂1i will ebb and flow only with changes in Z and X2 , which means
our fitted value of X has been purged of the association between X and .
All control variables from the second-stage model must be included in the
first stage. We want our instrument to explain variation in X1 over and above any
variation that can be explained by the other independent variables.
In the second stage, we estimate our outcome equation, but (key point here)
we use X̂1i —the fitted value of X1i —rather than the actual value of X1i . In other
words, instead of using X1i , which we suspect is endogenous (correlated with i ),
we use the measure of X̂1i , which has been purged of X1i ’s association with error.
Specifically, the second stage of the 2SLS model is
The little hat on X̂1i is a big deal. Once we appreciate why we’re using it and
how to generate it, 2SLS becomes easy. We are now estimating how much the
exogenous variation in X1i affects Y. Notice also that there is no Z in Equation 9.4.
By the logic of 2SLS, Z affects Y only indirectly, by affecting X.
Control variables play an important role, just as in OLS. If a factor that
affects Y is correlated with Z, we need to include it in the second-stage regression.
Otherwise, the instrument may soak up some of the effect of this omitted factor
rather than merely exogenous variation in X1 . For example, suppose that cities in
the South started facing more arson and hence hired more firefighters. In that case,
Levitt’s firefighter instrument for police officers will also contain variation due to
region. If we do not control for region in the second-stage regression, some of the
region effect may work its way through the instrument, potentially creating a bias.
Actual estimation via 2SLS is a bit more involved than simply running OLS
with X̂1 because X̂1i is itself an estimate, and the standard errors need to be adjusted
to account for this. In practice, though, statistical packages do this adjustment
automatically with their 2SLS commands.1
1
When there is a single endogenous independent variable and a single instrument, the 2SLS
cov(Z,Y)
estimator reduces to β̂1 = cov(Z, X)
(Murnane and Willett 2011, 229). While it may be computationally
simpler to use this ratio of covariances to estimate β̂1 , it becomes harder to see the intuition about
exogenous variation if we do so. In addition, the 2SLS estimator is more general: it allows for
multiple independent variables and instruments.
9.2 Two-Stage Least Squares (2SLS) 301
an argument that the number of firefighters in a city was uncorrelated with these
elements of the error term.
Unfortunately, there is no direct test of whether Z is uncorrelated with . The
whole point of the error term is that it covers unmeasured factors. We simply
cannot directly observe whether Z is correlated with these unmeasured factors.
A natural instinct is to try to test the exclusion condition by including Z
directly in the second stage, but this won’t work. If Z is a good instrument, it
will explain X1i , which in turn will affect Y. We will observe some effect of Z
on Y, which will be the effect of Z on X1i , which in turn can have an effect on
Y. Instead, the discussion of the exclusion condition will need to be primarily
conceptual rather than statistical. We will need to justify our assertion, without
statistical analysis, that Z does not affect Y directly. Yes, that’s a bummer and,
frankly, a pretty weird position to be in for a statistical analyst. Life is like that
sometimes.2
Figure 9.1 illustrates the two conditions necessary for Z to be an appropriate
instrument. The inclusion condition is that Z explains X. We test this simply by
regressing X on Z. The exclusion restriction is that Z does not cause Y. The
exclusion condition is tricky to test because if the inclusion condition holds, Z
causes X, which in turn may cause Y. In this case, there would be an observed
relationship between Z and Y but only via Z’s effect on X. Hence, we can’t test
the exclusion restriction statistically and must make substantive arguments about
why we believe Z has no direct effect on Y.
2
A test called the Hausman test (or the Durbin-Wu-Hausman test) is sometimes referred to as a test
of endogeneity. We should be careful to recognize that this is not a test of the exclusion restriction.
Instead, the Hausman test assesses whether X is endogenous. It is not a test of whether Z is
exogenous. Hausman derived the test by noting that if Z is exogenous and X is endogenous, then OLS
and 2SLS should produce very different β̂ estimates. If Z is exogenous and X is exogenous, then OLS
and 2SLS should produce similar β̂ estimates. The test involves assessing how different the β̂
estimates are from OLS and 2SLS. Crucially, we need to assume that Z is exogenous for this test.
That’s the claim we usually want to test, so the Hausman test of endogeneity is often less valuable
than it sounds.
302 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
(Independent variable)
Y
Inclusion condition:
Z must explain X (Dependent variable)
Exclusion restriction:
Z
Z must not explain Y
(Instrumental variable)
have laws that say that young people have to stay in school until they are 16. For
a school district that starts kids in school based on their age on September 1, kids
born in July would be in eleventh grade when they turn 16, whereas kids born in
October (who started a year later) would be only in tenth grade when they turn
16. Hence, kids born in July can’t legally drop out until they are in the eleventh
grade, but kids born in October can drop out in the tenth grade. The effect is not
huge, but with a lot of data (and Angrist and Krueger had a lot of data), this effect
is statistically significant.
Quarter of birth also seems to satisfy the exclusion condition because birth
month doesn’t seem to be related to such unmeasured factors that affect salary as
smarts, diligence, and family wealth. (Astrologers disagree, by the way.)
Bound, Jaeger, and Baker (1995), however, showed that quarter of birth
has been associated with school attendance rates, behavioral difficulties, mental
health, performance on tests, schizophrenia, autism, dyslexia, multiple sclerosis,
region, and income. [Wealthy families, for example, have fewer babies in the
winter (Buckles and Hungerman 2013). Go figure.] That this example may fail
the exclusion condition is disappointing: if quarter of birth doesn’t satisfy the
exclusion condition, it’s fair to say a lot of less clever instruments may be in trouble
9.2 Two-Stage Least Squares (2SLS) 303
as well. Hence, we should exercise due caution in using instruments, being sure
both to implement the diagnostics discussed next and to test theories with multiple
instruments or analytical strategies.
REMEMBER THIS
Two-stage least squares uses exogenous variation in X to estimate the effect of X on Y.
1. In the first stage, the endogenous independent variable is the dependent variable and the
instrument, Z, is an independent variable:
X1i = γ0 + γ1 Zi + γ2 X2i + νi
2. In the second stage, X̂1i (the fitted values from the first stage) is an independent variable:
Yi = β0 + β1 X̂1i + β2 X2i + i
Discussion Questions
1. Some people believe cell phones and platforms like Twitter, which use related technology, have
increased social unrest by making it easier to organize protests or acts of violence. Pierskalla
and Hollenbach (2013) used data from Africa to test this view. In its most basic form, the model
was
where Violencei is data on organized violence in city i and Cell phone coveragei measures
availability of mobile coverage in city i.
304 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
where Republican votei is the vote for the Republican candidate for Congress in district i in
2010 and Tea Party protest turnouti measures the number of people who showed up at Tea Party
protests in district i on April 15, 2009, a day of planned protests across the United States.
(a) Explain why endogeneity may be a concern.
(b) Consider local rainfall on April 15, 2009, as an instrument for Tea Party protest turnout.
Explain how to test whether the rain variable satisfies the inclusion condition.
(c) Does the local rainfall variable satisfy the exclusion condition? Can we test whether this
condition holds?
3. Do economies grow more when their political institutions are better? Consider the following
simple model:
where Economic growthi is the growth of country i and Institutional qualityi is a measure of
the quality of governance of country i.
(a) Explain why endogeneity may be a concern.
(b) Acemoglu, Johnson, and Robinson (2001) proposed country-specific mortality rates
faced by European soldiers, bishops, and sailors in their countries’ colonies in the sev-
enteenth, eighteenth, and nineteenth centuries as an instrument for current institutions.
The logic is that European powers were more likely to set up worse institutions in
places where the people they sent over kept dying. In these places, the institutions were
oriented more toward extracting resources than toward creating a stable, prosperous
society. Explain how to test whether the settler mortality variable satisfies the inclusion
condition.
(c) Does the settler mortality variable satisfy the exclusion condition? Can we test whether
this condition holds?
9.2 Two-Stage Least Squares (2SLS) 305
where Death equals 1 if the baby passed away (and 0 otherwise) and NICU equals 1
if the delivery occurred in a high-level NICU facility (and 0 otherwise).
It is highly likely that the coefficient in this case would be positive. It is beyond
doubt that the riskiest births go to the NICU, so clearly, the key independent variable
(NICU) will be correlated with factors associated with a higher risk of death. In other
words, we are quite certain endogeneity will bias the coefficient upward. We could,
of course, add covariates that indicate risk factors in the pregnancy. Doing so would
reduce the endogeneity by taking factors correlated with NICU out of the error term
and putting them in the equation. Nonetheless, we would still worry that cases that
are riskier than usual in reality, but perhaps in ways that are difficult to measure,
would still be more likely to end up in NICUs, with the result that endogeneity would
be hard to fully purge with multivariate OLS.
Perhaps experiments could be helpful. They are, after all, designed to ensure
exogeneity. They are also completely out of bounds in this context. It is shocking to
even consider randomly assigning mothers and newborns to NICU and non-NICU
facilities. It won’t and shouldn’t happen.
So are we done? Do we have to accept multivariate OLS as the best we can do?
Not quite. Instrumental variables, and 2SLS in particular, give us hope for producing
more accurate estimates. What we need is something that explains exogenous
variation in use of the NICU. That is, can we identify a variable that explains usage
of NICUs but is not correlated with pregnancy risk factors?
Lorch, Baiocchi, Ahlberg, and Small (2012) identified a good prospect: distance
to a NICU. Specifically, they created a dummy variable we’ll call Near NICU, which
equals 1 for mothers who could get to NICU in at most 10 minutes more than it took
to get to a regular hospital (and 0 otherwise). The idea is that mothers who lived
closer to a NICU-equipped hospital would be more likely to deliver there. At the
same time, distance to a NICU should not directly affect birth outcomes; it should
affect birth outcomes only to the extent that it affects utilization of NICUs.
306 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Does this variable satisfy the conditions necessary for an instrument? The first
condition is that the instrumental variable explains the endogenous variable, which
in this case is whether the mother delivered at a NICU. Table 9.2 shows the results
from a multivariate analysis in which the dependent variable was a dummy variable
indicating delivery at a NICU and the main independent variable was the variable
indicating that the mother lived near a NICU.
Clearly, mothers who live close to a NICU hospital are more likely to deliver
at such a hospital. The estimated coefficient on Near NICU is highly statistically
significant, with a t statistic exceeding 178. Distance does a very good job explain-
ing NICU usage. Table 9.2 shows coefficients for two other variables as well (the
actual analysis has 60 control variables). Gestational age indicates how far along
the pregnancy was at the time of delivery. ZIP code poverty indicates the percent
of people in a ZIP code living below the poverty line. Both these control variables
are significant, with babies that are gestationally older less likely to be delivered in
NICU hospitals and women from high-poverty ZIP codes more likely to deliver in
NICU hospitals.
The second condition that a good instrument must satisfy is that its variable not
be correlated with the error term in the second stage. This is the exclusion condition,
which holds that we can justifiably exclude the instrument from the second stage.
Certainly, it seems highly unlikely that the mere fact of living near a NICU would
help a baby unless the mother used that facility. However, living near a NICU might
be correlated with a risk factor. What if NICUs tended to be in large urban hospitals
in poor areas? In that case, living near one could be correlated with poverty, which
in turn might itself be a pregnancy risk factor. Hence, it is crucial in this analysis that
poverty be a control variable in both the first and second stages. In the first stage,
controlling for poverty allows us to identify how much more likely women are to go
The multivariate OLS and 2SLS models include many controls for pregnancy
risk and demographic factors. Results based on Lorch, Baiocchi, Ahlberg,
and Small (2012).
Review Questions
Table 9.4 provides results on regressions used in a 2SLS analysis of the effect of alcohol consumption
on grades. This is from hypothetical data on grades, standardized test scores, and average weekly
alcohol consumption from 1,000 undergraduate students at universities in multiple states. The beer
tax variable measures the amount of tax on beer in the state in which the student attends university.
The test score is the composite SAT score from high school. Grades are measured as grade point
average in the student’s most recent semester.
1. Identify the first-stage model and the second-stage model. What is the instrument?
2. Is the instrument a good instrument? Why or why not?
3. Is there evidence about the exogeneity of the instrument in the table? Why or why not?
4. What would happen if we included the beer tax variable in the grades model?
5. Do the (hypothetical!) results here present sufficient evidence to argue that alcohol has no effect
on grades?
If these are all valid instruments, we have multiple sources of exogeneity that could
improve the fit in the first stage.
When we have multiple instruments, the best way to assess whether the
instruments adequately predict the endogenous variable is to use an F test for the
null hypothesis that the coefficients on all instruments in the first stage are zero.
For our example, the F test would test H0 : γ1 = γ2 = γ3 = 0. We presented the F
test in Chapter 5 (page 159). In this case, rejecting the null would lead us to accept
that at least one of the instruments helps explain X1i . We discuss a rule of thumb
for this test shortly on page 312.
Overidentification tests
overidentification Having multiple instruments also allows us to implement an overidentification
test A test used for test. The name of the test comes from the fact that we say an instrumental
2SLS models having variable model is identified if we have an instrument that can explain X without
more than one directly influencing Y. When we have more than one instrument, the equation
instrument. The logic of is overidentified; that sounds a bit ominous, like something will explode.3
the test is that the
Overidentification is actually a good thing. Having multiple instruments allows
estimated coefficient on
the endogenous us to do some additional analysis that will shed light on the performance of the
variable in the instruments.
second-stage equation The references in this chapter’s Further Reading section point to a number
should be roughly the of formal tests regarding multiple instruments. These tests can get a bit involved,
same when each but the core intuition is rather simple. If each instrument is valid—that is, if each
individual instrument is satisfies the two conditions for instruments—then using each one alone should
used alone.
produce an unbiased estimate of β1 . Therefore, as an overidentification test, we
can simply estimate the 2SLS model with each individual instrument alone. The
coefficient estimates should look pretty much the same given that each instrument
alone under these circumstances produces an unbiased estimator. Hence, if each
of these models produces coefficients that are similar, we can feel pretty confident
3
Everyone out now! The model is going to blow any minute . . . it’s way overidentified!
310 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
that each is a decent instrument (or that they all are equally bad, which is the skunk
at the garden party for overidentification tests).
If the instruments produce vastly different β̂1 coefficient estimates, we have
to rethink our instruments. This can happen if one of the instruments violates the
exclusion condition. The catch is that we don’t know which instrument is the bad
one. Suppose that β̂1 found by using Z1 as an instrument is very different from
β̂1 found by using Z2 as an instrument. Is Z1 a bad instrument? Or is the problem
with Z2 ? Overidentification tests can’t say.
An overidentification test is like having two clocks. If the clocks show
different times, we know at least one is wrong, and possibly both. If both clocks
show the same time, we know they’re either both right or both wrong in same exact
way.
Overidentification tests are relatively uncommon, not because they aren’t
useful but because it’s hard to find one good instrument, let alone two
or more.
REMEMBER THIS
An instrumental variable is overidentified when there are multiple instruments for a single
endogenous variable.
1. To estimate a 2SLS model with multiple valid instruments, simply include all of them in the
first stage.
2. To use overidentification tests to assess instruments, run 2SLS models separately with each
instrumental variable. If the second-stage coefficients on the endogenous variable in question
are similar across models, this result is evidence that all the instruments are valid.
bit, or at least a lot less than X1 correlates with . Such an instrument is called a
quasi-instrument quasi-instrument.
An instrumental variable It can sometimes be useful to estimate a 2SLS model with a quasi-instrument
that is not strictly because a bit of correlation between Z and does not necessarily render 2SLS
exogenous. useless. To see why, let’s consider a simple case: one independent variable and
one instrument. We examine the probability limit of β̂1 because the properties
of probability limits are easier to work with than expectations in this context.4
For reference, we first note that the probability limit for the OLS estimate
of β̂1 is
OLS σ
plim β̂ 1 = β1 + corr(X1 , ) (9.7)
σX
where plim refers to the probability limit and corr indicates the correlation of the
two variables in parentheses. If corr(X1 , ) is zero, then the probability limit of
OLS
β̂ 1 is β1 . That’s a good thing! If corr(X1 , ) is non-zero, the OLS of β̂1 will
converge to something other than β1 as the sample size gets very large. That’s not
good.
If we use a quasi-instrument to estimate a 2SLS, the probability limit for the
2SLS estimate of β̂1 is
2SLS corr(Z, ) σ
plim β̂ 1 = β1 + (9.8)
corr(Z, X1 ) σX1
2SLS
If corr(Z, ) is zero, then the probability limit of β̂ 1 is β1 .5 Another good thing!
Otherwise, the 2SLS estimate of β̂1 will converge to something other than β1 as
the sample size gets very large.
Equation 9.8 has two very different implications. On the one hand, the
equation can be grounds for optimism about 2SLS. Comparing the probability
limits from the OLS and 2SLS models shows that if there is only a small
correlation between Z and and a high correlation between Z and X1 , then
2SLS will perform better than OLS when the correlation of X and is
large. This can happen when an instrument does a great job predicting X but
has a wee bit of correlation with the error in the main equation. In other
words, quasi-instruments may help us get estimates that are closer to the true
value.
On the other hand, the correlation of the Z and X1 in the denominator of
Equation 9.8 implies that when the instrument does a poor job of explaining
X1 , even a small amount of correlation between Z and can become magnified
by virtue of being divided by a very small number. In the education and wages
4
Section 3.5 introduces probability limits.
5
The form of this equation is from Wooldridge (2009), based on Bound, Jaeger, and Baker (1995).
312 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
example, the month of birth explained so little of the variation in education that the
danger was substantial distortion of the 2SLS estimate if even a dash of correlation
existed between month of birth and .
6
The rule of thumb is from Staiger and Stock (1997). We can, of course, run an F test even when we
have only a single instrument. A cool curiosity is that the F statistic in this case will be the square of
the t statistic. This means that when we have only a single instrument, we can simply look for a t
9.5 Precision of 2SLS 313
REMEMBER THIS
1. A quasi-instrument is an instrument that is correlated with the error term in the main equation.
If the correlation of the quasi-instrument (Z) and the error term () is small relative to the
correlation of the quasi-instrument and the endogenous variable (X), then as the sample size
gets very large, the 2SLS estimate based on Z will converge to something closer to the true
value than the OLS estimate.
2. A weak instrument does a poor job of explaining the endogenous variable (X). Weak
instruments magnify the problems associated with quasi-instruments and also can cause bias
in small samples.
3. All 2SLS analyses should report tests of independent explanatory power of the instrumental
variable or variables in first-stage regression. A rule of thumb is that the F statistic should be
at least 10 for the hypothesis that the coefficients on all instruments in the first-stage regression
are zero.
γ1 X1 + γ2 X2 + · · · ).
√
statistic that is bigger than 10, which we approximate (roughly!) by saying the t statistic should be
bigger than 3. Appendix H provides more information on the F distribution on page 549.
314 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
where we use π , the Greek letter pi, as coefficients and η, the Greek letter
eta (which rhymes with β), to emphasize that this is a new model, different
from earlier models. Notice that Z is not in this regression, meaning that
the R2 from it explains the extent to which X̂1 is a function of the other
independent variables. If this R2 is high, X̂1 is explained by X2 but not by
2SLS
Z, which will push up var(β̂ X̂1 ).
The point here is not to learn how to calculate standard error estimates by
hand. Computer programs do the chore perfectly well. The point is to understand
the sources of variance in 2SLS. In particular, it is useful to see the importance of
2SLS
the ability of Z to explain X1 . If Z lacks this ability, our β̂ 1 estimates will be
imprecise.
As for goodness of fit, the conventional R2 for 2SLS is basically broken. It is
possible for it to be negative. If we really need a measure of goodness of fit, the
square of the correlation of the fitted values and actual values will do. However, as
we discussed when we introduced R2 on page 71, the validity of the results does
not depend on the overall goodness of fit.
9.6 Simultaneous Equation Models 315
REMEMBER THIS
1. Four factors influence the variance of 2SLS β̂ j estimates.
2SLS
(a) Model fit: The better the model fits, the lower σ̂ 2 and var(β̂ j ) will be.
2SLS
(b) Sample size: The more observations, the lower var(β̂ j ) will be.
(c) The overall fit of the first-stage regression: The better the fit of the first-stage model, the
2SLS
higher var(X̂1 ) and the lower var( β̂1 ) will be.
(d) The explanatory power of the instrument in explaining X:
• If Z is a weak instrument (i.e., if it does a poor job of explaining X1 when we control
for the other X variables), then R2X̂ NoZ will be high because X̂1 will depend almost
1
2SLS
completely on the other independent variables. The result will be a high var( β̂1 ).
• If Z explains X1 when we control for the other X variables, then R2X̂ NoZ will be low,
1
2SLS
which will lower var( β̂1 ).
2. R2 is not meaningful for 2SLS models.
The labels X and Y don’t really work anymore when the variables cause each
other because no variable is only an independent variable or only a dependent
variable. Therefore, we use the following equations to characterize basic model of
simultaneous causality:
Y1
Y2
(Endogenous variable)
(Endogenous variable)
(Independent variable
in both equations)
Z1 Z2
where Ŷ2i is the fitted value from the first-stage regression (Equation 9.14).
318 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
REMEMBER THIS
We can use instrumental variables to estimate coefficients for the following simultaneous equation
model:
1. Use the following steps to estimate the coefficients in the first equation:
• In the first stage, we estimate a model in which the endogenous variable is the dependent
variable and all W and Z variables are the independent variables. Importantly, the other
endogenous variable (Y1 ) is not included in this first stage:
• In the second stage, we estimate a model in which the fitted values from the first stage, Ŷ2i ,
are an independent variable:
2. We proceed in a similar way to estimate coefficients for the second equation in the model:
• First, estimate a model with Y1i as the dependent variable and the W and Z variables (but
not Y2 !) as independent variables.
• Estimate the final model by using Ŷ1i instead of Y1i as an independent variable.
CASE STUDY Supply and Demand Curves for the Chicken Market
Even though nothing defines the field of eco-
nomics like supply and demand, estimating supply
and demand curves can be tricky. We can’t simply
estimate an equation in which quantity supplied is
the dependent variable and price is an indepen-
dent variable because price itself is a function of
how much is supplied. In other words, quantity and
price are simultaneously determined.
Our simultaneous equation framework can
help us navigate this challenge. First, though, let’s
be clear about what we’re trying to do. We want to
estimate two relationships: a supply function and a
demand function. Each of these characterizes the relationship between price and
amount, but they do so in pretty much opposite ways. We expect the quantity
supplied to increase as the price increases. After all, we suspect a producer will
say, “You’ll pay more? I’ll make more!” On the other hand, we expect the quantity
demanded to decrease when the price increases, as consumers will say, “It costs
more? I’ll buy less!”
As we pose the question, we can see this won’t be easy as we typically observe
one price and one quantity for each period. How are we going to get two different
slopes out of this same information?
If we only had information on price and quantity, we could not, in fact, estimate
the supply and demand functions. In that case, we should shut the computer off and
go to bed. If we have other information, however, that satisfies our conditions for
instrumental variables, then we can potentially estimate both supply and demand
functions. Here’s how.
Let’s start with the supply side and write down equations for quantity and price
supplied:
where Pricet is instrumented with change in income, change in the price of beef,
and the lagged price.
We can do a similar exercise when estimating the demand function. We still
work with quantity and price equations. However, now we’re looking for factors
7
We simplify things a fair bit; see the original article as well as Brumm, Epple, and McCallum
(2008) for a more detailed discussion.
9.6 Simultaneous Equation Models 321
TABLE 9.5 Price and Quantity Supplied Equations for U.S. Chicken Market
Price equation Quantity supplied equation
(first stage) (second stage)
that affect the price via the supply side but do not directly affect how much
chicken people will want to consume. Epple and McCallum proposed the price of
chicken feed, the amount of meat (non-chicken) demanded for export, and the
lagged amount of the amount produced as instrumental variables that satisfy these
conditions. For example, the price of feed will affect how much it costs to produce
chicken, but it should not affect the amount consumed except by affecting the
price. This leads to the following model:
where Pricet is instrumented with price of chicken feed, the change in meat exports,
and the lagged production.
There are two additional challenges. First, we will log variables in order to
generate price elasticities. We discussed reasons why in Section 7.2. Hence, every
variable except the time trend will be logged. Second, we’re dealing with time
series data. We saw a bit about time series data when we covered autocorrelation in
322 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
TABLE 9.6 Price and Quantity Demanded Equations for U.S. Chicken Market
Price equation Quantity demanded equation
(first stage) (second stage)
Section 3.6, and we’ll discuss time series data in much greater detail in Chapter 13.
For now, we simply note that a concern with strong time dependence led Epple and
McCallum to conclude the best approach was to use differenced variables for the
demand equation. Differenced variables measure the change in a variable rather
than the level. Hence, the value of a differenced variable for year 2 of the data is the
change from period 1 to period 2 rather than the amount in period 2.
Table 9.5 on page 321 shows the results for the supply equation. The first-stage
results are from a reduced form model in which the price is the dependent variable
and all the control variables and instruments are the independent variables.
Notably, we do not include the quantity as a control variable in this first-stage
regression. Each of the instruments is statistically significant, and the F statistic for
the null hypothesis that all coefficients on the instruments equal zero is 11.16, which
satisfies the rule of thumb that the F statistic for the test regarding all instruments
should be over 10.
The second-stage supply equation uses the fitted value of the price of chicken.
We see that the elasticity is 0.203, meaning that a one percent change in price
is associated with a 0.203 percent increase in production. We also see that a one
percent increase in the price of chicken feed, a major input, is associated with a
0.141 percent reduction in quantity of chicken produced.
Conclusion 323
Table 9.6 on page 322 shows the results for the demand equation. The
first-stage price equation uses the price of chicken feed, meat exports, and the
lagged price of chicken as instruments. Chicken feed prices should affect suppliers
but not directly affect the demand side. The volume of meat exports should
affect suppliers’ output but not what consumers in the United States want. Our
dependent variable in the second stage is the amount of chicken consumed by
people in the United States.
Each instrument performs reasonably well, with the t statistics above 2. The F
statistic for the null hypothesis that all coefficients are zero is 10.86, which satisfies
our first-stage inclusion condition.
The second-stage demand equation reported in Table 9.6 is quite sensible. A
one percent increase in price is associated with a 0.257 percent decline in amount
demanded. This is pretty neat. Whereas Table 9.5 showed an increase in quantity
supplied as price rises, Table 9.6 shows a decrease in quantity demanded as price
rises. This is precisely what economic theory says should happen.
The other coefficients in Table 9.6 make sense as well. A one percent increase in
incomes is associated with a 0.408 percent increase in consumption, although this
is not quite statistically significant. In addition, the amount of chicken demanded
increases as the price of beef rises. In particular, if beef prices go up by one percent,
people in the U.S. eat 0.232 percent more chicken. Think of that coefficient as
basically a Chick-fil-A commercial, but with math.
Conclusion
• Section 9.2: Explain the first- and second-stage regressions in 2SLS. What
two conditions are necessary for an instrument to be valid?
324 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
• Section 9.5: Explain how the first-stage results affect the precision of the
second-stage results.
Further Reading
Murray (2006a) summarizes the instrumental variables approach and is particu-
larly good at discussing finite sample bias and many statistical tests that are useful
in diagnosing whether instrumental variables conditions are met. Baiocchi, Cheng,
local average and Small (2014) provide an intuitive discussion of instrumental variables in health
treatment effect The research.
causal effect for those One topic that has generated considerable academic interest is the possibility
people affected by the
that the effect of X differs within a population. In this case, 2SLS estimates the
instrument only.
Relevant if the effect of
local average treatment effect, which is the causal effect only for those affected
X on Y varies within the by the instrument. This effect is considered “local” in the sense of describing the
population. effect for the specific class of individuals for whom the endogenous X1 variable
was influenced by the exogenous Z variable.8
In addition, scholars who study instrumental variables methods discuss the
monotonicity importance of monotonicity, which is a condition under which the effect of the
Monotonicity requires instrument on the endogenous variable goes in the same direction for everyone in
that the effect of the a population. This condition rules out the possibility that an increase in Z causes
instrument on the some units to increase X and other units to decrease X.
endogenous variable go Finally, scholars also discuss the stable unit treatment value assumption,
in the same direction for
the condition under which the treatment doesn’t vary in unmeasured ways across
everyone in a
population. individuals and there are no spillover effects that might be anticipated—for
example, if an untreated neighbor of someone in the treatment group somehow
stable unit benefits from the treatment via the neighbor who is in the group.
treatment value
assumption The 8
Suppose, for example, that the effect of education on future wages differs for students who like
condition that an school (they learn a lot in school, so more school leads to higher wages) and students who hate
instrument has no school (they learn little in school, so more school does not lead to higher wages for them). If we use
spillover effect. month of birth as an instrument, then the variation in years of schooling we are looking at is only the
variation among people who would or would not drop out of high school after their sophomore year,
depending on when they turned 16. The effect of schooling for those folks might be pretty small, but
that’s what the 2SLS approach will estimate.
Computing Corner 325
Imbens (2014) and Chapter 4 of Angrist and Pischke (2009) discuss these
points in detail and provide mathematical derivations. Sovey and Green (2011)
discuss these and related points, with a focus on the instrumental variables in
political science.
Key Terms
Exclusion condition (300) Overidentification test (309) Stable unit treatment value
Identified (318) Quasi-instrument (311) assumption (324)
Inclusion condition (300) Reduced form equation Two-stage least squares
Instrumental variable (297) (317) (295)
Local average treatment Simultaneous equation Weak instrument (312)
effect (324) model (315)
Monotonicity (324)
Computing Corner
Stata
• The rule of thumb when there is only one instrument is that the t
statistic on the instrument in the first stage should be greater than
3. The higher, the better.
• When there are multiple instruments, run an F test using the test
command. The rule of thumb is that the F statistic should be larger
than 10.
reg X1 Z1 Z2 X2 X3
test Z1=Z2=0
326 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
1. To estimate a 2SLS model in R, we can use the ivreg command from the
AER package.
• See page 85 on how to install the AER package. Recall that we need
to tell R to use the package with the library command below for
each R session in which we use the package.
• If there is only one instrument, the rule of thumb is that the t statistic
on the instrument in the first stage should be greater than 3. The
higher, the better.
lm(X1 ~ Z1 + X2 + X3)
reduced form variables that will be included (which is all variables but the
other dependent variable):
library(AER)
ivreg(Y1 ~ Y2 + W1 + Z1 | Z1 + W1 + Z2)
ivreg(Y2 ~ Y1 + W1 + Z2 | Z1 + W1 + Z2)
Exercises
1. Does economic growth reduce the odds of civil conflict? Miguel,
Satyanath, and Sergenti (2004) used an instrumental variables approach
to assess the relationship between economic growth and civil war. They
provided data (available in RainIV.dta) on 41 African countries from 1981
to 1999, including the variables listed in Table 9.7.
(b) Add control variables for initial GDP, democracy, mountains, and
ethnic and religious fractionalization to the model in part (a). Do
these results establish a causal relationship between the economy
and civil conflict?
InternalConflict Coded 1 if civil war with greater than 25 deaths and 0 otherwise
LaggedGDPGrowth Lagged GDP growth
InitialGDPpercap GDP per capita at the beginning of the period of analysis, 1979
Democracy A measure of democracy (called a “polity” score); values range from −10 to 10
(d) Explain in your own words how instrumenting for GDP with rain
could help us identify causal effect of the economy on civil conflict.
(e) Use the dependent and independent variables from part (b), but
now instrument for lagged GDP growth with lagged rainfall growth.
Comment on the results.
(f) Redo the 2SLS model in part (e), but this time, use dummy
variables to add country fixed effects. Comment on the quality of the
instrument in the first stage and the results for the effect of lagged
economic growth in the second stage.
(g) (funky) Estimate the first stage from the 2SLS model in part (f), and
save the residuals. Then estimate a regular OLS model that includes
the same independent variables from part (f) and country dummies.
Use lagged GDP growth (do not use fitted values), and now include
the residuals from the first stage you just saved. Compare the
coefficient on lagged GDP growth you get here to the coefficient on
that variable in the 2SLS. Discuss how endogeneity is being handled
in this specification.
2. Can television inform people about public affairs? It’s a tricky question
because the nerds (like us) who watch public-affairs-oriented TV are
pretty well informed to begin with. Therefore, political scientists Bethany
Albertson and Adria Lawrence (2009) conducted a field experiment in
which they randomly assigned people to treatment and control conditions.
Those assigned to the treatment condition were told to watch a specific
television broadcast about affirmative action and that they would be
interviewed about what they had seen. Those in the control group were not
told about the program but were told that they would be interviewed again
later. The program they studied aired in California prior to the vote on
Proposition 209, a controversial proposition relating to affirmative action.
Their data (available in NewsStudy.dta) includes the variables listed in
Table 9.8.
Education Education level (eighth grade or less = 1 to advanced graduate degree = 13)
TreatmentGroup Assigned to watch program (treatment = 1; control = 0)
WatchProgram Actually watched program (watched = 1, did not watch = 0)
InformationLevel Information about Proposition 209 prior to election (none = 1 to great deal = 4)
Exercises 329
(b) Estimate the model in part (a), but now include measures of political
interest, newspaper reading, and education. Are the results different?
Have we defeated endogeneity?
(e) What do the 2SLS results suggest about the effect of watching the
program on information levels? Compare the results to those in part
(b). Have we defeated endogeneity?
3. Suppose we want to understand the demand curve for fish. We’ll use the
following demand curve equation:
t = β0 + β1 Pricet + t
QuantityD D
(a) To see that prices and quantities are endogenous, draw supply and
demand curves and discuss what happens when the demand curve
shifts out (which corresponds to a change in the error term of the
demand function). Note also what happens to price in equilibrium
and discuss how this event creates endogeneity.
(b) The data set fishdata.dta (from Angrist, Graddy, and Imbens 2000)
provides data on prices and quantities of a certain kind of fish (called
whiting) over 111 days at the Fulton Street Fish Market, which then
existed in Lower Manhattan. The variables are indicated in Table 9.9.
The price and quantity variables are logged. Estimate a naive OLS
model of demand in which quantity is the dependent variable and
price is the independent variable. Briefly interpret results, and then
discuss whether this analysis is useful.
(c) Angrist, Graddy, and Imbens suggest that a dummy variable indicat-
ing a storm at sea is a good instrumental variable that should affect
the supply equation but not the demand equation. Stormy is a dummy
variable that indicates a wave height greater than 4.5 feet and wind
speed greater than 18 knots. Use 2SLS to estimate a demand function
in which Stormy is an instrument for Price. Discuss first-stage and
second-stage results, interpreting the most relevant portions.
(d) Reestimate the demand equation but with additional controls. Con-
tinue to use Stormy as an instrument for price, but now also include
covariates that account for the days of the week and the weather on
shore. Discuss first-stage and second-stage results, interpreting the
most relevant portions.
(a) Estimate a model with prison as the dependent variable and educa-
tion, age, and African-American as independent variables. Make this
a fixed effects model by including dummies for state of residence
(state) and year of census data (year). Report and briefly describe
the results.
(b) Based on the OLS results, can we causally conclude that increasing
education will reduce crime? Why is it difficult to estimate the effect
of education on criminal activity?
(c) Lochner and Moretti used 2SLS to improve upon their OLS esti-
mates. They used changes in compulsory attendance laws (set by
Exercises 331
ca11 Dummy equals 1 if state compulsory schooling is 11 or more years and 0 otherwise
FIPS codes are Federal Information Processing Codes for states (and also countries).
(d) Estimate a 2SLS model using the instruments just described and the
control variables from the OLS model above (including state and
year dummy variables). Briefly explain the results.
(e) 2SLS is known for being less precise than OLS. Is that true here? Is
this a problem for the analysis in this case? Why or why not?
(a) Are countries with higher income per capita more democratic? Run
a pooled regression model with democracy (democracy_fh) as the
dependent variable and logged GDP per capita (log_gdp) as the
332 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
year Year
YearCode Order of years of data set (1955 = 1, 1960 = 2, 1965 = 3, etc.)
(b) Rerun the model from part (a), but now include fixed effects for year
and country. Describe the model. How does including these fixed
effects change the results?
(c) To better establish causality, the authors use 2SLS. One of the
instruments that they use is changes in the income of trading partners
(worldincome). They theorize that the income of a given country’s
trading partners should predict its own GDP but should not directly
affect the level of democracy in the country. Discuss the viability
of this instrument with specific reference to the conditions that
instruments need to satisfy. Provide evidence as appropriate.
333
334 CHAPTER 10 Experiments: Dealing with Real-World Challenges
Yi = β0 + β1 Treatmenti + i (10.1)
where Yi is the outcome we care about and Treatmenti equals one for subjects in
the treatment group. In reality, randomized experiments face a host of challenges.
Not only are they costly, potentially infeasible, and sometimes unethical, as
discussed in Section 1.3, they run into several challenges that can undo the desired
exogeneity of randomized experiments. This chapter focuses on these challenges.
Section 10.1 discusses the challenges raised by possible dissimilarity of the
treatment and control groups. If the treatment group differs from the control group
in ways other than the treatment, we can’t be sure whether it’s the treatment or
other differences that explain differences across these groups. Section 10.2 moves
on to the challenges raised by non-compliance with assignment to an experimental
group. Section 10.3 shows how to use the 2SLS tools from Chapter 9 to deal
with non-compliance. Section 10.4 discusses the challenge posed to experiments
by attrition, a common problem that arises when people leave an experiment.
Section 10.5 changes gears to discuss natural experiments, which occur without
intervention by researchers.
We refer to the attrition, balance, and compliance challenges facing experi-
ABC issues Three ments as ABC issues.2 Every analysis of experiments should discuss these ABC
issues that every issues explicitly.
experiment needs to
address: attrition,
balance, and
compliance. 1
Often the control group is given a placebo treatment of some sort. In medicine, this is the
well-known sugar pill. In social science, a placebo treatment may be an experience that shares the
form of the treatment but not the content. For example, in a study of advertising efficacy, a placebo
group might be shown a public service ad. The idea is that the mere act of viewing an ad, any ad,
could affect respondents and that ad designers want their ad to cause changes over and above that
baseline effect.
2
We actually discuss balance first, followed by compliance and then attrition, because this order
follows the standard sequence of experimental analysis. We’ll stick with calling them ABC issues,
though, because BCA doesn’t sound as cool as ABC.
10.1 Randomization and Balance 335
group for their own reasons. Or maybe the folks doing the randomization
screwed up.
In other cases, the treatment and control groups may differ simply due to
chance. Suppose we want to conduct a random experiment on a four-person family
of mom, dad, big sister, and little brother. Even if we pick the two-person treatment
and control groups randomly, we’ll likely get groups that differ in important ways.
Maybe the treatment group will be dad and little brother—too many guys there. Or
maybe the treatment group will be mom and dad—too many middle-aged people
there. In these cases, any outcome differences between the treatment and control
groups would be due not only to the treatment but also possibly to the sex or
age differences. Of course the odds that the treatment and control groups differ
substantially fall rapidly as the sample size increases (a good reason to have a big
sample!). The chance that such differences occur never completely disappears,
however.
Xi = γ0 + γ1 TreatmentAssignedi + νi (10.2)
where TreatmentAssignedi is 1 for those assigned to the treatment group and 0 for
those assigned to the control group. We use γ (gamma) to indicate the coefficients
and ν (nu) to indicate the error term. We do not use β and here, to emphasize that
the model differs from the main model (Equation 10.1). We estimate Equation 10.2
for each potential independent variable; each equation will produce a different γ̂1
estimate. A statistically significant γ̂1 estimate indicates that the X variable differed
across those assigned to the treatment and control groups.3
Ideally, we won’t see any statistically significant γ̂1 estimates; this outcome
would indicate that the treatment and control groups are balanced. If the γ̂1
estimates are statistically significant for many X variables, we do not have balance
in our experimentally assigned groups, which suggests systematic interference
with the planned random assignments.
We should keep statistical power in mind when we evaluate balance tests. As
discussed in Section 4.4, statistical power relates to the probability of rejecting
3
More advanced balance tests also allow us to assess whether the variance of a variable is the same
across treatment and control groups. See, for example, Imai (2005).
10.1 Randomization and Balance 337
the null hypothesis when we should. Power is low in small data sets, since
when there are few observations, we are unlikely to find statistically significant
differences in treatment and control groups even when there really are differences.
In contrast, power is high for large data sets; that is, we may observe statistically
significant differences even when the actual differences are substantively small.
Hence, balance tests are sensitive not only to whether there are differences across
treatment and control groups but also to the factors that affect power. We should
therefore be cautious in believing we have achieved balance in a small sample set,
and we should be sure to assess the substantive importance of any differences we
see in large samples.
What if the treatment and control groups differ for only one or two variables?
Such an outcome is not enough to indicate that randomization failed. Recall that
even when there is no difference between treatment and control groups, we will
reject the null hypothesis of no difference 5 percent of the time when α = 0.05.
Thus, for example, if we look at at 20 variables, it would be perfectly natural for
the means of the treatment and control groups to differ statistically significantly
for one of those variables.
Good results on balancing tests also suggest (without proving) that balance
has been achieved even on the variables we can’t measure. Remember, the key to
experiments is that no unmeasured factor in the error term is correlated with the
independent variable. Given that we cannot see the darn things in the error term,
it seems a bit unfair to expect us to have any confidence about what’s going on in
there. However, if balance has been achieved for everything we can observe, we
can reasonably (albeit cautiously) speculate that the treatment and control groups
are also balanced for factors we cannot observe.
Yi = β0 + β1 Treatmenti + β2 Agei + i
For example, suppose we are analyzing an experiment in which job training was
randomly assigned within a certain population. In assessing whether the training
helped people get jobs, we would not want to control for test scores measured after
the treatment because the scores could have been affected by the training. Since
part of the effect of treatment may be captured by this post-treatment variable,
including such a post-treatment variable will muddy the analysis.
REMEMBER THIS
1. Experimental treatment and control groups are balanced if the average values of independent
variables are not substantially different for people assigned to treatment and control groups.
2. We check for balance by conducting difference of means tests for all possible independent
variables.
3. When we assess the effect of a treatment, it is a good idea to control for imbalanced variables.
where Healthit is the health of person i at time t, Aidit is the amount of foreign aid
going to person i’s village at time t, and Xit represents one or more variables that
affect the health of person i at time t. The problem is that the error may be correlated
with aid. Aid may flow to places where people are truly needy, with economic and
social problems that go beyond any simple measure of poverty. Or resources may
flow to places that are actually better off and better able to attract attention than
simple poverty statistics would suggest.
In other words, aid is probably endogenous. And because we cannot know if
aid is positively or negatively correlated with the error term, we have to admit that
10.1 Randomization and Balance 339
we don’t know whether the actual effects are larger or smaller than what we observe
with the observational analysis. That’s not a particularly satisfying study.
If the government resources flowed exogenously, however, we could analyze
health and other outcomes and be much more confident that we are measuring
the effect of the aid. One example of a confidence-inspiring study is the Progresa
experiment in Mexico, described in Gertler (2004). In the late 1990s the Mexican
government wanted to run a village-based health care program but realized it
did not have enough resources to cover all villages at once. The government
decided the fairest way to pick villages was to pick them randomly, and voila! an
experiment was born. Government authorities randomly selected 320 villages as
treatment cases and implemented the program there. The Mexican government
also monitored 185 control villages, where no new program was implemented. In
the program, eligible families received a cash transfer worth about 20 to 30 percent
of household income if they participated in health screening and education
activities, including immunizations, prenatal visits, and annual health checkups.
Before assessing whether the treatment worked, analysts needed to assess
whether randomization worked. Were villages indeed selected randomly, and if
so, were they similar with regard to factors that could influence health? Table 10.1
provides results for balancing tests for the Progresa program. The first column has
the γ̂0 estimates from Equation 10.2 for various X variables. These are the averages
of the variable in question for the young children in the control villages. The second
column displays the γ̂1 estimates, which indicate how much higher or lower the
TABLE 10.1 Balancing Tests for the Progresa Experiment: Difference of Means Tests
Using OLS
Dependent variable γ̂γ 0 γ̂γ 1 t stat (γ̂γ 1 ) p value (γ̂γ 1 )
11. Male daily wage rate (pesos) 31.22 −0.74 0.90 0.37
12. Female daily wage rate (pesos) 27.84 −0.58 0.69 0.49
Results from 12 different OLS regressions in which the dependent variable is as listed at left. The coefficients are
from the model Xi = γ0 + γ1 Treatmenti + νi (see Equation 10.2).
340 CHAPTER 10 Experiments: Dealing with Real-World Challenges
average of the variable in question is for children in the treatment villages. For
example, the first line indicates that the children in the treatment village were 0.01
year older than the children in the control village. The t statistic is very small for
this coefficient and the p value is high, indicating that this difference is not at all
statistically significant. For the second row, the male variable equals 1 for boys and
0 for girls. The average of this variable indicates the percent of the sample that
were boys. In the control villages, 49 percent of the children were males; 51 percent
(γ̂0 + γ̂1 ) of the children in the treatment villages were male. This 2 percent difference
is statistically significant at the 0.10 level (given that p < 0.10). The most statistically
significant difference we see is in mother’s years of education, for which the p value
is 0.06. In addition, houses in the treatment group were less likely to have electricity
(p = 0.09).
The study author took the results to indicate that balance had been achieved.
We see, though, that achieving balance is an art, rather than a science, because
for 12 variables, only one or perhaps two would be expected to be statis-
tically significant at the α = 0.10 level if there were, in fact, no differences
across the groups. These imbalances should not be forgotten; in this case, the
analysts controlled for all the listed variables when they estimated treatment
effects.
And by the way, did the Progresa program work? In a word, yes. Results from
difference of means tests revealed that kids in the treatment villages were sick less
often, taller, and less likely to be anemic.
4
Researchers in this area are careful to analyze only students who actually applied for the vouchers.
This is because the students (and parents) who apply for vouchers for private schools almost certainly
differ systematically from students (and parents) who do not.
342 CHAPTER 10 Experiments: Dealing with Real-World Challenges
Treatment assignment
(random)
Zi = 1 Zi = 0
Compliance Compliance
(non-random) (unobserved)
Ti = 1 Ti = 0 Ti = 0
lines in Figure 10.1 indicate that we can’t know who among the control group are
would-be compliers and would-be non-compliers.5
We can see the mischief caused by non-compliance when we think about
how to compare treatment and control groups in this context. We could compare
the students who actually went to the private school (Ti = 1) to those who didn’t
(Ti = 0). Note, however, that the Ti = 1 group includes only compliers—students
who, when given the chance to go to a private school, took it. These students
are likely to be more academically ambitious than the non-compliers. The Ti =
0 group includes non-compliers (for whom Zi = 1) and those not assigned
to treatment (for whom Zi = 0). This comparison likely stacks the deck in
favor of finding that the private schools improve test scores because this Ti =
1 group has a disproportionately high proportion of educationally ambitious
students.
Another option is to compare the compliers (the Zi = 1 and Ti = 1 students)
to the whole control group (the Zi = 0 students). This method, too, is problematic.
The control group has two types of students—would-be compliers and would-be
non-compliers—while the treatment group in this approach only has compliers.
Any differences found with this comparison could be attributed either to the effect
of the private school or to the absence of non-compliers from the complier group,
whereas the control group includes both complier types and non-complier types.
5
An additional wrinkle in the real world is that people from the control group may find a way to
receive the treatment without being assigned to treatment. For example, in the New York voucher
experiment just discussed, 5 percent of the control group ended up in private schools without having
received a voucher.
10.2 Compliance and Intention-to-Treat Models 343
Intention-to-treat models
intention-to-treat A better approach is to conduct an intention-to-treat (ITT) analysis. To conduct an
(ITT) analysis ITT ITT analysis, we compare the means of those assigned treatment (the whole Zi = 1
analysis addresses group, which consists of those who complied and those who did not comply with
potential endogeneity
the treatment) to those not assigned treatment (the Zi = 0 group, which consists
that arises in
experiments owing to
of would-be compliers and would-be non-compliers). The ITT approach sidesteps
non-compliance. We non-compliance endogeneity at the cost of producing estimates that are statistically
compare the means of conservative (meaning that we expect the estimated coefficients to be smaller
those assigned than the actual effect of the treatment).
treatment and those not To understand ITT, let’s start with the non-ITT model we really care
assigned treatment, about:
irrespective of whether
the subjects did or did
not actually receive the
treatment. Yi = β0 + β1 Treatmenti + i (10.4)
Yi = δ0 + δ1 Zi + νi (10.5)
the individuals who, being more academically ambitious kids, may have been
more likely to use the private school vouchers. ITT avoids this problem by
comparing all kids given a chance to use the vouchers to all kids not given that
chance.
ITT is not costless, however. When there is non-compliance, ITT will under-
estimate the treatment effect. This means the ITT estimate, δˆ1 , is a lower-bound
estimate of β, the estimate of the effect of the treatment itself from Equation 10.4.
In other words, we expect the magnitude of the δˆ1 parameter estimated from
Equation 10.5 to be smaller than or equal to the β1 parameter in Equation 10.4.
To see why, consider the two extreme possibilities: zero compliance and full
compliance. If there is zero compliance, such that no one assigned treatment
complied (Ti = 0 for all Zi = 1), then δ1 = 0 because there is no difference
between the treatment and control groups. (No one took the treatment!) At the
other extreme, if everyone assigned treatment (Zi = 1) also complied (Ti = 1),
then the Treatmenti variable in Equation 10.4 will be identical to Zi (treatment
assignment) in Equation 10.5. In this instance, β̂1 will be an unbiased estimator
of β1 because there are no non-compliers messing up the exogeneity of the
random experiment. In this case, β̂1 = δˆ1 because the variables in the models are
identical.
Hence, we know that the ITT estimate of δˆ1 is going to be somewhere
between zero and an unbiased estimator of the true treatment effect. The lower
the compliance, the more the ITT estimate will be biased toward zero. The
ITT estimator is still preferable to β̂1 from a model with treatment received
when there are non-compliance problems; this is because β̂1 can be biased
when compliers differ from non-compliers, causing endogeneity to enter the
model.
The ITT approach is a cop-out, but in a good way. When we use it, we’re
being conservative in the sense that the estimate will be prone to underestimate
the magnitude of the treatment effect. If the ITT approach reveals an effect, it will
be due to treatment, not to endogenous non-compliance issues.
Researchers regularly estimate ITT effects. Sometimes whether someone did
or did not comply with a treatment is not known. For example, if the experimenter
mailed advertisements to randomly selected households, it will be very hard, if
not impossible, to know who actually read the ads (Bailey, Hopkins, and Rogers
2015).
Or sometimes the ITT effect is the most relevant quantity of interest.
Suppose, for example, we know that compliance will be spotty and we want
to build non-compliance into our estimate of a program’s effectiveness. Miguel
and Kremer (2004) analyzed an experiment in Kenya that provided medical
treatment for intestinal worms to children at randomly selected schools. Some
children in the treated schools, however, missed school the day the medicine
was administered. An ITT analysis in this case compares kids assigned to
treatment (whether or not they were in school on that day) to kids not assigned
to treatment. Because some kids will always miss school for a treatment like this,
policy makers may care more about the ITT estimated effect of the treatment
10.2 Compliance and Intention-to-Treat Models 345
because ITT takes into account both the treatment effect and the less-than-perfect
compliance.
REMEMBER THIS
1. In an experimental context, a person assigned to receive a treatment who actually receives the
treatment is said to comply with the treatment.
2. When compliers differ from non-compliers, non-compliance creates endogeneity.
3. ITT analysis compares people assigned to treatment (whether they complied or not) to people
in the control group.
• ITT is not vulnerable to endogeneity due to non-compliance.
• ITT estimates will be smaller in magnitude than the true treatment effect. The more
numerous the non-compliers, the closer to zero the ITT estimates will be.
Discussion Questions
1. Will there be balance problems if there is non-compliance? Why or why not?
2. Suppose there is non-compliance but no signs of balance problems. Does this mean the
non-compliance must be harmless? Why or why not?
3. For each of the following scenarios, discuss (i) whether non-compliance is likely to be an issue,
(ii) the likely implication of non-compliance for comparing those who received treatment to the
control group, and (iii) what exactly an ITT variable would consist of.
(a) Suppose an international aid group working in a country with low literacy rates randomly
assigned children to a treatment group that received one hour of extra reading help each
day and a control group that experienced only the standard curriculum. The dependent
variable is a reading test score after one year.
(b) Suppose an airline randomly upgraded some economy class passengers to business class.
The dependent variable is satisfaction with the flight.
(c) Suppose the federal government randomly selected a group of school districts that could
receive millions of dollars of aid for revamping their curriculum. The control group
receives nothing from the program. The dependent variable is test scores after three years.
346 CHAPTER 10 Experiments: Dealing with Real-World Challenges
where Turnouti equals 1 for people who voted and 0 for those who did
not.6 The independent variable is whether or not someone was contacted by a
campaign.
What is in the error term? Certainly, political interest will be because more
politically attuned people are more likely to vote. We’ll have endogeneity if
political interest (incorporated in the error term) is correlated with contact by a
campaign (the independent variable). We will probably have endogeneity because
campaigns do not want to waste time contacting people who won’t vote. Hence,
we’ll have endogeneity unless the campaign is incompetent (or, ironically, run by
experimentalists).
Such endogeneity could corrupt the results easily. Suppose we find a positive
association between campaign contact and turnout. We should worry that the
relationship is due not to the campaign contact but to the kind of people who
were contacted—namely, those who were more likely to vote before they were
contacted. Such concerns make it very hard to analyze campaign effects with
observational data.
6
The dependent variable is a dichotomous variable. We discuss such dependent variables in more
detail in Chapter 12.
10.3 Using 2SLS to Deal with Non-compliance 347
Professors Alan Gerber and Don Green (2000, 2005) were struck by these
problems with observational studies and have almost single-handedly built an
empire of experimental studies in American politics.7 As part of their signat-
ure study, they randomly assigned citizens to receive in-person visits from a get-
out-the-vote campaign. In their study, all the factors that affect turnout would be
uncorrelated with assignment to receive the treatment.8
Compliance is a challenge in such studies. When campaign volunteers
knocked on doors, not everyone answered. Some people weren’t home. Some were
in the middle of dinner. Maybe a few ran out the back door screaming when they
saw a hippie volunteer ringing their doorbell.
Non-compliance, of course, could affect the results. If the more socially
outgoing types answered the door (hence receiving the treatment) and the more
reclusive types did not (hence not receiving the treatment even though they were
assigned to it), the treatment variable as delivered would depend not only on the
random assignment but also on how outgoing a person was. If more outgoing
people are more likely to vote, then treatment as delivered will be correlated with
the sociability of the experimental subject, and we will have endogeneity.
To get around this problem, Gerber and Green used treatment assignment as
an instrument. This variable, which we’ve been calling Zi , indicates whether a
person was randomly selected to receive a treatment. This variable is well suited
to satisfy the requisite conditions for a good instrument discussed in Section 9.2.
First, Zi should be included in the first stage because being randomly assigned to
be contacted by the campaign does indeed increase campaign contact. Table 10.2
shows the results from the first stage of Gerber and Green’s turnout experiment.
The dependent variable, treatment delivered, is 1 if the person actually talked to
the volunteer canvasser and 0 otherwise. The independent variable is whether the
person was or was not assigned to treatment.
Constant 0.000
(0.000)
[t = 0.00]
N 29, 380
7
Or should we say double-handedly? Or, really, quadruple-handedly?
8
The study also looked at other campaign tactics, such as phone calls and mailing postcards. These
didn’t work as well as the personal visits; for simplicity, we focus on the in-person visits.
348 CHAPTER 10 Experiments: Dealing with Real-World Challenges
These results suggest that 27.9 percent of those assigned to be visited were
actually visited. In other words, 27.9 percent of the treatment group complied
with the treatment. This estimate is hugely statistically significant, in part owing
to the large sample size. The intercept is 0.0, implying that no one in the
non-contact-assigned group was contacted by this particular get-out-the-vote
campaign.
The treatment assignment variable Zi also is highly likely to satisfy the 2SLS
exclusion condition because the randomized treatment assignment variable Zi
affects Y only through people actually getting campaign contact. Being assigned
to be contacted by the campaign in and of itself does not affect turnout. Note that
we are not saying that the people who actually complied (received a campaign
contact) are random, for all the reasons just given in relation to concerns about
compliance come into play here. We are simply saying that when we put a check
next to randomly selected names indicating that they should be visited, these folks
were indeed randomly selected. That means that Z is uncorrelated with and can
therefore be excluded from the main equation.
In the second-stage regression, we use the fitted values from the first-stage
regression as the independent variable. Table 10.3 shows that the effect of a
personal visit is to increase probability of turning out to vote by 8.7 percentage
points. This estimate is statistically significant, as we can see from the t stat, 3.34.
We could improve the precision of the estimates by adding covariates, but doing
so is not necessary to avoid bias.
Laura 1 1 0.279
Bryce 1 0 0.279
Gio 0 0 0.000
This selection was randomly determined. In the second column is actual contact,
which is observed contact by the campaign. Laura answered the door when the
campaign volunteer knocked, but Bryce did not. (No one went to poor Gio’s door.)
The third column displays the fitted value from the first-stage equation for the
treatment variable. These fitted values depend only on contact assignment. Laura
and Bryce were assigned to be called randomly (Z = 1), so both their fitted values
were X̂ = 0.0 + 0.279 × 1 = 0.279 even though Laura was actually contacted and
Bryce wasn’t. Gio was not assigned not to be visited (Z = 0), so his fitted contact
values was X̂ = 0.0 + 0.279 × 0 = 0.0.
2SLS uses the “contact-fitted” (T̂) variable. It is worth taking the time to really
understand T̂, which might be the weirdest thing in the whole book.9 Even though
Bryce was not contacted, his T̂i is 0.279, just the same as Laura, who was in fact
visited. Clearly, this variable looks very different from actual observed campaign
contact. Yes, this is odd, but it’s a feature, not a bug. The core inferential problem,
as we’ve noted, is endogeneity in actual observed contact. Bryce might be avoiding
contact because he loathes politics. That’s why we don’t want to use observed
contact as a variable—it would capture not only the effect of contact but also the
fact that the type of people who get contact in observational data are different. The
fitted value, however, varies only according the Z—something that is exogenous.
In other words, by looking at the bump up in expected contact associated with
being in the randomly assembled contact-assigned group, we have isolated the
exogenous bump up in contact associated with the exogenous factor and can assess
whether it is associated with a corresponding bump up in voting turnout.
REMEMBER THIS
1. 2SLS is useful for analyzing experiments when there is imperfect compliance with the
experimental treatment.
2. Assignment to treatment typically satisfies the inclusion and exclusion conditions necessary
for instruments in 2SLS analysis.
9
Other than the ferret thing in Chapter 3—also weird.
350 CHAPTER 10 Experiments: Dealing with Real-World Challenges
where Arrested later is 1 if the person is arrested at some later date for domestic
violence and 0 otherwise, Arrested initially is 1 if the suspect was arrested at the
time of the initial domestic violence report and 0 otherwise, and X refers to other
variables, such as whether a weapon or drugs were involved in the first incident.
Why might there be endogeneity? (That is, why might we suspect a cor-
relation between Arrested initially and the error term?) Elements in the error
term include person-specific characteristics. Some people who have police called
on them are indeed nasty; let’s call them the bad eggs. Others are involved
in a once-in-a-lifetime incident; in the overall population of people who have
police called on them, they are the (relatively) good eggs. Such personality
traits are in the error term of the equation predicting domestic violence in the
future.
We could also easily imagine that people’s good or bad eggness will affect
whether they are arrested initially. Police who arrive at the scene of a domestic
violence incident involving a bad egg will, on average, find more threat; police who
arrive at the scene of an incident involving a (relatively) good egg will likely find
the environment less threatening. We would expect police to arrest the bad egg
types more often, and we would expect these folks to have more problems in the
future. Observational data could therefore suggest that arrests make things worse
because those arrested are more likely to be bad eggs and therefore more likely to
be rearrested.
10.3 Using 2SLS to Deal with Non-compliance 351
N 314
1 1 1 0.989
2 1 0 0.989
3 0 1 0.216
4 0 0 0.216
The first column shows that OLS estimates a decrease of 7 percentage points in
probability of a rearrest later. The independent variable was whether someone was
actually arrested. This group includes people who were randomly assigned to be
arrested and people in the no-arrest-assigned treatment group who were arrested
anyway. We worry about bias when we use this variable because we suspect that
the bad eggs were more likely to get arrested.10
The second column shows that ITT estimates being assigned to the arrest
treatment lowers the probability of being arrested later by 10.8 percentage points.
This result is more negative than the OLS estimate and is statistically significant. The
ITT model avoids endogeneity because treatment assignment cannot be correlated
with the error term. The approach will understate the true effect when there was
non-compliance, either because some people not assigned to the treatment got it
or because everyone who was assigned to the treatment actually received it.
The third column shows the 2SLS results. In this model, the independent
variable is the fitted value of the treatment. The estimated coefficient on arrest is
even more negative than the ITT estimate, indicating that the probability of rearrest
for individuals who were arrested is 14 percentage points lower than for individuals
who were not initially arrested. The magnitude is double the effect estimated by
OLS. This result implies that Minneapolis can on average reduce the probability of
another incident by 14 percentage points by arresting individuals on the initial call.
2SLS is the best model because it accounts for non-compliance and provides an
10
The OLS model reported here is still based on partially randomized data because many people were
arrested owing to the randomization in the police protocol. If we had purely observational data with
no randomization, the bias of OLS would be worse, as it’s likely that only bad eggs would have been
arrested.
354 CHAPTER 10 Experiments: Dealing with Real-World Challenges
unbiased estimate of the effect that arresting someone initially has on likelihood of
a future arrest.
This study was quite influential and spawned similar investigations elsewhere;
see Berk, Campbell, Klap, and Western (1992) for more details.
10.4 Attrition
where Attritioni equals 1 for observations for which we do not observe the
dependent variable and equals 0 when we observe the dependent variable. A
statistically significant δˆ1 would indicate differential attrition across treatment and
control groups.
We can add some nuance to our evaluation of attrition by looking for
differential attrition patterns in the treatment and control groups. Specifically,
we can investigate whether the treatment variable interacted with one or more
covariates in a model explaining attrition. In our analysis of a randomized charter
school experiment, we might explore whether high test scores in earlier years were
associated with differential attrition in the treatment group. If we use the tools for
interaction variables discussed in Section 6.4, the model would be
approach here would be to trim the control group by removing another 5 percent
of the weakest students before doing our analysis so that both groups in the data
now have 10 percent attrition rates. This practice is statistically conservative in the
sense that it makes it harder to observe a statistically significant treatment effect
because it is unlikely that literally all of those who dropped out from the treatment
group were the worst students.
selection model A third approach to attrition is to use a selection model. The most famous
Simultaneously selection model is called a Heckman selection model (1979). In this approach,
accounts for whether we we would model both the process of being observed (which is a dichotomous
observe the dependent
variable equaling 1 for those for whom we observe the dependent variable and
variable and what the
dependent variable is.
0 for others) and the outcome (the model with the dependent variable of interest,
such as test scores). These models build on the probit model we shall discuss in
Chapter 12. More details are in the Further Reading section at the end of this
chapter.
REMEMBER THIS
1. Attrition occurs when individuals drop out of an experiment, causing us to lack outcome data
for them.
2. Non-random attrition can cause endogeneity even when treatment is randomly assigned.
3. We can detect problematic attrition by looking for differences in attrition rates across treated
and control groups.
4. Attrition can be addressed by using multivariate OLS, trimmed data sets, or selection
models.
Discussion Question
Suppose each of the following experimental populations suffered from attrition. Speculate on
the likely implications of not accounting for attrition in the analysis.
(a) Researchers were interested in the effectiveness of a new drug designed to lower
cholesterol. They gave a random set of patients the drug; the rest got a placebo pill.
(b) Researchers interested in rehabilitating former prisoners randomly assigned some
newly released individuals to an intensive support group. The rest received no such
access. The dependent variable was an indicator for returning to prison within five
years.
10.4 Attrition 357
plans, which would mean the people in the generous plans would be healthier than
others. Here, too, the diabetes in the error term would be correlated with the type
of health plan, although in the other direction.
Thus, we have a good candidate for a randomized experiment, which is exactly
what ambitious researchers at RAND Corporation designed in the 1970s. They
randomly assigned people to various health plans, including a free plan that
covered medical care at no cost and various cost-sharing plans that had different
levels of co-payments. With randomization, the type of people assigned to a free
plan should be expected to be the same as the type of people assigned to a
cost-sharing plan. The only expected difference between the groups should be their
health plans; hence, to the extent that the groups differed in utilization or health
outcomes, the differences could be attributed to differences in the health plans.
The RAND researchers found that medical expenses were 45 percent higher for
people in plans with no out-of-pocket medical expenses than for those who had
stingy insurance plans (which required people to pay 95 percent of costs, up to a
$1,000 yearly maximum). In general, health outcomes were no worse for those in
the stingy plans.11 This experiment has been incredibly influential—it is the reason
we pay $10 or whatever when we check out of the doctor’s office.
Attrition is a crucial issue in evaluating the RAND experiment. Not everyone
stayed in the experiment. Inevitably in such a large study, some people moved,
some died, and others opted out of the experiment because they were unhappy
with the plan in which they were randomly placed. The threat to the validity of this
experiment is that this attrition may have been non-random. If the type of people
who stayed with one plan differed systematically from the type of people who
stayed with another plan, comparing health outcomes or utilization rates across
these groups may be inappropriate, given that the groups differ both in their health
plans and in the type of people who remain in the wake of attrition.
Aron-Dine, Einav, and Finkelstein (2013) reexamined the RAND data in light of
attrition and other concerns. They showed that 1,894 people had been randomly
assigned to the free plan. Of those, 114 (6 percent) were non-compliers who
declined to participate. Of the remainder who participated, 89 (5 percent) left the
experiment. These low numbers for non-compliance and attrition are not very sur-
prising. The free plan was gold plated, covering everything. The cost-sharing plan
requiring the highest out-of-pocket expenditures had 1,121 assigned participants.
Of these, 269 (24 percent) declined the opportunity to participate, and another
145 (13 percent) left the experiment. These patterns contrast markedly from the
non-compliance and attrition patterns for the free plan.
What kind of people would we expect to leave a cost-sharing plan? Probably
people who ended up paying a lot of money under the plan. And what kind
of people would end up paying a lot of money under a cost-sharing plan? Sick
people, most likely. So that means we have reason to worry that the free plan
had all kinds of people, but that the cost-sharing plans had a sizable hunk of
11
Outcomes for people in the stingy plans were worse for some subgroups and some conditions,
however, leading the researchers to suggest programs targeted at specific conditions rather than
providing fee-free service for all health care.
10.4 Attrition 359
sick people who pulled out. So any finding that the cost-sharing plans yielded
the same health outcomes could have one of two causes: the plans did not
have different health impacts or the free plan was better but had a sicker
population.
Aron-Dine, Einav, and Finkelstein (2013) therefore conducted an analysis on
a trimmed data set based on techniques from Lee (2009). They dropped the
highest spenders in the free-care plan until they had a data set with the same
proportion of observations from those assigned to the free plan and to the costly
plan. Comparing these two groups is equivalent to assuming that those who
left the costly plan were the patients requiring the most expensive care; since
this is unlikely to be completely true, the results from such a comparison are
considered a lower bound—actual differences between the groups would be
larger if some of the people who dropped out from the costly plan were not
among the most expensive patients. The results indicated that the effect of the
cost-sharing plan was still negative, meaning that it lowered expenditures. How-
ever, the magnitude of the effect was less than the magnitude reported in the initial
study, which did little to account for differential attrition across the various types
of plans.
Review Questions
Consider a hypothetical experiment in which researchers evaluated a program that paid teachers a
substantial bonus if their students’ test scores rose. The researchers implemented the program in 50
villages and also sought test score data in 50 randomly selected villages.
Table 10.8 on the next page provides results from regressions using data available to the
researchers. Each column shows a bivariate regression in which Treatment was the independent
variable. This variable equaled 1 for villages where teachers were paid for student test scores and
0 for the control villages.
Researchers also had data on average village income, village population, and whether or not test
scores were available (a variable that equals 1 for villages that reported test scores and 0 for villages
that did not report test scores.)
1. Is there a balance problem? Use specific results in the table to justify your answer.
2. Is there an attrition problem? Use specific results in the table to justify your answer.
3. Did the treatment work? Justify your answer based on results here, and discuss what, if any,
additional information you would like to see.
360 CHAPTER 10 Experiments: Dealing with Real-World Challenges
TABLE 10.8 Regression Results for Models Relating Teacher Payment Experiment
(for Review Questions)
Dependent Variable
Test scores Village population Village income Test score availability
∗ ∗
Treatment 24.0 −20.00 500.0 0.20∗
(8.00) (100.0) (200.0) (0.08)
[t = 3.00] [t = 0.20] [t = 2.50] [t = 2.50]
12
Greene went on to get only 28 percent of the vote in the general election but vowed to run for
president anyway.
10.5 Natural Experiments 361
ballot do better. Conceptually, that’s not too hard, but practically, it is a lot to
ask, given that election officials are pretty protective of how they run elections.
In the 1998 Democratic primary in New York City, however, election officials
decided on their own to rotate the order of candidates’ names by precinct. Political
scientists Jonathan Koppell and Jennifer Steen got wind of this decision and
analyzed the election as a natural experiment. Their 2004 paper found that in
71 of 79 races, candidates received more votes in precincts where they were
listed first. In seven of those races, the differences were enough to determine the
election outcome. That’s pretty good work for an experiment the researchers didn’t
even set up.
Researchers have found other clever opportunities for natural experiments.
An important question is whether economic stimulus packages of tax cuts and
government spending increases that were implemented in response to the 2008
recession boosted growth. At a first glance, such analysis should be easy. We know
how much the federal government cut taxes and increased spending. We also know
how the economy performed. Of course things are not so simple because, as former
chair of the Council of Economic Advisers Christina Romer (2011) noted, “Fiscal
actions are often taken in response to other things happening in the economy.”
When we look at the relationship between two variables, like consumer spending
and the tax rebate, we “need to worry that a third variable, like the fall in wealth,
is influencing both of them. Failing to take account of this omitted variable leads
to a biased estimate of the relationship of interest.”
One way to deal with this challenge is to find exogenous variation in stimulus
spending that is not correlated with any of the omitted variables we worry about.
This is typically very hard, but sometimes natural experiments pop up. For
example, Parker, Souleles, Johnson, and McClelland (2013) noted that the 2008
stimulus consisted of tax rebate checks that were sent out in stages according to
the last two digits of recipients’ Social Security numbers. Thus, the timing was
effectively random for each family. After all, the last two digits are essentially
randomly assigned to people when they are born. This means that the timing of
the government spending by family was exogenous. An analyst’s dream come
true! The researchers found that family spending among those that got a check
was almost $500 more than those who did not, bolstering the case that the fiscal
stimulus boosted consumer spending.
REMEMBER THIS
1. In a natural experiment, the values of the independent variable have been determined by a
random, or at least exogenous, process.
2. Natural experiments are widely used and can be analyzed with OLS, 2SLS, or other
tools.
362 CHAPTER 10 Experiments: Dealing with Real-World Challenges
Conclusion
Experiments are incredibly promising for statistical inference. To find out if X
causes Y, do an experiment. Change X for a random subset of people. Compare
what happens to Y for the treatment and control groups. The approach is simple,
elegant, and has been used productively countless times.
For all their promise, though, experiments are like movie stars—idealized by
many but tending to lose some luster in real life. Movie stars’ teeth are a bit yellow,
and they aren’t particularly witty without a script. By the same token, experiments
don’t always achieve balance; they sometimes suffer from non-compliance and
attrition; and in many circumstances they aren’t feasible, ethical, or generalizable.
For these reasons, we need to take particular care when examining experi-
ments. We need to diagnose and, if necessary, respond to ABC issues (attrition,
balance, and compliance). Every experiment needs to assess balance to ensure
that the treatment and control groups do not differ systematically except for the
treatment. Many social science experiments also have potential non-compliance
problems since people can choose not to experience the randomly assigned
treatment. Non-compliance can induce endogeneity if we use Treatment delivered
as the independent variable, but we can get back to unbiased inference if we use
ITT or 2SLS to analyze the experiment. Finally, at least some people invariably
leave the experiment, which can be a problem if the attrition is related to the
treatment. Attrition is hard to overcome but must be diagnosed, and if it is a
problem, we should at least use multivariate OLS or trimmed data to lessen the
validity-degrading effects.
The following steps provide a general guide to implementing and analyzing
a randomized experiment:
2. Randomly pick a subset of the population and give them the treatment.
The rest are the control group.
364 CHAPTER 10 Experiments: Dealing with Real-World Challenges
(a) Assess balance with difference of means tests for all possible
independent variables.
(b) If there are imbalances, use multivariate OLS, controlling for vari-
ables that are unbalanced across treatment and control groups.
(d) If there is attrition, use multivariate OLS, trim the data, or use a
selection model.
• Section 10.3: Explain how 2SLS can be useful for experiments with
imperfect compliance.
• Section 10.4: Explain how attrition can create endogeneity, and describe
some steps we can take to diagnose and deal with attrition.
Further Reading
Experiments are booming in the social sciences. Gerber and Green (2012) provide
a comprehensive guide to field experiments. Banerjee and Duflo (2011) give
an excellent introduction to experiments in the developing world, and Duflo,
Glennerster, and Kremer (2008) provide an experimental toolkit that’s useful for
experiments in the developing world and beyond. Dunning (2012) has published
a detailed guide to natural experiments. A readable guide by Manzi (2012) is
also a critique of randomized experiments in social science and business. Manzi
(2012, 190) refers to a report to Congress in 2008 that identified policies that
demonstrated significant results in randomized field trials.
Attrition is one of the harder things to deal with, and different analysts take
different approaches. Gerber and Green (2012, 214) discuss their approaches
to dealing with attrition. The large literature on selection models includes, for
example, Das, Newey, and Vella (2003). Some experimentalists resist using
selection models because those models rely heavily on assumptions about the
distributions of error terms and functional form.
Imai, King, and Stuart (2008) discuss how to use blocking to get more
efficiency and less potential for bias in randomized experiments.
Key Terms
ABC issues (334) Compliance (340) Selection model (356)
Attrition (354) Intention-to-treat analysis Trimmed data set (355)
Balance (336) (343)
Blocking (335) Natural experiment (360)
Computing Corner
Stata
1. To assess balance, estimate a series of bivariate regression models with
all X variables as dependent variables and treatment assignment as
independent variables:
reg X1 TreatmentAssignment
reg X2 TreatmentAssignment
R
1. To assess balance, estimate a series of bivariate regression models with
all “X” variables as dependent variables and treatment assignment as
independent variables:
lm(X1 ~ TreatmentAssignment)
lm(X2 ~ TreatmentAssignment)
Exercises
1. In an effort to better understand the effects of get-out-the-vote messages
on voter turnout, Gerber and Green (2005) conducted a randomized field
experiment involving approximately 30,000 individuals in New Haven,
Connecticut, in 1998. One of the experimental treatments was randomly
assigned in-person visits where a volunteer visited the person’s home and
encouraged him or her to vote. The file GerberGreenData.dta contains the
variables described in Table 10.10.
(c) Use ITT to estimate the effect of being assigned treatment on whether
someone turned out to vote. Is this estimate likely to be higher
or lower than the actual effect of being contacted? Is it subject to
endogeneity?
(d) Use 2SLS to estimate the effect of contact on voting. Compare the
results to the ITT results. Justify your choice of instrument.
(e) We can use ITT results and compliance rates to generate a Wald
estimator, which is an estimate of the treatment effects calculated by
dividing the ITT effect by the coefficient on the treatment assignment
variable in the first-stage model of the 2SLS model. (If no one in the
non-treatment-assignment group gets the treatment, this coefficient
will indicate the compliance rate; more generally, this coefficient
indicates the net effect of treatment assignment on probability of
treatment observed.) Calculate this quantity by using the results in
part (b) and (c), and compare to the 2SLS results. It helps to be as
precise as possible. Are they different? Discuss.
(g) Estimate a 2SLS model including controls for Ward 2 and Ward 3
residence and the number of people in the household. Do you expect
the results to differ substantially? Why or why not? Explain how the
first-stage results differ from the balance tests described earlier.
(a) Check balance in treatment versus control for all possible independ-
ent variables.
(c) Are the compliers different from the non-compliers? Provide evi-
dence to support your answer.
(d) In the first round of the experiment, 805 participants were inter-
viewed and assigned to either the treatment or the control condition.
After the program aired, 507 participants were re-interviewed about
the program. With only 63 percent of the participants re-interviewed,
what problems are created for the experiment?
(e) In this case, data (even pretreatment data) is available only for the
507 people who did not leave the sample. Is there anything we
can do?
3. In their 2004 paper “Are Emily and Greg More Employable than
Lakisha and Jamal? A Field Experiment on Labor Market Discrimina-
tion,” Marianne Bertrand and Sendhil Mullainathan discuss the results
of their field experiment on randomizing names on job resumes. To
assess whether employers treated African-American and white applicants
similarly, they had created fictitious resumes and randomly assigned
white-sounding names (e.g., Emily and Greg) to half of the resumes
and African-American-sounding names (e.g., Lakisha and Jamal) to
the other half. They sent these resumes in response to help-wanted
ads in Chicago and Boston and collected data on the number of
callbacks received. Table 10.11 describes the variables in the data set
resume_HW.dta.
Exercises 369
education 0 = not reported; 1 = some high school; 2 = high school graduate; 3 = some college;
4 = college graduate or more
yearsexp Number of years of work experience
(a) What issues are associated with studying the effects of new schools
in Afghanistan that are not randomly assigned?
(d) On page 68, we noted that if errors are correlated, the standard
OLS estimates for the standard error of β̂ are incorrect. In this
Exercises 371
(h) Calculate the effect on test scores of being in a treatment village, con-
trolling for age of child, sex, number of sheep family owns, length of
time family lived in village, farmer, years of education for household
head, number of people in household, and distance to nearest school.
Use the standard errors that account for within-village correlation of
errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control
variables that are significantly associated with higher test scores.
(i) Compare the sample size for the enrollment and test score data. What
concern does this comparison raise?
(j) Assess whether attrition was associated with treatment. Use the
standard errors that account for within-village correlation of errors.
Regression Discontinuity: Looking 11
for Jumps in Data
373
374 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
cutoff
Normalized
grade
0.2
0.1
We’ve included fit lines to help make the pattern clear. Those who had not
yet turned 21 scored higher. There is a discontinuity at the zero point in the figure
(corresponding to students taking a test on their 21st birthday). If we can’t come
up with another explanation for test scores to change at this point, we have pretty
good evidence that drinking hurts grades.
regression Regression discontinuity (RD) analysis formalizes this logic by using
discontinuity (RD) regression analysis to identify possible discontinuities at the point of application
analysis Techniques of the treatment. For the drinking age case, RD analysis involves fitting an OLS
that use regression
model that allows us to see if there is a discontinuity at the point students become
analysis to identify
possible discontinuities
legally able to drink.
at the point at which RD analysis has been used in a variety of contexts in which a treatment of
some treatment applies. interest is determined by a strict cutoff. Card, Dobkin, and Maestas (2009) used RD
analysis to examine the effect of Medicare on health because Medicare eligibility
kicks in the day someone turns 65. Lee (2008) used RD analysis to study the
effect of incumbency on reelection to Congress because incumbents are decided
by whoever gets more votes. Lerman (2009) used RD analysis to assess the effect
11.1 Basic RD Model 375
Yi = β0 + β1 Ti + β2 (X1i − C) + i (11.1)
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
Dependent
variable
cutoff
(Y)
4,000
3,000 pe)
slo
(the
β2
β0 + β1
2,000 Jump is β1
β0
1,000
C
Assignment variable (X1)
Figure 11.3 displays more examples of results from RD models. In panel (a),
β1 is positive, just as in Figure 11.2, but β2 is negative, creating a downward slope
for the assignment variable. In panel (b), the treatment has no effect, meaning that
β1 = 0. Even though everyone above the cutoff received the treatment, there is no
discernible discontinuity in the dependent variable at the cutoff point. In panel (c),
β1 is negative because there is a jump downward at the cutoff, implying that the
treatment lowered the dependent variable.
3,000 2,500
3,000
2,500 2,000
2,000
2,000 1,500
1,500
1,000
1,000
1,000
500
500
0
0
0
−800 −400 0 400 −800 −400 0 400 −800 −400 0 400
One of the cool things about RD analysis is that even if the error term is
correlated with the assignment variable, the estimated effect of the treatment is still
valid. To see why, suppose C = 0, the error and assignment variable are correlated,
and we characterize the correlation as follows:
i = ρX1i + νi (11.2)
where the Greek letter rho (ρ, pronounced “row”) captures how strongly the
error and X1i are related and νi is a random term that is uncorrelated with X1i .
In the Medicare example, mortality is the dependent variable, the treatment T is
Medicare (which kicks in the second someone turns 65), age is the assignment
variable, and health is in the error term. It is totally reasonable to believe that health
is related to age, and we use Equation 11.2 to characterize such a relationship.
If we estimate the following model that does not control for the assignment
variable (X1i )
Yi = β0 + β1 Ti + i
function of Medicare only, the Medicare variable will pick up not only the effect
of the program but also the effect of health, which is in the error term, which is
correlated with age, which is in turn correlated with Medicare.
If we control for X1i , however, the correlation between T and disappears. To
see why, we begin with the basic RD model (Equation 11.1). For simplicity, we
assume C = 0. Using Equation 11.2 to substitute for yields
Yi = β0 + β1 Ti + β2 X1i + i
= β0 + β1 Ti + β2 X1i + ρX1i + νi
Yi = β0 + β1 Ti + (β2 + ρ)X1i + νi
= β0 + β1 Ti + β̃ 2 X1i + νi
Notice that we have an equation in which the error term is now νi (the part of
Equation 11.2 that is uncorrelated with anything). Hence, the treatment variable,
T, in the RD model is uncorrelated with the error term even though the assignment
variable is correlated with the error term. This means that OLS will provide an
unbiased estimate of β1 , the coefficient on Ti .
Meanwhile, the coefficient we estimate on the X1i assignment variable is β̃ 2
(notice the squiggly on top), a combination of β2 (with no squiggly on top and
the actual effect of X1i on Y) and ρ (the degree of correlation between X1i and the
error term in the original model, ).
Thus, we do not put a lot of stock into the estimate of the variable on the
assignment variable because the coefficient combines the actual effect of the
assignment variable and the correlation of the assignment variable and the error.
That’s OK, though, because our main interest is in the effect of the treatment, β1 .
REMEMBER THIS
An RD analysis can be used when treatment depends on an assignment variable being above some
cutoff C.
Yi = β0 + β1 Ti + β2 (X1i − C) + i
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
380 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
2. RD models require that the error term be continuous at the cutoff. That is, the value of the error
term must not jump up or down at the cutoff.
3. RD analysis identifies a causal effect of treatment because the assignment variable soaks up
the correlation of error and treatment.
Discussion Questions
1. Many school districts pay for new school buildings with bond issues that must be approved by
voters. Supporters of these bond issues typically argue that new buildings improve schools and
thereby boost housing values. Cellini, Ferreira, and Rothstein (2010) used RD analysis to test
whether passage of school bonds caused housing values to rise.
(a) What is the assignment variable?
(b) Explain how to use a basic RD approach to estimate the effect of school bond passage
on housing values.
(c) Provide a specific equation for the model.
2. U.S. citizens are eligible for Medicare the day they turn 65 years old. Many believe that people
with health insurance are less likely to die prematurely because they will be more likely to seek
treatment and doctors will be more willing to conduct tests and procedures for them. Card,
Dobkin, and Maestas (2009) used RD analysis to address this question.
(a) What is the assignment variable?
(b) Explain how to use a basic RD approach to estimate the effect of Medicare coverage on
the probability of dying prematurely.
(c) Provide a specific equation for the model. (Don’t worry that the dependent variable is a
dummy variable; we’ll deal with that issue later on in Chapter 12.)
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
The new term at the end of the equation is an interaction between T and
X1 − C. The coefficient on that interaction, β3 , captures how different the slope is
for observations where X1 is greater than C. The slope for untreated observations
(for which Ti = 0) will simply be β2 , which is the slope for observations to
the left of the cutoff. The slope for the treated observations (for which Ti = 1)
will be β2 + β3 , which is the slope for observations to the right of the cutoff.
(Recall our discussion in Chapter 6, page 202, regarding the proper interpretation
of coefficients on interactions.)
Figure 11.4 displays examples in which the slopes differ above and below the
cutoff. In panel (a), β2 = 1 and β3 = 2. Because β3 is greater than zero, the slope
is steeper for observations to the right of the cutoff. The slope for observations to
the left of the cutoff is 1 (the value of β2 ), and the slope for observations to the
right of the cutoff is β2 + β3 = 3.
In panel (b) of Figure 11.4, β3 is zero, meaning that the slope is the same (and
equal to β2 ) on both sides of the cutoff. In panel (c), β3 is less than zero, meaning
that the slope is less steep for observations for which X1 is greater than C. Note
that just because β3 is negative, the slope for observations to the right of the cutoff
need not be negative (although it may be). A negative value of β3 simply means
that the slope is less steep for observations to the right of the cutoff. In panel (c),
β3 = −β2 , which is why the slope is zero to the right of the cutoff.
In estimating an RD model with varying slopes, is important to use X1i − C
instead of X1i for the assignment variable. In this model, we’re estimating two
separate lines. The intercept for the line for the untreated group is β̂0 , and the
intercept for the line for the treated group is β̂0 + β̂1 . If we used X1i as the
assignment variable (instead of X1i − C), the β̂1 estimate would indicate the
differences in treated and control when X1i is zero even though we care about
the difference between treated and control when X1i equals the cutoff. By using
X1i − C instead of X1i for the assignment variable, we have ensured that β̂1 will
indicate the difference between treated and control when X1i − C is zero, which
occurs, of course, when X1i = C.
Polynomial model
Once we start thinking about how the slope could vary across different values of
X1 , it is easy to start thinking about other possibilities. Hence, more technical RD
382 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
Dependent
variable cutoff cutoff cutoff
(Y) 5,000
3,000
2,500
4,000
2,500
2,000
3,000
2,000
2,000 1,500
1,500
1,000
1,000
1,000
0
500
500
analyses spend a lot of effort estimating relationships that are even more flexible
than the varying slopes model. One way to estimate more flexible relationships
between the assignment variable and outcome is to use our polynomial regression
model from Chapter 7 (page 221) to allow the relationship between X1 to Y to
wiggle and curve. The RD insight is that however wiggly that line gets, we’re still
looking for a jump (a discontinuity) at the point where the treatment kicks in.
11.2 More Flexible RD Models 383
For example, we can use polynomial models to allow the estimated lines to
curve differently above and below the treatment threshold with a model like the
following:
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
Figure 11.5 shows two relationships that can be estimated with such a
polynomial model. In panel (a), the value of Y accelerates as X1 approaches the
Dependent Dependent
variable cutoff variable cutoff
(Y ) (Y )
10 10
8 8
6 6
4 4
2 2
0 0
−4 −2 0 2 4 6 −4 −2 0 2 4 6
cutoff, dips at the point of treatment, and accelerates again from that lower point. In
panel (b), the relationship appears relatively flat for values of X1 below the cutoff,
but there is a fairly substantial jump up in Y at the cutoff. After that, Y rises sharply
with X1 and then falls sharply.
It is virtually impossible to predict funky non-linear relationships like these
ahead of time. The goal is to find a functional form for the relationship between
X1 − C and outcomes that soaks up any relation between X1 − C and outcomes to
ensure that any jump at the cutoff reflects only the causal effect of the treatment.
This means we can estimate the polynomial models and see what happens even
without a full theory about how the line should wiggle.
With this flexibility comes danger, though. Polynomial models are quite
sensitive and sometimes can produce jumps at the cutoff that are bigger than they
should be. Therefore, we should always report simple linear models as well to
avoid seeming to be fishing around for a non-linear model that gives us the answer
we’d like.
REMEMBER THIS
When we conduct RD analysis, it is useful to allow for a more flexible relationship between
assignment variable and outcome.
• A varying slopes model allows the slope to vary on different sides of the treatment cutoff:
• We can also use polynomial models to allow for non-linear relationships between the
assignment and outcome variables.
Review Question
For each panel in Figure 11.6, indicate whether each of β1 , β2 , and β3 is less than, equal to, or greater
than zero for the varying slopes RD model:
cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X X
(a) (b)
cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X X
(c) (d)
cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 3 4 6 8 10 0 2 3 4 6 8 10
X X
(e) (f)
Binned graphs
A convenient trick that helps us understand non-linearities and discontinuities in
our RD data is to create binned graphs. Binned graphs look like scatterplots
but are a bit different. To construct a bin plot, we divide the X1 variable into
11.3 Windows and Bins 387
Dependent variable (Y )
cutoff
cutoff
10
4
6
3
4
2
2
0
−1 0 1
−4 −2 0 2 4 6
cutoff
10
7
8
6
5
4 4
3
2
0 −1 0 1
−4 −2 0 2 4 6
FIGURE 11.7: Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5
multiple regions (or “bins”) above and below the cutoff; we then calculate the
average value of Y within each of those regions. When we plot the data, we get
something that looks like panel (a) of Figure 11.8. Notice that there is a single
388 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
3,000 3,000
2,500 2,500
2,000 2,000 β1
1,500 1,500
1,000 1,000
observation for each bin, producing a graph that’s cleaner than a scatterplot of all
observations.
The bin plot provides guidance for selecting the right RD model. If the
relationship is highly non-linear or seems dramatically different above and below
the cutpoint, the bin plot will let us know. In panel (a) of Figure 11.8, we
see a bit of non-linearity because there is a U-shaped relationship between
X1 and Y for values of X1 below the cutoff. This relationship suggests that a
quadratic could be appropriate, or even simpler, the window could be narrowed
to focus only on the range of X1 where the relationship is more linear. Panel
(b) of Figure 11.8 shows the fitted lines based on an analysis that used only
observations for which X1 is between 900 and 2,200. The implied treatment
effect is the jump in the data indicated by β1 in the figure. We do not actually
use the binned data to estimate the model; we use the original data in our
regressions.
11.3 Windows and Bins 389
REMEMBER THIS
1. It is useful to look at smaller window sizes when possible by considering only data close to the
treatment cutoff.
2. Binned graphs help us visualize the discontinuity and the possibly non-linear relationship
between assignment variable and outcome.
cutoff
Test
score 12
10
the kids who didn’t go to pre-K would score lower than the kids who did, simply
because they were younger. But unless the program boosted test scores, there is
no obvious reason for a discontinuity to be located right at the cutoff.
Table 11.1 shows statistics results for the basic and varying slopes RD models.
For the basic model, the coefficient on the variable for pre-K is 3.492 and highly
significant, with a t statistic of 10.31. The coefficient indicates the jump that we see
in Figure 11.9. The age variable is also highly significant. No surprise there as older
children did better on the test.
In the varying slopes model, the coefficient on the treatment is virtually
unchanged from the basic model, indicating a jump of 3.479 in test scores for
the kids who went to pre-K. The effect is again highly statistically significant, with
a t statistic of 10.23. The coefficient on the interaction is insignificant, however,
indicating that the slope on age is the same for kids who had pre-K and those who
didn’t.
11.4 Limitations and Diagnostics 391
N 2, 785 2, 785
2
R 0.323 0.323
Imperfect assignment
One drawback to the RD approach is that it’s pretty rare to have an assignment
variable that decisively determines treatment. If we’re looking at the effect of going
to a certain college, for example, we probably cannot use RD analysis because
admission was based on multiple factors, none them cut and dried. Or if we’re
trying to assess the effectiveness of a political advertising campaign, it’s unlikely
that the campaign simply advertised in cities where its poll results were less than
some threshold; instead, the managers probably selected certain criteria to identify
where they might advertise and then decided exactly where to run ads on the basis
of a number of factors (including gut feel).
392 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
fuzzy RD models In the Further Reading section at the end of the chapter, we point to readings
RD models in which the on so-called fuzzy RD models, which can be used when the assignment variable
assignment variable imperfectly predicts treatment. Fuzzy RD models can be useful when there
imperfectly predicts
is a point at which treatment becomes much more likely but isn’t necessarily
treatment.
guaranteed. For example, a college might look only at people with test scores on an
admission exam of 160 or higher. Being above 160 may not guarantee admission,
but there is a huge leap in probability of admission for those who score 160 instead
of 159.
assignment variable itself acts peculiar at the cutoff. If the values of the assignment
variable cluster just above the cutoff, we should worry that people know about
the cutoff and are able to manipulate things to get over it. In such a situation,
it’s quite plausible that the people who are able to just get over the cutoff differ
from those who do not, perhaps because the former have more ambition (as in our
GPA example), or better contacts, or better information, or other advantages. To
the extent that these factors also affect the dependent variable, we’ll violate the
assumption that the error term does not have a discrete jump at the cutoff.
The best way to assess whether there is clustering on one side of the
cutoff is to create a histogram of the assignment variable and see if it shows
unusual activity at the cutoff point. Panel (a) in Figure 11.10 is a histogram of
assignment values in a case with no obvious clustering. The frequency of values
in each bin for the assignment variable bounces around a bit here and there,
but it’s mostly smooth. There is no clear jump up or down at the cutoff. In
contrast, the histogram in panel (b) shows clear clustering just above the cutoff.
When faced with data like panel (b), it’s pretty reasonable to suspect that the
word is out about the cutoff and people have figured out how to get over the
threshold.1
cutoff cutoff
Frequency Frequency
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
−5 −3 −1 1 3 5 −5 −3 −1 1 3 5
1
Formally testing for discontinuity of the assignment variable at the cutoff is a bit tricky. McCrary
(2008) has more details. Usually, a visual assessment provides a good sense of what is going on,
although it’s a good idea to try different bin sizes to make sure that what you’re seeing is not an
artifact of one particular choice for bin size.
394 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
The second diagnostic test involves assessing whether other variables act
weird at the discontinuity. For RD analysis to be valid, we want only Y, nothing
else, to jump at the point where T = 1. If some other variable jumps at the
discontinuity, we may wonder if people are somehow self-selecting (or being
selected) based on unknown additional factors. If so, it could be that the jump
at Y is being caused by these other factors jumping at the discontinuity, not the
treatment. A basic diagnostic test of this sort looks like
X2i = γ0 + γ1 Ti + γ2 (X1i − C) + νi
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
A statistically significant γ̂1 coefficient from this model means that X2 jumps at
the treatment discontinuity, which casts doubt on the main assumption of the RD
model—namely, that the only thing happening at the discontinuity is movement
from the untreated to the treated category.
A significant γ̂1 from this diagnostic test doesn’t necessarily kill the RD
analysis, but we would need to control for X2 in the RD model and explain
why this additional variable jumps at the discontinuity. It also makes sense to
use varying slopes models, polynomial models, and smaller window sizes in
conducting balance tests.
Including any variable that jumps at the discontinuity is only a partial fix,
though, because if we observe a difference at the cutoff in a variable we can mea-
sure, it’s plausible that there is also a difference at the cutoff in a variable we can’t
measure. We can measure education reasonably well. It’s a lot harder to measure
intelligence, however. And it’s extremely hard to measure conscientiousness. If
we see that people are more educated at the cutoff, we’ll worry that they are also
more intelligent and conscientious—that is, we’ll worry that at the discontinuity,
our treated group may differ from the untreated group in ways we can’t
measure.
Generalizability of RD results
An additional limitation of RD is that it estimates a very specific treatment effect,
also known as the local average treatment effect. This concept comes up for
instrumental variables models as well (as discussed in the Further Reading Section
of Chapter 9 on page 324). The idea is that the effects of the treatment may differ
within the population: a training program might work great for some types of
people but do nothing for others. The treatment effect estimated by RD analysis is
the effect of the treatment on folks who have X1 equal to the threshold. Perhaps the
treatment would have no effect on people with very low values of the assignment
variable. Or perhaps the treatment effect grows as the assignment variable grows.
RD analysis will not be able to speak to these possibilities because we observe only
the treatment happening at one cutoff. Hence, it is possible that the RD results will
not generalize to the whole population.
11.4 Limitations and Diagnostics 395
REMEMBER THIS
To assess the appropriateness of RD analysis:
1. Qualitatively assess whether people have control over the assignment variable.
2. Conduct diagnostic tests.
• Assess the distribution of the assignment variable by using a histogram to see if there is
clustering on one side of the cutoff.
• Run RD models, and use other covariates as dependent variables. The treatment should not
be associated with any discontinuity in any covariate.
Frequency
1,500
1,000
500
Cutoff
Assignment variable (X1 − C)
FIGURE 11.11: Histogram of Age Observations for Drinking Age Case Study
We can also run diagnostic tests. Figure 11.11 shows the frequency of observa-
tions for students above and below the age cutoff. There is no sign of manipulation
of the assignment variable: the distribution of ages is mostly constant, with some
apparently random jumps up and down.
We can also assess whether other covariates showed discontinuities at the
21st birthday. Since, as discussed earlier, the defining RD assumption is that the
only discontinuity at the cutoff is in the dependent variable, we hope to see no
All three specifications control for age, allowing the slope to vary on either side of the cutoff. The second and third
specifications control for semester, SAT scores, and other demographics factors.
∗
indicates significance at p < 0.05, two-tailed.
All three specifications control for age, allowing the slope to vary on either side of the cutoff.
Covariatei = γ0 + γ1 Ti + γ2 (Agei − C) + νi
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
Table 11.3 shows results for three covariates: SAT math scores, SAT verbal scores, and
physical fitness. For none of these covariates is γ̂1 statistically significant, suggesting
that there is no jump in covariates at the point of the discontinuity, a conclusion that
is consistent with the idea that the only thing changing at the discontinuity is the
treatment.
Conclusion
RD analysis is a powerful statistical tool. It works even when the treatment we
are trying to analyze is correlated with the error. It works because the assignment
variable—a variable that determines whether a unit gets the treatment—soaks up
endogeneity. The only assumption we need is that there is no discontinuity in the
error term at the cutoff in the assignment variable X1 .
If we have such a situation, the basic RD model is super simple. It is just an
OLS model with a dummy variable (indicating treatment) and a variable indicating
distance to the cutoff. More complicated RD models allow more complicated
relationships between the assignment variable and the dependent variable. No
matter the model, however, the heart of RD analysis remains looking for a jump
in the value of Y at the cutoff point for assignment to treatment. As long as there
is no discontinuity in relationship between error and the outcome at the cutoff, we
can attribute any jump in the dependent variable to the effect of the treatment.
398 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
• Section 11.1: Write down a basic RD model, and explain all terms,
including treatment variable, assignment variable, and cutoff, as well as
how RD models overcome endogeneity.
• Section 11.2: Write down and explain RD models with varying slopes and
non-linear relationships.
Further Reading
Imbens and Lemieux (2008) and Lee and Lemieux (2010) go into additional detail
on RD designs, including discussions of fuzzy RD models. Bloom (2012) gives
another useful overview of RD methods. Cook (2008) provides a history of RD
applications. Buddlemeyer and Skofias (2003) compare performance of RD and
experiments and find that RD analysis works well as long as discontinuity is
rigorously enforced.
See Grimmer, Hersh, Feinstein, and Carpenter (2010) for an example of using
diagnostics to critique RD studies with election outcomes as an RD assignment
variable.
Key Terms
Assignment variable (375) Fuzzy RD models (392) Window (386)
Binned graphs (386) Regression discontinuity
Discontinuity (373) analysis (374)
Computing Corner 399
Computing Corner
Stata
To estimate an RD model in Stata, create a dummy treatment variable and an
X1 − C variable and use the syntax for multivariate OLS.
4. To create a scatterplot with the fitted lines from a varying slopes RD model,
use the following:
graph twoway (scatter Y Assign) (lfit Y Assign if T == 0) /*
*/ (lfit Y Assign if T == 1)
R
To estimate an RD model in R, we create a dummy treatment variable and a
X1 − C variable and use the syntax for multivariate OLS.
4. There are many different ways to use R to create a scatterplot with the
fitted lines from a varying slopes RD model. Here is one example for a
model in which the assignment variable ranges from −1,000 to 1,000 with
a cutoff at zero. This example uses the results from the OLS regression
model RDResults:
plot(Assign, Y)
lines(−1000:0, RDResults$coef[1] + RDResults$coef[3]
∗(−1000:0))
lines(0:1000, RDResults$coef[1] + RDResults$coef[2] +
(RDResults$coef[3] + RDResults$coef[4])∗(0:1000))
Exercises
1. As discussed on page 389, Gormley, Phillips, and Gayer (2008) used
RD analysis to evaluate the impact of pre-K on test scores in Tulsa.
Children born on or before September 1, 2001, were eligible to enroll
in the program during the 2005–2006 school year, while children born
after this date had to wait to enroll until the 2006–2007 school year.
Table 11.4 lists the variables. The pre-K data set covers 1,943 children
just beginning the program in 2006–2007 (preschool entrants) and 1,568
children who had just finished the program and began kindergarten in
2006–2007 (preschool alumni).
(a) Why should there be a jump in the dependent variable right at the
point where a child’s birthday renders him or her eligible to have
participated in preschool the previous year (2005–2006) rather than
the current year (2006–2007)? Should we see jumps at other points
as well?
(b) Assess whether there is a discontinuity at the cutoff for the free-lunch
status, gender, and race/ethnicity covariates.
Exercises 401
age Days from the birthday cutoff. The cutoff value is coded as 0; negative values indicate
days born after the cutoff; positive values indicate days born before the cutoff
cutoff Treatment indicator (1 = born before cutoff, 0 = born after cutoff)
wjtest01 Woodcock-Johnson letter-word identification test score
freelunch Eligible for free lunch based on low income in 2006–07 (1 = yes, 0 = no)
(c) Repeat the tests for covariate discontinuities, restricting the sample
to a one-month (30-day) window on either side of the cutoff. Do the
results change? Why or why not?
(f) Add controls for lunch status, gender, and race/ethnicity to the
model. Does adding these controls change the results? Why or why
not?
(g) Reestimate the model from part (f), limiting the window to one
month (30 days) on either side of the cutoff. Do the results change?
How do the standard errors in this model compare to those from the
model using the full data set?
(a) Assess whether there is a discontinuity at the cutoff for the free-lunch
status, gender, and race/ethnicity covariates.
(b) Repeat the tests for covariate discontinuities, restricting the sample
to a one-month (30-day) window on either side of the cutoff. Do the
results change? Why or why not?
(e) Add controls for lunch status, gender, and race/ethnicity to the
model. Do the results change? Why or why not?
(f) Reestimate the model from part (e), limiting the window to one
month (30 days) on either side of the cutoff. Do the results change?
How do the standard errors in this model compare to those from the
model using the full data set?
3. Congressional elections are decided by a clear rule: whoever gets the most
votes in November wins. Because virtually every congressional race in the
United States is between two parties, whoever gets more than 50 percent
of the vote wins.2 We can use this fact to estimate the effect of political
party on ideology. Some argue that Republicans and Democrats are very
distinctive; others argue that members of Congress have strong incentives
to respond to the median voter in their districts, regardless of party. We can
assess how much party matters by looking at the ideology of members of
Congress in the 112th Congress (which covered the years 2011 and 2012).
Table 11.5 lists the variables.
2
We’ll look only at votes going to the two major parties, Democrats and Republicans, to ensure a
nice 50 percent cutoff.
Exercises 403
(b) How can an RD model fight endogeneity when we are trying to assess
if and how party affects congressional ideology?
(d) Write down a basic RD model for this question, and explain the
terms.
(g) Reestimate the varying slopes model, but use the unadjusted variable
(and unadjusted interaction). Compare the coefficient estimates
to your results in part (f). Calculate the fitted values for four
observations: a Democrat with GOP2party2010 = 0, a Democrat
with GOP2party2010 = 0.5, a Republican with GOP2party2010 =
0.5, and a Republican with GOP2party2010 = 1.0). Compare to the
fitted values in part (f).
GOP2party2010 The percent of the vote received by the Republican congressional candidate in the
district in 2010. Ranges from 0 to 1.
GOPwin2010 Dummy variable indicating Republican won; equals 1 if GOP2party2010 >
0.5 and equals 0 otherwise.
Ideology The conservativism of the member of Congress as measured by Carroll, Lewis, Lo,
Poole, and Rosenthal (2009, 2014). Ranges from –0.779 to 1.293. Higher values
indicate more conservative voting in Congress.
ChildPoverty Percentage of district children living in poverty. Ranges from 0.03 to 0.49.
WhitePct Percent of the district that is non-Hispanic white. Ranges from 0.03 to 0.97.
404 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
Mortality County mortality rate for children aged 5 to 9 from 1973 to 1983, limited to causes
plausibly affected by Head Start
Poverty Poverty rate in 1960: transformed by subtracting cutoff; also divided by 10 for easier
interpretation
HeadStart Dummy variable indicating counties that received Head Start assistance: counties with
poverty greater than 59.2 are coded as 1; counties with poverty less than 59.2 are coded
as 0
Bin The “bin” label for each observation based on dividing the poverty into 50 bins
(l) Estimate a varying slopes model with a window of GOP vote share
from 0.4 to 0.6. Discuss any meaningful differences in coefficients
and standard errors from the earlier varying slopes model.
(a) Write out an equation for a basic RD design to assess the effect
of Head Start assistance on child mortality rates. Draw a picture
of what you expect the relationship to look like. Note that in
this example, treatment occurs for low values of the assignment
variable.
Exercises 405
(b) Explain how RD analysis can identify a causal effect of Head Start
assistance on mortality.
(c) Estimate the effect of Head Start on mortality rate by using a basic
RD design.
(d) Estimate the effect of Head Start on mortality rate by using a varying
slopes RD design.
(e) Estimate a basic RD model with (adjusted) poverty values that are
between –0.8 and 0.8. Comment on your findings.
(g) Create a scatterplot of the mortality and poverty data. What do you
see?
(h) Use the following code to create a binned graph of the mortality and
poverty data. What do you see?3
egen BinMean = mean(Mortality), by(Bin)
graph twoway (scatter BinMean Bin, ytitle("Mortality") /*
*/ xtitle("Poverty") msize(large) xline(0.0) )/*
*/ (lfit BinMean Bin if HeadStart == 0, clcolor(blue)) /*
*/ (lfit BinMean Bin if HeadStart == 1, clcolor(red))
3
The trick to creating a binned graph is associating each observation with a bin label that is in the
middle of the bin. Stata code that created the Bin variable is (where we use semicolon to indicate line
breaks to save space) scalar BinNum = 50; scalar BinMin = -6; scalar BinMax
= 3; scalar BinLength = (BinMax-BinMin)/BinNum; gen Bin = BinMin +
BinLength*(0.5+(floor((Poverty-BinMin)/BinLength))). The Bin variable here sets the
value for each observation to the middle of the bin; there are likely to be other ways to do it.
PA R T III
409
410 CHAPTER 12 Dummy Dependent Variables
where E[Yi | X1 , X2 ] is the expected value of Yi given the values of X1i and X2i . This
term is also referred to as the conditional value of Y.2
When the dependent variable is dichotomous, the expected value of Y is equal
to the probability that the variable equals 1. For example, consider a dependent
variable that is 1 if it rains and 0 if it doesn’t. If there is a 40 percent chance of
rain, the expected value of this variable is 0.40. If there is an 85 percent chance of
rain, the expected value of this variable is 0.85. In other words, because E[Y | X] =
Probability(Y = 1 | X), OLS with a dependent variable provides
1
We discussed dichotomous independent variables in Chapter 7.
2
The terms linear and non-linear can get confusing. A linear model is one of the form
Yi = β0 + β1 X1i + β1 X2i + · · · , where none of the parameters to be estimated is multiplied, divided, or
raised to powers of other parameters. In other words, all the parameters enter in their own little plus
term. In a non-linear model, some of the parameters are multiplied, divided, or raised to powers of
other parameters. Linear models can estimate some non-linear relationships (by creating terms that
are functions of the independent variables, not the parameters). We described this process in
Section 7.1. Such polynomial models will not, however, solve the deficiencies of OLS for
dichotomous dependent variables. The models that do address the problems, the probit and logit
models we cover later in this chapter, are complex functions of other parameters and are therefore
necessarily non-linear models.
12.1 Linear Probability Model 411
GPA 0.032∗
(0.003)
[t = 9.68]
Constant −2.28∗
(0.256)
[t = 8.91]
N 514
R2 0.23
Minimum Ŷi −0.995
Table 12.1 displays the results from an LPM of the probability of admission
into a competitive Canadian law school (see Bailey, Rosenthal, and Yoon 2014).
The independent variable is college GPA (measured on a 100-point scale, as is
common in Canada). The coefficient on GPA is 0.032, meaning that an increase
in one point on the 100-point GPA scale is associated with a 3.2 percentage point
increase in the probability of admission into this law school.
Figure 12.1 is a scatterplot of the law school admissions data. It includes the
fitted line from the LPM. The scatterplot looks different from a typical regression
model scatterplot because the dependent variable is either 0 or 1, creating two
horizontal lines of observations. Each point is a light vertical line, and when there
are many observations, the scatterplot appears as a dark bar. We can see that folks
with GPAs under 80 mostly do not get admitted, while people with GPAs above
85 tend to get admitted.
The expected value of Y based on the LPM is a straight line with a slope
of 0.032. Clearly, as GPAs rise, the probability of admission rises as well.
The difference from OLS is that instead of interpreting β̂1 as the increase
in the value of Y associated with a one-unit increase in X, we now interpret
β̂1 as the increase in the probability Y equals 1 associated with a one-unit
increase in X.
Limits to LPM
While Figure 12.1 is generally sensible, it also has a glaring flaw. The fitted line
goes below zero. In fact, the fitted line goes far below zero. The poor soul with
a GPA of 40 has a fitted value of −0.995. This is nonsensical (and a bit sad).
Probabilities must lie between 0 and 1. For a low enough value of X, the predicted
412 CHAPTER 12 Dummy Dependent Variables
Probability of
admission
1
0.75
0.5
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on a 100-point scale)
FIGURE 12.1: Scatterplot of Law School Admissions Data and LPM Fitted Line
value falls below zero; for a high enough value of X, the predicted value exceeds
one.3
That LPM sometimes provides fitted values that make no sense isn’t the only
problem. We could, after all, simply say that any time we see a fitted value below 0,
we’ll call that a 0 and any time we see a fitted value above 1, we’ll call that a 1. The
deeper problem is that fitting a straight line to data with a dichotomous dependent
variable runs the risk of misspecifying the relationship between the independent
variables and the dichotomous dependent variable.
Figure 12.2 illustrates an example of LPM’s problem. Panel (a) depicts a
fitted line from an LPM that uses law school admissions data based on the six
hypothetical observations indicated. The line is reasonably steep, implying a
3
In this particular figure, the fitted probabilities do not exceed 1 because GPAs can’t go higher than
100. In other cases, though, the independent variable may not have such a clear upper bound. Even
so, it is extremely common for LPM fitted values to be less than 0 for some observations and greater
than 1 for other observations.
12.1 Linear Probability Model 413
Probability of
admission
0.75
(a) 0.5
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100-point scale)
Probability of
admission
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100-point scale)
clear relationship. Now suppose that we add three observations from applicants
with very high GPAs, all of whom were admitted. These observations are the
triangles in the upper right of panel (b). Common sense suggests these observations
should strengthen our belief that GPAs predict admission into law school. Sadly,
LPM lacks common sense. The figure shows that the LPM fitted line with the new
observations (the dashed line) is flatter than the original estimate, implying that the
estimated relationship is weaker than the relationship we estimated in the original
model with less data.
What’s that all about? Once we come to appreciate that the LPM needs to fit
a linear relationship, it’s pretty easy to understand. If these three new applicants
had higher GPAs, then from an LPM perspective, we should expect them to have
a higher probability of admission than the applicants in the initial sample. But
the dependent variable can’t be higher than 1, so the LPM interprets the new data
as suggesting a weaker relationship. In other words, because these applicants had
414 CHAPTER 12 Dummy Dependent Variables
higher independent variables but not higher dependent variables, the LPM suggests
that the independent variable is not driving the dependent variable higher.
What really is going on is that once GPAs are high enough, students are
pretty much certain to be admitted. In other words, we expect a non-linear
relationship—the probability of admission rises with GPAs up to a certain level,
then levels off as most applicants whose GPAs are above that level are admitted.
The probit and logit models we develop next allow us to capture precisely this
possibility.4
In LPM’s defense, it won’t systematically estimate positive slopes when
the actual slope is negative. And we should not underestimate its convenience
and practicality. Nonetheless, we should worry that LPM may leave us with an
incomplete view of the relationship between the independent and dichotomous
dependent variables.
REMEMBER THIS
The LPM uses OLS to estimate a model with a dichotomous dependent variable.
1. The coefficients are easy to interpret: a one-unit increase in Xj is associated with a βj increase
in the probability that Y equals 1.
2. Limitations of the LPM include the following:
• Fitted values of Ŷi may be greater than 1 or less than 0.
• Coefficients from an LPM may mischaracterize the relationship between X and Y.
4
LPM also has a heteroscedasticity problem. As discussed earlier, heteroscedasticity is a less serious
problem than endogeneity, but heteroscedasticity forces us to cast a skeptical eye toward standard
errors estimated by LPM. A simple fix is to use heteroscedasticity robust standard errors we
discussed on page 68 in Chapter 3; for more details, see Long (1997, 39). Rather than get too
in-the-weeds solving heteroscedasticity in LPMs, however, we might as well run the probit or logit
models described shortly.
12.2 Using Latent Variables to Explain Observed Variables 415
Probability of
admission
1
0.5
0.25
Fitted values from LPM
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on 100-point scale)
FIGURE 12.3: Scatterplot of Law School Admissions Data and LPM- and Probit-Fitted Lines
S-curves
Figure 12.3 shows the law school admissions data. The LPM fitted line, in all
its negative probability glory, is still there, but we have also added a fitted curve
from a probit model. The probit-fitted line looks like a tilted letter “S,” and so the
relationship between X and the dichotomous dependent variable is non-linear. We
explain how to generate such a curve over the course of this chapter, but for now,
let’s note some of its nice features.
For applicants with GPAs below 70 or so, the probit-fitted line has flattened
out. This means that no matter how low students’ GPAs go, their fitted probability
of admission will not go below zero. For applicants with very high GPAs,
increasing scores lead to only small increases in the probability of admission. Even
if GPAs were to go very, very high, the probit-fitted line flattens out, and no one
will have a predicted probability of admission greater than one.
Not only does the S-shaped curve of the probit-fitted line avoid nonsensical
probability estimates, it also reflects the data better in several respects. First, there
is a range of GPAs in which the effect on admissions is quite high. Look in the
416 CHAPTER 12 Dummy Dependent Variables
range from around 80 to around 90. As GPA rises in this range, the effect on
probability of admission is quite high, much higher than implied by the LPM fitted
line. Second, even though the LPM fitted values for the high GPAs are logically
possible (because they are between 0 and 1), they don’t reflect the data particularly
well. The person with the highest GPA in the entire sample (a GPA of 92) is
predicted by the LPM model to have only a 68 percent probability of admission.
The probit model, in contrast, predicts a 96 percent probability of admission for
this GPA star.
Latent variables
latent variable For a To generate such non-linear fitted lines, we’re going to think in terms of a latent
probit or logit model, an variable. Something is latent if you don’t see it, and a latent variable is something
unobserved continuous we don’t see, at least not directly. We’ll think of the observed dummy dependent
variable reflecting the
variable (which is zero or one) as reflecting an underlying continuous latent
propensity of an
individual observation of
variable. If the value of an observation’s latent variable is high, then the dependent
Yi to equal 1. variable for that observation is likely to be one; if the value of an observation’s
latent variable is low, then the dependent variable for that observation is likely
to be zero. In short, we’re interested in a latent variable that is an unobserved
continuous variable reflecting the propensity of an individual observation of Yi to
equal 1.
Here’s an example. Pundits and politicians obsess over presidential approval.
They know that a president’s reelection and policy choices are often tied to the
state of his approval. Presidential approval is typically measured with a yes-or-no
question: Do you approve of the way the president is handling his job? That’s
our dichotomous dependent variable, but we know full well that the range of
responses to the president covers far more than two choices. Some people froth
at the mouth in anger at the mention of the president. Others think “Meh.” Others
giddily support the president.
It’s useful to think of these different views as different latent attitudes toward
the president. We can think of the people who hate the president as having very
negative values of a latent presidential approval variable. People who are so-so
about the president have values of a latent presidential approval variable near zero.
People who love the president have very positive values of a latent presidential
approval variable.
We think in terms of a latent variable because it is easy to write down a
model of the propensity to approve of the president. It looks like an OLS model.
Specifically, Yi∗ (pronounced “Y-star”) is the latent propensity to be a 1 (an ugly
phrase, but that’s really what it is). It depends on some independent variable X and
the β’s.
people whose latent feelings are above zero.5 If the latent variable is less than zero,
we observe Yi = 0. (We ignore non-answers to keep things simple.)
This latent variable approach is consistent with how the world works. There
are folks who approve of the president but differ in the degree to which they
approve; they are all ones in the observed variable (Y) but vary in the latent variable
(Y ∗ ). There are folks who disapprove of the president but differ in the degree of
their disapproval; they are all zeros in the observed variable (Y) but vary in the
latent variable (Y ∗ ).
Formally, we connect the latent and observed variables as follows. The
observed variable is
0 if Yi∗ < 0
Yi =
1 if Yi∗ ≥ 0
β0 + β1 Xi + i ≥ 0
i ≥ −β0 − β1 X1i
In other words, if the random error term is greater than or equal to −β0 − β1 Xi ,
we’ll observe Yi = 1. This implies
With this characterization, the probability that the dependent variable is one
is necessarily bounded between 0 and 1 because it is expressed in terms of the
probability that the error term is greater or less than some number. Our task in the
next section is to characterize the distribution of the error term as a function of
the β parameters.
REMEMBER THIS
Latent variable models are helpful to analyze dichotomous dependent variables.
Yi∗ = β0 + β1 X1i + i
5
Because the latent variable is unobserved, we have the luxury of using zero to label the point in the
latent variable space at which folks become ones.
418 CHAPTER 12 Dummy Dependent Variables
Probit model
probit model A way The key assumption in a probit model is that the error term (i ) is itself normally
to analyze data with distributed. We’ve worked with the normal distribution a lot because the central
a dichotomous limit theorem (from page 56) implies that with enough data, OLS coefficient
dependent variable.
estimates are normally distributed no matter how is distributed. For the probit
The key assumption is
that the error term is
model, we’re saying that itself is normally distributed. So while normality of β̂1
normally distributed. is a proven result for OLS, normality of is an assumption in the probit model.
Before we explain the equation for the probit model, it is useful to do a bit
of bookkeeping. We have shown that Pr(Yi = 1 | X1 ) = Pr(i ≥ −β0 − β1 X1i ),
but this equation can be hard to work with given the widespread convention
in probability of characterizing the distribution of a random variable in terms
of the probability that it is less than some value. Therefore, we’re going to do
a quick trick based on the symmetry of the normal distribution: because the
distribution is symmetrical when it has the same shape on each side of the
mean, the probability of seeing something larger than some number is the same
as the probability of seeing something less than the negative of that number.
Figure 12.4 illustrates this property. In panel (a), we shade the probability of
being greater than −1.5. In panel (b), we shade the probability of being less than
1.5. The symmetry of the normal distribution backs up what our eyes suggest:
the shaded areas are equal in size, indicating equal probabilities. In other words,
Pr(i > −1.5) = Pr(i < 1.5). This fact allows us to rewrite Pr(Yi = 1 | X1 ) =
Pr(i ≥ −β0 − β1 X1i ) as
There isn’t a huge conceptual issue here, but now it’s much easier to characterize
the model with conventional tools for working with normal distributions. In
cumulative particular, stating the condition in this way simplifies our use of the cumulative
distribution function distribution function (CDF) of a standard normal distribution. The CDF tells us
(CDF) Indicates how how much of normal distribution is to the left of any given point. Feed the CDF a
much of normal
number, and it will tell us the probability that a standard normal random variable
distribution is to the left
of any given point.
is less than that number.
Figure 12.5 on page 420 shows examples for several values of β0 + β1 X1i . In
panel (a), the portion of a standard normal probability density to the left of −0.7
12.3 Probit and Logit Models 419
Probability
density
0.4
0.3
(a) 0.2
−3 −2 −1.5 −1 0 1 2 3
β 0 + β1X1i
Probability
density
0.4
0.3
(b) 0.2
−3 −2 −1 0 1 1.5 2 3
β 0 + β 1X1i
is shaded. Below that, in panel (d), the CDF function with the value of the CDF
at −0.7 is highlighted. The value is roughly 0.25, which is the area of the normal
curve that is to the left of −0.7 in panel (a).
Panel (b) in Figure 12.5 shows a standard normal density curve with the
portion to the left of +0.7 shaded. Clearly, this is more than half the distribution.
The CDF below it, in panel (e), shows that in fact roughly 0.75 of a standard
normal density is to the left of +0.7. Panel (c) shows a standard normal probability
density function (PDF) with the portion to the left of 2.3 shaded. Panel (f), below
that, shows a CDF and highlights its CDF value at 2.3, which is about 0.99. Notice
that the CDF can’t be less than 0 or more than 1 because it is impossible to have
less than 0 percent or more than 100 percent of the area of the normal density to
the left of any number.
Since we know Yi = 1 if i ≤ β0 + β1 X1i , the probability Yi = 1 will be the
CDF defined at the point β0 + β1 X1i .
420 CHAPTER 12 Dummy Dependent Variables
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1X1i β0 + β1X1i β0 + β1X1i
Probability
< β0 + β1Xi 1 1 1
(CDF)
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
The notation we’ll use for the normal CDF is Φ() (the Greek letter Φ is
pronounced “fi,” as in Wi-Fi), which indicates the probability that a normally
distributed random variable ( in this case) is less than the number in parentheses.
In other words,
The probit model produces estimates of β that best fit the data. That is, to
the extent possible, probit estimates will produce β̂’s that lead to high predicted
probabilities for observations that actually were ones. Likewise, to the extent
possible, probit estimates will produce β̂’s that lead to low predicted probabilities
for observations that actually were zeros. We discuss estimation after we introduce
the logit model.
Logit model
logit model A way A logit model also allow us to estimate parameters for a model with a dichotomous
to analyze data with dependent variable in a way that forces the fitted values to lie between 0 and
a dichotomous 1. They are functionally very similar to probit models. The difference from a
dependent variable.
probit model is the equation that characterizes the error term. The equation differs
The error term in a logit
model is logistically
dramatically from the probit equation, but it turns out that this difference has little
distributed. Pronounced practical import.
“low-jit”. In a logit model,
To get a feel for the logit equation, consider what happens when β0 + β1 X1i
is humongous. In the numerator, e is raised to that big number, which leads
to a super big number. In the denominator will be that same number plus 1,
which is pretty much the same number. Hence, the probability will be very, very
close to 1. But no matter how big β0 + β1 X1i gets, the probability will never
exceed 1.
If β0 + β1 X1i is super negative, the numerator of the logit function will have e
raised to a huge negative number, which is the same as one over e raised to a big
number, which is essentially zero. The denominator will have that number plus
one, meaning that the fraction is very close to 01 , and therefore the probability that
Yi = 1 will be very, very close to 0. No matter how negative β0 + β1 X1i gets, the
probability will never go below 0.6
The probit and logit models are rivals, but friendly rivals. When properly
interpreted, they yield virtually identical results. Do not sweat the difference.
Simply pick probit or logit and get on with life. Back in the early days of
computers, the logit model was often preferred because it is computation-
ally easier than the probit model. Now powerful computers make the issue
moot.
6
If β0 + β1 X1i is zero, then Pr(Yi = 1) = 0.5. It’s a good exercise to work out why. The logit function
can also be written as
1
Pr(Yi = 1) =
1 + e−(β0 +β1 X1i )
422 CHAPTER 12 Dummy Dependent Variables
REMEMBER THIS
The probit and logit models are very similar. Both estimate S-shaped fitted lines that are always above
0 and below 1.
1. In a probit model,
where Φ() is the standard normal CDF indicating the probability that a standard normal random
variable is less than the number in parentheses.
2. In a logit model,
Discussion Questions
1. Come up with an example of a dichotomous dependent variable of interest, and then do the
following:
(a) Describe the latent variable underlying the observed dichotomous variable.
(b) Identify a continuous independent variable that may explain this dichotomous dependent
variable. Create a scatterplot of what you expect observations of the independent and
dependent variables to be.
(c) Sketch and explain the relationship you expect between your independent variable and
the probability of observing the dichotomous dependent variable equal to 1.
2. Come up with another example of a dichotomous dependent variable of interest. This
time, identify a dichotomous independent variable as well, and finish up by doing the
following:
(a) Create a scatterplot of what you expect observations of the independent and dependent
variables to be.
(b) Sketch and explain the relationship you expect between your independent variable and
the probability of observing the dichotomous dependent variable equal to 1.
12.4 Estimation 423
12.4 Estimation
So how do we select the best β̂ for the data given? The estimation process for the
maximum likelihood probit and logit models is called maximum likelihood estimation (MLE). It is
estimation (MLE) The more complicated than estimating coefficients using OLS. Understanding the inner
estimation process used workings of MLE is not necessary to implement or understand probit and logit
to generate coefficient
models. Such an understanding can be helpful for more advanced work, however,
estimates for probit and
logit models, among
and we discuss the technique in more detail in the citations and additional notes
others. section on page 561.
In this section, we explain the properties of MLE estimates, describe the fitted
values produced by probit and logit models, and show how goodness of fit is
measured in MLE models.
Ŷi = Pr(Yi = 1)
= Φ(β̂ 0 + β̂1 Xi )
= Φ(−3 + 2 × 0)
= Φ(−3)
= 0.001
424 CHAPTER 12 Dummy Dependent Variables
Probability Probability
Y=1 Y=1
1 1
0.75 0.75
0.5 0.5
0.25 β0 = –3 0.25 β0 = –4
β1 = 2 β1 = 6
0 0
0 1 2 3 0 1 2 3
X X
(a) (b)
Probability Probability
Y=1 Y=1
1 1
0.75 0.75
0.5 0.5
0.25 β0 = –1 0.25 β0 = 3
β1 = 1 β1 = –2
0 0
0 1 2 3 0 1 2 3
X X
(c) (d)
Panel (b) of Figure 12.6 shows a somewhat similar relationship, but the
transition between the Y = 0 and Y = 1 observations is starker. When X is less
than about 0.5, the Y’s are all zero; when X is greater than about 1.0, the Ys are all
one. This pattern of data indicates a strong relationship between X and Y, and β̂1 is,
not surprisingly, larger in panel (b) than in panel (a). The fitted line is quite steep.
12.4 Estimation 425
Panel (c) of Figure 12.6 shows a common situation in which the relationship
between X and Y is rather weak. The estimated coefficients produce a fitted line
that is pretty flat; we don’t even see the full S-shape emblematic of probit models.
If we were to display the fitted line for a much broader range of X values, we
would see the S-shape because the fitted probabilities would flatten out at zero
for sufficiently negative values of X and would flatten out at one for sufficiently
positive values of X. Sometimes, as in this case, the flattening of a probit-fitted
line occurs outside the range of observed values of X.
Panel (d) of Figure 12.6 shows a case of a positive β̂ 0 coefficient and negative
β̂1 . This case best fits the pattern of the data in which Y = 1 for low values of X
and Y = 0 for high values of X.
REMEMBER THIS
1. Probit and logit models are estimated via MLE instead of OLS.
2. We can assess the statistical significance of MLE estimates of β̂ by using z tests, which closely
resemble t tests in large samples for OLS models.
426 CHAPTER 12 Dummy Dependent Variables
X1 0.5 1.0
(0.1) (1.0)
X2 −0.5 −3.0
(0.1) (1.0)
N 500 500
log L −1,000 −1,200
Review Questions
1. For each panel in Figure 12.6, identify the value of X that produces Ŷi = 0.5. Use the probit
equation.
2. Based on Table 12.2, indicate whether the following statements are true, false, or indetermi-
nate.
(a) The coefficient on X1 in column (a) is statistically significant.
(b) The coefficient on X1 in column (b) is statistically significant.
(c) The results in column (a) imply that a one-unit increase in X1 is associated with a
50-percentage-point increase in the probability that Y = 1.
(d) The fitted probability found by using the estimate in column (a) for X1i = 0 and
X2i = 0 is 0.
(e) The fitted probability found by using the estimate in column (b) for X1i = 0 and X2i = 0
is approximately 1.
3. Based on Table 12.2, indicate the fitted probability for the following:
(a) Column (a) and X1i = 4 and X2i = 0.
(b) Column (a) and X1i = 0 and X2i = 4.
(c) Column (b) and X1i = 0 and X2i = 1.
Probit and logit models have their strengths, but being easy to interpret is
not one of them. This is because the β̂’s feed into the complicated equations
defining the probability of observing Y = 1. These complicated equations keep
the predicted values above zero and less than one, but they can do so only by
allowing the effect of X to vary across values of X.
In this section, we explain how the estimated effect of X1 on Y in probit and
logit models depends not only on the value of X1 , but also on the value of the other
independent variables. We then describe approaches to interpreting the coefficient
estimates from these models.
Probability of
admission
1
Probability rises
by 0.01
when GPA goes
from 95 to 100
Probability rises
0.75 by 0.30
when GPA goes
from 85 to 90
0.5
0.25
Probability rises
by 0.03
when GPA goes
from 70 to 75
0
65 70 75 80 85 90 95 100
GPA
(on 100-point scale)
Then we want to know, for each observation, the estimated effect of increasing
X1 by one standard deviation. We could create a variable called P2 that is the fitted
value given the estimated β̂ coefficients and the actual values of the independent
variables for each observation with one important exception: for each observation,
we use the true value of X1 plus a standard deviation of X1 :
The average difference in these two fitted values across all observations is
the simulated effect of increasing X1 by one standard deviation. The difference
for each observation will be driven by the magnitude of β̂1 because the difference
in these fitted values is all happening in the term multiplied by β̂1 . In short, this
means that the bigger β̂1 , the bigger the simulated effect of X1 will be.
It is not set in stone that we add one standard deviation. Sometimes it may
make sense to calculate these quantities by simply using an increase of one or
some other amount.
These simulations make the coefficients interpretable in a commonsense
way. We can say things like, “The estimates imply that increasing GPA by one
standard deviation is associated with an average increase of 15 percentage points
in predicted probability of being admitted to law school.” That’s a mouthful, but
much more meaningful than the β̂ itself.
If X1 is a dummy variable, we summarize the effect of X1 slightly differently.
We calculate what the average increase in fitted probabilities would be if the value
of X1 for every observation were to go from zero to one. We’ll need to estimate
two quantities for each observation. First, we’ll need to calculate the estimated
probability that Y = 1 if X1 (the dummy variable) were equal to 0, given our β̂
estimates and the actual values of the other independent variables. For this purpose,
we could, for example, create a new variable called P0 :
Then we want to know, for each observation, what the estimated probability
that Y = 1 is if the dummy variable were to equal 1. For this purpose, we could,
for example, create a new variable called P1 that is the estimated probability that
Y = 1 if X1 (the dummy variable) were equal to 1 given our β̂ estimates and the
actual values of the other independent variables:
Notice that the only difference between P0 and P1 is that in P0 , X1 = 0 for all
observations (no matter what the actual value of X1 is) and in P1 , X1 = 1 for all
observations (no matter what the actual value of X1 is). The larger the value of β̂1 ,
the larger the difference between P0 and P1 will be for each observation. If β̂1 = 0,
then P0 = P1 for all observations and the estimated effect of X1 is clearly zero.
The approach we have just described is called the observed-value,
discrete-differences approach to estimating the effect of an independent variable
on the probability Y = 1. “Observed value” comes from our use of these
observed values in the calculation of simulated probabilities. The alternative to
the observed-value approach is the average-case approach, which creates a single
composite observation whose independent variables equal sample averages. We
discuss the average-case approach in the citations and additional notes section on
page 562.
The “discrete-difference” part of our approach involves the use of specific
differences in the value of X1 when simulating probabilities. The alternative to the
430 CHAPTER 12 Dummy Dependent Variables
REMEMBER THIS
1. Use the observed-value, discrete-differences method to interpret probit coefficients as
follows:
• If X1 is continuous:
(a) For each observation, calculate P1i as the standard fitted probability from the probit
results:
(b) For each observation, calculate P2i as the fitted probability when the value of X1i is
increased by one standard deviation (σX1 ) for each observation:
(c) The simulated effect of increasing X1 by one standard deviation is the average
difference P2i − P1i across all observations.
• If X1 is a dummy variable:
(a) For each observation, calculate P1i as the fitted probability but with X1i set to 0 for
all observations:
(b) For each observation, calculate P1i as the fitted probability but with X1i set to 1 for
all observations:
(c) The simulated effect of going from 0 to 1 for the dummy variable X1 is the average
difference P1i − P0i across all observations.
2. To interpret logit coefficients by using the observed-value, discrete-differences method,
proceed as with the probit model, but use the logit equation to generate fitted
values.
Review Questions
Suppose we have data on restaurants in Los Angeles and want to understand what causes them to
go out of business. Our dependent variable is a dummy variable indicating bankruptcy in the year
of the study. One independent variable is the years of experience running a restaurant of the owner.
Another independent variable is a dummy variable indicating whether or not the restaurant had a
liquor license.
1. Explain how to calculate the effects of the owner’s years of experience on the probability a
restaurant goes bankrupt.
2. Explain how to calculate the effects of having a liquor license on the probability a restaurant
goes bankrupt.
where the price difference is the average price of the brand-name products
(e.g., Heinz and Hunt’s) minus the store-brand product, the display variables are
dummy variables indicating whether brand-name and store-brand products were
displayed, and the featured variables are dummy variables indicating whether
brand-name and store-brand products were featured in advertisements.7
A probit model of the purchase decision is
Table 12.3 presents the results. As always, the LPM results are easy to interpret.
The coefficient on price difference indicates that consumers are 13.4 percentage
points more likely to purchase the store-brand ketchup if the average price of the
brand-name products is $1 more expensive than the store brand, holding all else
equal. (The average price difference in this data set is only $0.13, so $1 is a big
difference in prices.) If the brand-name ketchup is displayed, consumers are 4.2
percentage points less likely the buy the brand-name product, while consumers
are 21.4 percentage points more likely to buy the store-brand ketchup when it is
displayed. These, and all the other coefficients, are statistically significant.
The fitted probabilities of buying store-brand ketchup from the LPM range
from –13 percent to +77 percent. Yeah, the negative fitted probability is weird.
Probabilities below zero do not make sense, and that’s one of the reasons why the
LPM makes people a little squeamish.
The second and third columns of Table 12.3 display probit and logit results.
These models are, as we know, designed to avoid nonsensical fitted values and to
better capture the relationship between the dependent and independent variables.
7
The income variable ranges from 1 to 14, with each value corresponding to a specified income
range. This approach to measuring income is pretty common, even though it is not super precise.
Sometimes people break an income variable coded this way into dummy variables; doing so does not
affect our conclusions in this particular case.
12.5 Interpreting Probit and Logit Coefficients 433
Interpreting the coefficients in the probit and logit models is not straightfor-
ward, however. Does the fact that the coefficient on the price difference variable
in the probit model is 0.685 mean that consumers are 68.5 percentage points more
likely to buy store-brand ketchup when brand-name ketchup is $1 more expensive?
Does the coefficient on the store brand display variable imply that consumers
are 68.3 percentage points more likely to buy store-brand ketchup when it is on
display?
No. No. (No!) The coefficient estimates from the probit and logit models feed
into the complicated probit and logit equations on pages 418 and 421. We need
extra steps to understand what they mean. Table 12.4 shows the results when we
use our simulation technique to understand the substantive implications of our
estimates. The estimated effect of a $1 price increase from the probit model is
calculated by comparing the average fitted value for all individuals at their actual
values of their independent variables to the average fitted value for all individuals
when the price difference variable is increased by one for every observation (and all
other variables remain at their actual values). This value is 0.191, a bit higher than
the LPM estimate of 0.134 we see in Table 12.3 but still in the same ballpark.
Our simulations are slightly different for dummy independent variables. For
example, to calculate the estimated effect of displaying the brand-name ketchup
from the probit model, we first calculate fitted values from the probit model
assuming the value of this variable is equal to 0 for every consumer while using
the actual values of the other variables. Then, we calculate fitted values from the
probit model assuming the value of the brand name displayed variable is equal to
1 for everyone, again using the actual values of the other variables. The average
difference in these fitted probabilities is –0.049, indicating our probit estimates
imply that displaying the brand-name ketchup lowers the probability of buying the
store brand by 4.9 percentage points, on average.
The logit-estimated effects in Table 12.4 are generated via a similar process,
using the logit equation instead of the probit equation. The logit-estimated effects
for each variable track the probit-estimated effects pretty closely. This pattern is
not surprising because the two models are doing the same work, just with different
assumptions about the error term.
Figure 12.8 on page 435 helps us visualize the results by displaying the fitted
values from the LPM, probit, and logit estimates. We’ll display fitted values as a
−1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5
FIGURE 12.8: Fitted Lines from LPM, Probit, and Logit Models
One of the LPM lines dips below zero. That’s what LPMs do. It’s screwy. On the
whole, however, the LPM lines are pretty similar to the probit and logit lines. The
probit and logit lines are quite similar to each other as well. In fact, the fitted values
from the probit and logit models are very similar, as is common. Their correlation is
0.998. The probit and logit fitted values don’t quite show the full S-shaped curve;
they would, however, if we were to extend the graph to include even higher (and
less realistic) price differences.
8
It may seem odd that this is called a likelihood ratio test when the statistic is the difference in log
likelihoods. The test can also be considered as the log of the ratio of the two likelihoods. Because
12.6 Hypothesis Testing about Multiple Coefficients 437
An example makes this process clear. It’s not hard. Suppose we want to
know if displaying the store-brand ketchup is more effective than featuring it in
advertisements. This is the kind of thing people get big bucks for when they do
marketing studies.
Using our LR test framework, we first want to characterize the unrestricted
version of the model, which is simply the model with all the covariates in it:
This is considered unrestricted because we are letting the coefficients on the store
brand display and store brand featured variables be whatever best fit the data.
The null hypothesis is that the effect of displaying and featuring store-brand
ketchup is the same—that is, H 0 : β3 = β5 . We impose this null hypothesis on the
model by forcing the computer to give us results in which the coefficients on the
display and featured variables for the store brand are equal. We do this by replacing
β5 with β3 in the model (which we can do because under the null hypothesis they
are equal), yielding a restricted model of
Look carefully, and notice that the β3 is multiplied by (Store brand displayi +
Store brand featuredi ) in this restricted equation.
log LLUR = log LUR − log LR , however, we can use the form given. Most software reports the log
R
likelihood, not the (unlogged) likelihood, so it’s more convenient to use the difference of log
likelihoods than the ratio of likelihoods. The 2 in Equation 12.3 is there to make things work; don’t
ask.
438 CHAPTER 12 Dummy Dependent Variables
column in Table 12.3. At the bottom is the unrestricted log likelihood that will
feed into the LR test.
This is a good time to do a bit of commonsense approximating. The
coefficients on the store brand display and store brand featured variables in the
unrestricted model in Table 12.5 are both positive and statistically significant, but
the coefficient on the store brand featured variable is quite a bit higher than the
coefficient on the store brand displayed variable. Both coefficients have relatively
small standard errors, so it is reasonable to expect that there’s a difference,
suggesting that H 0 is false.
From Table 12.5, it is easy to calculate the LR test statistic:
Using the tools described in this chapter’s Computing Corner, we can calculate that
the p value associated with an LR value of 8.57 is 0.003, well below a conventional
significance level of 0.05.
Or, equivalently, we can reject the null hypothesis if the LR statistic is greater
than the critical value for our significance level. The critical value for a significance
level of 0.05 is 3.84, and our LR test statistic of 8.57 exceeds that. This means we
can reject the null that the coefficients on the display and featured variables are
the same. In other words, we have good evidence that consumers responded more
strongly when the product was featured in ads than when it was displayed.
REMEMBER THIS
Use the LR test to examine hypotheses involving multiple coefficients for probit and logit models.
4. Use the log likelihood values from the unrestricted and restricted models to calculate the LR
test statistic:
5. The larger the difference between the log likelihoods, the more the null hypothesis is reducing
fit and, therefore, the more likely we are to reject the null.
• The test statistic is distributed according to a χ 2 distribution with degrees of freedom equal
to the number of equal signs in the null hypothesis.
• Code for generating critical values and p values for this distribution is in the Computing
Corner on pages 446 and 448.
σ̂ 0.128 0.128
log L −549.092 −508.545
• GDP is lagged GDP per capita. The GDP measure is lagged to avoid any
taint from the civil war itself, which almost surely had an effect on the
economy. It is measured in thousands of inflation-adjusted U.S. dollars. The
variable ranges from 0.05 to 66.7, with a mean of 3.65 and a standard
deviation of 4.53.
Table 12.6 shows results for LPM and probit models. For each method, we
present results with and without GDP. We see a similar pattern when GDP is omitted.
In the LPM (a) specification, ethnic fractionalization is statistically significant and
religious fractionalization is not. The same is true for the probit (a) specification that
does not have GDP.
Fearon and Laitin’s suspicion, however, was supported by both LPM and probit
analyses. When GDP is included, the ethnic fractionalization variable becomes
insignificant in both LPM and probit (although it is close to significant in the LPM).
The GDP variable is highly statistically significant in both LPM and probit models.
So the general conclusion that GDP seems to matter more than ethnic fraction-
alization does not depend on which model we use to estimate this dichotomous
dependent variable model.
442 CHAPTER 12 Dummy Dependent Variables
Probability of
civil war
0.04
0
Fitted values from
probit model
−0.04
−0.08
0 10 20 30 40 50 60 70
GDP (per capita in $1,000 U.S.)
FIGURE 12.9: Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic and
Religious Variables at Their Means)
Yet, the two models do tell slightly different stories. Figure 12.9 shows the fitted
lines from the LPM and probit models for the specifications that include the GDP
variable. When calculating these lines, we held the ethnic and religious variables
at their mean values. The LPM model has its characteristic brutally straight fitted
line. It suggests that whatever its wealth, a country sees its probability of civil war
decline as it gets even wealthier. It does this to the point of not making sense—the
fitted probabilities are negative (hence meaningless) for countries with per capita
GDP above about $20,000 per year.
In contrast, the probit model has a curve. We’re seeing only a hint of the S
curve because even the poorest countries have less than a 4 percent probability of
experiencing civil war. But we do see that the effect of GDP is concentrated among
the poorest countries. For them, the effect of income is relatively higher, certainly
higher than the LPM suggests. But for countries with about $10,000 per capita GDP
per year, income shows basically no effect on the probability of a civil war. So even
as the broad conclusion that GDP matters is similar in the LPM and probit models,
the way in which GDP matters is quite different across the models.
Further Reading 443
Conclusion
Things we care about are often dichotomous. Think of unemployment, vote choice,
graduation, war, or countless other phenomena. We can use OLS to analyze
such data via LPM, but we risk producing models that do not fully reflect the
relationships in the data.
The solution is to fit an S-shaped relationship via probit or logit models. Probit
and logit models are, as a practical matter, interchangeable as long as sufficient
care is taken in the interpretation of coefficients. The cost of these models is that
they are more complicated, especially with regard to interpreting the coefficients.
We’re in good shape when we can:
• Section 12.2: Describe what a latent variable is and how it relates to the
observed dichotomous variable.
• Section 12.3: Describe the probit and logit models. What is the equation for
the probability that Yi = 1 for a probit model? What is the equation for the
probability that Yi = 1 for a logit model?
• Section 12.4: Discuss the estimation procedure used for probit and logit
models and how to generate fitted values.
Further Reading
There is no settled consensus on the best way to interpret probit and logit
coefficients. Substantive conclusions rarely depend on the mode of presentation,
so any of the methods is legitimate. Hanmer and Kalkan (2013) argue for the
observed-value approach and against the average-value approach.
MLE models do not inherit all properties of OLS models. In OLS, het-
eroscedasticity does not bias coefficient estimates; it only makes the conventional
equation for the standard error of β̂1 inappropriate. In probit and logit models,
heteroscedasticity can induce bias (Alvarez and Brehm 1995), but correcting for
heteroscedasticity may not always be feasible or desirable (Keele and Park 2006).
King and Zeng (2001) discuss small-sample properties of logistic models,
noting in particular that small-sample bias can be large when the dependent
444 CHAPTER 12 Dummy Dependent Variables
variable is a rare event, with only a few observations falling in the less frequent
category.
Probit and logit models are examples of limited dependent variable models.
In these models, the dependent variable is restricted in some way. As we have
seen, the dependent variable in probit models is limited to two values, 1 and 0.
MLE can be used for many other types of limited dependent variable models. If
the dependent variable is ordinal with more than two categories (e.g., answers to
a survey question for which answers are very satisfied, satisfied, dissatisfied, and
very dissatisfied), an ordered probit model is useful. It is based on MLE methods
and is a modest extension of the probit model. Some dependent variables are
categorical. For example, we may be analyzing the mode of transportation to work
(with walking, biking, driving, and taking public transportation as options). In
such a case, multinomial logit, another MLE technique, is useful. Other dependent
variables are counts (number of people on a bus) or lengths of time (how long
between buses or how long someone survives after a disease diagnosis). Models
with these dependent variables also can be estimated with MLE methods, such as
count models and duration models. Long (1997) introduces maximum likelihood
and covers a broad variety of MLE techniques. King (1989) explains the general
approach. Box-Steffensmeier and Jones (2004) provide an excellent guide to
duration models.
Key Terms
Cumulative distribution Linear probability model Maximum likelihood
function (418) (410) estimation (423)
Dichotomous (409) Log likelihood (425) Probit model (418)
Latent variable (416) Logit model (421) z test (423)
Likelihood ratio test (436)
Computing Corner
Stata
To implement the observed-value, discrete-differences approach to interpreting
estimated effects for probit in Stata, do the following.
• If X1 is continuous:
** Estimate probit model
probit Y X1 X2
*/ e(sample)
** “normal“ refers to normal CDF function
** _b[_cons] is beta0 hat,_b[X1] is beta1 hat etc
** “e(sample)“ tells Stata to only use observations
** used in probit analysis
** Display results
sum PDiff if e(sample)
• If X1 is a dummy variable:
** Estimate probit model
probit Y X1 X2
** Display results
sum PDiff if e(sample)
• The margins command produces average marginal effects, which are the
average of the slopes with respect to each independent variable evaluated
at observed values of the independent variables. See the discussion on
marginal-effects on page 563 for more details. These are easy to implement
in Stata, with similar syntax for both probit and logit models.
probit Y X1 X2
margins, dydx(X1)
To ascertain the p value for LR test with d.f. = 1 and substituting log
likelihood values in for logLunrestricted and logLrestricted, type
display 1-chi2(1, 2*(logLunrestricted - logLrestricted)).
Even easier, we can use Stata’s test command to conduct a Wald test,
which is asymptotically equivalent to the LR test (which is a fancy way of
saying the test statistics get really close to each other as the sample size
goes to infinity). For example,
probit Y X1 X2 X3
test X2 = X3 =0
• To estimate a logit model in Stata, use logic and structure similar to those
for a probit model. Here are the key differences for the continuous variable
example:
logit Y X1 X2
gen LogitP1 = exp(_b[_cons]+_b[X1]*X1+_b[X2]*X2)/*
*/(1+exp(_b[_cons]+_b[X1]*X1+_b[X2]*X2))
gen LogitP2 = exp(_b[_cons]+_b[X1]*X1Plus+_b[X2]*X2)/*
*/(1+exp(_b[_cons]+_b[X1]*X1Plus+_b[X2]*X2))
Computing Corner 447
• To graph fitted lines from a probit or logit model that has only one
independent variable, first estimate the model and save the fitted values.
Then use the following command:
graph twoway (scatter ProbitFit X)
R
To implement a probit or logit analysis in R, we use the glm function, which stands
for “generalized linear model” (as opposed to the lm function, which stands for
“linear model”).
• If X1 is continuous:
## Estimate probit model and name it Result
Result = glm(Y ~ X1 + X2, family = binomial(link =
"probit"))
• If X1 is a dummy variable:
## Estimate probit model and name it Result
Result = glm(Y ~ X1 + X2, family = binomial(link =
“probit“))
P1 = pnorm(Result$coef[1] + Result$coef[2]*1
+ Result$coef[3]*X2)
BushVote04 Dummy variable = 1 if person voted for President Bush in 2004 and 0 otherwise
ProIraqWar02 Position on Iraq War, ranges from 0 (opposed war) to 3 (favored war)
Party02 Partisan affiliation, ranges from 0 for strong Democrats to 6 for strong Republicans
BushVote00 Dummy variable = 1 if person voted for President Bush in 2000 and 0 otherwise
CutRichTaxes02 Views on cutting taxes for wealthy, ranges from 0 (oppose) to 2 (favor)
equation:
## Restricted probit model
RModel = glm(Y ~ X3, family = binomial(link =
"probit"))
and proceed with the rest of the test.
• To estimate a logit model in R, use logic and structure similar to those for
a probit model. Here are the key differences for the continuous variable
example:
Result = glm(Y ~ X1+X2, family = binomial(link ="logit"))
P1 = exp(Result$coef[1]+Result$coef[2]*X1+Result$coef[3]*X2)/
(1+exp(Result$coef[1]+Result$coef[2]*X1+Result$coef[3]*X2))
P2 = exp(Result$coef[1]+Result$coef[2]*X1Plus+Result$coef[3]*X2)/
(1+exp(Result$coef[1]+Result$coef[2]*X1Plus+Result$coef[3]*X2))
• To graph fitted lines from a probit or logit model that has only one
independent variable, first estimate the model and save it. In this case, we’ll
save a probit model as ProbResults. Create a new variable that spans the
range of the independent variable. In this case, we create a variable called
Xsequence that ranges from 1 to 7 in steps of 0.05 (the first value is 1,
the next is 1.05, etc.). We then use the coefficients from the ProbResults
model and this Xsequence variable to plot fitted lines:
Xsequence = seq(1, 7, 0.05)
plot(Xsequence, pnorm(ProbResults$coef[1] +
ProbResults$coef[2]*Xsequence), type = "l")
Exercises
1. In this question, we use the data set BushIraq.dta to explore the effect
of opinion about the Iraq War on the presidential election of 2004. The
variables we will focus on are listed in Table 12.7.
(a) Estimate two probit models: one with only ProIraqWar02 as the
independent variable and the other with all the independent variables
450 CHAPTER 12 Dummy Dependent Variables
(b) Use the model with all the independent variables and the
observed-value, discrete-differences approach to calculate the effect
of a one standard deviation increase in ProIraqWar02 on support for
Bush.
(c) Use the model with all the independent variables listed in the table
and the observed-value, discrete-differences approach to calculate
the effect of an increase of one standard deviation in Party02 on
support for Bush. Compare to the effect of ProIraqWar02.
(f) Calculate the correlation of the fitted values from the probit and logit
models.
(ii) What are the minimum and maximum fitted values from this
model? Discuss implications briefly.
(b) Use a probit model to estimate the probability of saying that global
warming is real and caused by humans (the dependent variable
is HumanCause2). Use the independent variables from part (a),
including the age-squared variable.
(ii) What are the minimum and maximum fitted values from this
model? Discuss implications briefly.
Party7 Partisan identification, ranging from 1 for strong Republican, 2 for not-so-strong
Republican, 3 leans Republican, 4 undecided/independent, 5 leans Democrat, 6
not-so-strong Democrat, 7 strong Democrat
452 CHAPTER 12 Dummy Dependent Variables
(c) The survey described in this item also included a survey experiment
in which respondents were randomly assigned to different question
wordings for an additional question about global warming. The idea
was to see which frames were most likely to lead people to agree that
the earth is getting warmer. The variable we analyze here is called
WarmAgree. It records whether respondents agreed that the earth’s
average temperature is rising. The experimental treatment consisted
of four different ways to phrase the question.
• The variable Treatment equals 2 for people who were given the
following information before being asked if they agreed that
the average temperature of the earth is getting warmer: “The
following figure [Figure 12.10] shows the average global tem-
perature compared to the average temperature from 1951–1980.
The temperature analysis comes from weather data from more
than 1,000 meteorological stations around the world, satellite
observations of sea surface temperature, and Antarctic research
station measurements.”
• The variable Treatment equals 3 for people who were given the
following information before being asked if they agreed that
average temperature of the earth is getting warmer: “Scientists
working at the National Aeronautics and Space Administration
(NASA) have concluded that the average global temperature
0.4
2011
Annual Mean
5-year Mean
0.2
–0.2
–0.4 uncertainty
FIGURE 12.10: Figure Included for Some Respondents in Global Warming Survey Experiment
Exercises 453
(a) Run a probit model explaining whether the coach was fired as a
function of winning percentage. Graph fitted values from this model
on same graph with fitted values results from a bivariate LPM (use
the lfit command to plot LPM results). Explain the differences in
the plots.
(b) Estimate LPM, probit, and logit models of coach firings by using
winning percentage, lagged winning percentage, a new coach
dummy, strength of schedule, and coach tenure as independent
variables. Are the coefficients substantially different? How about the
z statistics?
(c) Indicate the minimum, mean, and maximum of the fitted values for
each model, and briefly discuss each.
FiredCoach A dummy variable if the football coach was fired during or after the season
(1 = fired, 0 = otherwise)
WinPct The winning percentage of the team
Tenure The number of years the coach has coached the team
454 CHAPTER 12 Dummy Dependent Variables
(e) It’s kind of odd to say that lag winning percentage affects the
probability that new coaches were fired because they weren’t
coaching for the year associated with the lagged winning percentage.
Include an interaction for the new coach dummy variable and lagged
winning percentage. The effect of lagged winning percentage on
probability of being fired is the sum of the coefficients on lagged
winning percentage and the interaction. Test the null hypothesis that
lagged winning percentage has no effect on new coaches (meaning
coaches for whom NewCoach = 1). Use a Wald test (which is most
convenient) and a LR test.
4. Are members of Congress more likely to meet with donors than with
mere constituents? To answer this question, Kalla and Broockman (2015)
conducted a field experiment in which they had political activists attempt
to schedule meetings with 191 congressional offices regarding efforts to
ban a potentially harmful chemical. The messages the activists sent out
were randomized. Some messages described the people requesting the
meeting as “local constituents,” and others described the people requesting
the meeting as “local campaign donors.” Table 12.10 describes two key
variables from the experiment.
(b) Use a probit model to estimate the effect of the donor treatment
condition on probability of meeting with a member of Congress.
Interpret the results.
(c) What factors are missing from the model? What does this omission
mean for our results?
donor_treat Dummy variable indicating that activists seeking meeting were identified as donors
(1 = donors, 0 = otherwise)
staffrank Highest-ranking person attending the meeting: 0 for no one attended meeting, 1 for
non-policy staff, 2 for legislative assistant, 3 for legislative director, 4 for chief of
staff, 5 for member of Congress
Exercises 455
(d) Use an LPM to make your estimate. Interpret the results. Assess the
correlation of the fitted values from the probit model and LPM.
(e) Use an LPM to assess the probability of meeting with a senior staffer
(defined as staffrank > 2).
(g) Table 12.11 shows results for balance tests (covered in Section 10.1)
for two variables: Obama vote share in the congressional district
and the overall campaign contributions received by the member
of Congress contacted. Discuss the implication of these results for
balance.
N 191 191
Advanced Material
Time Series: Dealing with 13
Stickiness over Time
459
460 CHAPTER 13 Time Series: Dealing with Stickiness over Time
the bump in year 1 will percolate through the entire data series. Such a dynamic
model includes a lagged dependent variable as an independent variable. Dynamic
models might seem pretty similar to other OLS models, but they actually differ in
important and funky ways.
This chapter covers both approaches to dealing with time series data.
Section 13.1 introduces a model for autocorrelation. Section 13.2 shows how
to use this model to detect autocorrelation, and Section 13.3 presents two ways
to properly account for autocorrelated errors. Section 13.4 introduces dynamic
models, and Section 13.5 discusses an important but complicated aspect of
dynamic models called stationarity.
Yt = β0 + β1 Xt + t (13.1)
The notation for this differs from the notation for our standard OLS model. Instead
of using i to indicate each individual observation, we use t to indicate each time
period. Yt therefore indicates the dependent variable at time t; Xt indicates the
independent variable at time t.
This model helps us appreciate the potential that errors may be correlated in
time series data. To get a sense of how this happens, first let us consider a seemingly
random fact: sunspots are a solar phenomenon that may affect temperature and that
strengthen and weaken somewhat predictably over a roughly 11-year cycle. Now
suppose that we’re trying to assess if carbon emissions affect global temperature
with a data set that does not have a variable for sunspot activity. The fact that we
haven’t measured sunspots means that they will be in the error term, and the fact
that they cycle up and down over an 11-year period means that the errors in the
autoregressive
model will be correlated.
process A process in
which the value of a Here we will model autocorrelated errors by assuming the errors follow an
variable depends autoregressive process. In an autoregressive process, the value of a variable
directly on the value
from the previous 1
We show how the OLS equation for the variance of β̂1 depends on the errors being uncorrelated on
period. page 499.
13.1 Modeling Autocorrelation 461
depends directly on the value from the previous period. The equation for an
autoregressive error process is
t = ρt−1 + νt (13.2)
This equation says that the error term for time period t equals ρ times the error
in the previous term plus a random error, νt . We assume that νt is uncorrelated
with the independent variable and other error terms. We call t−1 the lagged error
lagged variable A because it is the error from the previous period. We indicate a lagged variable
variable with the values with the subscript t − 1 instead of t. A lagged variable is a variable with the values
from the previous from the previous period.2
period.
The absolute value of ρ must be less than one in autoregressive models. If ρ
were greater than one, the errors would tend to grow larger in each time period
and would spiral out of control.
We often refer to autoregressive models as AR models. In AR models, the
AR(1) model A errors are a function of errors in previous periods. If errors are a function of only
model in which the the errors from the previous period, the model is referred to as an AR(1) model
errors are assumed to
(pronounced A-R-1). If the errors are a function of the errors from two previous
depend on their value
from the previous periods, the model is referred to as an AR(2) model, and so on. We’ll focus on
period. AR(1) models in this book.
2
Some important terms here sound similar but have different meanings. Autocorrelation refers to
errors being correlated with each other. An autoregressive model is the most common, but not the
exclusive, way to model autocorrelation. It is possible to model correlated errors differently. In a
moving average error process, for example, errors can be the average of errors from some number of
previous periods. In Section 13.4 we’ll use an autoregressive model for the dependent variable rather
than for the error.
462 CHAPTER 13 Time Series: Dealing with Stickiness over Time
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
above zero for a few periods, then below zero for a few periods, and so on. This
graph is telling us that if we know the error in one period, we have some sense of
what it will be in the following period. That is, if the error is positive in period t,
then it’s likely (but not certain) to be positive in period t + 1.
Panel (b) of Figure 13.1 shows a case of no autocorrelation. The error in time
t is not a function of the error in the previous period. The telltale signature of no
autocorrelation is the randomness: the plot is generally spiky, but here and there
the error might linger above or below zero, without a strong pattern.
Panel (c) of Figure 13.1 shows negative serial correlation with ρ = −0.8. The
signature of negative serial correlation is extreme spikiness because a positive error
is more likely to be followed by a negative error, and vice versa.
13.2 Detecting Autocorrelation 463
REMEMBER THIS
1. Autocorrelation refers to the correlation of errors with each other.
2. A standard way to model autocorrelated error is to assume they come from an autoregressive
process in which the error term in period t is a function of the error in previous periods.
3. The equation for error in an AR(1) model is
t = ρt−1 + νt
Discussion Question
The temperature hovers above the trend line for periods (such as around World
War II and now) and below the line for other periods (such as 1950–1980). This
hovering is a sign that the error in one period is correlated with the error in the
next period. Panel (b) of Figure 13.2 shows the residuals from this regression. For
each observation, the residual is the distance from the fitted line; the residual plot
is essentially panel (a) tilted so that the fitted line in panel (a) is now the horizontal
line in panel (b).
13.2 Detecting Autocorrelation 465
Temperature
1
0.8
0.6
(a)
0.4
0.2
Year
Residuals
0.4
0.2
(b) 0.0
−0.2
−0.4
Year
where ˆt and ˆt−1 are, respectively, simply the residuals and lagged residuals from
the initial OLS estimation of Yt = β0 + β1 Xt + t . If ρ̂ is statistically significantly
different from zero, we have evidence of autocorrelation.3
3
This approach is closely related to a so-called Durbin-Watson test for autocorrelation. This test
statistic is widely reported, but it has a more complicated distribution than a t distribution and
requires use of specific tables. In general, it produces results similar to those from the auxiliary
regression process described here.
466 CHAPTER 13 Time Series: Dealing with Stickiness over Time
Table 13.1 shows the results of such a lagged residual model for the climate
data in Figure 13.2. The dependent variable in this model is the residual from
Equation 13.3, and the independent variable is the lagged value of that residual.
We’re using this model to estimate how closely ˆt and ˆt−1 are related. The answer?
They are strongly related. The coefficient on ˆt−1 is 0.608, meaning that our ρ̂
estimate is 0.608, which is quite a strong relation. The standard error is 0.072,
implying a t statistic of 8.39, which is well beyond any conventional critical value.
We can therefore handily reject the null hypothesis that ρ = 0 and conclude that
errors are autocorrelated.
REMEMBER THIS
To detect autocorrelation in time series data:
1. Graph the residuals from a standard OLS model over time. If the plot is relatively smooth,
positive autocorrelation is likely to exist. If the plot is relatively spiky, negative autocorrelation
is likely.
2. Estimate the following OLS model:
ˆt = ρˆt−1 + νt
Yt = β0 + β1 Xt + ρt−1 + νt (13.5)
This equation looks like a standard OLS equation except for a pesky ρt−1 term.
Our goal is to zap that term. Here’s how:
1. Write an equation for the lagged value of Yt that simply requires replacing
the t subscripts with t − 1 subscripts in the original model:
3. Subtract the equation for ρYt−1 (Equation 13.7) from Equation 13.5.
That is, subtract the left side of Equation 13.7 from the left side of
Equation 13.5, and subtract the right side of Equation 13.7 from the right
side of Equation 13.5.
4
Technically, GLS is an approach that accounts for known aspects of the error structure such as
autocorrelation. Since we need to estimate the extent of autocorrelation, the approach we discuss here
is often referred to as “feasible GLS,” or FGLS, because it is the only feasible approach given
uncertainty about the true error structure.
13.3 Fixing Autocorrelation 469
The key thing is to look at the error term in this new equation. It is νt ,
which we said at the outset is the well-behaved (not autocorrelated) part of the
error term. Where is t , the naughty, autocorrelated part of the error term? Gone!
That’s the thing. That’s what we accomplished with these equations: we end up
with an equation that looks pretty similar to our OLS equation with a dependent
variable (Ỹt ), parameters to estimate (β̃0 and β1 ), an independent variable (X̃t ),
and an error term (νt ). The difference is that unlike our original model (based on
Equations 13.1 and 13.2), this model has no autocorrelation. By using Ỹt and X̃t ,
we have transformed the model from one that suffers from autocorrelation to one
that does not.
While the coefficients produced by OLS (and used for Newey-West) need only
that the error term be uncorrelated with Xt (our standard condition for exogeneity),
the coefficients produced by the ρ-transformed model need the error term to be
uncorrelated with Xt , Xt−1 and Xt+1 , in order to be unbiased.5
The ρ-transformed model is also referred to as a Cochrane-Orcutt model or a
Prais-Winsten model.6
5
Wooldridge (2013, 424) has more detail on the differences in the two approaches. To get a sense of
why the ρ-transformed model has more conditions for exogeneity, note that the independent variable
in Equation 13.8 is composed of both Xt and Xt−1 , and we know from past results that the correlation
of errors and independent variables is a problem.
6
The Prais-Winsten approximates the values for the missing first observation in the ρ-transformed
data. These names are useful to remember when we’re looking for commands in Stata and R to
analyze time series data.
470 CHAPTER 13 Time Series: Dealing with Stickiness over Time
data will be missing because we don’t know the lagged value for that
observation.
Once we’ve created these transformed variables, things are easy. If we think
in terms of a spreadsheet, we’ll simply use the columns Ỹ and X̃ when we estimate
the ρ-transformed model.
It is worth emphasizing that the β̂1 coefficient we estimate in the
ρ-transformed model is an estimate of β1 . Throughout all the rigmarole of the
transformation process, the value of β1 doesn’t change. The value of β1 in
the original equation is the same as the value of β1 in the transformed equation.
Hence, when we get results from ρ-transformed models, we still speak of them in
the same terms as β1 estimates from standard OLS. That is, a one-unit increase in
X is associated with a β̂1 increase in Y.7
One non-intuitive thing is that even though the underlying data is the same,
the β̂ estimates differ from the OLS estimates. Both OLS and ρ-transformed
coefficient estimates are unbiased and consistent, which means that in expectation,
the estimates equal the true value, and as we get more data they converge to the true
value. These things can be true and the models can still yield different coefficient
estimates. Just like if we flip a coin 100 times we are likely to get something
different every time, we go through the process even though the expected number
of heads is 50. That’s pretty much what is going on here, as the two approaches
are different realizations of random processes that are correct on average but still
have random noise.
Should we use Newey-West standard errors or the ρ-transformed approach?
Each approach has its virtues. The ρ-transformed approach is more statistically
“efficient,” meaning that it will produce smaller (yet still appropriate) standard
errors than Newey-West. The downside of the ρ-transformed approach is that it
requires additional assumptions to produce unbiased estimates of β̂.
ρ = 0.5)
TABLE 13.2 Example of ρ -Transformed Data (for ρ̂
Original data ρ -Transformed data
Ỹ X̃
Year Y X ρ Yt – 1 )
(= Y – ρ̂ ρ Xt – 1 )
(= X – ρ̂
2000 100 50 — —
7
The intercept estimated in a ρ-transformed model is actually β0 (1 − ρ̂). If we want to know the
fitted value for Xt = 0 (which is the meaning of the intercept in a standard OLS model), we need to
divide β̃0 by (1 − ρ̂).
13.3 Fixing Autocorrelation 471
REMEMBER THIS
1. Newey-West standard errors account for autocorrelation. These standard errors are used with
OLS β̂ estimates.
2. We can also correct for autocorrelation by ρ-transforming the data, a process that purges
autocorrelation from the data and produces different estimates of β̂ than OLS.
a. The model is
The first column of Table 13.3 shows results from a standard OLS analysis of the
model. The t statistics for β̂1 and β̂ 2 are greater than 5 but, as we have discussed,
are not believable due to the corruption of standard OLS standard errors by the
correlation of errors.
The first column of Table 13.3 also reports that ρ̂ = 0.514; this result was gener-
ated by estimating an auxiliary regression with residuals as the dependent variable
and lagged residuals as the independent variable. The autocorrelation is lower
than in the model that does not include year squared as an independent variable
472 CHAPTER 13 Time Series: Dealing with Stickiness over Time
Temperature
(deviation 1
from
pre-industrial
average)
0.75
0.5
0.25
Year
(as reported on page 466) but is still highly statistically significant, suggesting that
we need to correct for autocorrelation.
The second column of Table 13.3 shows results with Newey-West standard
errors. The coefficient estimates do not change, but the standard errors and t
statistics do change. Note also that the standard errors are bigger and the t statistics
are smaller than in the OLS model.
The third column of Table 13.3 shows results from a ρ-transformed model.
β̂1 and β̂ 2 haven’t changed much from the first column. This outcome isn’t too
surprising given that both OLS and ρ-transformed models produce unbiased
estimates of β1 and β2 . The difference is in the standard errors. The standard error on
each of the Year and Year2 variables has almost doubled, which has almost halved
the t statistics for β̂1 and β̂ 2 to near 3. In this particular instance, the relationship
13.4 Dynamic Models 473
between year and temperature is so strong that even with these larger standard
errors, we reject the null hypotheses of no relationship at conventional significance
levels (such as α = 0.05 or α = 0.01). What we see, though, is the large effect on the
standard errors of addressing autocorrelation.
Several aspects of the results from the ρ-transformed model are worth noting.
First, the ρ̂ from the auxiliary regression is now very small (−0.021) and statistically
insignificant, indicating that we have indeed purged the model of first-order
autocorrelation. Well done! Second, the R2 is lower in the ρ-transformed model. It’s
reporting the traditional goodness of fit statistic for the transformed model, but it
is not directly meaningful or comparable to the R2 in the original OLS model. Third,
the constant changes quite a bit, from 155.68 to 79.97. Recall, from the footnote on
page 470 that that the constant in the ρ-transformed model is actually β0 (1 − ρ),
where ρ is the estimate of autocorrelation in the untransformed model. This means
that the estimate of β0 is 1 79.97
– 0.514 = 164.5, which is reasonably close to the estimate
of β̂0 in the OLS model.
dynamic model and discuss three ways in which the model differs from OLS
models.
Yt = γYt−1 + β0 + β1 Xt + t (13.9)
where the new term is γ times the value of the lagged dependent variable, Yt−1 .
The coefficient γ indicates the extent to which the dependent variable depends on
its lagged value. The higher it is, the more the dependence across time. If the data
is really generated according to a dynamic process, omitting the lagged dependent
variable would put us at risk for omitted variable bias, and given that the coefficient
on the lagged dependent variable is often very large, that means we risk large
omitted variable bias if we omit the lagged dependent variable when γ = 0.
As a practical matter, a dynamic model with a lagged dependent variable is
super easy to implement: just add the lagged dependent variable as an independent
variable.
of memory. A change in one period strongly affects the value of the dependent
variable in the next period. In this case, the long-term effect of X will be much
bigger than β̂1 because the estimated long-term effect will be β̂1 divided by a
small number. If γ is near zero, on the other hand, then the dependent variable has
little memory, meaning that the dependent variable depends little on its value in
the previous period. In this case, the long-term effect of X will be pretty much β̂1
because the estimated long-term effect will be β̂1 divided by a number close to 1.
8
The condition that the absolute value of γ is less than 1 rules out certain kinds of explosive
processes where Y gets increasingly bigger or smaller every period. This condition is related to a
requirement that data be “stationary,” as discussed on page 476.
13.4 Dynamic Models 475
REMEMBER THIS
1. A dynamic time series model includes a lagged dependent variable as a control variable. For
example,
Yt = γYt−1 + β0 + β1 Xt + t
13.5 Stationarity
stationarity A time We also need to think about stationarity when we analyze time series data. A
series term indicating stationary variable has the same distribution throughout the entire time series. This
that a variable has the is a complicated topic, and we’ll only scratch the surface here. The upshot is that
same distribution stationarity is good and its opposite, non-stationarity, is bad. When working with
throughout the time series data, we want to make sure our data is stationary.
entire time series. In this section, we define non-stationarity as a so-called unit root problem and
Statistical analysis of
then explain how spurious regression results are a huge danger with non-stationary
non-stationary variables
can yield spurious data. Spurious regression results are less likely with stationary data. We also show
regression results. how to detect non-stationarity and what to do if we find it.
Yt = γYt−1 + t (13.10)
13.5 Stationarity 477
Y3 = γY2 + 3
= γ(γY1 + 2 ) + 3
= γ(γ(γY0 + 1 ) + 2 ) + 3
= γ 3Y0 + γ 2 1 + γ2 + 3
When γ < 1, the effect of any given value of Y will decay over time. In this case,
the effect of Y0 on Y3 is γ 3 Y0 ; because γ < 1, γ 3 will be less than one. We could
extend the foregoing logic to show that the effect of Y0 on Y4 will be γ 4 , which is
less than γ 3 , when γ < 1. The effect of the error terms in a given period will also
have a similar pattern. This case presents some differences from standard OLS, but
it turns out that because the effects of previous values of Y and error fade away,
we will not face a fundamental problem when we estimate coefficients.
What if we have γ > 1? In this case, we’d see an explosive process because
the value of Y would grow by an increasing amount. Time series analysts rule out
such a possibility on theoretical grounds. Variables just don’t explode like this,
certainly not indefinitely, as implied by a model with γ > 1.
The tricky case occurs when γ = 1 exactly. In this case, the variable is said
unit root A variable to have a unit root. In a model with a single lag of the dependent variable, a unit
with a unit root has a root simply means that the coefficient on the lagged dependent variable (γ for the
coefficient equal to one model as we’ve written it) is equal to one. The terminology is a bit quirky: “unit”
on the lagged variable in
refers to the number 1, and “root” refers to the source of something, in this case the
an autoregressive
model.
lagged dependent variable that is a source for the value of the dependent variable.
9
Other problems are that the coefficient on the lagged dependent variable will be biased downward,
preventing the coefficient divided by its standard error from following a t distribution.
478 CHAPTER 13 Time Series: Dealing with Stickiness over Time
10
Zorro’s slashes would probably go more side to side, so maybe think of unit root variables as
slashed by an inebriated Zorro.
11
The citations and additional notes section on page 565 has code to simulate variables with unit
roots and run regressions using those variables. Using the code makes it easy to see that the
proportion of simulations with statistically significant (spurious) results is very high.
13.5 Stationarity 479
Y X
5
25
0
20
−5
15
−10
10
−15
5
−20
0 −25
Time Time
(a) (b)
Y
25
20
15
10
5 β1 = −0.81
t stat for β1 = −36.1
0
Y X
2
2
1
0
0
−1
−2 −2
−3
−4
Time Time
(a) (b)
−2
β1 = −0.08
t stat for β1 = −0.997
−4
−3 −2 −1 0 1 2
X
(c)
where the dependent variable ΔYt is now the change in Y in period t and the
independent variable is the lagged value of Y. We pronounce ΔYt as “delta Y.”
Here we’re using notation suggesting a unit root test for the dependent variable.
We also run unit root tests with the same approach for independent variables.
This transformation allows us to reformulate the model in terms of a new
coefficient we label as α = γ − 1. Under the null hypothesis that γ = 1, our
new parameter α equals 0. Under the alternative hypothesis that γ < 1, our new
parameter α is less than 0.
augmented It’s standard to estimate a so-called augmented Dickey-Fuller test that
Dickey-Fuller test A includes a time trend and a lagged value of the change of Y (ΔYt−1 ):
test for unit root for time
series data that includes ΔYt = αYt−1 + β0 + β1 Timet + β2 ΔYt−1 + t (13.11)
a time trend and lagged
values of the change in
where Timet is a variable indicating which time period observation t is. Time is
the variable as
independent variables. equal to 1 in the first period, 2 in the second period, and so forth.
The focus of the Dickey-Fuller approach is the estimate of α. What we do
with our estimate of α takes some getting used to. The null hypothesis is that Y is
non-stationary. That’s bad. We want to reject the null hypothesis. The alternative
is that the Y is stationary. That’s good. If we reject the null hypothesis in favor of
the alternative hypothesis that α < 0, then we are rejecting the non-stationarity of
Y in favor of inferring that Y is stationary.
The catch is that if the variable actually is non-stationary, the estimated
coefficient is not normally distributed, which means the coefficient divided by
its standard error will not have a t distribution. Hence, we have to use so-called
Dickey-Fuller critical values, which are bigger than standard critical values,
making it hard to reject the null hypothesis that the variable is non-stationary. We
show how to implement Dickey-Fuller tests in the Computing Corner at the end
of this chapter; more details are in the references indicated in the Further Reading
section.
REMEMBER THIS
A variable is stationary if its distribution is the same for the entire data set. A common violation of
stationarity occurs when data has a persistent trend.
1. Non-stationary data can lead to statistically significant regression results that are spurious when
two variables have similar trends.
2. The test for stationarity is a Dickey-Fuller test. Its most widely used format is an augmented
Dickey-Fuller test:
ΔYt = αYt−1 + β0 + β1 Timet + β2 ΔYt−1 + t
If we reject the null hypothesis that α = 0, we conclude that the data is stationary and can use
untransformed data. If we fail to reject the null hypothesis that α = 0, we conclude the data is
non-stationary and therefore should use a model with differenced data.
12
Including these variables is not a no-brainer. One might argue that the independent variables are
causing the non-linear time trend, and we don’t want the time trend in there to soak up variance.
Welcome to time series analysis. Without definitively resolving the question, we’ll include time
trends as an analytically conservative approach in the sense that it will typically make it harder, not
easier, to find statistical significance for independent variables.
13.5 Stationarity 483
Temperature Carbon
(deviation dioxide
1 370
from (parts per
Temperature (left-hand scale)
pre-industrial million)
average, in Carbon dioxide (right-hand scale)
Fahrenheit)
0.75 350
0.5 330
0.25 310
0 290
The model is
Temperaturet = γTemperaturet−1 + β0 + β1 Yeart + β2 Yeart2 + β3 CO2t + t (13.12)
where CO2t is a measure of the concentration of carbon dioxide in the atmosphere
at time t. This is a much (much!) simpler model than climate scientists use; our model
simply gives us a broad-brush picture of whether the relationship between carbon
dioxide and temperature can be ascertained in macro-level data.
Our first worry is that the data might not be stationary. If that is the case, there
is a risk of spurious regression. Therefore, the first two columns of Table 13.4 show
Dickey-Fuller results for the substantive variables, temperature and carbon dioxide.
We use an augmented Dickey-Fuller test of the following form:
ΔTemperaturet = αTemperaturet−1 + β1 Yeart + β2 ΔTemperaturet−1 + t
Recall that the null hypothesis in a Dickey-Fuller test is that the data is
non-stationary. The alternative hypothesis in a Dickey-Fuller test is that the data
is stationary; we will accept this alternative only if the coefficient is sufficiently
negative. (Yes, this way of thinking takes a bit of getting used to.)
To show that data is stationary (which is a good thing!), we need a sufficiently
negative t statistic on the estimate of α. For the temperature variable, the t statistic
484 CHAPTER 13 Time Series: Dealing with Stickiness over Time
in the Dickey-Fuller test is −4.22.13 As we discussed earlier, the critical values for the
Dickey-Fuller test are not the same as those for standard t tests. They are listed at
the bottom of Table 13.4. Because the t statistic on the lagged value of temperature
is more negative than the critical value, even at the one percent level, we can reject
the null hypothesis of non-stationarity. In other words, the temperature data is
stationary. We get a different answer for carbon dioxide. The t statistic is positive.
That immediately dooms a Dickey-Fuller test because we need to see t statistics
more negative than the critical values to be able to reject the null. In this case, we do
not reject the null hypothesis and therefore conclude that the carbon dioxide data
is non-stationary. This means that we should be wary of using the carbon dioxide
variable directly in a time series model.
A good way to begin to deal with non-stationarity is to use differenced data,
which we generate by creating a variable that is the change of a variable in period
t, as opposed to the level of the variable.
We still need to check for stationarity with the differenced data, though, so back
we go to Table 13.4 for the Dickey-Fuller tests. This time we see that the last two
columns use the changes in the temperature and carbon dioxide variables to test
for stationarity. The t statistic on the lagged value of the change in temperature
of −12.04 allows us to easily reject the null hypothesis of non-stationarity for
temperature. For carbon dioxide, the t statistic on the lagged value of the change
in carbon dioxide is −3.31, which is more negative than the critical value at the
13
So far in this book we have been reporting the absolute value of t statistics as the sign does not
typically matter. Here we focus on negative t statistics to emphasize the fact that the α coefficient
needs to be negative to reject the null hypothesis of stationarity.
13.5 Stationarity 485
(Intercept) 0.992
(0.830)
[t = 1.20]
N 126
R2 0.110
14
Dickey-Fuller tests tend to be low powered (see, e.g., Kennedy 2008, 302). This means that these
tests may fail to reject the null hypothesis when the null is false. For this reason, some people are
willing to use relatively high significance levels (e.g., α = 0.10). The costs of failing to account for
non-stationarity when it is present are high, while the costs of accounting for non-stationarity when
data is stationary are modest. Thus, many researchers are inclined to use differenced data when there
are any hints of non-stationarity (Kennedy 2008, 309).
486 CHAPTER 13 Time Series: Dealing with Stickiness over Time
Conclusion
Time series data is all over: prices, jobs, elections, weather, migration, and
much more. To analyze it correctly, we need to address several econometric
challenges.
One is autocorrelation. Autocorrelation does not cause coefficient estimates
from OLS to be biased and is therefore not as problematic as endogeneity.
Autocorrelation does, however, render the standard equation for the variance of
β̂ (from page 146) inaccurate. Often standard OLS will produce standard errors
that are too small when there is autocorrelation, giving us false confidence about
how precise our understanding of the relationship is.
We can correct for autocorrelation with one of two approaches. We can use
Newey-West standard errors that use OLS β̂ estimates and calculate standard
errors in a way that accounts for autocorrelated errors. Or we can ρ-transform
the data to produce unbiased estimates of β1 and correct standard errors of β̂1 .
Another, more complicated challenge associated with time series data is the
possibility that the dependent variable is dynamic, which means that the value of
the dependent variable in one period depends directly on its value in the previous
period. Dynamic models include the lagged dependent variable as an independent
variable.
Dynamic models exist in an alternative statistical universe. Coefficient
interpretation has short-term and long-term elements. Autocorrelation cre-
ates bias. Including a lagged dependent variable when we shouldn’t creates
bias, too.
As a practical matter, time series analysis can be hard. Very hard. This chapter
lays the foundations, but there is a much larger literature that gets funky fast.
In fact, sometimes the many options can feel overwhelming. Here are some
considerations to keep in mind when working with time series data:
• Deal with stationarity. It’s often an advanced topic, but it can be a serious
problem. If either a dependent or an independent variable is stationary,
one relatively easy fix is to use variables that measure changes (commonly
referred to as differenced data) to estimate the model.
• It’s probably a good idea to use a lagged dependent variable—and it’s then
advisable to check for autocorrelation. Autocorrelation does not cause bias
in standard OLS, but when a lagged dependent variable is included, it can
cause bias.
should probably lean toward the results from the model with the lagged dependent
variable. If not, we might lean toward the ρ-transformed result. Sometimes we may
simply have to report both and give our honest best sense of which one seems more
consistent with theory and the data.
After reading and discussing this chapter, we should be able to describe and
explain the following key points:
Further Reading
Researchers do not always agree on whether lagged dependent variables should
be included in models. Achen (2000) discusses bias that can occur when lagged
dependent variables are included. Keele and Kelly (2006) present simulation
evidence that the bias that occurs when one includes a lagged dependent variable is
small unless the autocorrelation of errors is quite large. Wilson and Butler (2007)
discuss how the bias is worse for the coefficient on the lagged dependent variable.
De Boef and Keele (2008) discuss error correction models, which can accom-
modate a broad range of time series dynamics. Grant and Lebo (2016) critique
error correction methods. Box-Steffensmeier and Helgason (2016) introduce a
symposium on the approach.
Another relatively advanced concept in time series analysis is cointegration,
a phenomenon that occurs when a linear combination of possibly non-stationary
variables is stationary. Pesaran, Shin, and Smith (2001) provide a widely used
approach to that integrates unit root and cointegration tests; Philips (2018)
provides an accessible introduction to these tools.
Pickup and Kellstedt (2017) present a very useful guide to thinking about
models that may have both stationary and non-stationary variables in them.
Stock and Watson (2011) provide an extensive introduction to the use of time
series models to forecast economic variables.
For more on the Dickey-Fuller test and its critical values, see Greene (2003,
638).
488 CHAPTER 13 Time Series: Dealing with Stickiness over Time
Key Terms
AR(1) model (461) Dynamic model (474) Spurious regression (477)
Augmented Dickey-Fuller Generalized least squares Stationarity (476)
test (481) (467) Time series data (459)
Autoregressive process (460) Lagged variable (461) Unit root (477)
Cross-sectional data (459) Newey-West standard errors
Dickey-Fuller test (480) (467)
Computing Corner
Stata
1. To detect autocorrelation, proceed in the following steps:
** Estimate basic regression model
regress Temp Year
** Save residuals using resid subcommand
predict Err, resid
** Plot residuals over time
scatter Err Year
** Tell Stata which variable indicates time
tsset year
** An equivalent way to do the auxiliary regression
reg Err L.Err
** “ L.“ for lagged values requires tsset command
1. To detect autocorrelation in R, first make sure that the data is ordered from
earliest to latest observation, and then proceed in the following steps:
# Estimate basic regression model
ClimateOLS = lm(Temp ~ Year)
# Save residuals
Err = resid(ClimateOLS)
# Plot residuals over time
plot (Year, Err)
# Generate lagged residual variable
LagErr = c(NA, Err[1:(length(Err)-1)])
# Auxiliary regression
LagErrOLS = lm(Err ~ LagErr)
# Display results
summary(LagErrOLS)
15
Wooldridge (2013, 425) notes that there is no clear benefit from iterating more than one time.
490 CHAPTER 13 Time Series: Dealing with Stickiness over Time
first time we use it; we desribe how to install packages on page 86):
library(sandwich)
sqrt(diag(NeweyWest(ClimateOLS, lag = 3, prewhite = FALSE,
adjust = TRUE)))
where ClimateOLS is the OLS model estimated above. The Newey-West
command produces a variance-covariance matrix for the standard errors.
We use the diag function to pull out the relevant parts of it, and we then
take the square root of that. The prewhite and adjust subcommands
are set to produce the same results that the Stata Newey-West command
provides.
The rule of thumb is to set the number of lags equal to the fourth root of
the number of observations. (Yes, that seems a bit obscure, but that’s what
it is.) To calculate this in R use
length(X1)ˆ(0.25)
Exercises
1. The Washington Post published data on bike share ridership (measured
in trips per day) over the month of January 2014. Bike share ridership
is what we want to explain. The Post also provided data on daily low
Exercises 491
(a) Use an auxiliary regression to assess whether the errors are autocor-
related.
(b) Estimate a model with Newey-West standard errors. Compare the
coefficients and standard errors to those produced by a standard OLS
model.
(c) Estimate a model that corrects for AR(1) autocorrelation using
the ρ-transformation approach.16 Are these results different from a
model in which we do not correct for AR(1) autocorrelation?
(a) Estimate a model of the federal funds rate, controlling for whether
the president was a Democrat, the number of quarters from the last
election, an interaction of the Democrat dummy variable and the
number of quarters from the last election, and inflation. Use a plot
and an auxiliary regression to assess whether there is first-order
autocorrelation.
(b) Estimate the model from part (a) with Newey-West standard errors.
Compare the coefficients and standard errors to those produced by a
standard OLS model.
(c) Estimate the model from part (a) by using the ρ-transformation
approach, and interpret the coefficients.
(d) Estimate the model from part (a), but add a variable for the lagged
value of the federal funds rate. Interpret the results, and use a plot
and an auxiliary regression to assess whether there is first-order
autocorrelation.
(e) Estimate the model from part (c) with the lagged dependent variable.
Use the ρ-transformation approach, and interpret the coefficients.
16
Stata users should use the subcommands as discussed in the Computing Corner.
17
As discussed in the Computing Corner, Stata needs us to specify a variable that indicates the
chronological order of the data. (Not all data sets are ordered sequentially from earliest to latest
observation.) The “date” variable in the data set for this exercise is not formatted to indicate order as
needed by Stata. Therefore, we need to create a variable indicating sequential order:
gen time = _n
which will be the observation number for each observation (which works in this case because the data
is sequentially ordered). Then we need to tell Stata that this new variable is our time series sequence
identifier with
tsset time
which allows us to proceed with Stata’s time series commands. In R, we can use the tools discussed
in the Computing Corner without necessarily creating a the “time” variable.
492 CHAPTER 13 Time Series: Dealing with Stickiness over Time
GrossRev Gross revenue, measured in millions of U.S. dollars and adjusted for inflation
Rating Average rating by viewers on online review sites (IMDb and Rotten Tomatoes) as of
April 2013
Budget Production budget, measured in millions of U.S. dollars and adjusted for inflation
Actor Name of main actor
Order A variable indicating the order of the movies; we use this variable as our “time”
indicator even though movies are not evenly spaced in time
3. The file BondUpdate.dta contains data on James Bond films from 1962 to
2012. We want to know how budget and ratings mattered for how well the
movies did at the box office. Table 13.6 describes the variables.
(a) Estimate an OLS model in which the amount each film grossed is
the dependent variable and ratings and budgets are the independent
variables. Assess whether there is autocorrelation.
(b) Estimate the model from part (a) with Newey-West standard errors.
Compare the coefficients and standard errors to those produced by a
standard OLS model.
(c) Correct for autocorrelation using the ρ-transformation approach. Did
the results change? Did the autocorrelation go away?
(d) Now estimate a dynamic model. Find the short-term and (approxi-
mate) long-term effects of a 1-point increase in rating.
(e) Assess the stationarity of the revenue, rating, and budget variables.
(f) Estimate a differenced model and explain the results.
(g) Build from the above models to assess the worth (in terms of revenue)
of specific actors.
Advanced OLS 14
493
494 CHAPTER 14 Advanced OLS
In this section, we derive the equation for β̂1 for a simplified regression model
and then show how β̂1 is unbiased if X and are not correlated.
Yi = β1 Xi + i (14.1)
Not having β0 in the model simplifies the derivation considerably while retaining
the essential intuition about how the assumptions matter.1
Our goal is to find the value of β̂1 that minimizes the sum of the squared
residuals; this value will produce a line that best fits the scatterplot. The residual
for a given observation is
ˆi = Yi − β̂1 Xi
We want to figure out what value of β̂1 minimizes this sum. A little simple
calculus does the trick. A function reaches a minimum or maximum at a point
where its slope is flat—that is, where the slope is zero. The derivative is the slope,
so we simply have to find the point at which the derivative is zero.2 The process is
the following:
1
We’re actually just forcing β0 to be zero, which means that the fitted line goes through the origin.
In real life, we would virtually never do this; in real life, we probably would be working with a
multivariate model, too.
2
For any given “flat” spot, we have to figure out if we are at a peak or in a valley. It is very easy to do
this. Simply put, if we are at a peak, our slope should get more negative as X gets bigger (we go
downhill); if we are at a minimum, our slope should get bigger as X goes higher. The second
derivative measures changes in the derivative, so it must be negative for a flat spot to be a maximum
(and we need to be aware of things like “saddle points”—topics covered in any calculus book).
14.1 How to Derive the OLS Estimator and Prove Unbiasedness 495
Equation 14.3, then, is the OLS estimate for β̂1 in a model with no β0 . It looks
quite similar to the equation for the OLS estimate of β̂1 in the bivariate model with
β0 (which is Equation 3.4 on page 49). The only difference is that here we do not
subtract X from X and Y from
Y. To derive Equation 3.4, we would do steps 1
through 7 by using ˆi2 = (Yi − β̂0 − β̂1 Xi )2 , taking the derivative with respect
to β̂0 and with respect to β̂1 to produce two equations, which we would then solve
simultaneously.
2. Use Equation 14.1 (which is the simplified model we’re using here, in
which β0 = 0) to substitute for Yi :
(β1 Xi + i )Xi
β̂1 = 2
Xi
496 CHAPTER 14 Advanced OLS
In other words, β̂1 is β1 (the true value) plus an ugly fraction with sums of and
X in it.
From this point, we can show that β̂1 is unbiased. Here we need to show the
conditions under which the expected value of β̂1 = β1 . In other words, the expected
value of β̂1 is the value of β̂1 we would get if we repeatedly regenerated data
sets from the original model and calculated the average of all the β̂1 ’s estimated
from these multiple data sets. It’s not that we would ever do this—in fact, with
observational data the task is impossible. Instead, thinking of estimating β̂1 from
multiple realizations from the true model is a conceptual way for us to think about
whether the coefficient estimates on average skew too high, too low, or are just
right.
It helps the intuition to note that we could, in principle, generate the
expected value of β̂1 ’s for an experiment by running it over and over again and
calculating the average of the β̂1 ’s estimated. Or, more plausibly, we could run
a computer simulation in which we repeatedly regenerated data (which would
involve simulating a new i for each observation for each iteration) and calculating
the average of the β̂1 ’s estimated.
expected value The To show that β̂1 is unbiased, we use the formal statistical concept of expected
average value of a large value. The expected value of a random variable is the value we expect the random
number of realizations variable to be, on average. (For more discussion, see Appendix B on page 538.)
of a random variable.
1. Take expectations of both sides of Equation 14.4:
i Xi
E[ β̂1 ] = E[β1 ] + E 2
Xi
14.1 How to Derive the OLS Estimator and Prove Unbiasedness 497
3. Use the fact that E[k × g()] = k × E[g()] for constant k and random
function g(). Here 1X 2 is a constant (equaling 1 over whatever the sum
i
of Xi2 is), and i Xi is a function of random variables (the i ’s).
1
E[ β̂1 ] = β1 + 2 E i Xi
Xi
4. We can move the expectation operator inside the summation because the
expectation of a sum is the sum of expectations:
1
E[ β̂1 ] = β1 + 2 E[i Xi ] (14.5)
Xi
Equation 14.5 means that the expectation of β̂1 is the true value (β1 ) plus some
number 1X 2 times the sum of i Xi ’s. At this point, we use our Very Important
i
Condition, which is the exogeneity condition that i and Xi be uncorrelated. We
show next that this condition is equivalent to saying that E[i Xi ] = 0, which means
E[i Xi ] = 0, which will imply that E[ β̂1 ] = β1 , which is what we’re trying to
show.
cov(Xi , i )
correlation(Xi , i ) =
var(Xi )var(i )
E[Xi i − Xi μ − μx i + μx μ ] = 0
498 CHAPTER 14 Advanced OLS
4. Using the fact that the expectation of a sum is the sum of expectations, we
can rewrite the equation as
5. Using the fact that μ and μX are fixed numbers, we can pull them out of
the expectations:
E[Xi i ] = 0
If E[Xi i ] = 0, Equation 14.5 tells us that the expected value of β̂1 will be β1 .
In other words, if the error term and the independent variable are uncorrelated, the
OLS estimate β̂1 is an unbiased estimator of β1 . The same logic carries through
in the bivariate model that includes β0 and in multivariate OLS models as well.
Showing that β̂1 is unbiased does not say much about whether any given
estimate will be near β1 . The estimate β̂1 is a random variable after all, and it
is possible that some β̂1 will be very low and some will be very high. All that
unbiasedness says is that on average, β̂1 will not run higher or lower than the true
value.
REMEMBER THIS
1. We derive the β̂1 equation by setting the derivative of the sum of squared residuals equation to
zero and solving for β̂1 .
2. The key step in showing that β̂1 is unbiased depends on the condition that X and are
uncorrelated.
3
In a model that has a non-zero β0 , the estimated constant coefficient would absorb any non-zero
mean in the error term. For example, if the mean of the error term is actually 5, the estimated constant
is 5 bigger than what it would be otherwise. Because we so seldom care about the constant term, it’s
reasonable to think of the β̂0 estimate as including the mean value of any error term.
14.2 How to Derive the Equation for the Variance of β̂1 499
β1
14.2 How to Derive the Equation for the Variance of β̂
In this section, we show how to derive an equation for the standard error of β̂1 .
This in turn reveals how we use the conditions that errors are homoscedastic and
uncorrelated with each other. Importantly, these assumptions are not necessary for
unbiasedness of OLS estimates. If these assumptions do not hold, we can still use
OLS, but we’ll have to do something different (as discussed in Chapter 13, for
example) to get the right standard error estimates.
We’ll combine two assumptions and some statistical properties of the variance
operator to produce a specific equation for the variance of β̂1 . We assume that the
Xi are fixed numbers and the ’s are random variables.
1. We start with the β̂1 equation (Equation 14.4) and take the variance of both
sides:
i Xi
var[ β̂1 ] = var β1 + 2
Xi
2. Use the fact that the variance of a sum of a constant (the true value
β1 ) and a function of a random variable is simply the variance of the
function of the random variable (see variance fact 1 in Appendix C
on page 539).
i Xi
var[ β̂1 ] = var 2
Xi
3. Note that 1X 2 is a constant (as we noted on page 497 too), and use
i
variance fact 2 (on page 540) that variance of k times a random variable is
k2 times the variance of that random variable.
1
2
var[ β̂1 ] = 2 var i Xi
Xi
2
1
var[ β̂1 ] = 2 var[Xi i ]
Xi
500 CHAPTER 14 Advanced OLS
1
2
var[ β̂1 ] = 2 Xi2 var[i ]
Xi
2
1
var[ β̂1 ] = 2 Xi2 σ 2
Xi
2
X
= σ 2i 2
2
( Xi )
σ2
= 2 (14.6)
Xi
1
2
var[ β̂1 ] = 2 Xi2 ˆi2 (14.7)
Xi
REMEMBER THIS
1. We derive the variance of a β̂1 by starting with the β̂1 equation.
2. If the errors are homoscedastic and not correlated with each other, the variance equation is in
a convenient form.
3. If the errors are not homoscedastic and uncorrelated with each other, OLS estimates are still
unbiased, but the easy-to-use standard OLS equation for the variance of β̂1 is no longer
appropriate.
β̂1
Pr(Type II error given β1 = β1True ) = Pr < Critical value | β1 = β1True
se( β̂1 )
(14.8)
This probability will depend on the actual value of β1 , since we know that the
distribution of β̂1 will depend on the true value of β1 .
ˆ
The key element of this equation is Pr se(ββ1ˆ ) < Critical value | β1 = β1True .
1
This mathematical term seems complicated, but we actually know a fair bit about
ˆ
it. For a large sample size, the t statistic which is se(ββ1ˆ ) will be normally
1
distributed with a variance of 1 around the true value divided by the standard error
of the estimated coefficient. And from the properties of the normal distribution
(see Appendix G on page 543 for a review), this means that
β1True
Pr(Type II error given β1 = β1True ) = Φ Critical value − (14.9)
se( β̂1 )
where Φ() indicates the normal cumulative density function (see page 420 for
more details).
502 CHAPTER 14 Advanced OLS
Review Questions
1. For each of the following, indicate the power of the test of the null hypothesis H0 : β1 = 0 against
the alternative hypothesis of HA : β1 > 0 for a large sample size and α = 0.01 for the given true
value of β1 . We’ll assume se( β̂1 ) = 0.75. Draw a sketch to help explain your numbers.
(a) β1True = 1
(b) β1True = 2
2. Suppose the estimated se( β̂1 ) doubled. What will happen to the power of the test for the two
cases in question 1? First, answer in general terms. Then calculate specific answers.
3. Suppose se( β̂1 ) = 2.5. What is the probability of committing a Type II error for each of the
true values given for β1 in question 1?
4
bit easier by using the fact that 1 − Φ(−Z) = Φ(Z) to write the
And we canmake the calculation a
β1True
power as Φ se( βˆ1 )
− Critical value .
14.4 How to Derive the Omitted Variable Bias Conditions 503
where Yi is the dependent variable, X1i and X2i are two independent variables, and
νi is an error term that is not correlated with any of the independent variables.
For example, suppose the dependent variable is test scores and the independent
variables are class size and family wealth. We assume (for this discussion) that νi
is uncorrelated with X1i and X2i .
What happens if we omit X2 and estimate the following model?
OmitX2 OmitX2
Yi = β0 + β1 X1i + i (14.11)
OmitX2
where we will use β1 to indicate the estimate we get from the model that omits
OmitX2
variable X2 . How close will β̂ 1 (the coefficient on X1i in Equation 14.11) be to
OmitX2
the true value (β1 in Equation 14.10)? In other words, will β̂ 1 be an unbiased
estimator of β1 ? This situation is common with observational data because we
will almost always suspect that we are missing some variables that explain our
dependent variable.
OmitX2
The equation for β̂ 1 is the equation for a bivariate slope coefficient (see
Equation 3.4). It is
N
OmitX2 i=1 (X1i − X 1 )(Yi − Y)
β̂ 1 = N (14.12)
i=1 (X1i − X 1 )
2
OmitX2
Will β̂ 1 be an unbiased estimator of β1 ? With a simple substitution and a
bit of rearranging, we can answer this question. We know from Equation 14.10 that
the true value of Yi is β0 + β1 X1i + β2 X2i + νi . Because the values of β are fixed,
the average of each is simply its value. That is, β 0 = β0 , and so forth. Therefore,
Y will be β0 + β1 X 1i + β2 X 2i + ν i . Substituting for Yi and Y in Equation 14.12 and
doing some rearranging yields
OmitX2 (X1i − X 1 )(β0 + β1 X1i + β2 X2i + νi − β0 − β1 X 1 − β2 X 2i − ν i )
β̂ =
(X1i − X 1 )2
(X1i − X 1 )(β1 (X1i − X 1 ) + β2 (X2i − X 2 ) + νi − ν i )
=
(X1i − X 1 )2
Gathering terms and recalling that β1 (X1i − X 1i )2 = β1 (X1i − X 1i )2 yields
OmitX2 (X1i − X 1 )2 (X1i − X 1 )(X2i − X 2 ) (X1i − X 1 )(νi − ν)
β̂ = β1 + β2 +
(X1i − X 1 ) 2 (X1i − X 1 )2 (X1i − X 1 )2
leaves us with
OmitX
(X1i − X 1 )(X2i − X 2 )
E β̂ 1 = β1 + β2
2
(14.13)
(X1i − X 1 )2
OmitX2
meaning that the expected value of β̂ 1 is β1 plus β2 times a messy fraction.
OmitX2
the estimate β̂ 1
In other words, will deviate, on average, from the true value,
(X1i −X 1 )(X2i −X 2 )
β1 , by β2 (X −X )2 .
1i 1
(X1i −X 1 )(X2i −X 2 )
Note that
(X −X )2
is simply the equation for the estimate of δˆ1 from
1i 1
the following model:
X2i = δ0 + δ1 X1i + τi
See, for example, page 49, and note the use of X2 and X 2 where we had Y and Y
in the standard bivariate OLS equation.
OmitX2
We can therefore conclude that our coefficient estimate β̂ 1 from the
model that omitted X2 will be an unbiased estimator of β1 if β2 δˆ1 = 0. This
condition is most easily satisfied if β2 = 0. In other words, if X2 has no effect
on Y (meaning β2 = 0), then omitting X2 does not cause our coefficient estimate
to be biased. This is excellent news. If it were not true, our model would have to
include variables that had nothing to do with Y. That would be a horrible way to
live.
The other way for β2 δˆ1 to be zero is for δˆ1 to be zero, which happens
whenever X1 would have a coefficient of zero in a regression in which X2
is the dependent variable and X1 is the independent variable. In short, if X1
and X2 are independent (such that regressing X2 on X1 yields a slope coeffi-
OmitX2
cient of zero), then even though we omitted X2 from the model, β̂ 1 will
be an unbiased estimate of β1 , the true effect of X1 on Y (from Equation 14.10).
No harm, no foul.
The flip side of these conditions is that when we estimate a model that
omits a variable that affects Y (meaning that β2 = 0) and is correlated with
the included variable, OLS will be biased. The extent of the bias depends on
how much the omitted variable explains Y (which is determined by β2 ) and
how much the omitted variable is related to the included variable (which is
reflected in δˆ1 ).
What is the takeaway here? Omitted variable bias is a problem if both of the
following conditions are met: (1) the omitted variable actually matters (β2 = 0)
and (2) X2 (the omitted variable) is correlated with X1 (the included variable).
This shorthand is remarkably useful in evaluating OLS models.
14.5 Anticipating the Sign of Omitted Variable Bias 505
REMEMBER THIS
The conditions for omitted variable bias can be derived by substituting the true value of Y into the β̂1
equation for the model with X2 omitted.
where Incomei is the monthly salary or wages of individual i and Educationi is the
number of years of schooling individual i completed. We are worried, as usual,
that certain factors in the error term are correlated with education.
We worry, for example, that some people are more productive than others
(a factor in the error term that affects income) and that productive folks are more
likely to get more schooling (school may be easier for them). In other words, we
fear the true equation is
6
Another option is to use panel data that allows us to control for certain unmeasured factors, as we
did in Chapter 8. Or we can try to find exogenous variation in education (variation in education that is
not due to differences in productivity); that’s what we did in Chapter 9.
506 CHAPTER 14 Advanced OLS
Cell entries show sign of bias for omitted variable bias problem in which a single variable (X2 ) is omitted.
The true equation is Equation 14.10 and the estimated model is Equation 14.11. If β2 > 0 and X1 and X2 are
OmitX2
positively correlated, βˆ1 (the expected value of the coefficient on X1 from a model that omits X2 ) will be
larger than the actual value of β1 .
Hence, the bias will be positive because it is β2 > 0 times the effect of the
productivity on education. A positive bias implies that omitting productivity
induces a positive bias for education. In other words, the effect of education on
income in a model that does not control for productivity will be overstated. The
magnitude of the bias will be related to how strong these two components are.
If we think productivity has a huge effect on income and is strongly related to
education levels, then the size of the bias is large.
In this example, this bias would lead us to be skeptical of a result from a
model like Equation 14.14 that omits productivity. In particular, if we were to find
that β̂1 is greater than zero, we would worry that the omitted variable bias had
inflated the estimate. On the other hand, if the results showed that education did
not matter or had a negative coefficient, we would be more confident in our results
because the bias would on average make the results larger than the true value, not
smaller. This line of reasoning, called “signing the bias,” would lead us to treat the
estimated effects based on Equation 14.14 as an upper bound on the likely effects
of education on income.
Table 14.1 summarizes the relationship for the simple case of one omitted
variable. If X2 , the omitted variable, has a positive effect on Y (meaning β2 > 0)
and X2 and X1 are correlated, then the coefficient on X1 in a model with only
X1 will produce a coefficient that is biased upward: the estimate will be too
big because some of the effect of unmeasured X2 will be absorbed by the
variable X1 .
REMEMBER THIS
We can use the equation for omitted variable bias to anticipate the effect of omitting a variable on the
coefficient estimate for an included variable.
14.6 Omitted Variable Bias with Multiple Variables 507
Discussion Questions
1. Suppose we are interested in knowing how much social media affect people’s income. Suppose
also that Facebook provided us data on how much time each individual spent on the site during
work hours. The model is
What is the implication of not being able to measure innate productivity for our estimate
of β1 ?
2. Suppose we are interested in knowing the effect of campaign spending on election outcomes.
We believe that the personal qualities of a candidate also matter. Some are more charming
and/or hardworking than others, which may lead to better election results for them. What is the
implication of not being able to measure “candidate quality” (which captures how charming
and hardworking candidates are) for our estimate of β1 ?
Assuming that the error in the true model (ν) is not correlated with any of the
OmitX3
independent variables, the expected value for β̂ 1 is
OmitX
r31 − r21 r32 V3
E β̂ 1 = β1 + β3
3
(14.18)
1 − r212 V1
where r31 is the correlation of X3 and X1 , r21 is the correlation of X2 and X1 , r32
is the correlation of X3 and X2 , and V3 and V1 are the variances of X3 and X1 ,
respectively. Clearly, there are more moving parts in this case than in the case we
discussed earlier.
508 CHAPTER 14 Advanced OLS
REMEMBER THIS
When there are multiple variables in the true equation, the effect of omitting one of them depends in
a complicated way on the interrelations of all variables.
1. As in the simpler model, if the omitted variable does not affect Y, there is no omitted variable
bias.
2. The equation for omitted variable bias when the true equation has only two variables often
provides a reasonable approximation of the effects for cases in which there are multiple
independent variables.
∗
Yi = β0 + β1 X1i + i (14.19)
∗
X1i = X1i + νi (14.20)
∗
where we assume that νi is uncorrelated with X1i . This little equation will do a lot
of work for us in helping us understand the effect of measurement error.
∗
Substituting for X1i in the true model yields
Yi = β0 + β1 (X1i − νi ) + i
= β0 + β1 X1i − β1 νi + i (14.21)
Let’s treat ν as the omitted variable and −β1 as the coefficient on the omitted
variable. (Compare these to X2 and β2 in Equation 5.7.) Doing so allows us to
write the omitted variable bias equation as
OmitX2 cov(X1 , ν)
β1 = β1 − β1 (14.22)
var(X1 )
OmitX2 σν2
β1 = β1 − β1 (14.23)
σX2∗ + σν2
σν2
plim β̂1 = β1 1 −
σν + σX2∗
2
7
First, note that cov(X1 , ν) = cov(X1∗ + ν, ν) = cov(X1∗ , ν) + cov(ν, ν) = cov(ν, ν) because ν is not
correlated with X1∗ . Finally, note that cov(ν, ν) = σν2 by standard rules of covariance.
510 CHAPTER 14 Advanced OLS
σX2∗
σ2
Finally, we use the fact that 1 − σ 2 +σν 2 = 1
σν2 +σ 2∗
to produce
ν X∗ X
1
σX2∗
plim β̂1 = β1 1
σν2 + σX2∗
1
REMEMBER THIS
1. We can use omitted variable logic to derive the effect of a poorly measured independent
variable.
2. A single poorly measured independent variable can cause other coefficients to be biased.
X1 U
Independent variable Unobserved
Example: 9th grade confounder variable
tutoring Example: intelligence
ρ1
α ρ2
γ1
X2 Y
Post−treatment variable γ2
Dependent variable
Example: 12th grade Example: age 26
reading score earnings
Earningsi = β0 + β1 Tutori
then β̂1 , the estimated coefficient on X1 , will in expectation equal the true effect
of X1 , which is γ1 + αγ2 .
Our interest here is in what happens when we include a post-treatment
variable:
In this case, we can work out what the expected values of the estimated coefficients
β̂1 and β̂2 . First, note that in the true model, the effect on Earnings of a one-unit
increase in Tutor is γ1 + γ2 α. The direct effect is γ1 , and the indirect effect is γ2 α
(because tutoring also affects reading by α and reading affects earnings by γ2 ).
Also note that in the true model, the effect on Earnings of a one-unit increase
in Intelligence is ρ2 +γ2 ρ1 . The direct effect of intelligence is ρ2 is and the indirect
effect of intelligence is γ2 ρ1 (because intelligence also affects reading by ρ1 and
reading affects earnings by γ2 ).
512 CHAPTER 14 Advanced OLS
We first substitute the true equation for reading scores (Equation 14.24) into
the estimated equation for earnings (14.26), producing
1 1 1 1 1 1 0 2
2 1 0.5 1 1 1 −1 3
3 1 0.01 1 1 1 −99 101
4 1 1 5 1 1 −4 6
8
If you must know, do the following: (1) isolate β1 on the left-hand side of the equation:
E[β1 ] = E[−β2 α + γ1 + γ2 α]; (2) substitute for E[ β̂2 ] : −α ρρ2 − αγ2 + γ1 + γ2 α = γ1 − α ρρ2 .
1 1
Conclusion 513
combinations and the expected values of the coefficients from a model with the
independent variable and post-treatment variable both included. The first line has
an extremely simple case in which the α, ρ’s and γ ’s all equal 1. The actual direct
effect of X1 is 1, but the expected value of the coefficient on X1 will be 0. Not
great. In row 2, we set the effect of U on X2 to be 0.5 and that the expected value
of the coefficient on X1 falls to −1 even though the actual direct effect is still 1.
In row 3, we set the effect of U on X2 to be 0.1, and now things get really crazy:
the expected value of the effect of X1 plummets to −99 even though the true direct
effect (γ1 ) is still just 1. This is nuts! Row 4 shows another example, still not good.
Exercise 4 in Chapter 7 provides a chance to simulate more examples.
Conclusion
OLS goes a long way with just a few assumptions about the model and the
error terms. Exogeneity gets us unbiased estimates if there are no post-treatment
variables. Homoscedasticity and non-correlated errors get us an equation for the
variance of our estimates.
How important is it to be able to know exactly how these assumptions come
together to provide all this good stuff? On a practical level, not very. We can
go about most of our statistical business without knowing how to derive these
results.
On a deeper level, though, it is useful to know how the assumptions matter.
The statistical properties of OLS are not magic. They’re not even that hard, once
we break the derivations down step by step. The assumptions we rely on play
specific roles in figuring out the properties of our estimates, as we have seen in the
derivations in this chapter. We also formalized and extended our understanding of
bias. First, we focused on omitted variable bias, deriving the omitted variable bias
conditions and exploring how omitted variable arises in various contexts. Then we
derived post-treatment collider bias for a reasonably general context.
We don’t need to be able to produce all the derivations from scratch. If we can
do the following, we will have a solid understanding of the statistical foundations
of OLS:
• Section 14.1: Explain the steps in deriving the equation for the OLS
estimate of β̂1 . What assumption is crucial for β̂1 to be an unbiased
estimator of β1 ?
• Section 14.3: Show how to calculate power for a given true value of β.
• Section 14.4: Show how to derive the omitted variable bias equation.
514 CHAPTER 14 Advanced OLS
• Section 14.5: Show how to use the omitted variable bias equation to “sign
the bias.”
• Section 14.6: Explain how omitted variable bias works when the true model
contains multiple variables.
• Section 14.7: Show how to use omitted variable bias tools to characterize
the effect of measurement error.
Further Reading
See Clarke (2005) for further details on omitted variables. Greene (2003, 148)
offers a generalization that uses matrix notation.
Greene (2003, 86) also discusses the implications of measurement error when
the model contains multiple independent variables. Cragg (1994) provides an
accessible overview of problems raised by measurement error and offers strategies
for dealing with them.
Key Term
Expected value (496)
Computing Corner
Stata
1. To estimate OLS models, use the tools discussed in the Computing Corner
in Chapter 5.
1. To estimate OLS models, use the tools discussed in the Computing Corner
in Chapter 5.
Exercises
1. Apply the logic developed in this chapter to the model Yi = β0 + β1 Xi + i .
(There was no β0 in the simplified model we used in Section 14.1.) Derive
the OLS estimate for β̂0 and β̂1 .
2. Show that the OLS estimate β̂1 is unbiased for the model Yi = β0 +
β1 Xi + i .
(a) Run a model with medals as the dependent variable and population
as the independent variable, and briefly interpret the results.
(b) The model given omits GDP (among other things). Use tools
discussed in Section 14.5 to anticipate the sign of omitted variable
bias for β̂1 in the results in part (a) that are due to omission of GDP
from that model.
(c) Estimate a model explaining medals with both population and GDP.
Was your prediction about omitted variable bias correct?
(d) Note that we have also omitted a variable for whether a country is
the host for the Winter Olympics. Sign the bias of the coefficient
on population in part (a) that is due to omission of the host country
variable.
516 CHAPTER 14 Advanced OLS
time A time variable equal to 1 for first Olympics in data set (1980), 2 for second Olympics
(1984), and so forth. Useful for time series analysis.
medals Total number of combined medals won
host Dummy variable indicating if country hosted Olympics in that year (1 = hosted, 0 =
otherwise)
temp Average high temperature (in Fahrenheit) in January (in July for countries in the
Southern Hemisphere)
elevation Highest peak elevation in the country
(e) Estimate a model explaining medals with both population and host
(do not include GDP at this point). Was your prediction about
omitted variable bias correct?
(g) Use tips in the Computing Corner to create a new GDP variable
called NoisyGDP that is equal to the actual GDP plus a standard
normally distributed random variable. Think of this as a measure of
GDP that has been corrupted by a measurement error. (Of course,
the actual GDP variable itself is almost certainly tainted by some
measurement error already.) Estimate the model from part (f), but
use NoisyGDP instead of GDP. Explain changes in the coefficient
on GDP, if any.
(b) Use the standard error from your results to calculate the statistical
power of a test of H0 : βruns_scored = 0 versus HA : βruns_scored > 0
Exercises 517
(c) Suppose we had much less data than we actually do, such that the
standard error on the coefficient on βruns_scored were 900 (which
is much larger than what we estimated). Use the standard error
of βruns_scored = 900 to calculate the statistical power of a test of
H0 : βruns_scored = 0 versus HA : βruns_scored > 0 with α = 0.05
(assuming a large sample for simplicity) for the three cases described
in part (b).
(d) Suppose we had much more data than we actually do, such that
the standard error on the coefficient on βruns_scored were 200 (which
is much smaller than what we estimated). Use the standard error
of βruns_scored = 100 to calculate the statistical power of a test of
H0 : βruns_scored = 0 versus. HA : βruns_scored > 0 with α = 0.05
(assuming a large sample for simplicity) for the three cases described
in part (b).
(e) Discuss the differences across the power calculations for the different
standard errors.
15 Advanced Panel Data
518
15.1 Panel Data Models with Serially Correlated Errors 519
the stuff in the error term? Lots of that will stick around for a while. Unmeasured
factors in year 1 may linger to affect what is going on in year 2, and so on. In this
section, we explain how to deal with autocorrelation in panel models, first without
fixed effects and then with fixed effects.
Before we get into diagnosing and addressing the problem, let’s recall the
stakes. Autocorrelation does not cause bias in the standard OLS framework, but it
does cause OLS estimates of standard errors to be incorrect. In fact, it often causes
the OLS estimates of standard errors to be too small because we don’t really have
the number of independent observations that OLS thinks we do.
where νit is a mean-zero, random error term that is not correlated with the
independent variables. There are N units and T time periods in the panel data
set. We limit ourselves to first-order autocorrelation (where the error this period is
a function of the error last period). The tools we discuss generalize pretty easily
to higher orders of autocorrelation.1
Estimation is relatively simple. First, we use standard OLS to estimate the
model. We then use the residuals from the OLS model to test for evidence of
autocorrelated errors. This works because OLS β̂ estimates are unbiased even if
errors are autocorrelated, which means that the residuals (which are functions of
the data and β̂) are unbiased estimates, too.
We test for autocorrelated errors in this context using something called a
Lagrange multiplier (LM) test. The LM test is similar to our test for autocorrelation
in Chapter 13 on page 465. It involves estimating the following:
where ηit (η is the Greek letter eta) is a mean-zero, random error term. We use
the fact that N × R2 from this auxiliary regression is distributed χ12 under the null
hypothesis of no autocorrelation.
If the LM test indicates autocorrelation, we will use ρ-transformation
techniques we discussed in Section 13.3 to estimate an AR(1) model.
1
A second-order autocorrelated process would also have the error in period t correlated with the error
in period t − 2, and so on.
520 CHAPTER 15 Advanced Panel Data
i will include the mean of the error terms for unit i ( i· ), which in turn means
that T1 of any given error term will appear in all error terms. This means that i1
(the raw error in the first period) is in the first de-meaned error term, the second
de-meaned error term, and so on via the i· term. The result will be at least a little
autocorrelation because the de-meaned error term in the first and second periods,
for example, will move together at least a little bit because both have some of the
same terms.
Therefore, to test for AR(1) errors in a panel data model with fixed effects,
we need to use robust errors that account for autocorrelation.
REMEMBER THIS
To estimate panel models that account for autocorrelated errors, proceed as follows:
1. Estimate an initial model that does not address autocorrelation. This model can be either an
OLS model or a fixed effects model.
2. Use residuals from the initial model to test for autocorrelation, and apply the LM test based on
the R2 from the following model:
3. If we reject the null hypothesis of no autocorrelation (which will happen when the R2 in the
equation above is high), then we should remove the autocorrelation by ρ-transforming the data
as discussed in Chapter 13.
where γ is the effect of the lagged dependent variable, the β’s are the immediate
effects of the independent variables, and it is uncorrelated with the independent
variables and homoscedastic.
We see how tricky this model is when we try to characterize the effect of
X1it on Yit . Obviously, if X1it increases by one unit, there will be a β1 increase in
Yit that period. Notice, though, that an increase in Yit in one period affects Yit in
future periods via the γYi,t−1 term in the model. Hence, increasing X1it in the first
period, for example, will affect the value of Yit in the first period, which will then
affect Y in the next period. In other words, if we change X1it , we get not only β1
more Yit but also γ × β1 more Y in the next period, and so on. That is, change in
X1it today dribbles on to affect Y forever through the lagged dependent variable in
Equation 15.1.
As a practical matter, including a lagged dependent variable is a double-edged
sword. On the one hand, it is often highly significant, which is good news in that we
have a control variable that soaks up variance that’s unexplained by other variables.
On the other hand, the lagged dependent variable can be too good—so highly
significant that it sucks the significance out of the other independent variables. In
fact, if there is serial autocorrelation and trending in the independent variable,
including a lagged dependent variable causes bias. In such a case, Princeton
political scientist Chris Achen (2000, 7) has noted that the lagged dependent
variable
This conclusion does not mean that lagged dependent variables are evil, but
rather that we should tread carefully when we are deciding whether to include
them. In particular, we should estimate models both with them and without. If
results differ substantially, we should decide to place more weight on the model
with or without the lagged dependent variable only after we’ve run all the tests
and absorbed the logic described next.
The good news is that if the errors are not autocorrelated, using OLS for
a model with lagged dependent variables works fine. Given that the lagged
dependent variable commonly soaks up any serial dependence in the data, this
approach is reasonable and widely used.2
If the errors are autocorrelated, however, OLS will produce biased estimates
of β̂ when a lagged dependent variable is included. In this case, autocor-
relation does more than render conventional OLS standard error estimates
inappropriate—autocorrelation in models with lagged dependent variables actu-
ally messes up the estimates. This bias is worth mulling over a bit. It happens
because models with lagged dependent variables are outside the conventional
2
See Beck and Katz (2011).
522 CHAPTER 15 Advanced Panel Data
OLS framework. Hence, even though autocorrelation does not cause bias in OLS
models, autocorrelation can cause bias in dynamic models.
Why does autocorrelation cause bias in a model when we include a lagged
dependent variable? It’s pretty easy to see: Yi,t−1 of course contains i,t−1 . And
if i,t−1 is correlated with it (which is exactly what first-order autocorrelation
implies), then one of the independent variables in Equation 15.1, Yi,t−1 , will be
correlated with the error.
This problem is not particularly hard to deal with. Suppose there is no
autocorrelation. In that case, OLS estimates are unbiased, meaning that the
residuals from the OLS model are consistent, too. We can therefore use these
residuals in an LM test like the one we described earlier (on page 519). If we
fail to reject the null hypothesis (which is quite common since lagged dependent
variables often zap autocorrelation), then OLS it is. If we reject the null hypothesis
of no autocorrelation, we can use an AR(1) model like the one discussed in Chapter
13 to rid the data of autocorrelation and thereby get us back to unbiased and
consistent estimates.
Two ways to estimate dynamic panel data models with fixed effects
What to do? One option is to follow instrumental variable (IV) logic, covered in
Chapter 9. In this context, the IV approach relies on finding some variable that
is correlated with the independent variable in question and not correlated with
the error. Most IV approaches rely on using lagged values of the independent
variables, which are typically correlated with the independent variable in question
but not correlated with the error, which happens later. The Arellano and Bond
(1991) approach, for example, uses all available lags as instruments. These models
are quite complicated and, like many IV models, imprecise.
Another option is to use OLS, accepting some bias in exchange for better
accuracy and less complexity. While we have talked a lot about bias, we have not
yet discussed the trade-off between bias and accuracy, largely because in basic
models such as OLS, unbiased models are also the most accurate, so we don’t
have to worry about the trade-off. But in more complicated models, it is possible
to have an estimator that produces coefficients that are biased but still pretty close
to the true value. It is also possible to have an estimator that is unbiased but very
imprecise. IV estimators are in the latter category—they are, on average, going to
get us the true value, but they have higher variance.
Here’s a goofy example of the trade-off between bias and accuracy. Consider
two estimators of average height in the United States. The first is the height of
a single person, randomly sampled. This estimator is unbiased—after all, the
average of this estimator will have to be the average of the whole population.
Clearly, however, this estimator isn’t very precise because it is based on a single
person. The second estimator of average height in the United States is the average
height of 500 randomly selected people, but measured with a measuring stick that
is inaccurate by 0.25 inch (making every measurement a quarter-inch too big).3
Which estimate of average height would we rather have? The second one may
well make up what it loses in bias by being more precise. That’s the situation here
because the OLS estimate is biased but more precise than the IV estimates.
Nathaniel Beck and Jonathan Katz (2011) have run a series of simulations of
several options for estimating models with lagged dependent variables and fixed
effects. They find that OLS performs better in that it’s actually more likely to
produce estimates close to the true value than the IV approach, even though OLS
estimates are a bit biased. The performance of OLS models improves relative to
the IV approach as T increases.
H. L. Mencken said that for every problem there is a solution that is simple,
neat, and wrong. Usually that’s a devastating critique. Here it is a compliment.
OLS is simple. It is neat. And yet, it is wrong in the sense of being biased when
we have a lagged dependent variable and fixed effects. But OLS is more accurate
(meaning the variance of β̂ 1 is smaller) than the alternatives, which nets out to a
pretty good approach.
3
Yes, yes, we could subtract the quarter-inch from all the height measurements. Work with me here.
We’re trying to make a point!
524 CHAPTER 15 Advanced Panel Data
REMEMBER THIS
1. Researchers often include lagged dependent variables to account for serial dependence. A
model with a lagged dependent variable is called a dynamic model.
(a) Dynamic models differ from conventional OLS models in many respects.
(b) In a dynamic model, a change in X has an immediate effect on Y, as well as an ongoing
effect on future Y’s, since any change in Y associated with a change in X will affect
future values of Y via the lagged dependent variable.
(c) If there are no fixed effects in the model and no autocorrelation, then using OLS for a
model with a lagged dependent variable will produce unbiased coefficient estimates.
(d) If there are no fixed effects in the model and there is autocorrelation, the autocorrelation
must be purged from the data before unbiased estimates can be generated.
2. OLS estimates from models with both a lagged dependent variable and fixed effects are
biased.
(a) One alternative to OLS is to use an IV approach. This approach produces unbiased
estimates, but it’s complicated and yields imprecise estimates.
(b) OLS is useful to estimate a model with a lagged dependent variable and fixed effects.
• The bias is not severe and decreases as T, the number of observations for each unit,
increases.
• OLS in this context produces relatively accurate parameter estimates.
the error term might be correlated with the independent variable; this problem
continues with random effects models, which address correlation of errors across
observations but not correlation of errors and independent variables. Hence,
random effects models fail to take advantage of a major attraction of panel data,
which is that we can deal with the possible correlation of the unit-specific effects
that might cause spurious inferences regarding the independent variables.
The Hausman test is a statistical test that pits random against fixed effects
models. Once we understand this test, we can see why the bang-for-buck payoff
with random effects models is generally pretty low. In a Hausman test, we use
the same data to estimate both a fixed effects model and a random effects model.
Under the null hypothesis that the αi ’s are uncorrelated with the X variables, the β̂
estimates should be similar. Under the alternative, the estimates should be different
because the random effects should be corrupted by the correlation of the αi ’s with
the X variables and the fixed effects should not.
The decision rules for a Hausman test are the following:
• If fixed effects and random effects give us pretty much the same β̂, we fail
to reject the null hypothesis and can use random effects.
• If the two approaches provide different answers, we reject the null and
should use fixed effects.
Ultimately, we believe either the fixed effects estimate (when we reject the null
hypothesis of no correlation between αi and Xi ) or pretty much the fixed effects
answer (when we fail to reject the null hypothesis of no correlation between αi
and Xi ).4
If used appropriately, random effects have some advantages. When the αi
are uncorrelated with the Xi , random effects models will generally produce
smaller standard errors on coefficients than fixed effects models. In addition, as
T gets large, the differences between fixed and random effects decline; in many
real-world data sets, however, the differences can be substantial.
REMEMBER THIS
Random effects models do not estimate fixed effects for each unit, but rather adjust standard errors
and estimates to account for unit-specific elements of the error term.
1. Random effects models produce unbiased estimates of β̂ 1 only when the αi ’s are uncorrelated
with the X variables.
2. Fixed effects models are unbiased regardless of whether the αi ’s are uncorrelated with the X
variables, making fixed effects a more generally useful approach.
4
For more details on the Hausman test, see Wooldridge (2002, 288).
526 CHAPTER 15 Advanced Panel Data
Conclusion
Serial dependence in panel data models is an important and complicated
challenge. There are two major approaches to dealing with it. One is to treat
the serial dependence as autocorrelated errors. In this case, we can test for
autocorrelation and, if necessary, purge it from the data by ρ-transforming
the data.
The other approach is to estimate a dynamic model that includes a lagged
dependent variable. Dynamic models are quite different from standard OLS
models. Among other things, each independent variable has a short- and a
long-term effect on Y.
Our approach to estimating a model with a lagged dependent variable hinges
on whether there is autocorrelation and whether we include fixed effects. If there is
no autocorrelation and we do not include fixed effects, the model is easy to estimate
via OLS and produces unbiased parameter estimates. If there is autocorrelation, the
correlation of error needs to be purged via standard ρ-transformation techniques.
If we include fixed effects in a model with a lagged dependent variable,
OLS will produce biased results. However, scholars have found that the bias is
relatively small and that OLS is likely to work better than alternatives such as IV
or bias-correction approaches.
We will have a good start on understanding advanced panel data analysis when
we can answer the following questions:
• Section 15.3: What are random effects models? When are they appropriate?
Further Reading
There is a large and complicated literature on accounting for time dependence
in panel data models. Beck and Katz (2011) is an excellent guide. Among other
things, these authors discuss how to conduct an LM test for AR(1) errors in a
model without fixed effects, the bias in models with autocorrelation and lagged
dependent variables, and the bias of fixed effects models with lagged dependent
variables.
There are many other excellent resources. Wooldridge (2002) is a valuable
reference for more advanced issues in analysis of panel data. An important article
Computing Corner 527
by Achen (2000) pushes for caution in the use of lagged dependent variables.
Wawro (2002) provides a nice overview of Arellano and Bond (1991) methods.
Another approach to dealing with bias in dynamic models with fixed effects
is to correct for bias directly, as suggested by Kiviet (1995). This procedure works
reasonably well in simulations, but it is quite complicated.
Key Term
Random effects model (524)
Computing Corner
Stata
1. It can be useful to figure out which variables vary within unit, as this will
determine if the variable can be included in a fixed effects model. Use
tabulate unit, summarize(X1)
which will show descriptive statistics for X1 grouped by the variable called
unit. If the standard deviation of an X1 is zero for all units, there is no
within-unit variation, and for the reasons discussed in Section 8.3, this
variable cannot be included in a fixed effects model.
5
If we want to know the αi + it portion of the error term, we type
predict ResidAE, ue
Note that Stata uses the letter u to refer to the fixed effect we denote with α in our notation.
528 CHAPTER 15 Advanced Panel Data
1. It can be useful to figure out which variables vary within unit as this
will determine if the variable can be included in a fixed effects model.
Use tapply(X1, unit, sd), which will show the standard deviation of
the variable X1 grouped by the variable called unit. If the standard deviation
of an X1 is zero for all units, there is no within-unit variation for the reasons
discussed in Section 8.3, and this variable cannot be included in a fixed
effects model.
(a) Make sure your data is listed by stacking units—that is, the
observations for unit 1 are first—and then ordered by time period.
Below the lines for unit 1 are observations for unit 2, ordered by
time period, and so on.6
(c) Create lag variables, one for the unit identifier and one for residuals.
Then set all the lag residuals to missing for the first observation for
each unit.
LagID = c(NA, ID[1:(length(ID)-1)])
LagResid = c(NA, Resid[1:(length(Resid)-1)])
LagResid[LagID != ID] = NA
(d) Use the variables from part (c) to estimate the model from
Chapter 13.
RhoHat = lm(Resid ~ LagResid)
6
It is possible to stack data by year. The way we’d create lagged variables would be different,
though.
530 CHAPTER 15 Advanced Panel Data
Exercises
1. Use the data in olympics_HW.dta on medals in the Winter Olympics from
1980 to 2014 to answer the following questions. Table 15.1 describes the
variables.
time A time variable equal to 1 for first Olympics in data set (1980), 2 for second Olympics
(1984) and so forth; useful for time series analysis.
unit for fixed effects.7 Briefly discuss the results, and explain what
is going on with the coefficients on temperature and elevation.
(b) Estimate a two-way fixed effects model with population, GDP, and
host country as independent variables. Use country and time as the
fixed effects. Explain any differences from the results in part (a).
(c) Estimate ρ̂ for the two-way fixed effects model. Is there evidence of
autocorrelation? What are the implications of your finding?
(d) Estimate a two-way fixed effects model that has population, GDP,
and host country as independent variables and accounts for autocor-
relation. Discuss any differences from results in part (b). Which is a
better statistical model? Why?
(h) Section 15.2 discusses potential bias when a fixed effects model
includes a lagged dependent variable. What is an important deter-
minant of this bias? Assess this factor for this data set.
(i) Use the concepts presented at the end of Section 13.4 to discuss
whether it is better to approach the analysis in an autocorrelation
or a lagged dependent variable framework.
(j) Use the concept of model robustness from Section 2.2 to discuss
which results are robust and which are not.
2. Answer the following questions using the Winter Olympics data described
in Table 15.1 that can be found in olympics_HW.dta
7
For simplicity, use the de-meaned approach, implemented with the xtreg command in Stata and the
plm command in R.
532 CHAPTER 15 Advanced Panel Data
(c) Estimate a random effects model with the same variables as in part
(b). Briefly explain the results, noting in particular what happens to
variables that have no within-unit variation.
(d) What is necessary to avoid bias for a random effects model? Do you
think this condition is satisfied in this case? Why or why not?
Conclusion: How to Be 16
an Econometric Realist
533
534 CHAPTER 16 Conclusion: How to Be an Econometric Realist
problem with this approach is that there really is no alternative to statistics and
econometrics. As baseball analyst Bill James says, the alternative to statistics is
not “no statistics.” The alternative to statistics is bad statistics. Anyone who makes
any empirical argument about the world is making a statistical argument. It might
be based on vague data that is not systematically analyzed, but that’s what people
do when they judge from experience or intuition. Hence, despite the inability of
statistics and econometrics to answer all questions or be above manipulation, a
serious effort to understand the world will involve some econometric reasoning.
A better approach is realism about econometrics. After all, in the right hands,
even chain saws are awesome. If we learn how to use the tool properly, realizing
what it can and can’t do, we can make a lot of progress.
An econometric realist is committed to robust and thoughtful evaluation of
theories. Five behaviors characterize this approach.
First, an econometric realist prioritizes. A model that explains everything is
impossible. We must simplify. And if we’re going to simplify the world, let’s do
it usefully. Statistician George Box (1976, 792) made this point wonderfully:
Since all models are wrong the scientist must be alert to what is
importantly wrong. It is inappropriate to be concerned about mice
when there are tigers abroad.
– All too often, a given theoretical claim is tested with the very data that
suggested the result. That’s not much to go on; a random or spurious
relationship in one data set does not a full-blown theory make. Hence,
we should be cautious about claims until they have been observed across
multiple contexts. With that requirement met, it is less likely that the
result is due to chance or to an analyst’s having leaned on the data to get
a desired result.
– If results are not observed across multiple contexts, are there contextual
differences? Perhaps the real finding would lie in explaining why a
relationship exists in one context and not in others.
– If other results are different, can we explain why the other results are
wrong? It is emphatically not the case that we should interpret two
competing statistical results as a draw. One result could be based on a
mistake. If that’s true, explain why (nicely, of course). If we can’t explain
why one approach is better, though, and we are left with conflicting
results, we need to be cautious about believing we have identified a real
relationship.
• Specificity: Are the patterns in the data consistent with the specific claim?
Each theory should be mined for as many specific claims as possible, not
only about direct effects but also about indirect effects and mechanisms. As
important, the theory should be mined for claims about when we won’t see
the relationship. This line of thinking allows us to conduct placebo tests in
which we should see null results. In other words, the relationship should be
observable everywhere we expect it and nowhere we don’t.
• Plausibility: Given what we know about the world, does the result make
sense? Sometimes results are implausible on their face: if someone found
that eating french fries led to weight loss, we should probably ask some
probing questions before supersizing. That doesn’t mean we should treat
implausible results as wrong. After all, the idea that the earth revolves
around the sun was pretty implausible before Copernicus. Implausible
results that happen to be true just need more evidence to overcome the
implausibility.
grammar for good analysis. It is not the story. No one reads a book and says, “Great
grammar!” A terrible book might have bad grammar, but a good book needs more
than good grammar. The material we covered in this book provides the grammar
for making convincing claims about the way the world works. The rest is up to
you. Think hard, be creative, take chances. Good luck.
Further Reading
In his 80-page paean to statistical realism, Achen (1982, 78) puts it this way: “The
uninitiated are often tempted to trust every statistical study or none. It is the task
of empirical social scientists to be wiser.” Achen followed this publication in 2002
with an often-quoted article arguing for keeping models simple.
The criteria for evaluating research discussed here are strongly influenced by
the Bradford-Hill criteria from Bradford-Hill (1965). Nevin (2013) assesses the
Bradford-Hill criteria for the theory that lead in gasoline was responsible for the
1980s crime surge in the United States (and elsewhere).
APPENDICES:
MATH AND PROBABILITY
BACKGROUND
A. Summation
N
• i=1 Xi = X1 + X2 + X3 + · · · + XN
• If a variable in the summation does not have a subscript, it can be “pulled out” of
the summation. For example,
N
βXi = βX1 + βX2 + βX3 + · · · + βXN
i=1
= β(X1 + X2 + X3 + · · · + XN )
N
=β Xi
i=1
observations).
B. Expectation
• Expectation is the value we expect a random variable to have. The expectation is
basically the average of the random variable if we could sample from the variable’s
distribution a huge (infinite, really) number of times.
538
MATH AND PROBABILITY BACKGROUND 539
• For example, the expected value of the value of a six-sided die is 3.5. If we roll a
die a huge number of times, we’d expect each side to come up an equal proportion
of times, so the expected average will equal 6 the average of 1, 2, 3, 4, 5, and 6.
More formally, the expected value will be 1 p(Xi )Xi , where X is 1, 2, 3, 4, 5, and
6 and p(Xi ) is the probability of each outcome, which in this example is 16 for each
value.
C. Variance
The variance of a random variable is a measure of how spread out the distribution
is. In a large sample, the variance can be estimated as
1
N
var(X) = (Xi − X)2
N
i=1
Here are some useful properties of variance (the “variance facts” cited in
Chapter 14):
var(k) = k2 var()
= k2 σ 2
3. When random variables are correlated, the variance of a sum (or differ-
ence) of random variables depends on the variances and covariance of the
variables. Letting and τ be random variables:
D. Covariance
• Covariance measures how much two random variables vary together. In large
samples, the covariance of two variables is
N
i=1 (X1i − X 1 )(X2i − X 2 )
cov(X1 , X2 ) = (A.1)
N
• As with variance, several useful properties apply when we are dealing with
covariance:
E. Correlation
The equation for correlation is
cov(X, Y)
corr(X, Y) =
σX σY
where σX is the standard deviation of X and σY is the standard deviation of Y.
If X = Y for all observations, cov(X, Y) = cov(X, X) = var(X) and σX = σY ,
implying that the denominator will be σX2 , which is the variance of X. These
calculations therefore imply that the correlation for X = Y will be +1, which is
the upper bound for correlations.1 For perfect negative correlation, X = −Y and
correlation is −1.
The equation for covariance (Equation A.1) looks a bit like the equation for
the slope coefficient in bivariate regression on page 49 in Chapter 3. The bivariate
regression coefficient is simply a restandardized correlation:
BivariateOLS σY
β̂1 = corr(X, Y) ×
σX
1
We also get perfect correlation if the variables are identical once normalized. That is, X and Y are
(Xi −X) (Yi −Y)
perfectly correlated if X = 10Y or if X = 5 + 3Y, and so forth. In these cases, σX
= σY
for all
observations.
542 MATH AND PROBABILITY BACKGROUND
would exceed 1 because there are always more possible values very near to any
given value. Instead, we need to think in terms of probabilities that the random
variable is in some (possibly small) region of values. Hence, we need the tools
from calculus to calculate probabilities from a PDF.
Figure A.1 shows the PDF for an example of a random variable. Although
we cannot use the PDF to simply calculate the probability the random variable
equals, say, 1.5, it is possible to calculate the probability that the random
variable is between 1.5 and any other value. The figure highlights the area
under the PDF curve between 1.5 and 1.8. This area corresponds to the
probability this random variable is between 1.5 and 1.8. In Appendix G, we
show example calculations of such probabilities based on PDFs from the normal
distribution.2
Probability
density
0.75
0.6
0.45
0.3
0.15
0 1 1.5 1.8 2 3 4 5
Value of x
2
More formally, we can indicate a PDF as a function, f (x), that is greater than zero for
∞ all values of x.
Because the total area under the curve defined by the PDF equals one, we know that −∞ f (x)dx = 1.
The probability that the random variable x is between a and b is ab f (x)dx = F(b) − F(a), where F()
is the integral of f ().
MATH AND PROBABILITY BACKGROUND 543
G. Normal Distributions
standard normal We work a lot with the standard normal distribution. (Only to us stats geeks does
distribution A normal “standard normal” not seem repetitive.) A normal distribution is a specific (and
distribution with a mean famous) type of PDF, and a standard normal distribution is a normal distribution
of zero and a variance
with mean of zero and a variance of one. The standard deviation of a standard
(and standard error) of
one.
normal distribution is also one, because the standard deviation is the square root
of the variance.
One important use of the standard normal distribution is to calculate
probabilities of observing standard normal random variables that are less than or
equal to some number. We denote the function Φ(x) = Pr(X < Z) as the probability
that a standard normal random variable X is less than Z. This is known as the
cumulative distribution function (CDF) because it indicates the probability of
seeing a random variable less than some value. It simply expresses the area under
a PDF curve to the left of some value.
Figure A.2 shows four examples of the use of the CDF for standard normal
PDFs. Panel (a) shows Φ(0), which is the probability that a standard normal
0.4 0.4
Probability density
Probability density
0.3
Φ(0)
0.3
= Pr(X < 0) Φ(−2)
= 0.500 = Pr(X < −2)
0.2 0.2 = 0.023
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4 0.4
Probability density
Probability density
0.1 0.1
0.0 0.0
1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)
FIGURE A.2: Probabilities that a Standard Normal Random Variable Is Less than Some Value
544 MATH AND PROBABILITY BACKGROUND
random variable will be less than zero. It is the area under the PDF to the left
of the zero. We can see that it is half the total area, meaning that the area to the left
of the zero is 0.500, and the probability of observing a value of a standard normal
random variable that is less than zero is 0.500. Panel (b) shows Φ(−2), which is
the probability that a standard normal random variable will be less than –2. It is
the proportion of the total area that is left of –2, which is 0.023. Panel (c) shows
Φ(1.96), which is the probability that a standard normal random variable will be
less than 1.96. It is 0.975. Panel (d) shows Φ(1), which is the probability that a
standard normal random variable will be less than 1. It is 0.841.
We can also use our knowledge of the standard normal distribution to calculate
the probability that β̂1 is greater than some value. The trick here is to recall that
if the probability of something happening is P, then the probability of its not
happening is 1 − P. This property tells us that if there is a 15 percent chance of
rain, then there is a 85 percent probability of no rain.
To calculate the probability that a standard normal variable is greater than
some value Z, use 1 − Φ(Z). Figure A.3 shows four examples. Panel (a) shows
0.4 0.4
Probability density
Probability density
0.3 0.3
1 − Φ(0)
= Pr(X > 0) 1 − Φ(−2)
= 0.500 = Pr(X > −2)
0.2 0.2 = 0.977
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4 0.4
Probability density
Probability density
0.3 0.3
1 − Φ(1.96) 1 − Φ(1.0)
= Pr(X > 1.96) = Pr(X > 1.0)
= 0.025 = 0.159
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 1 2 4
(c) (d)
FIGURE A.3: Probabilities that a Standard Normal Random Variable Is Greater than Some Value
MATH AND PROBABILITY BACKGROUND 545
1 − Φ(0), which is the probability that a standard normal random variable will be
greater than zero. This probability is 0.500. Panel (b) highlights 1 − Φ(−2), which
is the probability that a standard normal random variable will be greater than –2. It
is 0.977. Panel (c) shows Φ(1.96), which is the probability that a standard normal
random variable will be greater than 1.96. It is 0.025. Panel (d) shows Φ(1), which
is the probability that a standard normal random variable will be greater than 1. It
is 0.159.
Figure A.4 shows some key information about the standard normal distri-
bution. In the left-hand column of the figure’s table are some numbers, and in
the right-hand column are the corresponding probabilities that a standard normal
random variable will be less than the respective numbers. There is, for example, a
0.010 probability that a standard normal random variable will be less than –2.32.
Probability
density
0.4
SD = number of Probability
standard deviations β 1 ≤ SD
β̂ 0.3
above or below
0.2
the mean, β 1
0.1 Prob.β1 ≤ –2.32
−3.00 0.0001
−2.58 0.005 0.0
–2.32
−3 −2 −1 0 1 2 3
−2.32 0.010 ⇒
(a)
−2.00 0.023
−1.96 0.025 Probability
density
−1.64 0.050 0.4
We can see this graphically in panel (a). In the top bell-shaped curve, the portion
that is to the left of –2.32 is shaded. It is about one percent.
Because the standard deviation of a standard normal is 1, all the numbers in
the left-hand column can be considered as the number of standard deviations above
or below the mean. That is, the number −1 refers to a point that is a single standard
deviation below the mean, and the number +3 refers to a point that is 3 standard
deviations above the mean.
The third row of the table in Figure A.4 shows there is a probability of 0.010
that we’ll observe a value less than –2.32 standard deviations below the mean.
Going down to the shaded row SD = 0.00, we see that if β̂1 is standard normally
distributed, it has a 0.500 probability of being below zero. This probability is
intuitive: the normal distribution is symmetric, and we have the same chance of
seeing something above its mean as below it. Panel (b) shows this graphically.
In the last shaded row, where SD = 1.96, we see that there is a 0.975
probability that a standard normal random variable will be less than 1.96. Panel
(c) in Figure A.4 shows this graphically, with 97.5 percent of the standard
normal distribution shaded. We see this value a lot in statistics because twice
the probability of being greater than 1.96 is 0.05, which is a commonly used
significance level for hypothesis testing.
We can convert any normally distributed random variable to a standard
normally distributed random variable. This process, known as standardizing
values, is pretty easy. This trick is valuable because it allows us to use the intuition
and content of Figure A.4 to work with any normal distribution, whatever its mean
and standard deviation.
For example, suppose we have a normal random variable with a mean of 10
and a standard deviation of 1 and we want to know the probability of observing
a value less than 8. From common sense, we realize that in this case 8 is 2
standard deviations below the mean. Hence, we can use Figure A.4 to see that
the probability of observing a value less than 8 from a normal distribution with
mean 10 and standard deviation of 1 is 0.023; accordingly, the fourth row of the
table shows that the probability a standard normal random variable is less than –2
is 0.023.
How did we get there? First, subtract the mean from the value in question
to see how far it is from the mean. Then divide this quantity by the standard
deviation to calculate how many standard deviations away from the mean it is.
More generally, for any given number B drawn from a distribution with mean β1
and standard deviation se( β̂1 ), we can calculate the number of standard deviations
B is away from the mean via the following equation:
B − β1
Standard deviations from mean = (A.2)
se( β̂1 )
Notice in Equation A.2 that the β1 has no hat but se( β̂1 ) does. Seems odd,
doesn’t it? There is a logic to it, though. We’ll be working a lot with hypothetical
values of β1 , asking, for example, what the probability β̂1 is greater than some
MATH AND PROBABILITY BACKGROUND 547
number would be if the true β1 were zero. But since we’ll want to work with the
precision implied by our actual data, we’ll use se( β̂1 ).3
To get comfortable with converting the distribution of β̂1 to the standard
normal distribution, consider the examples in Table A.1. In the first example
(the first two rows), β1 is 0 and the standard error of β̂1 is 3. Recall that the
standard error of β̂1 measures the width of the β̂1 distribution. In this case, 3
is one standard deviation above the mean, and 1 is 0.33 standard deviation above
the mean.
In the third and fourth rows of Table A.1, β1 = 4 and the standard deviation is
3. In this case, 7 is one standard deviation above the mean, and 1 is one standard
deviation below the mean. In the bottom portion of the table (the last two rows), β1
is 8 and the standard deviation of β̂1 is 2. In this case, 6 is one standard deviation
below the mean, and 1 is 3.5 standard deviations below the mean.
To calculate Φ(Z), we use a table such as the one in Figure A.4 or, more
likely, computer software as discussed in the Computing Corner at the end of the
appendices.
3
Another thing that can be hard to get used to is the mixing of standard deviation and standard error.
Standard deviation measures the variability of a distribution, and in the case of the distribution of β̂1 ,
its standard deviation is the se( β̂1 ). The distinction between standard deviation and standard error
seems larger when calculating the mean of a variable. The standard deviation of X indicates the
variability of X, while the standard error of a sample mean indicates the variability of the estimate of
the mean. The standard error of the mean depends on the sample size while the standard deviation of
X is only a measure of the variability of X. Happily, this distinction tends not to be a problem in
regression.
548 MATH AND PROBABILITY BACKGROUND
REMEMBER THIS
1. A standard normal distribution is a normal distribution with a mean of zero and a standard
deviation of one.
(a) Any normally distributed random variable can be converted to a variable distributed
according to a standard normal distribution.
β̂−β
(b) If β̂1 is distributed normally with mean β and standard deviation se( β̂1 ), then se( βˆ1 )
will
be distributed as a standard normal random variable.
(c) Converting random variables to standard normal random variables allows us to use
standard normal tables to discuss any normal distribution.
2. To calculate the probability β̂1 ≤ B, where B is any number of interest, do the following:
B−β1
(a) Convert B to the number of standard deviations above or below the mean using se( βˆ1 )
.
(b) Use the table in Figure A.4 or software to calculate the probability that β̂1 is less than B
in standardized terms.
3. To calculate the probability that β̂1 > B, use the fact that the probability β̂1 is greater than B is
1 minus the probability that β̂1 is less than or equal to B.
Review Questions
1. What is the probability that a standard normal random variable is less than or equal
to 1.64?
2. What is the probability that a standard normal random variable is less than or equal
to –1.28?
3. What is the probability that a standard normal random variable is greater than 1.28?
4. What is the probability that a normal random variable with a mean of zero and a standard
deviation of 2 is less than –4?
5. What is the probability that a normal random variable with a mean of zero and a variance of 9
is less than –3?
6. Approximately what is the probability that a normal random variable with a mean of 7.2 and a
variance of 4 is less than 9?
MATH AND PROBABILITY BACKGROUND 549
The χ 2 distribution
χ 2 distribution A The χ 2 distribution (pronounced “kai-squared”) describes the distribution of
probability distribution squared normal variables. The distribution of a squared standard normal random
that characterizes the variable is a χ 2 distribution with one degree of freedom. The components of the
distribution of squared
sum of n independent squared standard normal random variables are distributed
standard normal
random variables.
according to a χ 2 distribution with n degrees of freedom.
The χ 2 distribution arises in many different statistical contexts. We’ll show
that it is a component of the all-important t distribution. The χ 2 distribution also
arises when we conduct likelihood ratio tests for maximum likelihood estimation
models.
The shape of the χ 2 distribution varies according to the degrees of freedom.
Figure A.5 shows two examples of χ 2 distributions. Panel (a) shows a χ 2
distribution with 2 degrees of freedom. We have highlighted the most extreme
5 percent of the distribution, which demonstrates that the critical value from a
χ 2 (2) distribution is roughly 6. Panel (b) shows a χ 2 distribution with 4 degrees
of freedom. The critical value from a χ 2 (4) distribution is around 9.5.
The Computing Corner in Chapter 12 (pages 446 and 448) shows how to
identify critical values from an χ 2 distribution. Software will often, but not always,
provide critical values for us automatically.
The t distribution
The t distribution characterizes the distribution of the ratio of a normal random
variable and the square root of a χ 2 random variable divided by its degrees of
freedom. While such a ratio may seem to be a pretty obscure combination of things
to worry about, we’ve seen in Section 4.2 that the t distribution is incredibly useful.
We know that our OLS coefficients (among other estimators) are normally
distributed. We also know (although we talk about this less) that the estimates of the
standard errors are distributed according to a χ 2 distribution. Since we need to stan-
dardize our OLS coefficients by dividing by our standard error estimates, we want to
know the distribution of the ratio of the coefficient divided by the standard error.
Formally, if z is a standard normal random variable and x is a χ 2 variable
with n degrees of freedom, the following represents a t distribution with n degrees
of freedom:
z
t(n) =
x/n
550 MATH AND PROBABILITY BACKGROUND
Probability
density
0.5
χ 2 (2) distribution
0.25
0 2 4 6 8 10
Value of x
(a)
Probability
density
χ 2 (4) distribution
0.15
0.1
0.05
0 2 4 6 8 9.49 10 12 14
Value of x
(b)
The F distribution
F distribution A
The F distribution characterizes the distribution of a ratio of two χ 2 random
probability distribution
that characterizes the variables divided by their degrees of freedom. The distribution is named in honor
distribution of a ratio of of legendary statistician R. A. Fisher.
two χ 2 random Formally, if x1 and x2 are independent χ 2 random variables with n1 and n2
variables. degrees of freedom, respectively, the following represents an F distribution with
MATH AND PROBABILITY BACKGROUND 551
x1 /n1
F(n1 , n2 ) =
x2 /n2
Since χ 2 variables are positive, a ratio of two of them must be positive as well,
meaning that random variables following F distributions are greater than or equal
to zero.
An interesting feature of the F distribution is that the square of a t distributed
variable with n degrees of freedom follows an F(1, n) distribution. To see this,
note that a t distributed variable is a normal random variable divided by the square
root of a χ 2 random variable. Squaring the t distributed variable gives us a squared
normal in the numerator, which is χ 2 , and a χ 2 in the denominator. In other words,
this gives us the ratio of two χ 2 random variables, which follow an F distribution.
We used this fact when noting on page 312 that in certain cases we can square a t
statistic to produce an F statistic that can be compared to a rule of thumb about F
statistics in the first stage of 2SLS analyses.
We use the F distribution when doing F tests which, among other things,
allows us to test hypotheses involving multiple parameters. We discussed F tests
in Section 5.6.
The F distribution depends on two degrees of freedom parameters. In the F
test examples, the degrees of freedom for the test statistic depend on the number
of restrictions on the parameters and the sample size. The order of the degrees of
freedom is important and is explained in our discussion of F tests.
The F distribution does not have an easily identifiable shape like the normal
and t distributions. Instead, its shape changes rather dramatically, depending on
the degrees of freedom. Figure A.6 plots four examples of F distributions, each
with different degrees of freedom. For each panel we highlight the extreme 5
percent of the distribution, providing a sense of the values necessary to reject the
null hypotheses for each case. Panel (a) shows an F distribution with degrees of
freedom equal to 3 and 2,000. This would be the distribution of an F statistic if
we were testing a null hypothesis that β1 = β2 = β3 = 0 based on a data set with
2,010 observations and 10 parameters to be estimated. The critical value is 2.61,
meaning that an F test statistic greater than 2.61 would lead us to reject the null
hypothesis. Panel (b) displays an F distribution with degrees of freedom equal to
18 and 300, and so on.
The Computing Corner in Chapter 5 on pages 170 and 172 shows how to
identify critical values from an F distribution. Often, but not always, software will
automatically provide critical values.
I. Sampling
Section 3.2 discussed two sources of variation in our estimates: sampling random-
ness and modeled randomness. Here we elaborate on sampling randomness.
552 MATH AND PROBABILITY BACKGROUND
Probability Probability
density density
1
F(3, 2,000) F(18, 300)
distribution distribution
1
0.75
0.75
0.5
0.5
0.25
0.25
2.61 1.64 2
0 2 4 0 4
Value of x Value of x
Probability (a) Probability (b)
density density
1
F(2, 100) 1
F(9, 10)
distribution distribution
0.75 0.75
0.5 0.5
0.25 0.25
3.09 3.02
0 2 4 0 2 4
Value of x Value of x
(c) (d)
Imagine that we are trying to figure out some feature of a given population.
For example, suppose we are trying to ascertain the average age of everyone in
the world at a given time. If we had (accurate) data from every single person,
we’d be done. Obviously, that’s not going to happen, so we take a random sample.
Since this random sample will not contain every single person, the average age of
people from it probably will not exactly match the population average. And if we
were to take another random sample, it’s likely that we’d get a different average
MATH AND PROBABILITY BACKGROUND 553
because we’d have different people in our sample. Maybe the first time our sample
contained more babies than usual, and the second time we got the world’s oldest
living person.
The genius of the sampling perspective is that we can characterize the degree
of randomness we should observe in our random sample. The variation will depend
on the sample size we observe and on the underlying variation in the population.
A useful exercise is to take some population, say the students in your
econometrics class, and gather information about every person in the population
for some variable. Then, if we draw random samples from this population, we will
see that the mean of the variable in the sampled group will bounce around for each
random sample we draw. The amazing thing about statistics is that we will be able
to say certain things about the mean of the averages we get across the random
samples and the variance of the averages. If the sample size is large, we will be
able to approximate the distribution of these averages with a normal distribution
having a variance we can calculate based on the sample size and the underlying
variance in the overall population.
This logic applies to regression coefficients as well. Hence, if we want to know
the relationship between age and wealth in the whole world, we can draw a random
sample and know that we will have variation related to the fact that we observe
only a subset of the target population. And recall from Section 6.1 that OLS easily
estimates means and difference of means, so even our average-age example works
in an OLS context.
It may be tempting to think of statistical analysis only in terms of sampling
variation, but this is not very practical. First, it is not uncommon to observe an
entire population. For example, if we want to know the relationship between
education and wages in European countries from 2000 to 2014, we could probably
come up with data for each country and year in our target population. And yet, we
would be naive to believe that there is no uncertainty in our estimates. Hence,
there is almost always another source of randomness, something we referred to as
modeled randomness in Section 3.2.
Second, the sampling paradigm requires that the samples from the underlying
target population be random. If the sampling is not random, the type of
observations that make their way into our analysis may systematically differ from
the people or units that we do not observe, thus causing us to risk introducing
endogeneity. A classic example is observing the wages of women who work, but
this subsample is unlikely to be a random sample from all women. The women who
work are likely more ambitious, more financially dependent on working, or both.
Even public opinion polling data, a presumed bastion of random sampling,
seldom provides random samples from underlying populations. Commercial
polls often have response rates of less than 20 percent, and even academic
surveys struggle to get response rates near 50 percent. It is reasonable to
believe that the people who respond differ in economic, social, and personal-
ity traits, and thus simply attributing variation to sampling variation may be
problematic.
So even though sampling variation is incredibly useful as an idealized source
of randomness in our coefficient estimates, we should not limit ourselves to
554 MATH AND PROBABILITY BACKGROUND
Further Reading
Rice (2007) is an excellent guide to probability theory as used in statistical
analysis.
Key Terms
F distribution (550) Standard normal distribution
Probability density function (543)
(541) χ 2 Distribution (549)
Computing Corner
Excel
Sometimes Excel offers the quickest way to calculate quantities of interest related
to the normal distribution.
• There are several ways to find the probability a standard normal is less than
some value.
2. Use the NORMDIST function and indicate the mean and the standard
deviation, which for a standard normal are 0 and 1, respectively. Use
a 1 after the last comma to produce the cumulative probability, which
is the percent of the distribution to the left of the number indicated:
=NORMDIST(2, 0, 1, 1).
• For a non-standard normal variable, use the NORMDIST function and indicate
the mean and the standard deviation. For example, if the mean is 9 and the
standard deviation is 3.2, the probability that this distribution will yield a
random variable less than 7 is =NORMDIST(7, 9, 3.2, 1).
Stata
• To calculate the probability that a standard normal is less than some value in
Stata, use the normal command. For example, display normal(2) will
return the probability that a standard normal variable is less than 2.
• To calculate the probability that a standard normal is less than some value
in R, use the pnorm command. For example, pnorm(2, mean= 1, sd=1)
will return the probability that a standard normal variable is less than 2.
Chapter 1
• Page 3 Gary Burtless (1995, 65) provides the initial motivation for this
example—he used Twinkies.
Chapter 3
• Page 45 Sides and Vavreck (2013) provide a great look at how theory can help cut
through some of the overly dramatic pundit-speak on elections.
• Page 57 For a discussion of the central limit theorem and its connection to the
normality of OLS coefficient estimates, see, for example, Lumley et al. (2002).
They note that for errors that are themselves nearly normal or do not have severe
outliers, 80 or so observations are usually enough.
• Page 67 Stock and Watson (2011, 674) present examples of estimators that
highlight the differences between bias and inconsistency. The estimators are silly,
but they make the authors’ point.
– Suppose we tried to estimate the mean of a variable with the first observation
in a sample. This will be unbiased because in expectation it will be equal to
the average of the population. Recall that expectation can be thought of as the
average value we would get for an estimator if we ran an experiment over and
over again. This estimator will not be consistent, though, because no matter
how many observations we have, we’re using only the first observation,
which means that the variance of the estimator will not get smaller as the
sample size gets very large. So yes, no one in their right mind would use this
estimator even though it is nonetheless unbiased—but also inconsistent.
556
CITATIONS AND ADDITIONAL NOTES 557
– Suppose we tried to estimate the mean of a variable with the sample mean
plus N1 . This will be biased because the expectation of this estimator will be
the population average plus N1 . However, this estimator will be consistent
because the variance of a sample mean goes down as the sample size
increases, and the N1 bit will go to zero as the sample size goes to infinity.
Again, this is a nutty estimator that no one would use in practice, but it shows
how it is possible for an estimator that is biased to be consistent.
Chapter 4
• Page 91 For a report on the Pasteur example, see Manzi (2012, 73) and
http://pyramid.spd.louisville.edu/∼eri/fos/Pasteur_Pouilly-le-fort.pdf.
• Page 109 The medical example is from Wilson and Butler (2007, 105).
Chapter 5
• Page 138 In Chapter 14, we show on page 497
that the bias term in a simplified
X
example for a model with no constant is E Xi 2i . For the more standard case
i
(Xi −X)
that includes a constant in the model, the bias term is E (Xi −X) 2 , which is the
i
covariance of X and divided by the variance of X. See Greene (2003, 148) for
a generalization of the omitted variable bias formula for any number of included
and excluded variables.
• Page 153 Harvey’s analysis uses other variables, including a measure of how
ethnically and linguistically divided countries are and a measure of distance
from the equator (which is often used in the literature to capture a historical
pattern that countries close to equator have tended to have weaker political
institutions).
Chapter 6
• Page 181 To formally show that the OLS β̂1 and β̂0 estimates are functions of the
means of the treated and untreated groups requires a bit of a slog through some
algebra.
From page 49, we know that the bivariate OLS equation for the slope is
N
(Ti −T)(Yi −Y)
β̂1 =
i=1
N , where we use Ti to indicate that our independent variable
(Ti −T)2
i=1
558 CITATIONS AND ADDITIONAL NOTES
The (1 − p) in the numerator and denominator of the first and second terms cancel
out. Note also that the sum of Y for the observations where Ti = 1 equals NT Y,
allowing us to express the OLS estimate of β̂1 as
Ti =1 Yi p Ti =0 (Yi − Y)
β̂1 = −Y −
NT NT (1 − p)
p
We’re almost there. Now note that N (1−p) in the third term can be written as N1 ,
T C
where NC is the number of observations in the
control group (for whom Ti = 0).2
N
Y
Ti =1 i
We denote the average of the treated group NT as Y T and the average of
N 2 N N 2 N 2
1
To see this, rewrite N i=1 (Ti − p) as
2
i=1 Ti − 2p i=1 Ti − i=1 p . Note that both i=1 Ti and
N
i=1 Ti equal NT because the squared value of a dummy variable is equal to itself and because the
sum of a dummy variable is equal to the number of observations for which Ti = 1. We also use the
NT NT2 NNT2
facts that N i=1 p = Np and p = N , which allows us to write the denominator as NT − 2 N + N 2 .
2 2
N
Y
Ti =0 i
the control group NC as Y C . We can rewrite our equation as
Ti =0 Yi Ti =0 Y
β̂1 = Y T − Y − +
NC NC
Using fact that Ti =0 Y = NC Y, we can cancel some terms and (finally!) get our
result:
β̂1 = Y T − Y C
To show that β̂0 is Y C , use Equation 3.5 from page 50, noting that Y = Y T NT +Y
N
C NC
.
• Page 183 Discussions of non-OLS difference of means tests sometimes get bogged
down in whether the variance is the same across the treatment and control groups.
If the variance varies across treatment and control groups, we should adjust our
analysis according to the heteroscedasticity that will be present.
• Page 194 This data is from from Persico, Postlewaite, and Silverman (2004).
Results are broadly similar even when we exclude outliers with very high salaries.
• Page 205 See Kam and Franceze (2007, 48) for the derivation of the variance
of estimated effects. The variance of β̂1 + Di β̂3 is var( β̂1 ) + D2i var( β̂3 ) +
2Di cov( β̂1 , β̂3 ), where cov is the covariance of β̂1 and β̂3 (see variance fact 3
on page 540).
– In Stata, we can display cov( β̂1 , β̂3 ) with the following commands:
regress Y X1 D X1D
matrix V = get(VCE)
disp V[3,1]
For more details, see Kam and Franceze (2007, 136–146).
Chapter 7
• Page 223 The data on life expectancy and GDP per capita are from the World
Bank’s World Development Indicators database available at http://data.worldbank
.org/indicator/.
• Page 228 Temperature data is from National Aeronautics and Space Administra-
tion (2012).
560 CITATIONS AND ADDITIONAL NOTES
Y = eβ0 eβ1 X e
If we use the fact that log(eA eB eC ) = A + B + C and log both sides, we get the
log-linear formulation:
lnY = β0 + β1 X +
dY
= eβ0 β1 eβ1 X e
dX
Chapter 8
• Page 263 See Bailey, Strezhnev, and Voeten (2015) for United Nations voting data.
Chapter 9
• Page 295 Endogeneity is a central concern of Medicaid literature. See, for
example, Currie and Gruber (1996), Finkelstein et al. (2012), and Baicker et al.
(2013).
• Page 317 The reduced form is simply the model rewritten to be only a function
of the non-endogenous variables (which are the X and Z variables, not the Y
variables). This equation isn’t anything fancy, although it takes a bit of math to
see where it comes from. Here goes:
3. Rearrange some more by moving all Y2 terms to the left side of the equation:
γ0 +γ1 β0
5. Relabel (1−γ as π0 , (γ(1−γ
1 β2 +γ2 )
as π1 , γ1 β3
as π2 , and γ3
as π3 ,
1 β1 ) 1 β1 ) (1−γ1 β1 ) (1−γ1 β1 )
and combine the terms into ˜ :
This “reduced form” equation isn’t a causal model in any way. The π coefficients
are crazy mixtures of the coefficients in Equations 9.12 and 9.13, which are the
equations that embody the story we are trying to evaluate. The reduced form
equation is simply a useful way to write down the first-stage model.
Chapter 10
• Page 358 See Newhouse (1993), Manning, Newhouse, et al., 1987, and Gerber
and Green (2012, 212–214) for more on the RAND experiment.
Chapter 12
• Page 423 A good place to start a consideration of maximum likelihood estimation
(MLE) is with the name. Maximum is, well, maximum; likelihood refers to the
probability of observing the data we observe; and estimation is, well, estimation.
For most people, the new bit is the likelihood. The concept is actually quite close to
ordinary usage. Roughly 20 percent of the U.S. population is under 15 years of age.
What is the likelihood that when we pick three people randomly, we get two people
under 15 and one over 15? The likelihood is L = 0.2 × 0.2 × 0.8 = 0.03. In other
words, if we pick three people at random in the United States, there is a 3 percent
chance (or, “likelihood”) we will observe two people under 15 and one over 15.
We can apply this concept when we do not know the underlying probability.
Suppose that we want to figure out what proportion of the population has health
insurance. Let’s call pinsured the probability that someone is insured (which is
simply the proportion of insured in the United States). Suppose we randomly select
562 CITATIONS AND ADDITIONAL NOTES
three people, ask them if they are insured, and find out that two are insured and
one is not. The probability (or “likelihood”) of observing that combination is
MLE finds an estimate of pinsured that maximizes the likelihood of observing the
data we actually observed.
We can get a feel for what values lead to high or low likelihoods by trying out a few
possibilities. If our estimate were pinsured = 0, the likelihood, L, would be 0. That’s
a silly guess. If our estimate were pinsured = 0.5, then L = 0.5 × 0.5 × (1 − 0.5) =
0.125, which is better. If we chose pinsured = 0.7, then L = 0.7 × 0.7 × 0.3 = 0.147,
which is even better. But if we chose pinsured = 0.9, then L = 0.9×0.9×0.1 = 0.081,
which is not as high as some of our other guesses.
Conceivably, we could keep plugging different values of pinsured into the likelihood
equation until we found the best value. Or, calculus gives us tools to quickly find
maxima.3 When we observe two people with insurance and one without, the value
of pinsured that maximizes the likelihood is 23 , which, by the way, is the common-
sense estimate when we know that two of three observed people are insured.
To use MLE to estimate a probit model, we extend this logic. Instead of estimating
a single probability parameter (pinsured in our previous example) we estimate the
probability Yi = 1 as a function of independent variables. In other words, we
substitute Φ(β0 + β1 Xi ) for pinsured into the likelihood equation just given. In this
case, the thing we are trying to learn about is no longer pinsured ; it’s now the β’s that
determine the probability for each individual based on their respective Xi values.
If we observe two people who are insured and one who is not, we have
• Page 429 To use the average-case approach, create a single “average” person for
whom the value of each independent variable is the average of that independent
variable. We calculate a fitted probability for this person. Then we add one to the
value of X1 for this average person and calculate how much the fitted probability
goes up. The downside of the average-case approach is that in the real data, the
variables might typically cluster together, with the result that no one is average
3
Here’s the formal way to do this via calculus. First, calculate the derivative of the likelihood with
respect to p: ∂L
∂p
= 2pinsured − 3p2insured . Second, set the derivative to zero and solve for pinsured ; this
yields pinsured = 23 .
CITATIONS AND ADDITIONAL NOTES 563
across all variables. It’s also kind of weird because dummy variables for the
“average” person will between 0 and 1 even though no single observation will
have any value other than 0 and 1. This means, for example, that the “average”
person will be 0.52 female, 0.85 right-handed, and so forth.
To interpret probit coefficients using the average-case approach, use the following
guide:
– If X1 is a continuous variable:
– If X1 is a dummy variable:
• Page 430 The marginal-effects approach uses calculus to determine the slope of
the fitted line. Obviously, the slope of the probit-fitted line varies, so we have
564 CITATIONS AND ADDITIONAL NOTES
Chapter 13
• Page 460 Another form of correlated errors is spatial autocorrelation, which
occurs when the error for one observation is correlated with the error for another
observation that is spatially close to it. If we polled two people per household,
there may be spatial autocorrelation because those who live close to each other
(and sleep in the same bed!) may have correlated errors. This kind of situation
can arise with geography-based data, such as state- or county-level data, because
certain unmeasured similarities (meaning stuff in the error term) may be common
within regions. The consequences of spatial autocorrelation are similar to the
consequences of serial autocorrelation. Spatial autocorrelation does not cause
bias. Spatial autocorrelation does however, cause the conventional standard error
equation for OLS coefficients to be incorrect. The easiest first step for dealing
with this situation is simply to include a dummy variable for region. Often this
step will capture any regional correlations not captured by the other independent
variables. A more technically complex way of dealing with this situation is via
spatial regression statistical models. The intuition underlying these models is
similar to that for serial correlation, but the math is typically harder. See, for
example, Tam Cho and Gimpel (2012).
• Page 465 Wooldridge (2009, 416) discusses inclusion of X variables in this test.
The so-called Breusch-Godfrey test is a more general test for autocorrelation. See,
for example, Greene (2003, 269).
CITATIONS AND ADDITIONAL NOTES 565
• Page 469 Wooldridge (2009, 424) notes that the ρ-transformed approach also
requires that t not be correlated with Xt−1 or Xt+1 . In a ρ-transformed model,
the independent variable is Xt − ρXt−1 and the error is t − ρt−1 . If the lagged
error term (t−1 ) is correlated with Xt , then the independent variable in the
ρ-transformed model will be correlated with the error term in the ρ-transformed
model.
• Page 478 R code to generate multiple simulations with unit root (or other) time
series variables:
• Page 490 To estimate a Cochrane-Orcutt manually in R, begin with the R code for
diagnosing autocorrelation and then
# Rho is rho-hat
Rho = summary(LagErrOLS)$coefficients[2]
# Length of Temp variable
N = length(Temp)
# Lagged temperature
LagTemp = c(NA, Temp[1:(N-1)])
# Lagged year
LagYear = c(NA, Year[1:(N-1)])
# Rho-transformed temperature
TempRho = AvgTemp - Rho*LagTemp
# Rho-transformed year
YearRho = Year- Rho*LagYear
# Rho-transformed model
ClimateRho = lm(TempRho ~ YearRho)
# Display results
summary(ClimateRho)
566 CITATIONS AND ADDITIONAL NOTES
Chapter 14
• Page 510 The attenuation bias result was introduced in Section 5.3. We can
also derive it by using the general form of endogeneity from page 60, which
is plim β̂1 = β1 + corr(X1 , ) σσ = β1 + cov(X σX
1 ,)
. Note that the error term in
X1 1
Equation 14.21 (which is analogous to in the plim equation) actually contains
−β1 νi + i . Solving for cov(X1 , −β1 νi + ) yields −β1 σν .
Chapter 16
• Page 534 Professor Andrew Gelman, of Columbia University, directed me to this
saying of Bill James.
GUIDE TO REVIEW QUESTIONS
Chapter 1
Review question on page 7:
Panel (d): Note that the X-axis ranges from about −6 to +6. β0 is the value of
Y when X is zero and is therefore 2, which can be seen in Figure R.1. β0 is not
the value of Y at the left-most point in the figure, as it was for the other panels
in Figure 1.4.
Chapter 3
Review questions on page 64:
1. Note that the variance of the independent variable is much smaller in panel (b).
From the equation for the variance of β̂1 , we know that higher variance of X is
associated with lower variance of β̂1 , meaning the variance of β̂1 in panel (a)
should be lower.
2. Note that the number of observations is much larger in panel (d). From the
equation for the variance of β̂1 , we know that higher sample size is associated
with lower variance, meaning the variance of β̂1 in panel (d) should be lower.
Chapter 4
Review questions on page 106:
(b) The degrees of freedom is sample size minus the number of parameters
estimated, so it is 17 − 2 = 15.
567
568 GUIDE TO REVIEW QUESTIONS
8
Y
7
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
Independent variable, X
(c) The critical value for a two-sided alternative hypothesis and α = 0.01 is
2.95. We reject the null hypothesis.
(d) The critical value for a one-sided alternative hypothesis and α = 0.05 is
1.75. We reject the null hypothesis.
2. The critical value from a two-sided test is bigger because it indicates the point
at which α2 of the distribution is larger. As Table 4.4 shows, the two-sided
critical values are larger than the one-sided critical values for all values of α.
3. The critical values from a small sample are larger because the t distribution
accounts for additional uncertainty about our estimate of the standard error of
β̂1 . In other words, even when the null hypothesis is true, the data could work
out to give us an unusually small estimate of se(β̂1 ), which would push up our t
statistic. That is, the more uncertainty there is about se(β̂1 ), the more we could
expect to see higher values of the t statistic even when the null hypothesis is
GUIDE TO REVIEW QUESTIONS 569
true. As the sample size increases, uncertainty about se(β̂1 ) decreases, so even
when the null hypothesis is true, this source of large t statistics diminishes.
Chapter 5
Review questions on page 150:
1. Not at all. R2j will be approximately zero. In a random experiment, the treatment
is uncorrelated with anything, including the other covariates. This buys us
exogeneity, but it also buys us increased precision.
2. We’d like to have a low variance for estimates, and to get that we want the R2j to
be small. In other words, we want the independent variables to be uncorrelated
with each other.
Chapter 6
Review questions on page 186:
The estimated constant (β̂0 ) is the average value of Yi for units in the excluded
category (in this case, U.S. citizens) after we have accounted for the effect
of X1 . The coefficient on the Canada dummy variable (β̂2 ) estimates how
much more or less Canadians feel about Y compared to Americans, the
excluded reference category. The coefficient on the Mexico dummy variable
(β̂3 ) estimates how much more or less Mexicans feel about Y compared to
Americans. Using Mexico or Canada as reference categories is equally valid
570 GUIDE TO REVIEW QUESTIONS
2. (a) 25
(b) 20
(c) 30
(d) 115
(e) 5
(f) −20
(g) 120
(h) −5
(i) −25
(j) 5
Chapter 7
Review questions on page 230:
1. Panel (a) looks like a quadratic model with effect accelerating as profits rise.
Panel (b) looks like a quadratic model with effect accelerating as profits rise.
Panel (c) is a bit of a trick question as the relationship is largely linear, but
with a few unusual observations for profits around 4. A quadratic model would
estimate an upside down U-shape but it would also be worth exploring if these
are outliers or if these observations can perhaps be explained by other variables.
Panel (d) looks like a quadratic model with rising and then falling effect of
profits on investment. For all quadratic models, we would simply include a
variable with the squared value of profits and let the computer program tell us
the coefficient values that produce the appropriate curve.
2. The sketches would draw lines through the masses of data for panels (a), (b),
and (d). The sketch for panel (c) would depend on whether we stuck with a
quadratic model or treated the unusual obervations as outliers to be excluded
or modeled with other variables.
Chapter 8
Review question on page 282—see Table R.1:
β0 2 3 2 3
β1 −1 −1 0 0
β2 0 −2 2 −2
β3 2 2 −1 1
Chapter 9
Review questions on page 308:
572 GUIDE TO REVIEW QUESTIONS
1. The first stage is the model explaining drinks per week. The second stage is
the model explaining grades. The instrument is beer tax, as we can infer based
on its inclusion in the first stage and exclusion from the second stage.
3. There is no evidence on exogeneity of the beer tax in the table because this is
not something we can assess empirically.
5. No. The first stage results do not satisfy the inclusion condition, and we
therefore cannot place any faith in the results of the second stage.
Chapter 10
Review questions on page 359:
1. There is a balance problem as the treatment villages have higher income, with
a t statistic of 2.5 on the treatment variable. Hence, we cannot be sure that the
differences in the treated and untreated villages are due to the treatment or to
the fact that the treated villages are wealthier. There is no difference in treated
and untreated villages with regard to population.
2. There is a possible attrition problem as treated villages are more likely to report
test scores. This is not surprising as teachers from treated villages have more
of an incentive to report test scores. The implication of this differential attrition
is not clear, however. It could be that the low-performing school districts tend
not to report among the control village while even low-performing school
districts report among the treated villages. Hence, the attrition is not necessarily
damning of the results. Rather, it calls for further analysis.
3. The first column reports that students in treated villages had substantially
higher test scores. However, we need to control for village income as well
because the treated villages also tended to have higher income. In addition, we
should be somewhat wary of the fact that 20 villages did not report test scores.
GUIDE TO REVIEW QUESTIONS 573
As discussed earlier, the direction of the bias is not clear, but it would be useful
to see additional analysis of the kinds of districts that did and did not report
test scores. Perhaps the data set could be trimmed and reanalyzed.
Chapter 11
Review question on page 384:
(a) β1 = 0, β2 = 0, β3 < 0
(f) β1 < 0, β2 < 0, β3 > 0 (here, too, β3 = −β2 , which means β3 is positive because
β2 is negative)
Chapter 12
Review questions on page 426:
Panel (b): X = 2
3
(b) False. The t statistic is 1, which is not statistically significant for any
reasonable significance level.
(a) The fitted probability is Φ(0 + 0.5 × 4 − 0.5 × 0) = Φ(2), which is 0.978.
Chapter 14
Review questions on page 502:
(a) The power when β1True = 1 is 1 − Φ 2.32 − 0.75
1
= 0.162.
(b) The power when β1True = 2 is 1 − Φ 2.32 − 0.75
2
= 0.636.
2. If the estimated se(β̂1 ) doubled, the power will go down because the center of
β True
the t statistic distribution will shift toward zero (because se(1β̂ ) gets smaller as
1
the standard error increases).
For
this higher standard error, the power when
β1True = 1 is 1 − Φ 2.32 − 1.5 1
= 0.049, and the power when β1True = 2 is
1 − Φ 2.32 − 1.5 2
= 0.161.
Appendix
Review questions on page 548:
1. The table in Figure A.4 shows that the probability a standard normal random
variable is less than or equal to 1.64 is 0.950, meaning there is a 95 percent
chance that a normal random variable will be less than or equal to whatever
value is 1.64 standard deviations above its mean.
2. The table in Figure A.4 shows that the probability a standard normal random
variable is less than or equal to −1.28 is 0.100, meaning there is a 10 percent
chance that a normal random variable will be less than or equal to whatever
value is 1.28 standard deviations below its mean.
3. The table in Figure A.4 shows that the probability that a standard normal
random variable is greater than 1.28 is 0.900. Because the probability of being
above some value is 1 minus the probability of being below some value, there
is a 10 percent chance that a normal random variable will be greater than or
equal to whatever number is 1.28 standard deviations above its mean.
5. First, convert −3 to standard deviations above or below the mean. In this case,
if the variance is 9, then the standard deviation (the square root of the variance)
is 3. Therefore, −3 is the same as one standard deviation below the mean. From
the table in Figure A.4, we see that there is a 0.16 probability a normal variable
576 GUIDE TO REVIEW QUESTIONS
will be more than one standard deviation below its mean. In other words, the
probability of being less than −3−0
√
9
= −1 is 0.16.
6. First, convert 9 to standard deviations above or below the mean. The standard
deviation (the square root of the variance) is 2. The value 9 is 9−7.2 2 =
1.8
2 standard deviation above the mean. The value 0.9 does not appear in
Figure A.4. However, it is close to 1, and the probability of being less than
1 is 0.84. Therefore, a reasonable approximation is in the vicinity of 0.8. The
actual value is 0.82 and can be calculated as discussed in the Computing Corner
on page 554.
BIBLIOGRAPHY
Acemoglu, Daron, Simon Johnson, and James of Los Angeles. University of Pennsylvania Law
A. Robinson. 2001. The Colonial Origins of Review 161: 699–756.
Comparative Development: An Empirical
Angrist, Joshua. 2006. Instrumental Variables
Investigation. American Economic Review 91(5):
Methods in Experimental Criminological
1369–1401.
Research: What, Why and How. Journal of
Acemoglu, Daron, Simon Johnson, James Experimental Criminology 2(1): 23–44.
A. Robinson, and Pierre Yared. 2008. Income
Angrist, Joshua, and Alan Krueger. 1991. Does
and Democracy. American Economic Review
Compulsory School Attendance Affect
98(3): 808–842.
Schooling and Earnings? Quarterly Journal of
Acharya, Avidit, Matthew Blackwell and Maya Sen. Economics. 106(4): 979–1014.
2016. Explaining Causal Findings without Bias:
Detecting and Assessing Direct Effects. Angrist, Joshua, and Jörn-Steffen Pischke. 2009.
American Political Science Review 110(3): Mostly Harmless Econometrics: An Empiricist’s
512–529. Companion. Princeton, NJ: Princeton University
Press.
Achen, Christopher H. 1982. Interpreting and Using
Regression. Newbury Park, CA: Sage Angrist, Joshua, and Jörn-Steffen Pischke. 2010.
Publications. The Credibility Revolution in Empirical
Economics: How Better Research Design is
Achen, Christopher H. 2000. Why Lagged Taking the Con out of Econometrics. National
Dependent Variables Can Suppress the Bureau of Economic Research working paper.
Explanatory Power of Other Independent http://www.nber.org/papers/w15794
Variables. Manuscript, University of Michigan.
Angrist, Joshua, Kathryn Graddy, and Guido
Achen, Christopher H. 2002. Toward a New Political Imbens. 2000. The Interpretation of Instrumental
Methodology: Microfoundations and ART. Variables Estimators in Simultaneous Equations
Annual Review of Political Science 5: 423–450. Models with an Application to the Demand for
Albertson, Bethany, and Adria Lawrence. 2009. Fish. Review of Economic Studies 67(3):
After the Credits Roll: The Long-Term Effects of 499–527.
Educational Television on Public Knowledge and Anscombe, Francis J. 1973. Graphs in Statistical
Attitudes. American Politics Research 37(2): Analysis. American Statistician 27(1): 17–21.
275–300.
Anzia, Sarah. 2012. The Election Timing Effect:
Alvarez, R. Michael, and John Brehm. 1995.
Evidence from a Policy Intervention in Texas.
American Ambivalence towards Abortion Policy:
Quarterly Journal of Political Science 7(3):
Development of a Heteroskedastic Probit Model
209–248.
of Competing Values. American Journal of
Political Science 39(4): 1055–1082. Arellano, Manuel, and Stephen Bond. 1991. Some
Tests of Specification for Panel Data. Review of
Anderson, James M., John M. Macdonald, Ricky
Economic Studies 58(2): 277–297.
Bluthenthal, and J. Scott Ashwood. 2013.
Reducing Crime by Shaping the Built Aron-Dine, Aviva, Liran Einav, and Amy
Environment with Zoning: An Empirical Study Finkelstein. 2013. The RAND Health Insurance
577
578 BIBLIOGRAPHY
Experiment, Three Decades Later. Journal of Way to Fight Global Poverty. New York: Public
Economic Perspectives 27(1): 197–222. Affairs.
Aronow, Peter M. and Cyrus Samii. 2016. Does Bartels, Larry M. 2008. Unequal Democracy: The
Regression Produce Representative Estimates of Political Economy of the New Gilded Age.
Causal Effects? American Journal of Political Princeton, NJ: Princeton University Press.
Science 60(1): 250–267.
Beck, Nathaniel. 2010. Making Regression and
Baicker, Katherine, and Amitabh Chandra. 2017. Related Output More Helpful to Users. The
Evidence-Based Health Policy. New England Political Methodologist 18(1): 4–9.
Journal of Medicine 377(25): 2413–2415.
Beck, Nathaniel, and Jonathan N. Katz. 1996.
Baicker, Katherine, Sarah Taubman, Heidi Allen, Nuisance vs. Substance: Specifying and
Mira Bernstein, Jonathan Gruber, Joseph P. Estimating Time-Series–Cross-Section Models.
Newhouse, Eric Schneider, Bill Wright, Alan Political Analysis 6: 1–36.
Zaslavsky, Amy Finkelstein, and the Oregon
Health Study Group. 2013. The Oregon Beck, Nathaniel, and Jonathan N. Katz. 2011.
Experiment—Medicaid’s Effects on Clinical Modeling Dynamics in Time-Series–Cross-
Outcomes. New England Journal of Medicine Section Political Economy Data. Annual Review
368(18): 1713–1722. of Political Science 14: 331–352.
Bailey, Michael A., and Elliott Fullmer. 2011. Berk, Richard A., Alec Campbell, Ruth Klap, and
Balancing in the States, 1978–2009. State Bruce Western. 1992. The Deterrent Effect of
Politics and Policy Quarterly 11(2): 149–167. Arrest in Incidents of Domestic Violence: A
Bayesian Analysis of Four Field Experiments.
Bailey, Michael A., Daniel J. Hopkins, and Todd
American Sociological Review 57(5): 698–708.
Rogers. 2015. Unresponsive and Unpersuaded:
The Unintended Consequences of Voter Bertrand, Marianne, and Sendhil Mullainathan.
Persuasion Efforts. Manuscript, Georgetown 2004. Are Emily and Greg More Employable
University. than Lakisha and Jamal? A Field Experiment on
Labor Market Discrimination. American
Bailey, Michael A., Jon Mummolo, and Hans Noel.
Economic Review 94(4): 991–1013.
2012. Tea Party Influence: A Story of Activists
and Elites. American Politics Research 40(5): Bertrand, Marianne, Esther Duflo, and Sendhil
769–804. Mullainathan. 2004. How Much Should We Trust
Bailey, Michael A., Jeffrey S. Rosenthal, and Albert Differences-in-Differences Estimates? Quarterly
H. Yoon. 2014. Grades and Incentives: Assessing Journal of Economics 119(1): 249–275.
Competing Grade Point Average Measures and Blinder, Alan S., and Mark W. Watson. 2013.
Postgraduate Outcomes. Studies in Higher Presidents and the Economy: A Forensic
Education. Investigation. Manuscript, Princeton University.
Bailey, Michael A., Anton Strezhnev, and Erik Bloom, Howard S. 2012. Modern Regression
Voeten. 2015. Estimating Dynamic State Discontinuity Analysis. Journal of Research on
Preferences from United Nations Voting Data. Educational Effectiveness 5(1): 43–82.
Journal of Conflict Resolution.
Bound, John, David Jaeger, and Regina Baker. 1995.
Baiocchi, Michael, Jing Cheng, and Dylan S. Small. Problems with Instrumental Variables Estimation
2014. Tutorial in Biostatistics: Instrumental When the Correlation Between the Instruments
Variable Methods for Causal Inference. Statistics and the Endogenous Explanatory Variable Is
in Medicine 33(13): 2297–2340. Weak. Journal of the American Statistical
Baltagi, Badi H. 2005. Econometric Analysis of Association 90(430): 443–450.
Panel Data, 3rd ed. Hoboken, NJ: Wiley. Box, George E. P. 1976. Science and Statistics.
Banerjee, Abhijit Vinayak, and Esther Duflo. 2011. Journal of the American Statistical Association
Poor Economics: A Radical Rethinking of the 71(356): 791–799.
BIBLIOGRAPHY 579
Box-Steffensmeier, Janet, and Agnar Freyr Campbell, James E. 2011. The Economic Records
Helgason. 2016. Introduction to Symposium on of the Presidents: Party Differences and Inherited
Time Series Error Correction Methods in Economic Conditions. Forum 9(1): 1–29.
Political Science. Political Analysis 24(1):1–2. Card, David. 1990. The Impact of the Mariel
Box-Steffensmeier, Janet M., and Bradford S. Jones. Boatlift on the Miami Labor Market. Industrial
2004. Event History Modeling: A Guide for and Labor Relations Review 43(2): 245–257.
Social Scientists. Cambridge, U.K.: Cambridge
Card, David. 1999. The Causal Effect of Education
University Press.
on Earnings. In Handbook of Labor Economics,
Bradford-Hill, Austin. 1965. The Environment and vol. 3, O. Ashenfelter and D. Card, eds.
Disease: Association or Causation? Proceedings Amsterdam: Elsevier Science.
of the Royal Society of Medicine 58(5): 295–300.
Card, David, Carlos Dobkin, and Nicole Maestas.
Brambor, Thomas, William Roberts Clark, and Matt 2009. Does Medicare Save Lives? Quarterly
Golder. 2006. Understanding Interaction Models: Journal of Economics 124(2): 597–636.
Improving Empirical Analyses. Political Analysis
Carrell, Scott E., Mark Hoekstra, and James E.
14: 63–82.
West. 2010. Does Drinking Impair College
Braumoeller, Bear F. 2004. Hypothesis Testing and Performance? Evidence from a Regression
Multiplicative Interaction Terms. International Discontinuity Approach. National Bureau of
Organization 58(4): 807–820. Economic Research Working Paper.
Brown, Peter C., Henry L. Roediger III, and Mark http://www.nber.org/papers/w16330
A. McDaniel. 2014. Making It Stick: The Science Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith
of Successful Learning. Cambridge, MA: T. Poole, and Howard Rosenthal. 2009.
Harvard University Press. Measuring Bias and Uncertainty in
Brownlee, Shannon, and Jeanne Lenzer. 2009. Does DW-NOMINATE Ideal Point Estimates via the
the Vaccine Matter? The Atlantic, November. Parametric Bootstrap. Political Analysis 17:
www.theatlantic.com/doc/200911/brownlee-h1n1/2 261–27. Updated at http://voteview.com/
dwnominate.asp
Brumm, Harold J., Dennis Epple, and Bennett T.
McCallum. 2008. Simultaneous Equation Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith
Econometrics: Some Weak-Instrument and T. Poole, and Howard Rosenthal. 2014.
Time-Series Issues. Manuscript, Carnegie DW-NOMINATE Scores with Bootstrapped
Mellon. Standard Errors. Updated February 17, 2013, at
http://voteview.com/dwnominate.asp
Buckles, Kasey, and Dan Hungerman. 2013. Season
of Birth and Later Outcomes: Old Questions, Cellini, Stephanie Riegg, Fernando Ferreira, and
New Answers. The Review of Economics and Jesse Rothstein. 2010. The Value of School
Statistics 95(3): 711–724. Facility Investments: Evidence from a Dynamic
Regression Discontinuity Design. Quarterly
Buddlemeyer, Hielke, and Emmanuel Skofias. 2003.
Journal of Economics 125(1): 215–261.
An Evaluation on the Performance of Regression
Discontinuity Design on PROGRESA. Institute Chabris, Christopher, and Daniel Simmons. Does
for Study of Labor, Discussion Paper 827. the Ad Make Me Fat? New York Times, March
10, 2013.
Burde, Dana, and Leigh L. Linden. 2013. Bringing
Education to Afghan Girls: A Randomized Chakraborty, Indraneel, Hans A. Holter, and Serhiy
Controlled Trial of Village-Based Schools. Stepanchuk. 2012. Marriage Stability, Taxation,
American Economic Journal: Applied Economics and Aggregate Labor Supply in the U.S. vs.
5(3): 27–40. Europe. Uppsala University Working Paper
Burtless, Gary, 1995. The Case for Randomized 2012: 10.
Field Trials in Economic and Policy Research. Chen, Xiao, Philip B. Ender, Michael Mitchell, and
Journal of Economic Perspectives 9(2): 63–84. Christine Wells. 2003. Regression with Stata.
580 BIBLIOGRAPHY
Finkelstein, Amy, Sarah Taubman, Bill Wright, Mira Greene, William. 2003. Econometric Analysis, 6th
Bernstein, Jonathan Gruber, Joseph P. Newhouse, ed. Upper Saddle River, NJ: Prentice Hall.
Heidi Allen, Katherine Baicker, and the Oregon Greene, William. 2008. Econometric Analysis, 7th
Health Study Group. 2012. The Oregon Health ed. Upper Saddle River, NJ: Prentice Hall.
Insurance Experiment: Evidence from the First
Year. Quarterly Journal of Economics 127(3): Grimmer, Justin, Eitan Hersh, Brian Feinstein, and
1057–1106. Daniel Carpenter. 2010. Are Close Elections
Randomly Determined? Manuscript, Stanford
Gaubatz, Kurt Taylor. 2015. A Survivor’s Guide to University.
R: An Introduction for the Uninitiated and the
Unnerved. Los Angeles: Sage. Hanmer, Michael J., and Kerem Ozan Kalkan. 2013.
Behind the Curve: Clarifying the Best Approach
Gerber, Alan S., and Donald P. Green. 2000. The to Calculating Predicted Probabilities and
Effects of Canvassing, Telephone Calls, and Marginal Effects from Limited Dependent
Direct Mail on Voter Turnout: A Field Variable Models. American Journal of Political
Experiment. American Political Science Review Science 57(1): 263–277.
94(3): 653–663.
Hanushek, Eric, and Ludger Woessmann. 2012.
Gerber, Alan S., and Donald P. Green. 2005. Do Better Schools Lead to More Growth?
Correction to Gerber and Green (2000), Cognitive Skills, Economic Outcomes, and
Replication of Disputed Findings, and Reply to Causation. Journal of Economic Growth 17(4):
Imai (2005). American Political Science Review 267–321.
99(2): 301–313.
Harvey, Anna. 2011. What’s So Great about
Gerber, Alan S., and Donald P. Green. 2012. Field Independent Courts? Rethinking Crossnational
Experiments: Design, Analysis, and Studies of Judicial Independence. Manuscript,
Interpretation. New York: Norton. New York University.
Gertler, Paul. 2004. Do Conditional Cash Transfers Hausman, Jerry A., and William E. Taylor. 1981.
Improve Child Health? Evidence from Panel Data and Unobservable Individual Effects.
PROGRESA’s Control Randomized Experiment. Econometrica 49(6): 1377–1398.
American Economic Review 94(2): 336–341.
Heckman, James J. 1979. Sample Selection Bias as
Goldberger, Arthur S. 1991. A Course in a Specification Error. Econometrica 47(1):
Econometrics. Cambridge, MA: Harvard 153–161.
University Press.
Heinz, Matthias, Sabrina Jeworrek, Vanessa
Gormley, William T., Jr., Deborah Phillips, and Ted Mertins, Heiner Schumacher, and Matthias
Gayer. 2008. Preschool Programs Can Boost Sutter. 2017. Measuring Indirect Effects of
School Readiness. Science 320(5884): Unfair Employer Behavior on Worker
1723–1724. Productivity: A Field Experiment. MPI
Grant, Taylor, and Matthew J. Lebo. 2016. Error Collective Goods Preprint, No. 2017/22.
Correction Methods with Political Time Series. Herndon, Thomas, Michael Ash, and Robert Pollin.
Political Analysis 24(1): 3–30. 2014. Does High Public Debt Consistently Stifle
Green, Donald P., Soo Yeon Kim, and David Economic Growth? A Critique of Reinhart and
H. Yoon. 2001. Dirty Pool. International Rogoff. Cambridge Journal of Economics 38(2):
Organization 55(2): 441–468. 257–279.
Green, Joshua. 2012. The Science Behind Those Howell, William G., and Paul E. Peterson. 2004.
Obama Campaign E-Mails. Business Week. The Use of Theory in Randomized Field Trials:
(November 29). Accessed from http://www Lessons from School Voucher Research on
.businessweek.com/articles/2012–11-29/ Disaggregation, Missing Data, and the
the-science-behind-those-obama-campaign- Generalization of Findings. American Behavioral
e-mails Scientist 47(5): 634–657.
582 BIBLIOGRAPHY
Imai, Kosuke. 2005. Do Get-Out-the-Vote Calls Kennedy, Peter. 2008. A Guide to Econometrics, 6th
Reduce Turnout? The Importance of Statistical ed. Malden, MA: Blackwell Publishing.
Methods for Field Experiments. American Khimm, Suzy. 2010. Who Is Alvin Greene? Mother
Political Science Review 99(2): 283–300. Jones. http://motherjones.com/mojo/2010/06/
Imai, Kosuke, Gary King, and Elizabeth A. Stuart. alvin-greene-south-carolina
2008. Misunderstandings among King, Gary. 1989. Unifying Political Methodology:
Experimentalists and Observationalists about The Likelihood Theory of Statistical Inference.
Causal Inference. Journal of the Royal Statistical Cambridge: Cambridge University Press.
Society, Series A (Statistics in Society) 171(2):
King, Gary. 1995. Replication, Replication. PS:
481–502.
Political Science and Politics 28(3): 444–452.
Imbens, Guido W. 2014. Instrumental Variables: An King, Gary, and Langche Zeng. 2001. Logistic
Econometrician’s Perspective. IZA Discussion Regression in Rare Events Data. Political
Paper 8048. Bonn: Forschungsinstitut zur Analysis 9: 137–163.
Zukunft der Arbeit (IZA).
King, Gary, Robert Keohane, and Sidney Verba.
Imbens, Guido W., and Thomas Lemieux. 2008. 1994. Designing Social Inquiry: Scientific
Regression Discontinuity Designs: A Guide to Inference in Qualitative Research. Princeton, NJ:
Practice. Journal of Econometrics 142(2): Princeton University Press.
615–635.
Kiviet, Jan F. 1995. On Bias, Inconsistency, and
Iqbal, Zaryab, and Christopher Zorn. 2008. The Efficiency of Various Estimators in Dynamic
Political Consequences of Assassination. Journal Panel Data Models. Journal of Econometrics
of Conflict Resolution 52(3): 385–400. 68(1): 53–78.
Jackman, Simon. 2009. Bayesian Analysis for the Klick, Jonathan, and Alexander Tabarrok. 2005.
Social Sciences. Hoboken, NJ: Wiley. Using Terror Alert Levels to Estimate the Effect
of Police on Crime. Journal of Law and
Jacobson, Gary C. 1978. Effects of Campaign Economics 48(1): 267–279.
Spending in Congressional Elections. American
Political Science Review 72(2): 469–491. Koppell, Jonathan G. S., and Jennifer A. Steen.
2004. The Effects of Ballot Position on Election
Kalla, Joshua L., and David E. Broockman. 2015. Outcomes. Journal of Politics 66(1): 267–281.
Congressional Officials Grant Access due to
Campaign Contributions: A Randomized Field La Porta, Rafael, F. Lopez-de-Silanes, C.
Experiment. American Journal of Political Pop-Eleches, and A. Schliefer. 2004. Judicial
Science 60(3): 545–558. Checks and Balances. Journal of Political
Economy 112(2): 445–470.
Kam, Cindy D., and Robert J. Franceze, Jr. 2007.
Lee, David S. 2008. Randomized Experiments from
Modeling and Interpreting Interactive
Non-random Selection in U.S. House Elections.
Hypotheses in Regression Analysis. Ann Arbor:
Journal of Econometrics 142(2): 675–697.
University of Michigan Press.
Lee, David S. 2009. Training, Wages, and Sample
Kastellec, Jonathan P., and Eduardo L. Leoni. 2007. Selection: Estimating Sharp Bounds on
Using Graphs Instead of Tables in Political Treatment Effects. Review of Economic Studies
Science. Perspectives on Politics 5(4): 755–771. 76(3): 1071–1102.
Keele, Luke, and Nathan J. Kelly. 2006. Dynamic Lee, David S., and Thomas Lemieux. 2010.
Models for Dynamic Theories: The Ins and Outs Regression Discontinuity Designs in Economics.
of Lagged Dependent Variables. Political Journal of Economic Literature 48(2): 281–355.
Analysis 14: 186–205.
Lenz, Gabriel, and Alexander Sahn. 2017.
Keele, Luke, and David Park. 2006. Difficult Achieving Statistical Significance with
Choices: An Evaluation of Heterogenous Choice Covariates and without Transparency.
Models. Manuscript, Ohio State University. Manuscript.
BIBLIOGRAPHY 583
Lerman, Amy E. 2009. The People Prisons Make: Manning, Willard G., Joseph P. Newhouse, Naihua
Effects of Incarceration on Criminal Psychology. Duan, Emmett B. Keeler, and Arleen Leibowitz.
In Do Prisons Make Us Safer? Steve Raphael 1987. Health Insurance and the Demand for
and Michael Stoll, eds. New York: Russell Sage Medical Care: Evidence from a Randomized
Foundation. Experiment. American Economic Review 77(3):
Levitt, Steven D. 1997. Using Electoral Cycles in 251–277.
Police Hiring to Estimate the Effect of Police on Manzi, Jim. 2012. Uncontrolled: The Surprising
Crime. American Economic Review 87(3): Payoff of Trial-and-Error for Business, Politics
270–290. and Society. New York: Basic Books.
Levitt, Steven D. 2002. Using Electoral Cycles in Marvell, Thomas B., and Carlisle E. Moody. 1996.
Police Hiring to Estimate the Effect of Police on Specification Problems, Police Levels and Crime
Crime: A Reply. American Economic Review Rates. Criminology 34(4): 609–646.
92(4): 1244–1250. McClellan, Chandler B., and Erdal Tekin. 2012.
Lochner, Lance, and Enrico Moretti. 2004. The Stand Your Ground Laws and Homicides.
Effect of Education on Crime: Evidence from National Bureau of Economic Research Working
Prison Inmates, Arrests, and Self-Reports. Paper No. 18187.
American Economic Review 94(1): 155–189. McCrary, Justin. 2002. Using Electoral Cycles in
Long, J. Scott. 1997. Regression Models for Police Hiring to Estimate the Effect of Police on
Categorical and Limited Dependent Variables. Crime: Comment. American Economic Review
London: Sage Publications. 92(4): 1236–1243.
Lorch, Scott A., Michael Baiocchi, Corinne S. McCrary, Justin. 2008. Manipulation of the Running
Ahlberg, and Dylan E. Small. 2012. The Variable in the Regression Discontinuity Design:
Differential Impact of Delivery Hospital on the A Density Test. Journal of Econometrics 142(2):
Outcomes of Premature Infants. Pediatrics 698–714.
130(2): 270–278. Miguel, Edward, and Michael Kremer. 2004.
Ludwig, Jens, and Douglass L. Miller. 2007. Does Worms: Identifying Impacts on Education and
Head Start Improve Children’s Life Chances? Health in the Presence of Treatment
Evidence from a Regression Discontinuity Externalities. Econometrica 72(1): 159–217.
Design. Quarterly Journal of Economics 122(1): Miguel, Edward, Shanker Satyanath, and Ernest
159–208. Sergenti. 2004. Economic Shocks and Civil
Lumley, Thomas, Paula Diehr, Scott Emerson, and Conflict: An Instrumental Variables Approach.
Lu Chen. 2002. The Importance of the Normality Journal of Political Economy 112(4): 725–753.
Assumption in Large Public Health Data Sets. Montgomery, Jacob M., Brendan Nyhan, and
Annual Review of Public Health 23: 151–169. Michelle Torres. 2017. How Conditioning on
Madestam, Andreas, Daniel Shoag, Stan Veuger, Post-Treatment Variables Can Ruin Your
and David Yanagizawa-Drott. 2013. Do Political Experiment and What to Do about It.
Protests Matter? Evidence from the Tea Party Manuscript, Washington University.
Movement. Quarterly Journal of Economics Morgan, Stephen L., and Christopher Winship.
128(4): 1633–1685. 2014. Counterfactuals and Causal Inference:
Makowsky, Michael, and Thomas Stratmann. 2009. Methods and Principals for Social Research,
Political Economy at Any Speed: What 2nd ed. Cambridge, U.K.: Cambridge University
Determines Traffic Citations? American Press.
Economic Review 99(1): 509–527. Murnane, Richard J., and John B. Willett. 2011.
Malkiel, Burton G. 2003. A Random Walk Down Methods Matter: Improving Causal Inference in
Wall Street: The Time-Tested Strategy for Educational and Social Science Research.
Successful Investing. New York: W.W. Norton. Oxford, U.K.: Oxford University Press.
584 BIBLIOGRAPHY
Murray, Michael P. 2006a. Avoiding Invalid Payments of 2008. American Economic Review
Instruments and Coping with Weak Instruments. 103(6): 2530–2553.
Journal of Economic Perspectives 20(4):
Persico, Nicola, Andrew Postlewaite, and Dan
111–132.
Silverman. 2004. The Effect of Adolescent
Murray, Michael P. 2006b. Econometrics: A Modern Experience on Labor Market Outcomes: The
Introduction. Boston: Pearson Addison Wesley. Case of Height. Journal of Political Economy
National Aeronautics and Space Administration. 112(5): 1019–1053.
2012. Combined Land-Surface Air and Pesaran, M. Hasehm, Yongcheol Shin, and Richard
Sea-Surface Water Temperature Anomalies J. Smith. 2001. Bounds Testing Approaches to
(Land-Ocean Temperature Index, LOTI) the Analysis of Level Relationships. Journal of
Global-Mean Monthly, Seasonal, and Annual Applied Econometrics 16(3): 289–326.
Means, 1880–Present, Updated through Most
Recent Months at https://data.giss.nasa.gov/ Philips, Andrew Q. 2018. Have Your Cake and Eat It
gistemp/ Too? Cointegration and Dynamic Inference from
Autoregressive Distributed Lag Models.
National Center for Addiction and Substance American Journal of Political Science 62(1):
Abuse at Columbia University. 2011. National 230–244.
Survey of American Attitudes on Substance
Abuse XVI: Teens and Parents (August). Pickup, Mark, and Paul M. Kellstedt. 2017.
Accessed November 10, 2011, at Equation Balance in Time Series Analysis: What
www.casacolumbia.org/download.aspx?path= It Is and How to Apply It. Manuscript, Simon
/UploadedFiles/ooc3hqnl.pdf Fraser University.
Newhouse, Joseph. 1993. Free for All? Lessons from Pierskalla, Jan H., and Florian M. Hollenbach. 2013.
the RAND Health Insurance Experiment. Technology and Collective Action: The Effect of
Cambridge, MA: Harvard University Press. Cell Phone Coverage on Political Violence in
Africa. American Political Science Review
Nevin, Rick. 2013. Lead and Crime: Why This
107(2): 207–224.
Correlation Does Mean Causation. January 26.
http://ricknevin.com/uploads/Lead_and_Crime_ Reinhart, Carmen M., and Kenneth S. Rogoff. 2010.
_Why_This_Correlation_Does_Mean_ Growth in a Time of Debt. American Economic
Causation.pdf Review: Papers & Proceedings 100(2): 573–578.
Noel, Hans. 2010. Ten Things Political Scientists Rice, John A. 2007. Mathematical Statistics and
Know that You Don’t. The Forum 8(3): article 12. Data Analysis, 3rd ed. Belmont, CA: Thomson.
Orwell, George. 1946. In Front of Your Nose. Roach, Michael A. 2013. Mean Reversion or a
Tribune. London (March 22). Breath of Fresh Air? The Effect of NFL
Osterholm, Michael T., Nicholas S. Kelley, Alfred Coaching Changes on Team Performance in the
Sommer, and Edward A. Belongia. 2012. Salary Cap Era. Applied Economics Letters
Efficacy and Effectiveness of Influenza Vaccines: 20(17): 1553–1556.
A Systematic Review and Meta-analysis. Lancet: Romer, Christina D. 2011. What Do We Know about
Infectious Diseases 12(1): 36–44. the Effects of Fiscal Policy? Separating Evidence
Palmer, Brian. 2013. I Wish I Was a Little Bit from Ideology. Talk at Hamilton College,
Shorter. Slate. July 30. http://www.slate.com/ November 7.
articles/health_and_science/science/2013/07/ Rossin-Slater, Maya, Christopher J. Ruhm, and Jane
height_and_longevity_the_research_is_clear_ Waldfogel. 2014. The Effects of California’s Paid
being_tall_is_hazardous_to_your.html Family Leave Program on Mothers’
Parker, Jonathan A., Nicholas S. Souleles, David Leave-Taking and Subsequent Labor Market
S. Johnson, and Robert McClelland. 2013. Outcomes. Journal of Policy Analysis and
Consumer Spending and the Economic Stimulus Management 32(2): 224–245.
BIBLIOGRAPHY 585
Scheve, Kenneth, and David Stasavage. 2012. Tam Cho, Wendy K., and James G. Gimpel. 2012.
Democracy, War, and Wealth: Lessons from Two Geographic Information Systems and the Spatial
Centuries of Inheritance Taxation. American Dimensions of American Politics. Annual Review
Political Science Review 106(1): 81–102. of Political Science 15: 443–460.
Schrodt, Phil. 2014. Seven Deadly Sins of Tufte, Edward R. 2001. The Visual Display of
Contemporary Quantitative Political Science. Quantitative Information, 2nd ed. Cheshire, CT:
Journal of Peace Research 51: 287–300. Graphics Press.
Schwabish, Jonathan A. 2004. An Economist’s Verzani, John. 2004. Using R for Introductory
Guide to Visualizing Data. Journal of Economic Statistics. London: Chapman and Hall.
Perspectives 28(1): 209–234.
Wawro, Greg. 2002. Estimating Dynamic Models in
Shen XiaoFeng, Yunping Li, ShiQin Xu, Nan Wang, Political Science. Political Analysis 10: 25–48.
Sheng Fan, Xiang Qin, Chunxiu Zhou and Philip
Wilson, Sven E., and Daniel M. Butler. 2007. A Lot
Hess. 2017. Epidural Analgesia During the
More to Do: The Sensitivity of Time-Series
Second Stage of Labor: A Randomized
Cross Section Analyses to Simple Alternative
Controlled Trial. Obstetrics & Gynecology
Specifications. Political Analysis 15: 101–123.
130(5): 1097–1103.
Sides, John, and Lynn Vavreck. 2013. The Gamble: Wooldridge, Jeffrey M. 2002. Econometric Analysis
Choice and Chance in the 2012 Presidential of Cross Section and Panel Data. Cambridge,
Election. Princeton, NJ: Princeton University MA: MIT Press.
Press. Wooldridge, Jeffrey M. 2009. Introductory
Snipes, Jeffrey B., and Edward R. Maguire. 1995. Econometrics, 4th ed. Mason, OH:
Country Music, Suicide, and Spuriousness. South-Western Cengage Learning.
Social Forces 74(1): 327–329. Wooldridge, Jeffrey M. 2013. Introductory
Solnick, Sara J., and David Hemenway. 2011. The Econometrics, 5th ed. Mason, OH:
“Twinkie Defense”: The Relationship between South-Western Cengage Learning.
Carbonated Non-diet Soft Drinks and Violence World Values Survey. 2008. Integrated EVS/WVS
Perpetration among Boston High School 1981–2008 Data File. http://www.world
Students. Injury Prevention. valuessurvey.org/
Sovey, Allison J., and Donald P. Green. 2011. Yau, Nathan. 2011. Visualize This: The Flowing
Instrumental Variables Estimation in Political Data Guide to Design, Visualization, and
Science: A Reader’s Guide. American Journal of Statistics. Hoboken, NJ: Wiley.
Political Science 55(1): 188–200.
Zakir Hossain, Mohammad. 2011. The Use of
Stack, Steven, and Jim Gundlach. 1992. The Effect Box-Cox Transformation Technique in Economic
of Country Music on Suicide. Social Forces and Statistical Analyses. Journal of Emerging
71(1): 211–218. Trends in Economics and Management Sciences
Staiger, Douglas, and James H. Stock. 1997. 2(1): 32–39.
Instrumental Variables Regressions with Weak Ziliak, Stephen, and Deirdre N. McCloskey. 2008.
Instruments. Econometrica 65(3): 557–586. The Cult of Statistical Significance: How the
Stock, James H, and Mark W. Watson. 2011. Standard Error Costs Us Jobs, Justice, and
Introduction to Econometrics, 3rd ed. Boston: Lives. Ann Arbor: University of Michigan
Addison-Wesley. Press.
PHOTO CREDITS
Page 1, Chapter 1 Opening Photo: (c Shutterstock/NigelSpiers); 13, Case Study 1.1: (c Shutterstock/ antonio-
diaz); 15, Case Study 1.2: (
c istockphoto/clickhere); 24, Chapter 2 Opening Photo: ( c istockphoto/ EdStock);
31, Case Study 2.1: (
c Shutterstock/Alan C. Heison); 45, Chapter 3 Opening Photo: Mark Wallheiser / Stringer);
74, Case Study 3.1: ( c Shutterstock/Gemenacom); 91, Chapter 4 Opening Photo: ( c Shutterstock/Torsten
Lorenz); 127, Chapter 5 Opening Photo: ( c Shutterstock/ Ritu Manoj Jethani); 141, Case Study 5.1:
(c Shutterstock/bibiphoto); 153, Case Study 5.2: (c Shutterstock/pcruciatti); 179, Chapter 6 Opening Photo:
(c Shutterstock/katatonia82); 187, Case Study 6.1: ( c Getty Images/Clarissa Leahy); 197, Case Study 6.2:
(c Shutterstock/Rena Schild); 220, Chapter 7 Opening Photo: ( c Shutterstock/gulserinak1955); 227, Case
Study 7.1: (c Shutterstock/lexaarts); 255, Chapter 8 Opening Photo: (c Shutterstock/ bikeriderlondon); 274,
Case Study 8.1: (c Getty Images/Anadolu Agency); 295, Chapter 9 Opening Photo: ( c Shutterstock/Monkey
Business Images); 305, Case Study 9.1: ( c istockphoto/RapidEye); 333, Chapter 10 Opening Photo:
(c Getty Images/Joe Raedle); 338, Case Study 10.1: ( c Shutterstock/De Visu); 350, Case Study 10.2:
(c istockphoto/lisafx); 357, Case Study 10.3: ( c Getty Images/Chicago Tribune); 362, Case Study 10.4:
(c istockphoto/Krakozawr); 373, Chapter 11 Opening Photo: ( c istockphoto/LauriPatterson), 389, Case Study
11.1: (c istockphoto/CEFutcher); 395, Case Study 11.2: ( c Shutterstock/Goodluz); 409, Chapter 12 Opening
Photo: (c Getty Images/BSIP); 431, Case Study 12.2: ( c Shutterstock/servickuz); 459, Chapter 13 Opening
Photo: (c Shutterstock/FloridaStock); 471, Case Study 13.1: ( c Getty Images/Chris Hondros) ; 482, Case
Study 13.2: (c Shutterstock/worradirek); 493, Chapter 14 Opening Photo: (_Creativecommonsstockphotos via
Dreamstime); 518, Chapter 15 Opening Photo: ( c Getty Images/J. R. Eyerman); 533, Chapter 16 Opening
Photo: (c Shutterstock/Everett Historical)
586
GLOSSARY
χ 2 distribution A probability distribution that char- autocorrelation Errors are autocorrelated if the error
acterizes the distribution of squared standard normal in one time period is correlated with the error in
random variables. Standard errors are distributed the previous time period. One of the assumptions
according to this distribution, which means that the necessary to use the standard equation for variance
χ 2 plays a role in the t distribution. Also relevant for of OLS estimates is that errors are not autocorrelated.
many statistical tests, including likelihood ratio tests Autocorrelation is common in time series data. 69
for maximum likelihood estimations. 549
autoregressive process A process in which the value
ABC issues Three issues that every experiment needs of a variable depends directly on the value from the
to address: attrition, balance, and compliance. 334 previous period. Autocorrelation is often modeled as
an autoregressive process such that the error term is a
adjusted R2 The R2 with a penalty for the number of
function of previous error terms. A standard dynamic
variables included in the model. Widely reported, but
models is also modeled as autoregressive process as
rarely useful. 150
the dependent variable is modeled to depend on the
alternative hypothesis An alternative hypothesis is lagged value of the dependent variable. 460
what we accept if we reject the hypothesis. It’s
auxiliary regression A regression that is not directly
not something that we are proving (given inherent
the one of interest but yields information helpful in
statistical uncertainty), but it is the idea we hang onto
analyzing the equation we really care about. 138
if we reject the null. 94
balance Treatment and control groups are balanced
AR(1) model A model in which the errors are
if the distributions of control variables are the same
assumed to depend on their value from the previous
for both groups. 336
period. 461
bias A biased coefficient estimate will systematically
assignment variable An assignment variable deter-
be higher or lower than the true value. 58
mines whether someone receives some treatment.
People with values of the assignment variable above binned graphs Used in regression discontinuity
some cutoff receive the treatment; people with values analysis. The assignment variable is divided into bins,
of the assignment variable below the cutoff do not and the average value of the dependent variable is
receive the treatment. 375 plotted for each bin. The plots allow us to visual-
ize a discontinuity at the treatment cutoff. Binned
attenuation bias A form of bias in which the esti-
graphs also are useful to help us identify possible
mated coefficient is closer to zero than it should be.
non-linearities in the relationship between the assign-
Measurement error in the independent variable causes
ment variable and the dependent variable. 386
attenuation bias. 145
blocking Picking treatment and control groups so
attrition Occurs when people drop out of an exper-
that they are equal in covariates. 335
iment altogether such that we do not observe the
dependent variable for them. 354 categorical variables Variables that have two or
more categories but do not have an intrinsic ordering.
augmented Dickey-Fuller test A test for unit root for
Also known as nominal variables. 179, 193
time series data that includes a time trend and lagged
values of the change in the variable as independent central limit theorem The mean of a sufficiently
variables. 481 large number of independent draws from any
587
588 GLOSSARY
distribution will be normally distributed. Because control variable An independent variable included in
OLS estimates are weighted averages, the central a statistical model to control for some factor that is not
limit theorem implies that β̂1 will be normally the primary factor of interest. 134, 298
distributed. 56
correlation Measures the extent to which two vari-
ceteris paribus All else being equal. A phrase used ables are linearly related to each other. A correlation
to describe multivariate regression results as a coeffi- of 1 indicates the variables move together in a straight
cient is said to account for change in the dependent line. A correlation of 0 indicates the variables are not
variable with all other independent variables held linearly related to each other. A correlation of −1
constant. 131 indicates the variables move in opposite directions. 9
codebook A file that describes sources for variables critical value In hypothesis testing, a value above
and any adjustments made. A codebook is a necessary which a β̂1 would be so unlikely that we reject the
element of a replication file. 29 null. 101
collider bias Bias that occurs when a post-treatment
cross-sectional data Data having observations for
variable creates a pathway for spurious effects to
multiple units for one time period. Each observation
appear in our estimation. 238
indicates the value of a variable for a given unit for the
compliance The condition of subjects receiving the same point in time. Cross-sectional data is typically
experimental treatment to which they were assigned. contrasted to panel and time series data. 459
A compliance problem occurs when subjects assigned
to an experimental treatment do not actually experi- cumulative distribution function Indicates how
ence the treatment, often because they opt out in some much of normal distribution is to the left of any given
way. 340 point. 418, 543
confidence interval Defines the range of true values de-meaned approach An approach to estimating
that are consistent with the observed coefficient esti- fixed effects models for panel data involving sub-
mate. Confidence intervals depend on the point esti- tracting average values within units from all vari-
mate, β̂1 , and the measure of uncertainty, se( β̂1 ). 117, ables. This approach saves us from having to include
133 dummy variables for every unit and highlights the
ability of fixed effects models to estimate param-
confidence levels Term referring to confidence inter- eters based on variation within units, not between
vals and based on 1 − α. 117 them. 263
consistency A consistent estimator is one for which degrees of freedom The sample size minus the
the distribution of the estimate gets closer and closer number of parameters. It refers to the amount of
to the true value as the sample size increases. For information we have available to use in the estimation
example, the bivariate OLS estimate β̂1 consistently process. As a practical matter, degrees of freedom
estimates β1 if X is uncorrelated with . 66 corrections produce more uncertainty for smaller
constant The parameter β0 in a regression model. sample sizes. The shape of a t distribution depends
It is the point at which a regression line crosses the on the degrees of freedom. The higher the degrees of
Y-axis. It is the expected value of the dependent freedom, the more a t distribution looks like a normal
variable when all independent variables equal 0. Also distribution. 63, 100
referred to as the intercept. 4
dependent variable The outcome of interest, usu-
continuous variable A variable that takes on any ally denoted as Y. It is called the dependent vari-
possible value over some range. Continuous variables able because its value depends on the values of the
are distinct from discrete variables, which can take on independent variables, parameters, and error term. 2,
only a limited number of possible values. 54 47
control group In an experiment, the group that does dichotomous Divided into two parts. A dummy
not receive the treatment of interest. 19 variable is an example of a dichotomous variable. 409
GLOSSARY 589
difference of means test A test that involves com- external validity A research finding is externally
paring the mean of Y for one group (e.g., the treatment valid when it applies beyond the context in which the
group) against the mean of Y for another group (e.g., analysis was conducted. 21
the control group). These tests can be conducted with F distribution A probability distribution that charac-
bivariate and multivariate OLS and other statistical terizes the distribution of a ratio of χ 2 random vari-
procedures. 180 ables. Used in tests involving multiple parameters,
among other applications. 550
difference-in-difference model A model that looks
at differences in changes in treated units compared to F statistic The test statistic used in conducting an
untreated units. These models are particularly useful F test. Used in testing hypotheses about multiple
in policy evaluation. 276 coefficients, among other applications. 159
discontinuity Occurs when the graph of a line makes F test A type of hypothesis test commonly used to
a sudden jump up or down. 373 test hypotheses involving multiple coefficients. 159
fitted value A fitted value, Ŷi , is the value of Y
distribution The range of possible values for a ran- predicted by our estimated equation. For a bivariate
dom variable and the associated relative probabilities OLS model, it is Ŷi = β̂0 + β̂1 Xi . Also called predicted
for each value. Examples of four distributions are value. 48
displayed in Figure 3.4. 54
fixed effect A parameter associated with a specific
dummy variable A dummy variable equals either unit in a panel data model. For a model Yit = β0 +
0 or 1 for all observations. Dummy variables are β1 X1it + αi + νit , the αi parameter is the fixed effect
sometimes referred to as dichotomous variable. 181 for unit i. 261
dyad An entity that consists of two elements. 274 fixed effects model A model that controls for unit-
and/or period-specific effects. These fixed effects cap-
dynamic model A time series model that includes ture differences in the dependent variable associated
a lagged dependent variable as an independent vari- with each unit and/or period. Fixed effects models are
able. Among other differences, the interpretation of used to analyze panel data and can control for both
coefficients differs in dynamic models from that in measurable and unmeasurable elements of the error
standard OLS models. Sometimes referred to as an term that are stable within unit. 261
autoregressive model. 460, 473
fuzzy RD models Regression discontinuity models in
elasticity The percent change in Y associated with which the assignment variable imperfectly predicts
a percent change in X. Elasticity is estimated with treatment. 392
log-log models. 234
generalizable A statistical result is generalizable if
endogenous An independent variable is endogenous it applies to populations beyond the sample in the
if changes in it are related to other factors that analysis. 21
influence the dependent variable. 8 generalized least squares (GLS) An approach to
error term The term associated with unmeasured estimating linear regression models that allows for
factors in a regression model, typically denoted as . 5 correlation of errors.. 467
exclusion condition For two-stage least squares, a goodness of fit How well a model fits the data. 70
condition that the instrument exert no direct effect in heteroscedastic A random variable is heteroscedas-
the second-stage equation. This condition cannot be tic if the variance differs for some observations.
tested empirically. 300 Heteroscedasticity does not cause bias in OLS models
590 GLOSSARY
but does violate one of the assumptions necessary irrelevant variable A variable in a regression model
to use the standard equation for variance of OLS that should not be in the model, meaning that its
estimates. 68 coefficient is zero. Including an irrelevant variable
heteroscedasticity-consistent standard errors does not cause bias, but it does increase the variance
Standard errors for the coefficients in OLS that are of the estimates. 150
appropriate even when errors are heteroscedastic. 68 jitter A process used in scatterplotting data. A small,
random number is added to each observation for
homoscedastic Describing a random variable having
purposes of plotting only. This procedure produces
the same variance for all observations. To use the
cloudlike images, which overlap less than the unjit-
standard equation for variance of OLS estimates. 68
tered data and therefore provide a better sense of the
hypothesis testing A process assessing whether the data. 74, 184
observed data is or is not consistent with a claim of
lagged variable A variable with the values from the
interest. The most widely used tools in hypothesis
previous period. 461
testing are t tests and F tests. 91
latent variable For a probit or logit model, an unob-
identified A statistical model is identified on the
served continuous variable reflecting the propensity
basis of assumptions that allow us to estimate the
of an individual observation of Yi to equal 1. 416
model. 318
least squares dummy variable approach An
inclusion condition For two-stage least squares, a
approach to estimating fixed effects models in the
condition that the instrument exert a meaningful
analysis of panel data. 262
effect in the first-stage equation in which the endoge-
nous variable is the dependent variable. 300 likelihood ratio (LR) test A statistical test for max-
imum likelihood models that is useful in testing
independent variable A variable that possibly influ- hypotheses involving multiple coefficients. 436
ences the value of the dependent variable. It is usually
denoted as X. It is called independent because its linear probability model Used when the dependent
value is typically treated as independent of the value variable is dichotomous. This is an OLS model in
of the dependent variable. 2, 47 which the coefficients are interpreted as the change in
probability of observing Yi = 1 for a one-unit change
instrumental variable Explains the endogenous in X. 410
independent variable of interest but does not directly
explain the dependent variable. Two-stage least linear-log model A model in which the independent
squares (2SLS) uses instrumental variables to variable is not logged but the independent variable
produce unbiased estimates. 297 is. In such a model, a one percent increase in X is
β1
associated with a 100 change in Y. 232
intention-to-treat (ITT) analysis ITT analysis add-
resses potential endogeneity that arises in experi- local average treatment effect The causal effect
ments owing to non-compliance. We compare the for those people affected by the instrument only.
means of those assigned treatment and those not Relevant if the effect of X on Y varies within the
assigned treatment, irrespective of whether the sub- population. 324
jects did or did not actually receive the treatment. 343 log likelihood The log of the probability of observ-
intercept The parameter β0 in a regression model. ing the Y outcomes we report, given the X data and
It is the point at which a regression line crosses the the β̂’s. It is a by-product of the maximum likelihood
Y-axis. It is the expected value of the dependent estimation process. 425
variable when all independent variables equal 0. Also log-linear model A model in which the dependent
referred to as the constant. 4, 47 variable is transformed by taking its natural log.
internal validity A research finding is internally A one-unit change in X in a log-linear model is
valid when it is based on a process free from system- associated with a β1 percent change in Y (on a 0-to-1
atic error. Experimental results are often considered scale). 233
internally valid, but their external validity may be log-log model A model in which the dependent vari-
debatable. 21 able and the independent variables are logged. 234
GLOSSARY 591
logit model A way to analyze data with a dichoto- Newey-West standard errors Standard errors for the
mous dependent variable. The error term in a coefficients in OLS that are appropriate even when
logit model is logistically distributed. Pronounced errors are autocorrelated. 467
“low-jit”. 418, 421
normal distribution A bell-shaped probability den-
maximum likelihood estimation The estimation sity that characterizes the probability of observing
process used to generate coefficient estimates for outcomes for normally distributed random variables.
probit and logit models, among others. 423, 549 Because of the central limit theorem, many statistical
quantities are distributed normally. 55
measurement error Measurement error occurs when
a variable is measured inaccurately. If the depen- null hypothesis A hypothesis of no effect. Statistical
dent variable has measurement error, OLS coeffi- tests will reject or fail to reject such hypotheses. The
cient estimates are unbiased but less precise. If an most common null hypothesis is β1 = 0, written as
independent variable has measurement error, OLS H0 : β1 = 0. 92
coefficient estimates suffer from attenuation bias,
with the magnitude of the attenuation depending on null result A finding in which the null hypothesis is
how large the measurement error variance is relative not rejected. 113
to the variance of the variable. 143 observational studies Use data generated in an
mediator bias Bias that occurs when a post- environment not controlled by a researcher. They
treatment variable is added and absorbs some of the are distinguished from experimental studies and are
causal effect of the treatment variable. 237 sometimes referred to as non-experimental studies. 21
model fishing Model fishing is a bad statistical prac- omitted variable bias Bias that results from leaving
tice that occurs when researchers add and subtract out a variable that affects the dependent variable and
variables until they get the answers they were looking is correlated with the independent variable. 138
for. 243 one-sided alternative hypothesis An alternative to
model specification The process of specifying the the null hypothesis that indicates whether the coeffi-
equation for our model. 220 cient (or function of coefficients) is higher or lower
than the value indicated in the null hypothesis. Typi-
modeled randomness Variation attributable to cally written as HA : β1 > 0 or HA : β1 < 0. 94
inherent variation in the data-generation process. This
source of randomness exists even when we observe one-way fixed effects model A panel data model
data for an entire population. 54 that allows for fixed effects at the unit level. 271
monotonicity A condition invoked in discussions of ordinal variables Variables that express rank but
instrumental variable models. Monotonicity requires not necessarily relative size. An ordinal variable,
that the effect of the instrument on the endogenous for example, is one indicating answers to a survey
variable go in the same direction for everyone in a question that is coded 1 = strongly disagree, 2 =
population. 324 disagree, 3 = agree, 4 = strongly agree. 193
multicollinearity Variables are multicollinear if they outliers Observations that are extremely different
are correlated. The consequence of multicollinearity from those in the rest of sample. 77
is that the variance of β̂1 will be higher than it
overidentification test A test used for two-stage
would have been in the absence of multicollinearity.
least squares models having more than one instru-
Multicollinearity does not cause bias. 148, 159
ment. The logic of the test is that the estimated coeffi-
multivariate OLS OLS with multiple independent cient on the endogenous variable in the second-stage
variables. 127 equation should be roughly the same when each
individual instrument is used alone. 309
natural experiment Occurs when a researcher iden-
tifies a situation in which the values of the indepen- p-hacking Occurs when a researcher changes the
dent variable have been determined by a random, or model until the p value on the coefficient of interest
at least exogenous, process. 334, 360 reaches a desired level. 243
592 GLOSSARY
p value The probability of observing a coefficient as continuous random variable to take on a given
extreme as we actually observed if the null hypothesis probability. 541
were true. 106 probability distribution A graph or formula that
panel data Has observations for multiple units over gives the probability across the possible values of a
time. Each observation indicates the value of a vari- random variable. 54
able for a given unit at a given point in time. Panel probability limit The value to which a distribution
data is typically contrasted to cross-sectional and time converges as the sample size gets very large. When the
series data. 255 error is uncorrelated with the independent variables,
perfect multicollinearity Occurs when an indepen- the probability limit of β̂1 is β1 for OLS models. The
dent variable is completely explained by a linear probability limit of a consistent estimator is the true
combination of the other independent variables. 149 value of the parameter. 65, 145, 311
plim A widely used abbreviation for probability probit model A way to analyze data with a dichoto-
limit, the value to which an estimator converges as mous dependent variable. The key assumption is that
the sample size gets very, very large. 66 the error term is normally distributed. 418
point estimates Point estimates describe our best quadratic model A model that includes X and X 2
guess as to what the true value is. 117 as independent variables. The fitted values will be
defined by a curve. A quadratic model is an example
polynomial model A model that includes values of of a polynomial model. 223, 227
X raised to powers greater than one. A polynomial
model is an example of a non-linear model in which quasi-instrument An instrumental variable that is
the effect of X on Y varies depending on the value not strictly exogenous. Two-stage least squares with
of X. The fitted values will be defined by a curve. a quasi-instrument may produce a better estimate
A quadratic model is an example of a polynomial than OLS if the correlation of the quasi-instrument
model. 223, 226 and the error in the main equation is small relative
to the correlation of the quasi-instrument and the
pooled model Treats all observations as indepen- endogenous variable. 311
dent observations. Pooled models contrast with
fixed effects models that control for unit-specific or random effects model Treats unit-specific error as a
time-specific fixed effects. 256 random variable that is uncorrelated with the indepen-
dent variable. 524
post-treatment variable A variable that is causally
affected by an independent variable. 236 random variable A variable that takes on values
in a range and with the probabilities defined by a
power The ability of our data to reject the null distribution. 54
hypothesis. A high-powered statistical test will reject
randomization The process of determining the
the null with a very high probability when the null is
false; a low-powered statistical test will reject the null experimental value of the key independent variable
with a low probability when the null is false. 111 based on a random process. If successful, random-
ization will produce as independent variable that
power curve Characterizes the probability of reject- is uncorrelated with all other potential independent
ing the null hypothesis for each possible value of the variables, including factors in the error term. 19
parameter. 111
randomized controlled trial An experiment in
predicted value The value of Y predicted by our which the treatment of interest is randomized. 19
estimated equation. For a bivariate OLS model, it is
reduced form equation In a reduced form equation,
Ŷi = β̂0 + β̂1 Xi . Also called fitted values. 48
Y1 is only a function of the non-endogenous variables
probability density A graph or formula that (which are the X and Z variables, not the Y variables).
describes the relative probability that a random Used in simultaneous equation models. 317
variable is near a specified value. 55 reference category When a model includes dummy
probability density function A mathematical func- variables indicating the multiple categories of a nomi-
tion that describes the relative probability for a nal variable, we need to exclude a dummy variable for
GLOSSARY 593
one of the groups, which we refer to as the reference attrition problems in experiments. The most famous
category. The coefficients on all the included dummy selection model is the Heckman selection model. 356
variables indicate how much higher or lower the significance level For each hypothesis test, we set
dependent variable is for each group relative to the a significance level that determines how unlikely a
reference category. Also referred to as the excluded result has to be under the null hypothesis for us to
category. 194 reject the null hypothesis. The significance level is
regression discontinuity (RD) analysis Techniques the probability of committing a Type I error for a
that use regression analysis to identify possible dis- hypothesis test. 95
continuities at the point at which some treatment simultaneous equation model A model in which
applies. 374 two variables simultaneously cause each other. 315
regression line The fitted line from a regression. 48 slope coefficient The coefficient on an independent
replication Research that meets a replication stan- variable. It reflects how much the dependent variable
dard can be duplicated based on the information increases when the independent variable increases by
provided at the time of publication. 28 one. In a plot of fitted values, the slope coefficient
characterizes the slope of the fitted line. 4
replication files Files that document how data is
gathered and organized. When properly compiled, spurious regression A regression that wrongly sug-
these files allow others to reproduce our results gests X has an effect on Y. Can be caused by, for
exactly. 28 example, omitted variable bias and nonstationary
data. 477
residual The difference between the fitted value and
the observed value. Graphically, it is the distance stable unit treatment value assumption The condi-
between an estimated line and an observation. Math- tion that an instrument has no spillover effect. This
ematically, a residual for a bivariate OLS model is condition rules out the possibility that the value of an
ˆi = Yi − β̂0 − β̂1 Xi . An equivalent way to calculate a instrument going up by one unit will cause a neighbor
residual is ˆi = Yi − Ŷi . 48 to become more likely to change X as well. 324
restricted model The model in an F test that imposes standard deviation The standard deviation des-
the restriction that the null hypothesis is true. If the cribes the spreadof the data. For large samples, it
(Xi −X)2
fit of the restricted model is much worse than the fit is calculated as . For probability distribu-
N
of the unrestricted model, we infer that that the null tions, the standard deviation refers to the width of
hypothesis is not true. 159 the distribution. For example, we often refer to the
robust Statistical results are robust if they do not standard deviation of the distribution as σ ; it is the
change when the model changes. 30, 130, 244, 534 square root of the variance (which is σ 2 ). To convert a
normally distributed random variable into a standard
rolling cross section data Repeated cross sections of normal variable, we subtract the mean and divide
data from different individuals at different points in by the standard deviation of the distribution of the
time (e.g., an annual survey of U.S. citizens in which random variable. 26
different citizens are chosen each year). 279
standard error The square root of the variance. Com-
sampling randomness Variation in estimates that is monly used to refer to the precision of a parameter
seen in a subset of an entire population. If a given estimate. The standard error of β̂1 from a bivariate
sample had a different selection of people, we would OLS model is the
observe a different estimated coefficient. 53, 551 square root of the variance of the
σ̂2
estimate. It is N×var(X)
. The difference between
scatterplot A plot of data in which each observation
standard errors and standard deviations can some-
is located at the coordinates defined by the indepen-
times be confusing. The standard error of a parameter
dent and dependent variables. 3
estimate is the standard deviation of the sampling
selection model Simultaneously accounts for distribution of the parameter estimate. For example,
whether we observe the dependent variable and what the standard deviation of the distribution of β̂1 distri-
the dependent variable is. Often used to deal with bution is estimated by the standard error of β̂1 . A good
594 GLOSSARY
rule of thumb is to associate standard errors with t statistic The test statistic used in a t test. It is equal
ˆ Null
parameter estimates and standard deviations with the to β1se(
−β
β̂1 )
. If the t statistic is greater than our critical
spread of a variable or distribution, which may or
value, we reject the null hypothesis. 104
may not be a distribution associated with a parameter
estimate. 61 t test A test for hypotheses about a normal random
variable with an estimated standard error. We com-
ˆ
standard error of the regression A measure of how pare | se(ββ1ˆ ) | to a critical value from a t distribution
well the model fits the data. It is the square root of the 1
determined by the chosen significance level (α). For
variance of the regression. 71
large sample sizes, a t test is closely approximated by
standard normal distribution A normal distribution a z test. 98
with a mean of zero and a variance (and standard time series data Consists of observations for a single
deviation) of one. 543 unit over time. Each observation indicates the value of
standardize Standardizing a variable converts it to a variable at a given point in time. The data proceed
a measure of standard deviations from its mean. in order, indicating, for example, annual, monthly, or
This is done by subtracting the mean of the variable daily data. Time series data is typically contrasted to
from each observation and dividing the result by the cross-sectional and panel data. 459
standard deviation of the variable. 156 treatment group In an experiment, the group that
receives the treatment of interest. 19
standardized coefficient The coefficient on an inde-
pendent variable that has been standardized accord- trimmed data set A set for which observations are
1 −X 1 removed in a way that offsets potential bias due to
ing to X1Standardized = Xsd(X . A one-unit change in
1) attrition. 355
a standardized variable is a one-standard-deviation
change no matter what the unit of X is (e.g., inches, two-sided alternative hypothesis An alternative to
dollars, years). Therefore, effects across variables can the null hypothesis that indicates the coefficient is not
be compared because each β̂ represents the effect of equal to 0 (or some other specified value). Typically
a one-standard-deviation change in X on Y. 157 written as HA : β1 = 0. 94
stationarity A time series term indicating that a two-stage least squares Uses exogenous variation
variable has the same distribution throughout the in X to estimate the effect of X on Y. In the first-stage,
entire time series. Statistical analysis of nonstationary we estimate a model in which the endogenous inde-
variables can yield spurious regression results. 476 pendent variable is the dependent variable and the
instrument, Z, is an independent variable. In the
statistically significant A coefficient is statistically second-stage, we estimate a model in which we
significant when we reject the null hypothesis that it is use the fitted values from the first-stage, X̂1i , as an
zero. In this case, the observed value of the coefficient independent variable. 295
is a sufficient number of standard deviations from the
two-way fixed effects model A panel data model
value posited in the null hypothesis to allow us to
reject the null. 93 that allows for fixed effects at the unit and time
levels. 271
substantive significance If a reasonable change in Type I error A hypothesis testing error that occurs
the independent variable is associated with a mean- when we reject a null hypothesis that is in fact true. 93
ingful change in the dependent variable, the effect
is substantively significant. Some statistically signif- Type II error A hypothesis testing error that occurs
icant effects are not substantively significant, espe- when we fail to reject a null hypothesis that is in fact
cially for large data sets. 116 false. 93
t distribution A distribution that looks like a normal unbiased estimator An estimator that produces esti-
distribution, but with fatter tails. The exact shape of mates that are on average equal to the true value of
the distribution depends on the degrees of freedom. the parameter of interest. 58
This distribution converges to a normal distribution unit root A variable with a unit root has a coefficient
for large sample sizes. 99, 549 equal to 1 on the lagged variable in an autoregressive
GLOSSARY 595
model. A variable with a unit root is nonstationary variables from the main equation are included as
and must be modeled differently than a stationary independent variables. 148
variable. 477 variance of the regression The variance of the
unrestricted model The model in an F test that regression measures how well the model explains
imposes no restrictions on the coefficients. If the fit variation in the dependent
variable. For large samples,
N
of the restricted model is much worse than the fit (Y −Ŷi )2
i=1 i
it is estimated as σ̂ 2 = N
. 63
of the unrestricted model, we infer that that the null
hypothesis is not true. 159 weak instrument An instrumental variable that adds
little explanatory power to the first-stage regression in
variance A measure of how much a random variable a two-stage least squares analysis. 312
varies. In graphical terms, the variance of a random
variable characterizes how wide the distribution is. 61 window The range of observations we analyze in
a regression discontinuity analysis. The smaller the
variance inflation factor A measure of how much window, the less we need to worry about non-linear
variance is inflated owing to multicollinearity. It can functional forms. 386
1
be estimated for each variable and is equal to 1−R 2,
j z test A hypothesis test involving comparison of a
where R2j is from an auxiliary regression in which Xj test statistic and a critical value based on a normal
is the dependent variable and all other independent distribution. 423
INDEX
Entries with page numbers followed by t will be found in Tables, by an f in Figures and by an n in footnotes.
596
INDEX 597
panel data and, 521 unbiased estimator; plim in, 65, 65f
robustness and, 520 unbiasedness precision in, 61–64
auxiliary regression, 138, 173n14 2SLS and, 312 for presidential elections, 46f ,
for autocorrelation, 464–66 attrition and, 355 50–51, 51f , 51t, 94–95, 95t,
independent variable and, autocorrelation and, 69, 459, 96f
465–66n3 464, 476 probability density in, 55–56,
for institutions and human in bivariate OLS, 58–61 55f , 58f
rights, 154 characterization of, 60–61 randomness of, 53–57
averages collider, 238–43, 510–13 random variables in, 53–57
central limit theorem and, 56 from fixed effects, 268n6 regression coefficient and,
de-meaned approach and, 263 mediator, 237 50–51n4
of dependent variables, 261 modeled randomness and, 59 for retail sales and temperature,
of distributions, 58 in multivariate OLS, 167 130, 130t
of independent variables, 338 random effects model and, 524 sample size and, 80
of random variables, 56 sampling randomness and, 59 sampling randomness in, 53
standard deviation and, 26n2 weak instruments and, 313 standard error in, 61–63, 74–75
for treatment group, 182 binned graphs, RD and, 386–91, standard error of the regression
388f , 393n1 in, 71
Baicker, Katherine, 119 bivariate OLS, 45–90 Stata for, 81–84
Baiocchi, Michael, 305, 324 balance and, 337 t test for, 97–106
Baker, Regina, 302, 311n5 bias in, 58–61 unbiased estimator in, 58–60,
balance causality and, 50–51n4 58f
2SLS for, 366 central limit theorem for, 56–57 unbiasedness in, 57–61
bivariate OLS and, 337 coefficient estimates in, 46–50, variance in, 50–51n4, 61–63,
checking for, 336–37 48n3, 53–59, 76–77, 97 62f , 63n14, 67
for congressional members and consistency in, 66–67, 66f ,
variance of the regression in, 63
donors, 454, 455t 66n16
for violent crime, 77–80, 77f ,
in control group, 335–40 correlated errors in, 68
78t, 79f
control variables and, 337–38 d.f. in, 63, 63n13
for violent crime and ice
in education and wages, 359, for difference of means test,
cream, 60
360t 180–90
Blackwell, Matthew, 238, 246
foreign aid for poverty and, distributions of, 54–56, 55f
blocking, in randomized
338–40, 339t dummy independent variables
experiments, 335
ITT for, 365, 366 in, 180–90, 182f
Bloom, Howard, 398
multivariate OLS and, 337 equation for, 57n8
Bound, John, 302, 311n5
in randomized experiments, exogeneity in, 57–61
goodness of fit in, 70–77 Box, George, 534
335–40
R for, 366 for height and wages, 74–77, Box-Cox tests, 245
Stata for, 365–66 75f , 132, 132t, 133f Box-Steffensmeier, Janet, 444
in treatment group, 335–40 homoscedasticity in, 68, 74, Bradford-Hill, Austin, 537
Bayesian Analysis for the Social 75t, 80 Brambor, Thomas, 212
Sciences (Jackman), 120 hypothesis testing and, 92 Braumoeller, Bear, 212
Beck, Nathaniel, 81, 523, 526 normal distribution in, 55, 55f Broockman, David, 454
Berk, Richard, 354 null hypothesis and, 97 Brownlee, Shannon, 14–15, 21
Bertrand, Marianne, 368 observational data for, 78, 127, Buddlemeyer, Hielke, 398
bias. See also attenuation bias; 131, 198 Bush, George W., 449–50
omitted variable bias; outliers in, 77–80 Butler, Daniel, 283
598 INDEX
campaign contributions, for Clark, William, 212 equations for, 118–19, 119t
President Obama, 333 Clarke, Kevin., 514 in hypothesis testing, 117–19,
Campbell, Alec, 354 Cochrane-Orcutt model. See 118f
car accidents and hospitalization, ρ-transformed model for interaction variables, 205
238–40 codebooks for multivariate OLS, 133
Card, David, 245, 374 for data, 29, 29t probability density and, 117,
Carpenter, Daniel, 398 for height and wages, 29, 29t 118f
Carrell, Scott, 374 coefficient estimates sampling randomness and,
categorical variables, 194n5 assignment variables and, 343 118n9
to dummy independent attenuation bias and, 144 confint, in R, 122
variables, 193–202 bias in, 58 congressional elections, RD for,
in R, 213–14 in bivariate OLS, 46–50, 48n3, 402–4, 403t
regional wage differences and, 53–59, 76–77 congressional members and
194–96, 195t, 197t exogeneity of, 57–59 donors
in regression models, 193–94 for logit model, 426–29, 434 balance for, 454, 455t
in Stata, 213 in multivariate OLS, 128, 133, LPM for, 454–55, 455t
causality, 1–23 144, 146–47 probit model for, 454–55, 455t
bivariate OLS and, 50–51n4 in OLS, 493–98 consistency
core model for, 2–7, 7f outliers and, 79 in bivariate OLS, 66–67, 66f ,
correlation and, 2, 2f overidentification test and, 310 66n16
with country music and suicide, for probit model, 426–29, 427f , causality and, 535–36
15–17 434 constant (intercept)
data and, 1 random effects model and, 524 in bivariate OLS, 47, 53
dependent variable and, 2–3, random variables in, 53–57 fixed effects model and, 262
12f in simultaneous equation in regression model, 4, 5f
donuts and weight and, 3–9, 3t models, 318–19 continuous variables
endogeneity and, 7–18 unbiasedness of, 57–59 in bivariate OLS, 54
independent variable and, 2–3, variance of, 146–47, 313–14 dummy independent variables
12f coefficients and, 191t, 203
indicators of, 535–36 comparing, 155 for trade and alliances, 274–75
observational data and, 25n1 standardized, 155–58 control group
randomized experiments and, cointegration, 487 attrition in, 354–55
18–22 collider bias, 238–43, 510–13 balance in, 335–40
randomness and, 7–18 Columbia University National blocking for, 335–36
CDF. See cumulative distribution Center for Addiction and ITT and, 343
function Substance Abuse, 136 multivariate OLS and, 134,
central limit theorem, for bivariate commandname, in Stata, 34–35 134n1
OLS, 56–57 comment lines placebo to, 334n1
ceteris paribus, 131 in R, 37 in randomized experiments, 19
Chandra, Amitabh, 119 in Stata, 35 treatment group and, 134,
Chen, Xiao, 34 compliance. See also 134n1, 180, 334
Cheng, Jing, 324 non-compliance variables in, 337
χ(chi)2 distribution. See χ 2 in randomized experiments, control variables
distribution 340–54 for 2SLS, 300
Ching, Andrew, 431–35 in treatment group, 342, 348 balance and, 337–38
civil war. See economic growth confidence intervals multivariate OLS and, 134,
and civil war autocorrelation and, 460 134n1
INDEX 599
economic growth and civil war in difference-in-difference for difference of means test,
instrumental variable for, models, 276–83 334, 336
327–29, 327t in domestic violence in for fixed effect model, 261
LPM for, 441–43, 442f Minneapolis, 350–51 for flu shots and health, 13
probit model for, 441–43, 441f , fixed effects models and, for F test, 166
442f 255–94 for heteroscedasticity-consistent
economic growth and democracy, flu shots and health and, 13–15, standard errors, 68n18
instrumental variables for, 14f for independent and dependent
331–32, 332t Hausman test for, 301n2 variable relationship, 4
economic growth and education, hypothesis testing and, 115 for logit model, 421, 421n6
multivariate OLS for, independent variable and, 10 for LR test, 436–37
140–43, 141t, 142f instrumental variables and, for multicollinearity, 147
economic growth and elections, 45 295–332 for omitted variable bias, 138,
economic growth and government multivariate OLS and, 129–37, 502–4
debt, 24–26, 25f , 25n1 166 for polynomial models, 224–25,
education. See also alcohol non-compliance and, 340–41 225n4
consumption and grades; observational data and, 21, 127 for power, 113n7
crime and education; omitted variable bias and, 139 for probit model, 420
economic growth and for p value, 108n5
overidentification test and, 310
education; law school for quasi-instrumental
in panel data, 255–94
admission variables, 310, 311n5
pooled model and, 256–57
in Afghanistan, 370–72, 371t for standard deviation, 26n3
vouchers for, non-compliance RD and, 373–405
for simultaneous equation
with, 341, 342n4 simultaneous equation models
model, 316
education and wages, 9, 359, 360t and, 315–23
for two-way fixed effects
2SLS for, 301–3 unmeasured factors, 198
model, 271
Einav, Liran, 358 for violent crime, 32
for variance, 313–14
elasticity, 234 energy efficiency, dummy for variance of standard error,
elections. See also presidential independent variables for, 499–501
207–10, 208f , 209t, 211f
elections for ρ-transformed model,
congressional elections, RD for, Epple, Dennis, 320, 321 468–69
402–4, 403t equations Erdem, Tülin, 431–35
economic growth and, 45 for 2SLS, 298, 299 errors. See also correlated errors;
get-out-the-vote efforts, for AR(1) model, 463 measurement error; standard
non-compliance for, 346–48, for attrition, 355 error; Type I errors; Type II
347n8, 347t, 348t, 366–67, for baseball players’ salaries, errors
367t 155 autocorrelated, 461–62
Ender, Philip, 34 for bivariate OLS, 50, 57n8 autoregressive, 460–62, 461n2
endogeneity, 11 for confidence interval, 118–19, heteroscedasticity-consistent
attrition and, 354 119t standard errors, 68–70,
causality and, 7–18 for core model, 5 68n18
correlation and, 10 for country music and suicide, lagged, 461, 466, 466t
for country music and suicide, 15 MSE, 71
16–17 for de-meaned approach, 263, random, 6, 417
for crime and police, 299 264n3 root mean squared error, in
data and, 24 for difference-in-difference Stata, 71
dependent variable and, 10 models, 277 spherical, 81
602 INDEX
errors (continued) for crime and police, 296–98 from probit model, 423–25,
standard error of the regression, independent variable and, 10 424f
71, 83 in natural experiments, 362 variance of, 314
error term observational data and, 21, 182 fixed effects, 261, 268
for 2SLS, 299 quasi-instrumental variables alternative hypothesis and,
autocorrelation and, 460–62 and, 310–12 268n5
autoregressive error and, randomized experiments for, AR(1) model and, 521
460–62 18–19, 334 autocorrelation and, 519–20
in bivariate OLS, 46, 47, 59–60, expected value, of random bias from, 268n6
198 variables, 496–97 lagged dependent variables and,
for country music and suicide, experiments. See randomized 520–23
16 experiments random effects model and,
dependent variable and, 12f external validity, of randomized 524–25
for donuts and weight, 9 experiments, 21 fixed effects models
endogenity and, 8, 198 constant and, 262
fixed effects models and, 262 Facebook, 333 for crime and police, 256–61,
for flu shots and health, 13–14 false-negative results, 501 297
homoscedasticity of, 68 Fearon, James, 440–41 for difference-in-difference
independent variable and, 10, Feinstein, Brian, 398 models, 255–83
12f , 16, 46, 59, 334, Finkelstein, Amy, 358 dyads and, 274–76, 275t
465–66n3 fish market, instrumental variables endogeneity and, 255–94
ITT and, 343 for, 329–30, 329t error term and, 262
in multivariate OLS, 137–39 fitted lines independent variable and, 268
normal distribution of, 56n6 independent variables and, 449 for instructor evaluation,
observational data and, 198, 323 latent variables and, 416–17 289–90, 290t
in OLS, 525 logit model and, 434–35, 435f , LSDV and, 262–63, 263t
omitted variable bias and, 503 449 multivariate OLS and, 262
quasi-instruments and, 310–13 for LPM, 411–13, 412f , 415f , for panel data, 255–94
random effects model and, 524 434–35, 435f for Peace Corps, 288–89, 289t
randomized experiments and, probit model and, 423–25, for presidential elections, 288,
337 424f , 434–35, 435f , 449 288t
RD and, 377–79 for RD, 385f , 387f R for, 528–30
in regression model, 5–6 for violent crime, 79f Stata for, 285, 527–28
for test scores, 260 fitted values for Texas school boards,
ρ-transformed model and, 469 for 2SLS, 299, 314, 348 291–93, 292t
EViews, 34 based on regression line, 50–51, for trade and alliances, 274–76,
Excel, 34 52f 275t
excluded category, 194 in bivariate OLS, 47, 53 two-way, 271–75
exclusion condition for difference-in-difference for Winter Olympics, 530–32,
for 2SLS, 300–301, 302f models, 278 530t
observational data and, 303 from logit model, 425 flu shots and health, 21n9
exogeneity, 9 for LPM, 412, 412n3 correlation with, 14–15
in bivariate OLS, 46, 57–61, 67 for Manchester City soccer, endogeneity and, 13–15, 14f
of coefficient estimates, 57–59 192f foreign aid for poverty, balance
consistency and, 67 observations and, 428–29 and, 338–40, 339t
correlation and, 10 for presidential elections, 50, Franceze, Robert, 212
correlation errors and, 68–70 52f Freakonomics (Levitt), 296
INDEX 603
frequency table global education, 177–78, 177t Head Start, RD for, 401–2, 404–5,
for donuts and weight, 26–27, global warming, 227–30, 228f , 404t
26t, 27t 229t health. See donuts and weight; flu
in R, 38 AR(1) model for, 471–73, shots and health
F statistic, 159n10, 165n13 472f , 473t health and Medicare, 374, 375–76
defined, 159 autocorrelation for, 471–73, health insurance, attrition and,
multiple instruments and, 312 472f , 473t 357–59, 358n11
F tests, 159–66 Dickey-Fuller test for, 483–84, heating degree-days (HDD),
and baseball salaries, 162–64 483t dummy independent
defined, 159 dynamic model for, 482–85, variables for, 207–10, 208f ,
for multiple coefficients, 162, 483f , 485t 209t, 211f
436 LPM for, 450–53, 451t, 452f Heckman, James, 356
with multiple instruments, 309 time series data for, 459 height and gender, difference of
for null hypothesis, 162, 309 GLS. See generalized least squares means test for, 187–90, 188f ,
OLS and, 436 Goldberger, Arthur, 168 188t, 189f , 190t
restricted model for, 160–62, Golder, Matt, 212 height and wages
165t gold standard, randomized bivariate OLS for, 74–77, 75f ,
in Stata, 170 experiments as, 18–22 132, 132t, 133f
t statistic and, 312n6 Goldwater, Barry, 50 codebooks for, 29, 29t
unrestricted model for, 160–62, goodness of fit
and comparing effects of height
165t for 2SLS, 314
measures, 164–66
using R2 values, 160–62 in bivariate OLS, 70–77
heteroscedasticity for, 75t
fuzzy RD models, 392 for MLE, 425
homoscedasticity for, 75t
in multivariate OLS, 149–50
hypothesis testing for, 123–24,
Galton, Francis, 45n2 scatterplots for, 71–72, 72f , 74
123t, 126
Gaubatz, Kurt Taylor, 34 standard error of the regression
and, 71 logged variables for, 234–36,
Gayer, Ted, 389, 400 235t
Gore, Al, 50
GDP per capita. See life multivariate OLS for, 131–34,
expectancy and GDP per Gormley, William, Jr., 389, 400
Gosset, William Sealy, 99n1 132t, 133f
capita null hypothesis for, 92
gender and wages governmental debt. See economic
growth and government debt p value for, 107f
assessing bias in, 242 scatterplot for, 75f
interaction variables for, 203–4, Graddy, Kathryn, 330
grades. See alcohol consumption t statistic for, 104–5, 104t
204f
and grades two-sided alternative hypothesis
generalizability
Green, Donald P., 274, 283, 325, for, 94
in randomized experiments, 21
347, 365, 366 variables for, 40, 40t
of RD, 394
Greene, William, 487, 514 Herndon, Thomas, 24
generalized least squares, 467–68
Grimmer, Justin, 398 Hersh, Eitan, 398
generalized linear model (glm), in
Gundlach, Jim, 15 heteroscedasticity
R, 447
Gerber, Alan, 347, 365, 366 bivariate OLS and, 68, 75t, 80
Gertler, Paul, 339 Hanmer, Michael, 443 for height and wages, 75t
get-out-the-vote efforts, Hanushek, Eric, 140, 141, 177 LPM and, 414n4
non-compliance for, 346–48, Harvey, Anna, 152–53 R and, 86
347n8, 347t, 348t, 366–67, Hausman test, 268n6, 301n2 weighted least squares and, 81
367t random effects model and, 525 heteroscedasticity-consistent
glm. See generalized linear model HDD. See heating degree-days standard errors, 68–70, 68n18
604 INDEX
high-security prison and inmate Stata for, 121–22 multicollinearity and, 148
aggression, 374 statistically significant in, multivariate OLS and, 127–28,
histograms 93, 120 134, 144–45
for alcohol consumption and substantive significance and, observed-value, discrete
grades, 396f 115 differences approach and,
for RD, 393, 393f , 396f t test for, 97–106 429
Hoekstra, Mark, 374 Type I errors and, 93, 93t omitted variable bias and, 503,
homicide. See stand your ground Type II errors and, 93t 508–10
laws and homicide probability limits and, 65f
homoscedasticity ice cream, violent crime and, 60 probit model and, 430
in bivariate OLS, 68, 74, 75t, 80 identification, simultaneous randomization of, 19, 334
for height and wages, 74, 75t equation model and, 318 slope coefficient on, 4
hospitalization, car accidents and, Imai, Kosuke, 365 substantive significance and,
238–40 Imbens, Guido, 325, 330, 398 115
Howell, William, 341 inclusion condition, for 2SLS, for test scores, 260
Huber-White standard errors. See 300, 302f for trade and alliances, 274
heteroscedasticity-consistent independent variables. See also ρ-transformed model and, 469
standard errors dummy independent inheritance tax, public policy and,
human rights. See institutions and variables 197–202
human rights attenuation bias and, 144 inmate aggression. See
hypothesis testing, 91–126. See auxiliary regression and, high-security prison and
also alternative hypothesis; 465–66n3 inmate aggression
null hypothesis averages of, 338 institutions and human rights,
alternative hypothesis and, 94, in bivariate OLS, 46, 47, 59, multivariate OLS for,
97, 105 65f , 66n16 152–55, 153t
bivariate OLS and, 92 causality and, 2–3, 12f instructor evaluation, fixed effects
confidence intervals in, 117–19, consistency and, 66n16 model for, 289–90, 290t
118f constant and, 4
instrumental variables
critical value in, 101–4 for country music and suicide,
Dickey-Fuller test for, 480–81 2SLS and, 295–308, 313
16
for dummy dependent variables, for chicken market, 319–23
defined, 3
434–43 for crime and education,
as dichotomous variables, 181
endogeneity and, 115 330–31, 331t
as dummy independent
for height and wages, 123–24, variables, 179–219 for economic growth and civil
123t, 126 dynamic models and, 476 war, 327–29, 327t
log likelihood for, 425, 436 endogeneity and, 8, 10 for economic growth and
LR test for, 434–40 error term and, 10, 12f , 16, 46, democracy, 331–32, 332t
MLE and, 423 59, 334, 465–66n3 endogeneity and, 295–332
for multiple coefficients, exogeneity and, 9, 10 for fish market, 329–30, 329t
158–64, 171–72, 434–43 fitted lines and, 449 for Medicaid enrollment, 295
power and, 109–11 fixed effects methods and, 268 multiple instruments for,
for presidential elections, for flu shots and health, 13 309–10
124–26 instrumental variables and, simultaneous equation models
p value and, 106–9, 107f 295–308 and, 315–23
R for, 122–23 logit model and, 430 for television and public affairs,
significance level and, 95–96, LPM and, 414 328–29, 328t
105 measurement error in, 144–45 weak instruments for, 310–13
INDEX 605
for instrumental variables, precision in, 146–50 with educational vouchers, 341,
309–10 R2 and, 149 342n4
multiple variables for retail sales and temperature, endogeneity and, 340–41
difference of means test tests 127, 128f , 129–31, 129f , for get-out-the-vote efforts,
and, 182 130t 346–48, 347n8, 347t, 348t,
in multivariate OLS, 128, 135, R for, 170–71 366–67, 367t
167 standard errors in, 133 ITT and, 343–45
omitted variable bias with, Stata for, 168 schematic representation of,
507–8 variance in, 146–47 341–43, 342f
multivariate OLS, 127–77 for wealth and universal male variables for, 348–49
attenuation bias and, 144 suffrage, 200–201, 201t, non-linear models
balance and, 337 202f latent variables and, 416–17
bias in, 167 Murnane, Richard, 300n1 linear models and, 410n2
coefficient estimates in, 128, Murray, Michael, 81, 324 OLS and, 220–21
133, 144, 146–47 normal distributions
confidence interval for, 133 in bivariate OLS, 55, 55f
control group and, 134, 134n1 _n, in Stata, 89n29 CDF and, 418–21, 420f
control variables in, 134, 134n1 National Center for Addiction and of error term, 56n6
dependent variable and, Substance Abuse (Columbia probit model and, 418, 419f
143–44 University), 136 t distribution and, 100, 100f
dummy independent variables National Longitudinal Survey of null hypothesis, 92–126
in, 190–93 Youth (NLSY), 40–41, 123 alternative hypothesis and, 94,
for economic growth and natural experiments, on crime and 97, 105
education, 140–43, 141t, terror alerts, 360–62, 363t augmented Dickey-Fuller test
142f natural logs, 230, 234n6 and, 481
endogeneity and, 129–37, 166 negative autocorrelation, 462, autocorrelation and, 460
error term in, 137–39 462f bivariate OLS coefficient
estimation process for, 134–36 negative correlation, 9–10, 10f estimates and, 97
fixed effects models and, 262 neonatal intensive care unit Dickey-Fuller test and, 480–81
goodness of fit in, 149–50 (NICU), 2SLS for, 305–8, distributions for, 94, 96f
for height and wages, 131–34, 306t, 307t F test for, 159–60, 309
132t, 133f Nevin, Rick, 537 for height and athletics, 164–66
independent variables and, Newey, Whitney, 365 log likelihood and, 436
127–28, 134, 144–45 Newey-West standard errors, 467, power and, 109–11, 336–37,
for institutions and human 470, 489–90 502
rights, 152–55, 153t NFL coaches, probit model for, for presidential elections,
irrelevant variables in, 150 452t, 453–54 94–95, 95t, 96f
for judicial independence, NICU. See neonatal intensive care p value and, 106–9, 107f
152–55, 153t unit significance level and, 95–96,
measurement error in, 143–45 NLSY. See National Longitudinal 105
multicollinearity in, 147–49, Survey of Youth statistically significant and, 93
154, 167 nominal variables, 193 t test for, 105
multiple variables in, 128, 135, non-compliance Type I errors and, 93, 95, 97
167 2SLS for, 346–56 Type II errors and, 93, 95, 97
observational data for, 166 for domestic violence in types of, 105
omitted variable bias in, Minneapolis, 350–54, 353t, null result, power and, 113
137–39, 144, 154, 167 354t Nyhan, Brendan, 246
608 INDEX
Obama, President Barack critical value and, 101–3, 102f for television and public affairs,
campaign contributions for, 333 one-way fixed effect models, 271 368
ObamaCare, 19 orcutt, 490 unbiased estimator and, 493–98
simultaneous equation models ordinal variables, 193, 194n5 variance for, 314, 499–501
for, 316 ordinary least squares (OLS). See for Winter Olympics, 515–16
observational data also bivariate OLS; Orwell, George, 533
for 2SLS, 323, 346, 349, 350 multivariate OLS outcome variables
for bivariate OLS, 78, 127, 131, 2SLS and, 298, 301n2 for Medicaid, 295
198 advanced, 493–512 RD and, 384
causality and, 25n1 autocorrelation and, 460, 464, outliers
for crime and terror alerts, 362 466, 466t, 519 in bivariate OLS, 77–80
difference of means test for, 182 autocorrelation for, 459 coefficient estimates and, 80
dummy independent variables balance and, 336 sample size and, 80
and, 182 coefficient estimates in, 493–98 scatterplots for, 80
for education and wages, 301 for crime and police, 256–61, overidentification test, 2SLS and,
endogeneity and, 21, 127 257t, 258f , 259f 309–10
error term and, 198, 323 for dichotomous variables, 409
exclusion condition and, 303 for difference-in-difference panel data
exogeneity and, 21, 182 models, 277–79, 278f advanced, 518–32
and fitted values, 428–29 difference of means test and, AR(1) model and, 521
latent variables and, 414–17 334, 336 with correlated errors, 518–20
messiness of, 24 for domestic violence in difference-in-difference models
for multivariate OLS, 166 Minneapolis, 352–53, for, 279–81, 280t
in natural experiments, 362 353n10 endogeneity in, 255–94
for NICU, 305 dynamic models and, 474–75 fixed effects models for, 255–94
RD and, 375 error term in, 525 lagged dependent variable and,
observed-value, discrete F test and, 436 520–24
differences approach Hausman test for, 301n2 OLS for, 284
dummy dependent variables lagged dependent variables in, random effects model and,
and, 429, 443 519–24 524–25
independent variable and, 429 logged variables in, 230–36 parent in jail, effect of, 242
for probit model, 430–31 LPM and, 410, 414 Park, David, 487
Stata for, 444–47 LSDV and, 262–63, 263t Pasteur, Louis, 91
OLS. See ordinary least squares MLE and, 423 Peace Corps, fixed effects model
omitted variable bias model specification and, 220 for, 288–89, 289t
anticipating sign of, 505–6, for multiple coefficients, 436 perfect multicollinearity, 149
506t omitted variable bias in, 502–14 Persico, Nicola, 40, 74, 123
for institutions and human for panel data, 284 Pesaran, Hashem, 487
rights, 154 polynomial models and, 224 Peterson, Paul E., 341
from measurement error, probit model and, 418 p−hacking, 243–45
508–10 quadratic models and, 226 Philips, Andrew, 487
with multiple variables, 507–8 quantifying relationships Phillips, Deborah, 389, 400
in multivariate OLS, 137–39, between variables with, 46 Pickup, Mark, 487
144, 154, 167 quasi-instruments and, 311 Pischke, Jörn-Steffen, 325
in OLS, 502–14 R for, 515 placebo, to control group, 334n1
one-sided alternative hypothesis, se for, 499–501 plausability, causality and, 536
94 Stata for, 170–72, 514 plim. See probability limits (plim)
INDEX 609
point estimate, 117 in bivariate OLS, 47 for economic growth and civil
police. See crime and police bivariate OLS for, 46f , 50–51, war, 441–43, 441f , 442f
Pollin, Robert, 24 51f , 51t, 94–95, 95t, 96f equation for, 420
polynomial models, 221–30 fixed effects model for, 288, fitted lines and, 423–25, 424f ,
dichotomous variables and, 288t 434–35, 435f , 449
410n2 hypothesis testing for, 124–26 fitted values from, 423–25,
equations for, 224–25, 225n4 null hypothesis for, 94–95, 95t, 424f
for life expectancy and GDP per 96f independent variables and, 430
capita, 222–26, 223f , 224f for presidential elections, 50 for Iraq War and President
OLS and, 224 variables for, 87t Bush, 449–50, 449t
for RD, 383–84, 383f , 387f presidential elections ketchup econometrics, 431–34,
pooled model 434t, 435f
bivariate OLS for, 46f , 50–51,
for crime and police, 256–61, for law school admission, 415f ,
51f , 51t, 94–95, 95t, 96f
257t, 258f , 259f 427–28, 427f
fitted values for, 50, 52f
two-way fixed effects model LR test and, 438t, 439–40
fixed effects models for, 288,
and, 272 for NFL coaches, 452t, 453–54
288t
positive autocorrelation, 462, normal distribution and, 418,
hypothesis testing for, 124–26 419f
462f
null hypothesis for, 94–95, 95t, observed-value, discrete
positive correlation, 9–10, 10f
96f differences approach for,
Postlewaite, Andrew, 40, 74, 123
predicted values for, 45, 50 430–31
post-treatment variables, 236–43
residuals for, 50 R for, 446–49
collider bias with, 510–13
scatterplots for, 45, 46f Stata for, 444–47
defined, 236
variables for, 87t Progresa experiment, in Mexico,
pound sign (#), in R, 37
prison. See high-security prison 338–40, 339t
poverty. See foreign aid for
and inmate aggression public affairs. See television and
poverty
power probability, of Type II error, 111n7 public affairs
balance and, 336–37 probability density p-value
calculating, 501–2 in bivariate OLS, 55–56, 55f , hypothesis testing and, 106–9,
58f 107f
equations for, 113n7
confidence interval and, 117, for LR test, 446–47
hypothesis testing and, 109–11
118f in Stata, 446–47
null hypothesis and, 336–37,
502 critical value and, 102f
null result and, 113 for null hypothesis, 95 quadratic models, 221–30
and standard error, 113 p value and, 107f fitted curves for, 225f
Type II errors and, 109–11, probability distribution, in for global warming, 227–30,
110f , 501–2 bivariate OLS, 54, 55f 228f , 229t
power curve, 111–13, 112f probability limits (plim), in OLS and, 226
R for, 123 bivariate OLS, 65, 65f R for, 246
Prais-Winsten model. See probit model Stata for, 246
ρ-transformed model coefficient estimates for, quarter of birth, 2SLS for, 301–3
precision 426–29, 427f quasi-instrumental variables
in 2SLS, 313–15 for congressional members and equation for, 310, 311n5
in bivariate OLS, 61–64 donors, 454–55, 455t exogeneity and, 310–12
in multivariate OLS, 146–50 dependent variables in, 443
predict, in Strata, 83 for dummy dependent variables, R (software), 33, 36–39, 39n8
predicted values 418–21, 423–25, 424f for 2SLS, 326
610 INDEX
R (software) (continued) for congressional members and assignment variable in, 375–76,
AER package for, 85–86, 326 donors, 454–55, 455t 384, 391–95, 393n1
for autocorrelation, 488–90 control group in, 19 basic model for, 375–80
for balance, 366 discontinuity in, 373–74 binned graphs and, 386–91,
data frames in, 286–87 error term and, 337 388f , 393n1
for dummy variables, 213–14 for exogeneity, 18–19, 334 χ 2 distribution for, 394
for fixed effects models, 528–30 external validity of, 21 for congressional elections,
for hypothesis testing, 122–23 for flu shots and health, 13–15, 402–4, 403t
installing packages, 86 14–15, 14f , 21n9 covariates in, 395
for logit model, 446–49 generalizability of, 21 dependent variable in, 395
for LSDV, 286–87 as gold standard, 18–22 diagnostics for, 393–97
for multivariate OLS, 170–71 internal validity of, 21 discontinuous error distribution
for Newey-West standard for job resumes and racial at threshold in, 392
errors, 489–90 discrimination, 368–70, 369t endogeneity and, 373–405
for OLS, 515 RD for, 395–97, 396f , 397t error term and, 377–79
for probit model, 446–49 for television and public affairs, fitted lines for, 385f , 387f
for quadratic models, 246 328–29, 328t, 366–68 flexible models for, 381–84
residual standard error in, 71, 85 treatment group in, 19, 334 fuzzy RD models, 392
sample limiting with, 38–39 randomness. See also modeled generalizability of, 394
for scatterplots, 400 randomness; sampling for Head Start, 401–2, 404–5,
variables in, 37–38, 38n7 randomness 404t
R2 of bivariate OLS estimates, histograms for, 393, 393f , 396f
for 2SLS, 314 53–57 LATE and, 394
adjusted, 150 causality and, 7–18 limitations of, 391–97
F tests using, 160–62 random variables Medicare and, 374–76
goodness of fit and, 71–72, 74 averages of, 56 outcome variables and, 384
multiple, 85 in bivariate OLS, 46, 53–57, 54 polynomial models for, 383–84,
multivariate OLS and, 149 central limit theorem and, 56 383f , 387f
racial discrimination. See job χ 2 distribution and, 99n1 scatterplots for, 376–77, 377f ,
resumes and racial in coefficient estimates, 53–57 378f , 400
discrimination expected value of, 496–97 slope and, 381, 381f
RAND, 358 probability density for, 55–56, treatment group and, 376
random effects model, panel data 55f for universal prekindergarten,
and, 524–25 probit model and, 418 389–90, 389f , 390t,
random error, 6 random walks. See unit roots 400–402, 401t
latent variables and, 417 RD. See regression discontinuity windows and, 386–91, 387f
randomization reduced form equation, 317 regression line
of independent variable, 19, 334 reference category, 194 in bivariate OLS, 47
in Progresa experiment, 339 reg, in Stata, 325 fitted values based on, 50–51
randomized experiments, 333–34 regional wage differences, scatterplot with, 85
2SLS for, 308, 308t categorical variables and, regression models
ABC issues in, 334, 334n2 194–96, 195t, 197t categorical variables in, 193–94
attrition in, 354–59 regression coefficient, bivariate for chicken market, 319–23
balance in, 335–40 OLS and, 50–51n4 constant in, 4, 5f
blocking in, 335 regression discontinuity (RD) error term in, 5–6
causality and, 18–22 for alcohol consumption and regression to the mean, 45n2
compliance in, 340–54 grades, 395–97, 396f , 397t Reinhart, Carmen, 24–25
INDEX 611
stable unit treatment value for categorical variables, 213 unit roots and, 477–81, 479f ,
assumption (SUTVA), 324 critical value in, 121, 170 480f
Stack, Steven, 15 dfbeta in, 79n24 statistically significant
Staiger, Douglas, 312n6 for dummy variables, 212–13 balance and, 336
standard deviation (SD) for fixed effects models, 285, in hypothesis testing, 93, 120
averages and, 26n2 527–28 statistical realism, 533–37
with data, 26 F test in, 170 statistical software, 32–33
equation for, 26n3 for hypothesis testing, 121–22 Stock, James, 312n6, 487
se and, 61 interaction variables in, 212 strength, causality and, 535
standard error of the regression ivreggress in, 325 Stuart, Elizabeth, 365
in bivariate OLS, 71 jitter in, 83n25, 173n14 substantive significance,
in Stata, 83 limit sample in, 176n15 hypothesis testing and, 115
standard error (se) linear-log model in, 246 suicide. See country music and
for 2SLS, 300, 313 suicide
logit model in, 446–47
autocorrelation and, 464 summarize, in Stata, 34–35
LR test in, 446
in bivariate OLS, 61, 74 supply equation, 320–22
for LSDV, 284–85
fixed effects and, 268n5 SUTVA. See stable unit treatment
marginal-effects approach and,
for height and wages, 74, 133 value assumption
451–52
heteroscedasticity-consistent Swirl, 34
multicollinearity in, 169
standard errors, 68–70, 68n18 syntax files
for multivariate OLS, 168 in R, 37
for interaction variables, 205
_n in, 89n29 in Stata, 35
multicollinearity and, 149–50
for observed-value, discrete
in multivariate OLS, 133
differences approach, 444–47
Newey-West, 467 Tabarrok, Alexander, 362
for OLS, 170, 514
for null hypothesis, 95 t distribution, 99–100, 99n1
for probit model, 444–47
for OLS, 499–501 critical value for, 101, 103t
for quadratic models, 246
and power, 113 d.f. and, 103
reg in, 325
in R, 86 inverse t function and, 121n9
and sample size, 113–14 robust in, 83, 168, 212
MLE and, 423
substantive significance and, root mean squared error in, 71
normal distribution and, 100,
115 scalar variables in, 405n3
100f
t tests and, 98 scatterplots in, 399
teacher salaries. See education and
variance of, 499–501 standard error of the regression wages
standardization, of variables, 156 in, 83 Tekin, Erdal, 280
standardized coefficients, 155–58 for standardized regression television and public affairs,
standardized regression coefficients, 169–71 367–68
coefficients test in, 446–47 instrumental variables for,
in Stata, 169–71 ttail in, 121n10 328–29, 328t
stand your ground laws and twoway in, 83 temperature. See global warming;
homicide, 276–77, 280–81, VIF in, 169 retail sales and temperature
280t stationarity, 485n12 terror alerts. See crime and terror
Stasavage, David, 197, 200 augmented Dickey-Fuller test alerts
Stata, 34–36 for, 482 test, in Stata, 446–47
for 2SLS, 325 Dickey-Fuller test for, 482 test scores. See education and
for autocorrelation, 488–90 global warming and, 482–85, wages
for balance, 365–66 483f , 485t Texas school boards, fixed effects
for bivariate OLS, 81–84 time series data and, 476–82 model for, 291–93, 292t
INDEX 613
time series data, 459–92 critical value for, 101 for television and public affairs,
autocorrelation in, 460–63 for hypothesis testing, 97–106 368
correlated errors in, 68–70 MLE and, 423 for treatment group, 329
dependent variable and, 460 for null hypothesis, 105 variables in, 348–49
dynamic models for, 473–76 se and, 98 variance of, 313–14
for global warming, 459 Tufte, Edward, 34 twoway, in Stata, 83
stationarity and, 476–82 two-sided alternative hypothesis, two-way fixed effects models,
Torres, Michelle, 246 94 271–75
trade and alliances, fixed effects critical value and, 101–3, 102f Type I errors
model for, 274–76, 275t two-stage least squares (2SLS), hypothesis testing and, 93, 93t
treatment group, 335 300n1 null hypothesis and, 95, 97
2SLS for, 329 for alcohol consumption and significance level and, 95–96
attrition in, 354–55 grades, 308, 308t Type II errors
averages for, 182 assignment variable and, 348 hypothesis testing and, 93t
balance in, 335–40 for balance, 366 null hypothesis and, 93, 95, 97
blocking for, 335–36 bias and, 312 power and, 109–11, 110f ,
compliance in, 342, 348 for crime and police, 296–98, 501–2
control group and, 134, 134n1, 297t probability of, 111n7
180, 334 for domestic violence in significance level and, 95–96
difference-in-difference models Minneapolis, 352–53
and, 285 for education and wages, 301–3 unbiased estimator
difference of means test for, exclusion condition for, in bivariate OLS, 58–60, 58f
181, 186f 300–301, 302f correlation of, 61
dummy independent variables fitted value for, 299, 314, 348 distributions of, 61
and, 181, 182f goodness of fit for, 314 ITT and, 344
ITT and, 343 Hausman test for, 301n2 OLS and, 493–98
in randomized experiments, inclusion condition for, 300, unbiasedness
19, 334 302f in bivariate OLS, 57–61
RD and, 376 instrumental variables and, of coefficient estimates, 57–59
SUTVA and, 324 295–308, 313 Uncontrolled (Manzi), 21
variables in, 337 LATE with, 324 unit roots
trimmed data set, attrition and, with multiple instruments, 309 augmented Dickey-Fuller test
355–56 for NICU, 305–8, 306t, 307t for, 481
Trump, President Donald, 1, 45, for non-compliance, 346–56 Dickey-Fuller test for, 480–81
183–85 observational data for, 323, 346, lagged dependent variable and,
TSTAT, 121–22, 121n10 349, 350 477
t statistic OLS and, 298, 301n2 stationarity and, 477–81, 479f ,
critical value and, 104 overidentification test and, 480f
for economic growth and 309–10 universal prekindergarten, RD for,
education, 143 precision of, 313–15 389–90, 389f , 390t,
F test and, 312n6 for quarter of birth, 301–3 400–402, 401t
for height and wages, 104–5, R2 for, 314 unrestricted model
104t R for, 326 defined, 159
p value and, 108 se for, 300, 313 for LR test, 439–40
ttail, in Stata, 121n10 for simultaneous equation
t tests, 99n1 model, 317–18 variables
for bivariate OLS, 97–106 Stata for, 325 in 2SLS, 348–49
614 INDEX