Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli

Real
Econometrics
The Right Tools to Answer
Important Questions
Real
Econometrics
The Right Tools to Answer
Important Questions
Second Edition
Michael A. Bailey
New York Oxford

OXFORD UNIVERSITY PRESS
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and certain other countries.
Published in the United States of America by Oxford University Press

198 Madison Avenue, New York, NY 10016, United States of America.

c 2020, 2017 by Oxford University Press
For titles covered by Section 112 of the US Higher Education

Opportunity Act, please visit www.oup.com/us/he for the latest
information about pricing and alternate formats.
All rights reserved. No part of this publication may be reproduced, stored in

a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by license, or under terms agreed with the appropriate reproduction
rights organization. Inquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press,
at the address above.
You must not circulate this work in any other form

and you must impose this same condition on any acquirer.
Library of Congress Cataloging-in-Publication Data
Names: Bailey, Michael A., 1969- author.

Title: Real econometrics : the right tools to answer important questions /
Michael A. Bailey.
Description: Second Edition. | New York : Oxford University Press, [2019] |
Revised edition of the author’s Real econometrics, [2017] | Includes
bibliographical references and index.
Identifiers: LCCN 2018046855 (print) | LCCN 2018051766 (ebook) | ISBN
9780190857486 (ebook) | ISBN 9780190857462 (pbk.) | ISBN 9780190857523
(looseleaf)
Subjects: LCSH: Econometrics–Textbooks. | Economics–Study and teaching.
Classification: LCC HB139 (ebook) | LCC HB139 .B344 2019 (print) | DDC
330.01/5195–dc23
LC record available at https://lccn.loc.gov/2018046855
Printing number: 9 8 7 6 5 4 3 2 1
Printed in the United States of America

CONTENTS
List of Figures xii

List of Tables xvii
Useful Commands for Stata xxii
Useful Commands for R xxiv

Preface for Students: How This Book Can Help You Learn Econometrics xxvi
Preface for Instructors: How to Help Your Students Learn Econometrics xxx
Acknowledgments xxxvi
1 The Quest for Causality 1

1.1 The Core Model 2
1.2 Two Major Challenges: Randomness and Endogeneity 7
CASE STUDY: Flu Shots 13
CASE STUDY: Country Music and Suicide 15
1.3 Randomized Experiments as the Gold Standard 18
Conclusion 22 . Key Terms 23
2 Stats in the Wild: Good Data Practices 24

2.1 Know Our Data 26
2.2 Replication 28
CASE STUDY: Violent Crime in the United States 31
2.3 Statistical Software 32
Conclusion 33 .
Further Reading 34 . Key Terms 34 . Computing
Corner 34 .
Exercises 39
I THE OLS FRAMEWORK 43

3 Bivariate OLS: The Foundation of Econometric Analysis 45
3.1 Bivariate Regression Model 47
3.2 Random Variation in Coefficient Estimates 53
v
vi CONTENTS
3.3 Endogeneity and Bias 57

3.4 Precision of Estimates 61
3.5 Probability Limits and Consistency 65
3.6 Solvable Problems: Heteroscedasticity and Correlated Errors 67
3.7 Goodness of Fit 70
CASE STUDY: Height and Wages 74
3.8 Outliers 77
Conclusion 80 .
Further Reading 81 . Key Terms 82 . Computing
Corner 82 .
Exercises 86
4 Hypothesis Testing and Interval Estimation:

Answering Research Questions 91
4.1 Hypothesis Testing 92
4.2 t Tests 98
4.3 p Values 106
4.4 Power 109
4.5 Straight Talk about Hypothesis Testing 115
4.6 Confidence Intervals 117
Conclusion 120 .
Further Reading 120 .
Key Terms 121
. Computing Corner 121 .
Exercises 123
5 Multivariate OLS: Where the Action Is 127

5.1 Using Multivariate OLS to Fight Endogeneity 129
5.2 Omitted Variable Bias 137
CASE STUDY: Does Education Support Economic Growth? 140
5.3 Measurement Error 143
5.4 Precision and Goodness of Fit 146
CASE STUDY: Institutions and Human Rights 152
5.5 Standardized Coefficients 155
5.6 Hypothesis Testing about Multiple Coefficients 158
CASE STUDY: Comparing Effects of Height Measures 164
Conclusion 166 .
Key Terms 168
Exercises 172
6 Dummy Variables: Smarter than You Think 179

6.1 Using Bivariate OLS to Assess Difference of Means 180
CASE STUDY: Sex Differences in Heights 187
CONTENTS vii
6.2 Dummy Independent Variables in Multivariate OLS 190

6.3 Transforming Categorical Variables to Multiple Dummy Variables 193
CASE STUDY: When Do Countries Tax Wealth? 197
6.4 Interaction Variables 202
CASE STUDY: Energy Efficiency 207
Conclusion 211 .Further Reading 212 .
Key Terms 212
Exercises 214
7 Specifying Models 220

7.1 Quadratic and Polynomial Models 221
CASE STUDY: Global Warming 227
7.2 Logged Variables 230
7.3 Post-Treatment Variables 236
7.4 Model Specification 243
Key Terms 246
Exercises 247
II THE CONTEMPORARY ECONOMETRIC TOOLKIT 253

8 Using Fixed Effects Models to Fight Endogeneity
in Panel Data and Difference-in-Difference Models 255
8.1 The Problem with Pooling 256
8.2 Fixed Effects Models 261
8.3 Working with Fixed Effects Models 267
8.4 Two-Way Fixed Effects Model 271
CASE STUDY: Trade and Alliances 274
8.5 Difference-in-Difference 276
Key Terms 284
Exercises 288
9 Instrumental Variables: Using Exogenous Variation

to Fight Endogeneity 295
9.1 2SLS Example 296
9.2 Two-Stage Least Squares (2SLS) 298
CASE STUDY: Emergency Care for Newborns 305
9.3 Multiple Instruments 309
viii CONTENTS
9.4 Quasi and Weak Instruments 310

9.5 Precision of 2SLS 313
9.6 Simultaneous Equation Models 315
CASE STUDY: Supply and Demand Curves for the Chicken Market 319
Conclusion 323 .
Key Terms 325
Exercises 327
10 Experiments: Dealing with Real-World Challenges 333

10.1 Randomization and Balance 335
CASE STUDY: Development Aid and Balancing 338
10.2 Compliance and Intention-to-Treat Models 340
10.3 Using 2SLS to Deal with Non-compliance 346
CASE STUDY: Minneapolis Domestic Violence Experiment 350
10.4 Attrition 354
CASE STUDY: Health Insurance and Attrition 357
10.5 Natural Experiments 360
CASE STUDY: Crime and Terror Alerts 362
Conclusion 363 .
Key Terms 365
Exercises 366
11 Regression Discontinuity: Looking for Jumps in Data 373

11.1 Basic RD Model 375
11.2 More Flexible RD Models 380
11.3 Windows and Bins 386
CASE STUDY: Universal Prekindergarten 389
11.4 Limitations and Diagnostics 391
CASE STUDY: Alcohol and Grades 395
Conclusion 397 .
Key Terms 398
Exercises 400
III LIMITED DEPENDENT VARIABLES 407

12 Dummy Dependent Variables 409
12.1 Linear Probability Model 410
CONTENTS ix
12.2 Using Latent Variables to Explain Observed Variables 414

12.3 Probit and Logit Models 418
12.4 Estimation 423
12.5 Interpreting Probit and Logit Coefficients 426
CASE STUDY: Econometrics in the Grocery Store 431
12.6 Hypothesis Testing about Multiple Coefficients 436
CASE STUDY: Civil Wars 440
Conclusion 443 .
Key Terms 444
Exercises 449
IV ADVANCED MATERIAL 457
13 Time Series: Dealing with Stickiness over Time 459

13.1 Modeling Autocorrelation 460
13.2 Detecting Autocorrelation 463
13.3 Fixing Autocorrelation 467
CASE STUDY: Using an AR(1) Model to Study Global Temperature
Changes 471
13.4 Dynamic Models 473
13.5 Stationarity 476
CASE STUDY: Dynamic Model of Global Temperature 482
Conclusion 486 .
Key Terms 488
Exercises 490
14 Advanced OLS 493

14.1 How to Derive the OLS Estimator and Prove Unbiasedness 493
14.2 How to Derive the Equation for the Variance of β̂1 499
14.3 Calculating Power 501
14.4 How to Derive the Omitted Variable Bias Conditions 502
14.5 Anticipating the Sign of Omitted Variable Bias 505
14.6 Omitted Variable Bias with Multiple Variables 507
14.7 Omitted Variable Bias due to Measurement Error 508
x CONTENTS
14.8 Collider Bias with Post-Treatment Variables 510

Conclusion 513 .
Key Term 514
Exercises 515
15 Advanced Panel Data 518

15.1 Panel Data Models with Serially Correlated Errors 518
15.2 Temporal Dependence with a Lagged Dependent Variable 520
15.3 Random Effects Models 524
Conclusion 526 .
Key Term 527
Exercises 530
16 Conclusion: How to Be an Econometric Realist 533

Further Reading 537
APPENDICES
Math and Probability Background 538
A Summation 538
B Expectation 538
C Variance 539
D Covariance 540
E Correlation 541
F Probability Density Functions 541
G Normal Distributions 543
H Other Useful Distributions 549
I Sampling 551
Further Reading 554 . Key Terms 554 . Computing Corner 554
Citations and Additional Notes 556
Guide to Review Questions 567

CONTENTS xi
Bibliography 577
Photo Credits 586
Glossary 587
Index 596
LIST OF FIGURES
1.1 Rule #1 2
1.2 Weight and Donuts in Springfield 4
1.3 Regression Line for Weight and Donuts in Springfield 5
1.4 Examples of Lines Generated by Core Statistical Model (for Review
Question) 7
1.5 Correlation 10
1.6 Possible Relationships between X, , and Y (for Discussion
Questions) 12
1.7 Two Scenarios for the Relationship between Flu Shots and Health 14
2.1 Two Versions of Debt and Growth Data 25

2.2 Weight and Donuts in Springfield 28
2.3 Scatterplots of Violent Crime against Percent Urban, Single Parent,
and Poverty 32
3.1 Relationship between Income Growth and Vote for the Incumbent
President’s Party, 1948–2016 46
3.2 Elections and Income Growth with Model Parameters Indicated 51
3.3 Fitted Values and Residuals for Observations in Table 3.1 52
3.4 Four Distributions 55
3.5 Distribution of β̂1 58
3.6 Two Distributions with Different Variances of β̂1 62
3.7 Four Scatterplots (for Review Questions) 64
3.8 Distributions of β̂1 for Different Sample Sizes 66
3.9 Plots with Different Goodness of Fit 73
3.10 Height and Wages 75
3.11 Scatterplot of Violent Crime and Percent Urban 77
3.12 Scatterplots of Crime against Percent Urban, Single Parent, and
Poverty with OLS Fitted Lines 79
xii
LIST OF FIGURES xiii
4.1 Distribution of β̂1 under the Null Hypothesis for Presidential Election
Example 96
4.2 Distribution of β̂1 under the Null Hypothesis with Larger Standard
Error for Presidential Election Example 99
4.3 Three t Distributions 100
4.4 Critical Values for Large-Sample t Tests 102
4.5 Two Examples of p Values 107
4.6 Statistical Power for Three Values of β1 Given α = 0.01 and a
One-Sided Alternative Hypothesis 110
4.7 Power Curves for Two Values of se(β̂1 ) 112
4.8 Tradeoff between Type I and Type II Error 114
4.9 Meaning of Confidence Interval for Example of 0.41 ± 0.196 118
5.1 Monthly Retail Sales and Temperature in New Jersey from 1992 to
2013 128
5.2 Monthly Retail Sales and Temperature in New Jersey with December
Indicated 129
5.3 95 Percent Confidence Intervals for Coefficients in Adult Height,
Adolescent Height, and Wage Models 133
5.4 Economic Growth, Years of School, and Test Scores 142
6.1 Goal Differentials for Home and Away Games for Manchester City
and Manchester United 180
6.2 Bivariate OLS with a Dummy Independent Variable 182
6.3 Scatterplot of Trump Feeling Thermometers and Party Identification 185
6.4 Three Difference of Means Tests (for Review Questions) 186
6.5 Scatterplot of Height and Gender 188
6.6 Another Scatterplot of Height and Gender 189
6.7 Fitted Values for Model with Dummy Variable and Control Variable:
Manchester City Example 192
6.8 Relation between Omitted Variable (Year) and Other Variables 199
6.9 95 Percent Confidence Intervals for Universal Male Suffrage Variable
in Table 6.8 202
6.10 Interaction Model of Salaries for Men and Women 204
6.11 Various Fitted Lines from Dummy Interaction Models (for Review
Questions) 206
xiv LIST OF FIGURES
6.12 Heating Used and Heating Degree-Days for Homeowner who

Installed a Programmable Thermostat 208
6.13 Heating Used and Heating Degree-Days with Fitted Values for
Different Models 211
6.14 Marginal Effect of Text Ban as Total Miles Changes 217
7.1 Average Life Satisfaction by Age in the United States 221

7.2 Life Expectancy and Per Capita GDP in 2011 for All Countries
in the World 223
7.3 Linear and Quadratic Fitted Lines for Life Expectancy Data 224
7.4 Examples of Quadratic Fitted Curves 225
7.5 Global Temperature over Time 228
7.6 Hypothetical Investment Data (for Review Questions) 231
7.7 Linear-Log Model for Life Expectancy Data 233
7.8 Post-Treatment Variable that Soaks Up Effect of X1 237
7.9 Example in which a Post-Treatment Variable Creates a Spurious
Relationship between X1 and Y 239
7.10 A More General Depiction of Models with a Post-Treatment
Variable 241
8.1 Robberies and Police for Large Cities in California 258

8.2 Robberies and Police for Specified Cities in California 258
8.3 Robberies and Police for Specified Cities in California with
City-Specific Regression Lines 259
8.4 Robberies and Police for Hypothetical Cities in California 265
8.5 Difference-in-Difference Examples 278
8.6 More Difference-in-Difference Examples (for Review Question) 282
9.1 Conditions for Instrumental Variables 302

9.2 Simultaneous Equation Model 317
10.1 Compliance and Non-compliance in Experiments 342
11.1 Drinking Age and Test Scores 374

11.2 Basic RD Model, Yi = β0 + β1 Ti + β2 (X1i − C) 377
LIST OF FIGURES xv
11.3 Possible Results with Basic RD Model 378

11.4 Possible Results with Differing Slopes RD Model 382
11.5 Fitted Lines for Examples of Polynomial RD Models 383
11.6 Various Fitted Lines for RD Model of Form
Yi = β0 + β1 Ti + β2 (X1i − C) + β3 (X1i − C)Ti (for Review Question) 385
11.7 Smaller Windows for Fitted Lines for Polynomial RD Model in
Figure 11.5 387
11.8 Bin Plots for RD Model 388
11.9 Binned Graph of Test Scores and Pre-K Attendance 390
11.10 Histograms of Assignment Variable for RD Analysis 393
11.11 Histogram of Age Observations for Drinking Age Case Study 396
12.1 Scatterplot of Law School Admissions Data and LPM Fitted Line 412
12.2 Misspecification Problem in an LPM 413
12.3 Scatterplot of Law School Admissions Data and LPM- and
Probit-Fitted Lines 415
12.4 Symmetry of Normal Distribution 419
12.5 PDFs and CDFs 420
12.6 Examples of Data and Fitted Lines Estimated by Probit 424
12.7 Varying Effect of X in Probit Model 427
12.8 Fitted Lines from LPM, Probit, and Logit Models 435
12.9 Fitted Lines from LPM and Probit Models for Civil War Data
(Holding Ethnic and Religious Variables at Their Means) 442
12.10 Figure Included for Some Respondents in Global Warming Survey
Experiment 452
13.1 Examples of Autocorrelation 462

13.2 Global Average Temperature since 1880 465
13.3 Global Temperature Data 472
13.4 Data with Unit Roots and Spurious Regression 479
13.5 Data without Unit Roots 480
13.6 Global Temperature and Carbon Dioxide Data 483
14.1 A More General Depiction of Models with a Post-Treatment

Variable 511
xvi LIST OF FIGURES
A.1 An Example of a Probability Density Function (PDF) 542

A.2 Probabilities that a Standard Normal Random Variable Is Less than
Some Value 543
A.3 Probabilities that a Standard Normal Random Variable Is
Greater than Some Value 544
A.4 Standard Normal Distribution 545
A.5 Two χ Distributions
2
550
A.6 Four F Distributions 552
R.1 Identifying β0 from a Scatterplot 568
LIST OF TABLES
1.1 Donut Consumption and Weight 3
2.1 Descriptive Statistics for Donut and Weight Data 26

2.2 Frequency Table for Male Variable in Donut Data Set 27
2.3 Frequency Table for Male Variable in Second Donut Data Set 27
2.4 Codebook for Height and Wage Data 29
2.5 Descriptive Statistics for State Crime Data 31
2.6 Variables for Winter Olympics Questions 39
2.7 Variables for Height and Wage Data in the United States 40
3.1 Selected Observations from Election and Income Data 51

3.2 Effect of Height on Wages 75
3.3 OLS Models of Crime in U.S. States 78
3.4 Variables for Questions on Presidential Elections and the Economy 87
3.5 Variables for Height and Wage Data in Britain 89
3.6 Variables for Divorce Rate and Hours Worked 89
4.1 Type I and Type II Errors 93

4.2 Effect of Income Changes on Presidential Elections 95
4.3 Decision Rules for Various Alternative Hypotheses 101
4.4 Critical Values for t Distribution 103
4.5 Effect of Height on Wages with t Statistics 104
4.6 Calculating Confidence Intervals for Large Samples 119
4.7 Variables for Height and Wage Data in the United States 123
5.1 Bivariate and Multivariate Results for Retail Sales Data 130
5.2 Bivariate and Multiple Multivariate Results for Height and Wages
Data 132
xvii
xviii LIST OF TABLES
5.3 Using Multiple Measures of Education to Study Economic Growth

and Education 141
5.4 Effects of Judicial Independence on Human Rights 153
5.5 Determinants of Major League Baseball Salaries, 1985–2005 156
5.6 Means and Standard Deviations of Baseball Variables 156
5.7 Means and Standard Deviations of Baseball Variables for Three
Players 157
5.8 Standardized Determinants of Major League Baseball Salaries,
1985–2005 158
5.9 Unrestricted and Restricted Models for F Tests 165
5.10 Variables for Height and Wages Data in the United States 173
5.11 Variables for Cell Phones and Traffic Deaths Data 175
5.12 Variables for Speeding Ticket Data 175
5.13 Variables for Height and Wages Data in Britain 176
5.14 Variables for Global Education Data 177
6.1 Feeling Thermometer toward Donald Trump 184

6.2 Difference of Means Test for Height and Gender 188
6.3 Another Way to Show Difference of Means Test Results for Height
and Gender 190
6.4 Manchester City Example with Dummy and Continuous
Independent Variables 191
6.5 Using Different Reference Categories for Women’s Wages and
Region 195
6.6 Hypothetical Results for Wages and Region When Different
Categories Are Used as Reference Categories 197
6.7 Difference of Means of Inheritance Taxes for Countries with
Universal Male Suffrage, 1816–2000 198
6.8 Multivariate OLS Analysis of Inheritance Taxes 201
6.9 Interpreting Coefficients in Dummy Interaction Model:
Yi = β0 + β1 Xi + β2 Di + β3 Xi × Di 205
6.10 Data from Programmable Thermostat and Home Heating Bills 209
6.11 Variables for Monetary Policy Data 215
7.1 Global Temperature, 1879–2012 229

7.2 Different Logged Models of Relationship between Height and Wages 235
LIST OF TABLES xix
7.3 Variables for Political Instability Data 247

7.4 Variables for Height and Wages Data in Britain 248
8.1 Basic OLS Analysis of Robberies and Police Officers 257

8.2 Example of Robbery and Police Data for Cities in California 263
8.3 Robberies and Police Data for Hypothetical Cities in California 265
8.4 Robberies and Police Officers, Pooled versus Fixed Effects Models 266
8.5 Robberies and Police Officers, for Multiple Models 273
8.6 Bilateral Trade, Pooled versus Fixed Effects Models 275
8.7 Effect of Stand Your Ground Laws on Homicide Rate per 100,000
Residents 280
8.8 Variables for Presidential Approval Data 288
8.9 Variables for Peace Corps Data 289
8.10 Variables for Instructor Evaluation Data 290
8.11 Variables for the HOPE Scholarship Data 291
8.12 Variables for the Texas School Board Data 292
8.13 Variables for the Cell Phones and Traffic Deaths Data 293
9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime 297
9.2 Influence of Distance on NICU Utilization (First-Stage Results) 306
9.3 Influence of NICU Utilization on Baby Mortality 307
9.4 Regression Results for Models Relating to Drinking and Grades 308
9.5 Price and Quantity Supplied Equations for U.S. Chicken Market 321
9.6 Price and Quantity Demanded Equations for U.S. Chicken Market 322
9.7 Variables for Rainfall and Economic Growth Data 327
9.8 Variables for News Program Data 328
9.9 Variables for Fish Market Data 329
9.10 Variables for Education and Crime Data 331
9.11 Variables for Income and Democracy Data 332
10.1 Balancing Tests for the Progresa Experiment: Difference of Means

Tests Using OLS 339
10.2 First-Stage Regression in Campaign Experiment: Explaining Contact 347
xx LIST OF TABLES
10.3 Second-Stage Regression in Campaign Experiment: Explaining

Turnout 348
10.4 Various Measures of Campaign Contact in 2SLS Model for
Selected Observations 349
10.5 First-Stage Regression in Domestic Violence Experiment:
Explaining Arrests 351
10.6 Selected Observations for Minneapolis Domestic Violence
Experiment 352
10.7 Using Different Estimators to Analyze the Minneapolis Results
of the Domestic Violence Experiment 353
10.8 Regression Results for Models Relating Teacher Payment
Experiment (for Review Questions) 360
10.9 Effect of Terror Alerts on Crime 363
10.10 Variables for Get-out-the-Vote Experiment 367
10.11 Variables for Resume Experiment 369
10.12 Variables for Afghan School Experiment 371
11.1 RD Analysis of Prekindergarten 391

11.2 RD Analysis of Drinking Age and Test Scores 396
11.3 RD Diagnostics for Drinking Age and Test Scores 397
11.4 Variables for Prekindergarten Data 401
11.5 Variables for Congressional Ideology Data 403
11.6 Variables for Head Start Data 404
12.1 LPM of the Probability of Admission to Law School 411

12.2 Sample Probit Results for Review Questions 426
12.3 Multiple Models of Probability of Buying Store-Brand Ketchup 433
12.4 Estimated Effect of Independent Variables on Probability of Buying
Store-Brand Ketchup 434
12.5 Unrestricted and Restricted Probit Results for LR Test 438
12.6 Probit Models of the Determinants of Civil Wars 441
12.7 Variables for Iraq War Data 449
12.8 Variables for Global Warming Data 451
12.9 Variables for Football Coach Data 453
12.10 Variables for Donor Experiment 454
12.11 Balance Tests for Donor Experiment 455
LIST OF TABLES xxi
13.1 Using OLS and Lagged Residual Model to Detect Autocorrelation 466
13.2 Example of ρ-Transformed Data (for ρ̂ = 0.5) 470
13.3 Global Temperature Model Estimated by Using OLS, Newey-West,
and ρ-Transformation Models 473
13.4 Dickey-Fuller Tests for Stationarity 484
13.5 Change in Temperature as a Function of Change in Carbon Dioxide
and Other Factors 485
13.6 Variables for James Bond Movie Data 492
14.1 Effect of Omitting X2 on Coefficient Estimate for X1 506

14.2 Examples of Parameter Combinations for Models with
Post-Treatment Variables 512
14.3 Variables for Winter Olympics Data 516
15.1 Another Set of Variables for Winter Olympics Data 530
A.1 Examples of Standardized Values 547

R.1 Values of β0 , β1 , β2 , and β3 in Figure 8.6 571
USEFUL COMMANDS FOR STATA
Task Command Example Chapter
Help help help summarize 2

Comment line * * This is a comment line 2
Comment on command line /* */ use "C:\Data.dta" /* This is a comment */
Continue line /* */ reg y X1 X2 X3 /* 2
*/ X4 X5
Load Stata data file use use "C:/Data.dta" 2
Load text data file insheet insheet using "C:/Data.txt" 2
Display variables in memory list list /* Lists all observations for all variables */ 2
list Y X /* Lists all observations for Y and X */ 2
list X in 1/10 /* Lists first 10 observations for X */ 2
Descriptive statistics summarize summarize X1 X2 Y 2
Frequency table tabulate tabulate X1 2
Scatter plot scatter scatter Y X 2
scatter Y X, mlabel(name) /* Adds labels */ 2
Limit data if summarize X1 if X2 > 1 2
Equal (as used in if statement, for example) == summarize X1 if X2 == 1 2
Not equal != summarize X1 if X2!=0 2
And & list X1 if X2 == 1 & X3 > 18 2
Or | list X1 if X2 == 1 | X3 > 18 2
Delete a variable drop drop X7 2
Missing data in Stata . * Caution: Stata treats missing data as having 2
infinite value, so list X1 if X2 > 0 will include
values of X1 for which X2 is missing
Regression reg reg Y X1 X2 3
Heteroscedasticity robust regression , robust reg Y X1 X2, robust 3
Generate predicted values predict predict FittedY /* Run this after reg command */ 3
Add regression line to scatter plot twoway, lfit twoway (scatter Y X) (lfit Y X) 3
Critical value for t distribution, two-sided invttail display invttail(120, .05/2) /* For model with 120 4
degrees of freedom and α = 0.05; note that we
divide α by 2 */
Critical value for t distribution, one-sided invttail display invttail(120, .05) /* For model with 120 4
degrees of freedom and α = 0.05 */
Critical value for normal distribution, two-sided invnormal display invnormal(.975) /* For α = 0.05, note that 4
we divide α by 2 */
Critical value for normal distribution, one-sided invnormal display invnormal(.05) 4
Two-sided p values [Reported in reg output] 4
One-sided p values ttail display 2*ttail(120, 1.69) /* For model with 120 4
degrees of freedom and a t statistic of 1.69 */
Confidence intervals [Reported in reg output] 4
Produce standardized regression coefficients , beta reg Y X1 X2, beta 5
Produce standardized variable egen egen X_std = std(X) /* Creates variable called 5
X_std */
xxii
USEFUL COMMANDS FOR STATA xxiii
F test test test X1 = X2 = 0 /* Run this after regression with 5

X1 and X2 in model */
Critical value for F test invF display invF(2, 120, 0.95) /* Degrees of freedom 5
equal 2 and 120 and α = 0.05 */
p value for F statistic Ftail Ftail(2, 1846, 7.77) /* Degrees of freedom equal 2 5
and 1846 and F statistic = 7.77*/
Difference of means test using OLS reg reg Y Dum /* Where Dum is a dummy variable */ 6
Create an interaction variable gen gen DumX = Dum * X 6
Include dummies for categorical variable i.varname reg Y i.X1 /* Includes appropriate number of 6
dummy variables for categorical variable X1 */
Set reference category ib#.varname reg Y ib2.X1 /* Sets 2nd category as reference 6
category */
Create a squared variable gen gen X_sq = Xˆ2 7
Create a logged variable gen gen X_log =log( X) 7
Generate dummy variables for each unit tabulate and tabulate City, generate(City_dum) 8
generate
LSDV model for panel data reg reg Y X1 X2 City_dum2 - City_dum80 8
De-meaned model for panel data xtreg xtreg Y X1 X2, fe i(City) 8
Two-way fixed effects xtreg xtreg Y X1 X2 i.year Yr2- Yr10, fe i(City) 8
2SLS model ivregress ivregress 2sls Y X2 X3 (X1 = Z), first 9
Probit probit probit Y X1 X2 X3 12
Normal CDF normal normal(0) /* The normal CDF evaluated at 0 12
(which is 0.5)*/
Logit logit logit Y X1 X2 X3 12
Critical value for χ 2 test invchi2 display invchi2(1, 0.95) /* Degrees of freedom = 1 12
and 0.95 confidence level */
Account for autocorrelation in time series data prais tsset Year 13
prais Y X1 X2, corc twostep
Include lagged dependent variable L.Y reg Y L.Y X1 X2 /* Run tsset command first */ 13
Augmented Dickey-Fuller test dfuller dfuller Y, trend lags(1) regress 13
Generate draws from standard normal rnormal gen Noise = rnormal(0,1) /* Length will be same 14
distribution as length of variables in memory */
Indicate to Stata unit and time variables tsset tsset ID time 15
Panel model with autocorrelation xtregar xtregar Y X1 X2, fe rhotype(regress) twostep 15
Include lagged dependent variable L.Y xtreg Y L.Y X1 X2, fe i(ID) 15
Random effects panel model , re xtreg Y X1 X2, re 15
USEFUL COMMANDS FOR R
Help ? ?mean # Describes the "mean" command 2

Comment line # # This is a comment 2
Load R data file load Data = load("C:/Data.RData") 2
Load text data file read.table Data = read.table("C:/Data.txt", header = TRUE) 2
Display names of variables in memory objects objects() # Will list names of all variables in memory 2
Display variables in memory [enter variable X1 # Display all values of this variable; enter directly in console 2
name] or highlight in editor and press ctrl-r
X1[1:10] # Display first 10 values of X1 2
Missing data in R NA
Mean mean mean(X1) 2
mean(X1, na.rm=TRUE) # Necessary if there are missing values
Variance var var(X1) 2
var(X1, na.rm=TRUE) # Necessary if there are missing values
sqrt(var(X1)) # This is the standard deviation of X1
Minimum min min(X1, na.rm=TRUE) 2
Maximum max max(X1, na.rm=TRUE) 2
Number of observations sum and is.finite sum(is.finite(X1)) 2
Frequency table table table(X1) 2
Scatter plot plot plot(X, Y) 2
text(X, Y, name) # Adds labels from variable called "name" 2
Limit data (similar to an if statement) [] plot(Y[X3<10], X1[X3<10]) 2
Equal (as used in if statement, for example) == mean(X1[X2==1]) # Mean of X1 for cases where X2 equals 1 2
Not equal != mean(X1[X1!=0]) # Mean of X1 for observations where X1 is 2
not equal to 0
And & X1[X2 == 1 & X3 > 18] 2
Or | X1[X2 == 1 | X3 > 18] 2
Regression lm lm(Y ˜X1 + X2) # lm stands for "linear model" 3
Results = lm(Y˜X) # Creates an object called "Results" that 3
stores coefficients, standard errors, fitted values, and other
information about this regression
Display results summary summary(Results) # Do this after creating "Results" 3
Install a package install.packages install.packages("AER") # Only do this once for each computer 3
Load a package library library(AER) # Include in every R session in which we use
package specified in command
Heteroscedasticity robust regression coeftest coeftest(Results, vcov = vcovHC(Results, type = "HC1")) 3
# Need to install and load AER package for this command. Do
this after creating OLS regression object called "Results"
Generate predicted values $fitted.values Results$fitted.values # Run after creating OLS regression object 3
called "Results"
Add regression line to scatter plot abline abline(Results) # Run after plot command and after creating 3
"Results" object based on a bivariate regression
xxiv
USEFUL COMMANDS FOR R xxv
Critical value for t distribution, two-sided qt qt(0.975, 120) # For α = 0.05 and 120 degrees of freedom; 4
divide α by 2
Critical value for t distribution, one-sided qt qt(0.95, 120) # For α = 0.05 and 120 degrees of freedom 4
Critical value for normal distribution, two-sided qnorm qnorm(0.975) # For α = 0.05; divide α by 2 4
Critical value for normal distribution, one-sided qnorm qnorm(0.95) # For α = 0.05 4
Two-sided p values [Reported in summary(Results) output]
One-sided p values pt 2*(1-pt(abs(1.69), 120)) # For model with 120 degrees of 4
freedom and a t statistic of 1.69
Confidence intervals confint confint(Results, level = 0.95) # For OLS object "Results" 4
Produce standardized regression coefficients scale Res.std = lm(scale(Y) ˜scale(X1) + scale(X2) ) 5
Display R squared $r.squared summary(Results)$r.squared 5
Critical value for F test qf qf(.95, df1 = 2, df2 = 120) # Degrees of freedom equal 2 and 5
120 and α = 0.05
p value for F statistic pf 1 - pf(7.77, df1=2, df2=1,846) # For F statistic = 7.77, and 5
degrees of freedom equal 2 and 1846
Include dummies for categorical variable factor lm(Y ∼ factor(X1)) # Includes appropriate number of dummy 6
variables for categorical variable X1
Set reference category relevel X1 = relevel(X1, ref = “south”) # Sets 2nd category as 6
reference category; include before OLS model
Difference of means test using OLS lm lm(Y˜Dum) # Where Dum is a dummy variable 6
Create an interaction variable DumX = Dum * X # Or use <- in place of = 6
Create a squared variable X_sq = Xˆ2 7
Create a logged variable X_log =log( X) 7
LSDV model for panel data factor Results = lm(Y ∼ X1 + factor(country)) # Factor adds a 8
dummy variable for every value of variable called country
One-way fixed-effects model (de-meaned) plm library(plm) 8
Results = plm(Y ˜X1+ X2+ X3, data = dta,
index=c("country"), model="within")
Two-way fixed-effects model (de-meaned) plm library(plm) 8
Results = plm(Y ˜X1+ X2+ X3, data = dta,
index=c("country", "year"), model="within",
effect = "twoways")
2SLS model ivreg library(AER) 9
ivreg(Y ˜X1 + X2 + X3 |Z1 + Z2 + X2 + X3)
Probit glm glm(Y ˜X1 + X2, family = binomial(link ="probit")) 12
Normal CDF pnorm pnorm(0) # The normal CDF evaluated at 0 (which is 0.5) 12
Logit glm glm(Y ˜X1 + X2, family = binomial(link ="logit")) 12
Generate draws from standard normal distribution rnorm Noise = rnorm(500) # 500 draws from standard normal 14
distribution
Panel model with autocorrelation [See Computing Corner in Chapter 15] 15
Include lagged dependent variable plm with Results = plm(Y ˜lag(Y) + X1 + X2, data = dta, index = c("ID", 15
lag(Y) "time"), effect = "twoways")
Random effects panel model plm with Results = plm(Y ˜X1 + X2, data = dta, model = "random") 15
"random"
PREFACE FOR STUDENTS:
HOW THIS BOOK CAN HELP YOU
LEARN ECONOMETRICS
“Less dull than traditional texts.”—Student A.H.

“It would have been immensely helpful for me to have a textbook like this in
my classes throughout my college and graduate experience. It feels more like
an interactive learning experience than simply reading equations and facts out
of a book and being expected to absorb them.”—Student S.A.
“I wish I had had this book when I was first exposed to the material—it would
have saved a lot of time and hair-pulling . . .”—Student J.H.
“Material is easy to understand, hard to forget.”—Student M.H.
This book introduces the econometric tools necessary to answer important ques-
tions. Do antipoverty programs work? Does unemployment affect inflation? Does
campaign spending affect election outcomes? These and many more questions are
not only interesting but also important to answer correctly if we want to support
policies that are good for people, countries, and the world.
When using econometrics to answer such questions, we need always to
remember a single big idea: correlation is not causation. Just because variable
Y rises when variable X rises does not mean that variable X causes variable Y to
rise. The essential goal is to figure out when we can say that changes in variable
X will lead to changes in variable Y.
This book helps us learn how to identify causal relationships with three
features seldom found in other econometrics textbooks. First, it focuses on
the tools that economic researchers use most. These are the real econometric
techniques that help us make reasonable claims about whether X causes Y, and
by using these tools, we can produce analyses that others can respect. We’ll get
the most out of our data while recognizing the limits in what we can say or how
confident we can be.
This emphasis on real econometrics means that we skip obscure econometric
tools that could come up under certain conditions. Econometrics is too often
complicated by books and teachers trying to do too much. This book shows that
we can have a sophisticated understanding of statistical inference without having
to catalog every method that our instructor had to learn as a student.
Second, this book works with a single unifying framework. We don’t start over
with each new concept; instead, we build around a core model. That means there
is a single equation and a unifying set of assumptions that we poke, probe, and
xxvi
PREFACE FOR STUDENTS xxvii
expand throughout the book. This approach reduces the learning costs of moving
through the material and allows us to go back and revisit material. As with any
skill, we probably won’t fully understand any given technique the first time we see
it. We have to work at it; we have to work with it. We’ll get comfortable; we’ll see
connections. Then it will click. Whether the skill is jumping rope, typing, throwing
a baseball, or analyzing data, we have to do things many times to get good at it.
By sticking to a unifying framework, we have more chances to revisit what we
have already learned. You’ll also notice that I’m not afraid to repeat myself on the
important stuff. Really, I’m not afraid to repeat myself.
Third, this book uses many examples from the policy, political, and economic
worlds. So even if you do not care about “two-stage least squares” or “maximum
likelihood” in and of themselves, you will see how understanding these techniques
will affect what you think about education policy, trade policy, election outcomes,
and many other interesting issues. The examples and case studies make it clear
that the tools developed in this book are being used by contemporary applied
economists who are actually making a difference with their empirical work.
Real Econometrics is meant to serve as the primary textbook in an introduc-
tory econometrics course or as a supplemental text providing more intuition and
context in a more advanced econometric methods course. As more and more public
policy and corporate decisions are based on statistical and econometric analysis,
this book can also be used outside of course work. Econometrics has infiltrated
into every area of our lives—from entertainment to sports (I no longer spit out my
coffee when I come across an article on regression analysis of National Hockey
League players)—and a working knowledge of basic econometric techniques can
help anyone make better sense of the world around them.
What’s in This Book?

The preparation necessary to use this book successfully is modest. We use basic
algebra a fair bit, being careful to explain every step. You do not need calculus. We
refer to calculus when useful, and the book certainly could be used by a course that
works through some of the concepts using calculus. However, you can understand
everything without knowing calculus.
We start with two introductory chapters. Chapter 1 lays out the central
challenge in econometrics. This is the challenge of making probabilistic yet
accurate claims about causal relations between variables. We present experiments
as an ideal way to conduct research, but we also show how experiments in the
real world are tricky and can’t answer every question we care about. This chapter
provides the “big picture” context for econometric analysis that is every bit as
important as the specifics that follow.
Chapter 2 provides a practical foundation related to good econometric
practices. In every econometric analysis, data meets software, and if we’re not
careful, we lose control. This chapter therefore seeks to teach good habits about
documenting analysis and understanding data.
xxviii PREFACE FOR STUDENTS
The five chapters of Part One constitute the heart of the book. They introduce
ordinary least squares (OLS), also known as regression analysis. Chapter 3
introduces the most basic regression model, the bivariate OLS model. Chapter
4 shows how to use OLS to test hypotheses. Chapters 5 through 7 introduce
the multivariate OLS model and applications. By the end of Part One, you will
understand regression and be able to control for anything you can measure. You’ll
also be able to fit curves to data and assess whether the effects of some variables
differ across groups, among other skills that will impress your friends.
Part Two introduces techniques that constitute the contemporary econometric
toolkit. These are the techniques people use when they want to get published—or
paid. These techniques build on multivariate OLS to give us a better chance of
identifying causal relations between two variables. Chapter 8 covers a simple yet
powerful way to control for many factors we can’t measure directly. Chapter 9
covers instrumental variable techniques, which work if we can find a variable
that affects our independent variable but not our dependent variable. Instrumental
variable techniques are a bit funky, but they can be very useful for isolating causal
effects. Chapter 10 covers randomized experiments. Although ideal in theory, in
practice such experiments often raise a number of challenges we need to address.
Chapter 11 covers regression discontinuity tools that can be used when we’re
studying the effect of variables that were allocated based on a fixed rule. For
example, Medicare is available to people in the United States only when they turn
65, and admission to certain private schools depends on a test score exceeding
some threshold. Focusing on policies that depend on such thresholds turns out to
be a great context for conducting credible econometric analysis.
Part Three contains a single chapter (Chapter 12) that covers dichotomous
dependent variable models. These are simply models in which the outcome we
care about takes on two possible values. Examples and case studies include high
school graduation (someone graduates or doesn’t), unemployment (someone has
a job or doesn’t), and alliances (two countries sign an alliance treaty or don’t). We
show how to apply OLS to such models and then provide more elaborate models
that address the deficiencies of OLS in this context.
Part Four supplements the book with additional useful material. Chapter 13
covers time series data. The first part of the chapter is a variation on OLS; the
second part introduces dynamic models that differ from OLS models in important
ways. Chapter 14 derives important OLS results and extends discussion on specific
topics. Chapter 15 goes into greater detail on the vast literature on panel data,
showing how the various strands fit together.
Chapter 16 concludes the book with tips on adopting the mind-set of an
econometric realist. In fact, if you are looking for an overall understanding of the
power and limits of statistics, you might want to read this chapter first—and then
read it again once you’ve learned all the statistical concepts covered in the other
chapters.
PREFACE FOR STUDENTS xxix
How to Use This Book

Real Econometrics is designed to help you master the material. Each section ends
with a “Remember This” box that highlights the key points of that section. If you
remember what’s in each of these boxes, you’ll have a great foundation in statistics.
Key Terms are boldfaced where they are first introduced in the text, defined briefly
in the margins, and defined again in the glossary at the end of the book.
Review Questions and Discussion Questions appear at the end of selected
sections. I recommend using these. Answering questions helps us be realistic
about whether we’re truly on track. What we’re fighting is something cognitive
psychologists call the “illusion of explanatory depth.” That’s a fancy way of saying
we don’t always know as much as we think we do. By answering the Review
Questions and Discussion Questions, we can see where we are. The Review
Questions are more concrete and have specific answers, which are found at the
end of the book. The Discussion Questions are more open-ended and encourage
us to explore how the concepts apply to issues we care about. Once invested in
this way, we’re no longer doing econometrics for the sake of doing econometrics;
instead, we’re doing econometrics to help us learn about important issues.
And remember, learning is not only about answering questions: coming up
with your own questions for your instructor or classmates or the dude next to you
on the bus is a great way to learn. Doing so will help you formulate exactly what is
unclear and will open the door to an exchange of ideas. Heck, maybe you’ll make
friends with the bus guy or, worst case, you’ll see an empty seat open up next to
you . . .
Finally, you may have noticed that this book is opinionated and a bit chatty.
This is not the usual tone of econometrics books, but being chatty is not the
same as being dumb. You’ll see real material, with real equations and real
research—sometimes accompanied by smart-ass asides that you may not see in
other books. This approach makes the material more accessible and also reinforces
the right mind-set: econometrics is not simply a set of mathematical equations;
instead, econometrics provides a set of practical tools that curious people use to
learn from the world. But don’t let the tone fool you. This book is not Econometrics
for Dummies; it’s Real Econometrics. Learn the material, and you will be well on
your way to using econometrics to answer important questions.
PREFACE FOR INSTRUCTORS:
HOW TO HELP YOUR STUDENTS
LEARN ECONOMETRICS
We econometrics teachers have high hopes for our students. We want them to
understand how econometrics can shed light on important economic and policy
questions. Sometimes they humor us with incredible insight. The heavens part;
angels sing. We want that to happen daily. Sadly, a more common experience is
seeing a furrowed brow of confusion and frustration. It’s cloudy and rainy in that
place.
It doesn’t have to be this way. If we distill the material to the most critical
concepts, we can inspire more insight and less brow-furrowing. Unfortunately,
conventional statistics and econometrics books all too often manage to be too
simple and too confusing at the same time. Many are too simple in that they
provide a semester’s worth of material that hardly gets past rudimentary ordinary
least squares (OLS). Some are too confusing in that they get to OLS by way of
going deep into the weeds of probability theory without showing students how
econometrics can be useful and interesting.
Real Econometrics is predicated on the belief that we are most effective
when we teach the tools we use. What we use are regression-based tools with an
increasing focus on experiments and causal inference. If students can understand
these fundamental concepts, they can legitimately participate in analytically sound
conversations. They can produce analysis that is interesting—and believable!
They can understand experiments and the sometimes subtle analysis required
when experimental methods meet social scientific reality. They can appreciate that
causal effects are hard to tease out with observational data and that standard errors
estimated on crap coefficients, however complex, do no one any good. They can
sniff out when others are being naive or cynical. It is only when we muck around
too long in the weeds of less useful material that statistics becomes the quagmire
students fear.
Hence this book seeks to be analytically sophisticated in a simple and relevant
way. It focuses on tools actually used by real analysts. Nothing useless. No clutter.
To do so, the book is guided by three principles: relevance, opportunity costs, and
pedagogical efficiency.
Relevance
Relevance is a crucial first principle for successfully teaching econometrics in
the social sciences. Every experienced instructor knows that most students care
xxx
PREFACE FOR INSTRUCTORS xxxi
more about the real world than math. How do we get such students to engage
with econometrics? One option is to cajole them to care more and work harder.
We all know how well that works. A better option is to show them how a
sophisticated understanding of statistical concepts helps them learn more about
the topics that concern them. Think of a mother trying to get a child to commit to
the training necessary to play competitive sports. She could start with a semester
of theory. . . . No, that would be cruel. And counterproductive. Much better to let
the child play and experience the joy of the sport. Then there will be time (and
motivation!) to understand nuances. Thus every chapter is built around examples
and case studies on topics students might actually care about—topics like violent
crime in the United States (Chapter 2), global warming (Chapter 7), and the
relationship between alcohol consumption and grades (Chapter 11).
Learning econometrics is not that different from learning anything else. We
need to care to truly learn. Therefore this book takes advantage of a careful
selection of material to spend more time on the real examples that students care
about.
Opportunity Costs
Opportunity costs are, as we all tell our students, what we have to give up to
do something. So, while some topic might be a perfectly respectable part of an
econometric toolkit, we should include it only if it does not knock out something
more important. The important stuff all too often gets shunted aside as we fill up
the early part of students’ analytical training with statistical knick-knacks, material
“some people still use” or that students “might see.”
Therefore this book goes quickly through descriptive statistics and doesn’t
cover χ 2 tests for two-way tables, weighted least squares, and other denizens of
conventional statistics books. These concepts—and many, many more—are all
perfectly legitimate. Some are covered elsewhere (descriptive statistics are covered
in elementary schools these days). Others are valuable enough to rate inclusion
here in an “advanced material” section for students and instructors who want
to pursue these topics further. And others simply don’t make the cut. Only by
focusing the material can we get to the tools used by researchers today, tools such
as panel data analysis, instrumental variables, and regression discontinuity. The
core ideas behind these tools are not particularly difficult, but we need to make
time to cover them.
Pedagogical Efficiency
Pedagogical efficiency refers to streamlining the learning process by using a single
unified framework. Everything in this book builds from the standard regression
model. Hypothesis testing, difference of means, and experiments can be—and
often are—taught independently of regression. Causal inference is sometimes
taught with potential outcomes notation. There is nothing intellectually wrong
with these approaches. But is using them pedagogically efficient? If we teach
xxxii PREFACE FOR INSTRUCTORS
these as stand-alone concepts we have to take time and, more important, student
brain space to set up each separate approach. For students, this is really hard.
Remember the furrowed brows? Students work incredibly hard to get their heads
around difference of means and where to put degrees of freedom corrections and
how to know if the means come from correlated groups or independent groups and
what the equation is for each of these cases. Then BAM! Suddenly the professor is
talking about residuals and squared deviations. The transition is old hat for us, but
it can overwhelm students first learning the material. It is more efficient to teach the
OLS framework and use that to cover difference of means, experiments, and the
contemporary canon of econometric analysis, including panel data, instrumental
variables, and regression discontinuity. Each tool builds from the same regression
model. Students start from a comfortable place and can see the continuity that
exists.
An important benefit of working with a single framework is that it allows
students to revisit the core model repeatedly throughout the term. Despite the
brilliance of our teaching, students rarely can put it all together with one pass
through the material. I know I didn’t when I was beginning. Students need to see
the material a few times, work with it a bit, and then it will finally click. Imagine
if sports were coached the way we do econometrics. A tennis coach who said
“This week we’ll cover forehands (and only forehands), next week backhands (and
only backhands), and the week after that serves (and only serves)” would not be a
tennis coach for long. Instead, coaches introduce material, practice, and then keep
working on the fundamentals. Working with a common framework throughout
makes it easier to build in mini-drills about fundamentals as new material is
introduced.
Course Adoption
Real Econometrics is organized to work well in three different kinds of courses.
First, it can be used in an introductory econometrics course that follows a semester
of probability and statistics. In such a course, students should probably be able to
move quickly through the early material and then pick up where they left off,
typically with multivariate OLS.
Second, this book can be used with students who have not previously (or
recently) studied statistics, either in a one-semester course covering Part One or
a year-long course covering the whole book. Using this book as a first course
avoids the “warehouse problem,” which occurs when we treat students’ statistical
education as a warehouse, filling it up with tools first and accessing them only
later. One challenge is that things rot in a warehouse. Another challenge is that
instructors tend to hoard a bit, putting things in the warehouse “just in case”
and creating clutter. And students find warehouse work achingly dull. Using this
book in a first-semester course avoids the warehouse problem by going directly
to interesting and useful material, providing students with a more just-in-time
approach. For example, they see statistical distributions, but in the context of trying
to solve a specific problem rather than as an abstract concept that will become
useful later.
PREFACE FOR INSTRUCTORS xxxiii
Finally, Real Econometrics can be used as a supplement in a more advanced

econometrics course, providing intuition and context that sometimes gets lost in
the more technical courses.
Real Econometrics is also designed to encourage two particularly useful
pedagogical techniques. One is interweaving, the process of weaving material
from previous lessons into later lessons. Numbered sections end with a “Remem-
ber This” box that summarizes key points. Connecting back to these points
in later lessons is remarkably effective at getting the material into the active
part of students’ brains. The more we ask students about omitted variable bias
or multicollinearity or properties of instruments (and in sometimes surprising
contexts), the more they become able to actively apply the material on their own.
The second teaching technique is to use frequent low-stakes quizzes to convert
students to active learners with less stress than the exams they will also be taking.
These quizzes need not be hard. They just need to give students a chance to
independently access and apply the material. Students can test themselves with the
Review Questions at the end of many sections, as the answers to these questions
are at the back of the book. It can also be useful for students to discuss or at
least reflect on the Discussion Questions at the ends of many sections, as these
enable students to connect the material to real world examples. Brown, Roediger,
and McDaniel (2014) provide an excellent discussion of these and other teaching
techniques.
Overview
The first two chapters of the book serve as introductory material and introduce
the science of statistics. Chapter 1 lays out the theme of how important—and
hard—it is to generate unbiased estimates. This is a good time to let students offer
hypotheses about questions of the day, because these questions can help bring
to life the subsequent material. Chapter 2, which introduces computer programs
and good practices, is a confidence builder that gets students who are not already
acclimated to statistical computing over the hurdle of using statistical software.
Part One covers core OLS material. Chapter 3 introduces bivariate OLS.
Chapter 4 covers hypothesis testing, and Chapter 5 moves to multivariate OLS.
Chapters 6 and 7 proceed to practical tasks such as use of dummy variables, logged
variables, interactions, and F tests.
Part Two covers essential elements of the contemporary econometric toolkit,
including panel data, instrumental variables, analysis of experiments, and regres-
sion discontinuity. Chapter 10, on experiments, uses instrumental variables.
Chapters 8, 9, and 11 can be covered in any order, however, so instructors can
pick and choose among these chapters as needed.
Part Three contains a single chapter (Chapter 12) on dichotomous dependent
variables. It develops the linear probability model in the context of OLS and
uses the probit and logit models to introduce students to maximum likelihood.
Instructors can cover this chapter any time after Part One if dichotomous
dependent variables play a major role in the course.
xxxiv PREFACE FOR INSTRUCTORS
Part Four introduces some advanced material. Chapter 13 discusses time

series models, introducing techniques to account for autocorrelation and to
estimate dynamic time series models; this chapter can also be covered at any
time following Part One. Chapter 14 offers derivations of the OLS model and
additional material on omitted variable bias. Instructors seeking to expose students
to derivations and extensions of the core OLS material can use this chapter as an
auxiliary to Chapters 3 through 5. Chapter 15 introduces more advanced topics in
panel data. This chapter builds on material from Chapters 8 and 13.
Chapter 16 concludes the book by discussing ways to maximize the chances
that we use econometrics properly to answer important questions about the world.
Every chapter ends with a series of learning tools. Each conclusion sum-
marizes the learning objectives by section and provides a list of key terms
introduced in the chapter (along with the page where first introduced). Each
Further Reading section guides students to additional resources on the material
covered in the chapter. The Computing Corners provide a guide to the syntax
needed to implement the analysis discussed in the chapters. We provide this syntax
for both Stata and R computing languages. Finally, the Exercises provide a variety
of opportunities for students to analyze real data sets from important papers on
interesting topics.
Several appendices provide supporting material. An appendix on math and
probability covers background ranging from mathematical functions to important
concepts in probability. In addition, citations and additional notes are linked to
the text by page numbers and elaborate on some finer points. Answers to Review
Questions are also provided.
Teaching econometrics is difficult. When the going gets tough it is tempting
to blame students, to say they are unwilling to do the work. Before we go
that route, we should recognize that many students find the material quite
foreign and (unfortunately) irrelevant. If we can streamline what we teach and
connect it to things students care about, we can improve our chances of getting
students to understand the material, which not only is intrinsically interesting
but also forms the foundation for all empirical work. When students understand,
teaching becomes easier. And better. The goal of this book is to help get us there.
Supplements Accompanying Real Econometrics

A broad array of instructor and student resources for Real Econometrics are
available online at www.oup.com/us/bailey.
Data
Much of the supplementary material for Real Econometrics focuses on

data—through online access to the data sets referenced in the chapters, their
documentation, and additional data sets. These include:
PREFACE FOR INSTRUCTORS xxxv
• Chapter-specific libraries of downloadable figures, graphs, and data sets

(and their documentation) for the examples and exercises found in the text.
• Links to other data sets (both experimental and non-experimental) for

creating new assignments.
Instructor’s Manual
Each chapter in the Instructor’s Manual provides an overview of the chapter

goals and section-by-section teaching tips along with suggested responses to the
in-chapter Discussion Questions. The Instructor’s Manual also contains sample
data sets for the Computing Corner activities and solutions to the Exercises found
at the end of each chapter.
PowerPoint Presentations
Presentation slides offer bullet-point summaries as well as all the tables and graphs
from the book to help guide and design lectures. A separate set of slides containing
only the text tables and graphs is also available.
Computerized Test Bank
The computerized test bank that accompanies this text enables instructors to
easily create quizzes and exams, using any combination of publisher-provided
questions and their own questions. Questions can be edited and easily assembled
into assessments that can then be exported for use in learning management systems
or printed for paper-based assessments.
Learning Management Systems Support
For instructors using an online learning management system (e.g., Moodle, Sakai,
Blackboard, or others), Oxford University Press can provide all the electronic
components of the package in a format suitable for easy upload. Adopting
instructors should contact their local Oxford University Press sales representative
or OUP’s Customer Service (800-445-9714) for more information.
ACKNOWLEDGMENTS
This book has benefited from close reading and probing questions from a large
number of people, including students at the McCourt School of Public Policy
at Georgetown University and my current and former colleagues and students at
Georgetown, including Shirley Adelstein, Rachel Blum, David Buckley, Ian Gale,
Ariya Hagh, Carolyn Hill, Mark Hines, Dan Hopkins, Jeremy Horowitz, Huade
Huo, Wes Joe, Karin Kitchens, Jon Ladd, Jens Ludwig, Paasha Mahdavi, Jean
Mitchell, Paul Musgrave, Sheeva Nesva, Hans Noel, Irfan Nooruddin, Ji Yeon
Park, Parina Patel, Betsy Pearl, Lindsay Pettingill, Carlo Prato, Barbara Schone,
George Shambaugh, Dennis Quinn, Chris Schorr, Frank Vella, and Erik Voeten.
Credit (and/or blame) for the Simpsons figure goes to Paul Musgrave.
Participants at a seminar on the book at the University of Maryland, especially
Antoine Banks, Brandon Bartels, Kanisha Bond, Ernesto Calvo, Sarah Croco,
Michael Hanmer, Danny Hayes, Eric Lawrence, Irwin Morris, and John Sides,
gave excellent early feedback.
In addition, colleagues across the country have been incredibly helpful,
especially Allison Carnegie, Craig Volden, Sarah Croco, and Wendy Tam-Cho.
Reviewers for Oxford University Press and other commentators have provided
supportive yet probing feedback. These individuals include:
Steve Balla, George Washington University; Yong Bao, Purdue University;
James Bland, The University of Toledo; Kwang Soo Cheong, Johns Hopkins
University; Amanda Cook, Bowling Green State University; Renato Corbetta,
University of Alabama at Birmingham; Sarah Croco, University of Maryland;
David E. Cunningham, University of Maryland; Seyhan Erden, Columbia
University; José M. Fernández, University of Louisville; Luca Flabbi, Georgetown
University; Mark A. Gebert, University of Kentucky; Kaj Gittings, Texas Tech
University; Brad Graham, Grinnell College; Jonathan Hanson, University of
Michigan; David Harris, Benedictine College; Daniel Henderson, University of
Alabama; Matthew J. Holian, San Jose State University; Todd Idson, Boston
University; Changkuk Jung, SUNY Geneseo; Manfred Keil, Claremont McKenna
College; Subal C. Kumbhakar, State University of New York at Binghamton; Latika
Lagalo, Michigan Technological University; Matthew Lang, Xavier University;
Jing Li, Miami University; Quan Li, Texas A&M University; Drew A. Linzer,
Civiqs; Steven Livingston, Middle Tennessee State University; Aprajit Mahajan,
Stanford University; Brian McCall, University of Michigan; Phillip Mixon,
Troy University; David Peterson, Iowa State University; Leanne C. Powner,
Christopher Newport University; Zhongjun Qu, Boston University; Robi Ragan,
Stetson School of Business and Economics; Stephen Schmidt, Union College;
Markus P. A. Schneider, University of Denver; Sam Schulhofer-Wohl, Federal
Reserve Bank of Minneapolis; Christina Suthammanont, Texas A&M University,
San Antonio; Kerry Tan, Loyola University Maryland; Robert Turner, Colgate
University; Martijn van Hasselt, University of North Carolina—Greensboro;
David Vera, California State University; Christopher Way, Cornell University;
Acknowledgments xxxvii
Phanindra V. Wunnava, Middlebury College; and Jie Jennifer Zhang, University

of Texas at Arlington.
I also appreciate the generosity of colleagues who shared data, including Bill
Clark, Anna Harvey, Dan Hopkins, and Hans Noel.
The editing team at Oxford has done wonders for this book. Valerie Ashton
brought energy and wisdom to the early life of the book. Ann West and Jennifer
Carpenter have been supportive and insightful throughout, going to great lengths
to make this book the best it can be. Thom Holmes and Maegan Sherlock provided
expert development oversight on the first edition. Micheline Frederick has been a
very capable editor; without her help this book would have a lot more mistakes
. . . and a bit more cussing. Steve Rigolosi and Wesley Morrison performed the
unenviable work of copyediting and proofreading my writing with verve. Allison
Ball helped with the photo choices, and Tony Mathias has been enthusiastically
conveying the message of this book to the marketplace.
I am grateful for the support of my family, Mari, Jack, Emi, and Ken. After
years of working on Real Econometrics, now we can work on a real vacation.
The Quest for Causality 1
How do we know what we know? Or at least, why

do we think what we think? The modern answer
is evidence. In order to convince others—in order
to convince ourselves—we need to provide infor-
mation that others can verify. Something that is a
hunch or something that we simply “know” may
be important, but it is not the kind of evidence that
drives the modern scientific process.
What is the basis of our evidence? In some
cases, we can see cause and effect. We see a
burning candle tip over and start a fire. Now
we know what caused the fire. This is perfectly
good knowledge. Sometimes in politics and policy
we trace back a chain of causality in a similar
way. This process can get complicated, though.
Why do some economies stagnate while others
thrive? What are the economic and social effects
of international trade? Why did Donald Trump win
the presidential election in 2016? Why has crime
gone down in the United States? For these types
of questions, we are not looking only at a single
candle; there are lightning strikes, faulty wires,
arsonists, and who knows what else to worry
about. Clearly, it will be much harder to trace cause
and effect.
When there is no way of directly observing cause and effect, we naturally turn
to data. And data holds great promise. A building collapses during an earthquake.
What about the building led it—and not others in the same city—to collapse? Was
it the building material? The height? The design? Age? Location near a fault?
While we might not be able to see the cause directly, we can gather information
on buildings that did and did not collapse. If the older buildings were more likely
1
2 CHAPTER 1 The Quest for Causality
FIGURE 1.1: Rule #1
to collapse, we might reasonably suspect that building age mattered. If buildings

constructed without steel reinforcement collapsed no matter what their age, we
might reasonably suspect that buildings without reinforcement designs were more
likely to collapse.
And yet, we should not get overconfident. Even if old buildings were more
likely to collapse, we do not know for certain that age of the building is the main
explanation for the collapse. It could be that more buildings from a certain era
were designed a certain way; it could be that there were more old buildings in
a neighborhood where the seismic activity was most severe. Or the collapse of
many buildings that happened to be old could represent a massive coincidence. In
other words, correlation is not the same as causation. We put this fact in big blue
letters in Figure 1.1 because it is a fundamental starting point in any serious data
analysis.
The econometrics we learn in this book will help us to identify causes and
make claims about what really mattered—and what didn’t. If correlation is not cau-
sation, what does imply causation? It will take the whole book to fully flesh out the
answer, but here’s the short version: if we can find exogenous variation, then cor-
relation is probably causation. Our task then will be to figure out what exogenous
variation means and how to distinguish randomness from causality as best we can.
In this chapter, we introduce three concepts at the heart of the book.
Section 1.1 explains the core model we use throughout. Section 1.2 introduces
two major challenges that can make it hard to use data to learn about the world.
Neither is math. (Really!) The first is randomness: sometimes the luck of the
draw will lead us to observe relationships that aren’t real; other times random
chance will lead us to miss relationships that are real. The second is endogeneity,
a phenomenon that can cause us to wrongly think a variable causes some effect
when it doesn’t. Section 1.3 presents randomized experiments as the ideal way to
overcome endogeneity. Usually, these experiments aren’t possible, and even when
they are, things can go wrong. Hence, the rest of the book is about developing a
toolkit that helps us meet (or approximate) the idealized standard of randomized
experiments.
1.1 The Core Model

When we talk about cause and effect, we’ll refer to the outcome of interest as the
dependent variable. We’ll refer to a possible cause as an independent variable.
dependent variable The dependent variable, usually denoted as Y, is called that because its value
The outcome of interest, depends on the independent variable. The independent variable, usually denoted
usually denoted as Y. by X, is called that because it does whatever the hell it wants. It is potentially the
cause of some change in the dependent variable.
independent At root, social scientific theories posit that a change in one thing (the
variable A variable independent variable) will lead to a change in another (the dependent variable).
that possibly influences We’ll formalize this relationship in a bit, but let’s start with an example. Suppose
the value of the
we’re interested in the U.S. obesity epidemic and want to analyze the influence
dependent variable.
of snack food on health. We may wonder, for example, if donuts cause health
problems. Our model is that eating donuts (variable X, our independent variable)
causes some change in weight (variable Y, our dependent variable). If we can find
data on how many donuts people ate and how much they weighed, we might be
on the verge of a scientific breakthrough.
Let’s conjure up a small midwestern town and do a little research. Figure 1.2
plots donuts eaten and weights for 13 individuals from a randomly chosen town:
Springfield, U.S.A. Our raw data is displayed in Table 1.1. Each person has a line in
the table. Homer is observation 1. Since he ate 14 donuts per week, Donuts1 = 14.
We’ll often refer to Xi or Yi , which are the values of X and Y for person i in the
data set. The weight of the seventh person in the data set, Smithers, is 160 pounds,
meaning Weight7 = 160, and so forth.
scatterplot A plot of Figure 1.2 is a scatterplot of data, with each observation located at the
data in which each coordinates defined by the independent and dependent variables. The value of
observation is located at donuts per week is on the X-axis, and weights are on the Y-axis. Just by looking
the coordinates defined
at this plot, we sense there is a positive relationship between donuts and weight
by the independent and
dependent variables.
because the more donuts eaten, the higher the weight tends to be.
TABLE 1.1 Donut Consumption and Weight

Observation Name Donuts Weight
number per week (pounds)
1 Homer 14 275
2 Marge 0 141
3 Lisa 0 70
4 Bart 5 75
5 Comic Book Guy 20 310

6 Mr. Burns 0.75 80
7 Smithers 0.25 160

8 Chief Wiggum 16 263
9 Principal Skinner 3 205

10 Rev. Lovejoy 2 185
11 Ned Flanders 0.8 170
12 Patty 5 155
13 Selma 4 145
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150
Marge Selma
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
FIGURE 1.2: Weight and Donuts in Springfield
We use a simple equation to characterize the relationship between the two

variables:
Weighti = β0 + β1 Donutsi + i (1.1)
slope coefficient
The coefficient on an
independent variable. It
• The dependent variable, Weighti , is the weight of person i.
reflects how much the
dependent variable • The independent variable, Donutsi , is how many donuts person i eats per
increases when the week.
independent variable
increases by one. • β1 is the slope coefficient on donuts, indicating how much more1 a person
weighs for each donut eaten. (For those whose Greek is a bit rusty, β is the
constant The Greek letter beta.)
parameter β0 in a
regression model. It is
the point at which a • β0 is the constant or intercept, indicating the expected weight of people
regression line crosses who eat zero donuts.
the Y-axis. Also referred
to as the intercept. 1
Or less—be optimistic!
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
)
pe
slo
Principal (the
Skinner β1
200
Rev. Lovejoy
Ned Flanders
Smithers Patty
150
Marge Selma
β0 = 123
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
FIGURE 1.3: Regression Line for Weight and Donuts in Springfield
error term The term • i is the error term that captures anything else that affects weight. ( is the
associated with Greek letter epsilon)
unmeasured factors in a
regression model,
This equation will help us estimate the two parameters necessary to charac-
typically denoted as .
terize a line. Remember Y = mX + b from junior high? This is the equation for
a line where Y is the value of the line on the vertical axis, X is the value on the
horizontal axis, m is the slope, and b is the intercept, or the value of Y when X is
zero. Equation 1.1 is essentially the same, only we refer to the “b” term as β0 and
call the “m” term β1 .
Figure 1.3 shows an example of a possible line from this model for our
Springfield data. The intercept (β0 ) is the value of weight when donut consumption
is zero (X = 0). The slope (β1 ) is the amount that weight increases for each donut
eaten. In this case, the intercept is about 123, which means that the expected weight
for those who eat zero donuts is around 123 pounds. The slope is around 9.1, which
means that for each donut eaten per week, weight is about 9.1 pounds higher.
More generally, our core model can be written as
Yi = β0 + β1 Xi + i (1.2)
where β0 is the intercept that indicates the value of Y when X = 0 and β1 is the
slope that indicates how much change in Y is expected if X increases by one unit.
We almost always care a lot about β1 , which characterizes the relationship between
X and Y. We usually don’t care a whole lot about β0 . It plays an important role in
helping us get the line in the right place, but determining the value of Y when X is
zero is seldom our core research interest.
In Figure 1.3, we see that the actual observations do not fall neatly on the
line that we’re using to characterize the relationship between donuts and weight.
The implication is that our model does not perfectly explain the data. Of course
it doesn’t! Springfield residents are much too complicated for donuts to explain
them completely (except, apparently, Comic Book Guy).
The error term, i , comes to the rescue by giving us some wiggle room. The
error term is what is left over after the variables have done their work in explaining
variation in the dependent variable. In doing this service, it plays an incredibly
important role for the entire econometric enterprise. As this book proceeds, we
will keep coming back to the importance of getting to know our error term.
The error term, i , is not simply a Greek letter. It is something real. What it
covers depends on the model. In our simple model—in which weight is a function
only of how many donuts a person eats—oodles of factors are contained in the
error term. Basically, anything else that affects weight will be in the error term:
sex, height, other eating habits, exercise patterns, genetics, and on and on. The
error term includes everything we haven’t measured in our model.
We’ll often see i referred to as random error, but be careful about that one.
Yes, for the purposes of the model we are treating the error term as something
random, but it is not random in the sense of a roll of the dice. It is random more in
the sense that we don’t know what the value of it is for any individual observation.
But as a practical matter every error term reflects, at least in part, some relationship
to real things that we have not measured or included in the model. We will come
back to this point often.
REMEMBER THIS
Our core statistical model is
Yi = β0 + β1 Xi + i
1. β1 , the slope, indicates how much change in Y (the dependent variable) is expected if X (the
independent variable) increases by one unit.
2. β0 , the intercept, indicates where the regression line crosses the Y-axis. It is the value of Y when
X is zero.
3. β1 is usually more interesting than β0 because β1 characterizes a relationship between X and
Y.
1 1
Y Y
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
X X
(a) (b)
1
Y
Y 8
0.8 7
6
5
0.6 4
3
2
0.4
1
0
0.2 −1
−2
−3
0 −4
0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6
X X
(c) (d)
FIGURE 1.4: Examples of Lines Generated by Core Statistical Model (for Review Question)
Review Question
For each of the panels in Figure 1.4, determine whether β0 and β1 are greater than, equal to, or less
than zero. [Be careful with β0 in panel (d)!]
1.2 Two Major Challenges: Randomness and Endogeneity

Understanding that there are real factors in the error term helps us be smart about
making causal claims. Our data seems to suggest that the more donuts people ate,
the more they packed on the pounds. It’s not crazy to think that donuts cause weight
gain.
But can we be certain that donuts, and not some other factor, cause weight
gain? Two core challenges in econometric analysis should make us cautious. One
is randomness. Any time we observe a relationship in data, we need to keep in mind
that some coincidence could explain it. Perhaps we happened to pick some unusual
people for our data set. Or perhaps we picked perfectly representative people, but
they happened to have had unusual measurements on the day we examined them.
In the donut example, the possibility of such randomness should worry us, at
least a little. Perhaps the people in Figure 1.3 are a bit odd. Perhaps if we had more
people, we might get more heavy folks who don’t eat donuts and skinny people
who scarf them down. Adding those folks to the data set would change the figure
and our conclusions. Or perhaps even with the set of folks we observed, we might
have gotten some of them on a bad (or a good) day, whereas if we had looked at
them another day, we might have observed a different relationship.
Every legitimate econometric analysis therefore will account for randomness
in an effort to distinguish results that could happen by chance from those that
would be unlikely to happen by chance. The bad news is that we will never escape
the possibility that the results we observe are due to randomness rather than a
causal effect. The good news, though, is that we can often do a pretty good job
characterizing our confidence that the results are not simply due to randomness.
Another major challenge arises from the possibility that an observed relation-
ship between X and Y is actually due to another variable, which causes Y and
is associated with X. In the donuts example, worry about scenarios in which we
wrongly attribute to our key independent variable (in this case, donut consumption)
changes in weight that were caused by other factors. What if tall people eat more
donuts? Height is in the error term as a contributing factor to weight, and if tall
people eat more donuts, we may wrongly attribute to donuts the effect of height.
There are loads of other possibilities. What if men eat more donuts? What if
exercise addicts don’t eat donuts? What if people who eat donuts are also more
likely to down a tub of Ben and Jerry’s ice cream every night? What if thin people
can’t get donuts down their throats? Being male, exercising, bingeing on ice cream,
having itty-bitty throats—all these things are probably in the error term (meaning
they affect weight), and all could be correlated with donut eating.
Speaking econometrically, we highlight this major statistical challenge by
endogenous An saying that the donut variable is endogenous. An independent variable is
independent variable is endogenous if changes in it are related to factors in the error term. The prefix
endogenous if changes “endo” refers to something internal, and endogenous independent variables are
in it are related to factors
“in the model” in the sense that they are related to other things that also determine
in the error term.
Y (but are not already accounted for by X).
In the donuts example, donut consumption is likely endogenous because how
many donuts a person eats is not independent of other factors that influence weight
gain. Factors that cause weight gain (e.g., eating Ben and Jerry’s ice cream)
might be associated with donut eating; in other words, factors that influence the
dependent variable Y might also be associated with the independent variable X,
muddying the connection between correlation and causation. If we can’t be sure
that our variation in X is not associated with factors that influence Y, we need to
worry about wrongly attributing to X the causal effect of some other variable.
We might wrongly conclude that donuts cause weight gain when really donut
eaters are more likely to eat tubs of Ben and Jerry’s, with the ice cream being
the real culprit.
In all these examples, something in the error term that really causes weight
gain is related to donut consumption. When this connection exists, we risk
spuriously attributing to donut consumption the causal effect of some other factor.
Remember, anything not measured in the model is in the error term, and here,
at least, we have a wildly simple model in which only donut consumption is
measured. So Ben and Jerry’s, genetics, and everything else are in the error term.
Endogeneity is everywhere; it’s endemic. Suppose we want to know if raising
teacher salaries increases test scores. It’s an important and timely question.
Answering it may seem easy enough: we could simply see if test scores (a
dependent variable) are higher in places where teacher salaries (an independent
variable) are higher. It’s not that easy, though, is it? Endogeneity lurks. Test
scores might be determined by unmeasured factors that also affect teacher salaries.
Maybe school districts with lots of really poor families don’t have very good test
scores and don’t have enough money to pay teachers high salaries. Or perhaps
the relationship is the opposite—poor school districts get extra federal funds to
pay teachers more. Either way, teacher salaries are endogenous because their
levels depend in part on factors in the error term (like family income) that affect
educational outcomes. Simply looking at the relationship of test scores to teacher
salaries risks confusing the effect of family income and teacher salaries.2
The opposite of endogeneity is exogeneity. An independent variable is
exogenous An exogenous if changes in it are not related to factors in the error term. The prefix
independent variable is “exo” refers to something external, and exogenous independent variables are
exogenous if changes in “outside the model” in the sense that their values are unrelated to other things
it are unrelated to
that also determine Y. For example, if we use an experiment to randomly set the
factors in the error term.
value of X, then changes in X are not associated with factors that also determine
Y. This gives us a clean view of the relationship between X and Y, unmuddied by
associations between X and other factors that affect Y.
One of our central challenges is to avoid endogeneity and thereby achieve
exogeneity. If we succeed, we can be more confident that we have moved beyond
correlation and closer to understanding if X causes Y—our fundamental goal. This
process is not automatic or easy. Often we won’t be able to find purely exogenous
variation, so we’ll have to think through how close we can get. Nonetheless, the
bottom line is this: if we can find exogenous variation in X, we will be in a good
position to make reasonable inferences about what will happen to variable Y if we
change variable X.
correlation To formalize these ideas, we’ll use the concept of correlation, which most
Measures the extent to people know, at least informally. Two variables are correlated (“co-related”) if
which two variables are they move together. A positive correlation means that high values of one variable
linearly related to each
are associated with high values of the other; a negative correlation indicates that
other.
high values of one variable are associated with low values of the other.
Figure 1.5 shows examples of variables that have positive correlation
[panel (a)], no correlation [panel (b)], and negative correlation [panel (c)].
2
A good idea is to measure these things and put them in the model so that they are no longer in the
error term. That’s what we do in Chapter 5.
Error Error No correlation Error

term 1.2 term 1.2 term 1.2
Ne
1 1 1
tio
ga
la
rre
tiv
e
co
co
ive
rre
0.8 0.8 0.8
sit
la
Po
tio
n
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Independent variable Independent variable Independent variable

(a) (b) (c)
FIGURE 1.5: Correlation
Correlations range from 1 to −1. A correlation of 1 means that the variables move
perfectly together.
Correlations close to zero indicate weak relationships between variables.
When the correlation is zero, there is no linear relationship between two variables.3
We use correlation in our definitions of endogeneity and exogeneity. If our
independent variable has a relationship to the error term like the one in panel (a) of
Figure 1.5 (which shows positive correlation) or in panel (c) (which shows negative
correlation), then we have endogeneity. In other words, we have endogeneity when
the unmeasured stuff that constitutes the error term is correlated with our indepen-
dent variable, and endogeneity will make it difficult to tell whether changes in the
dependent variable are caused by our independent variable or the error term.
On the other hand, if our independent variable has no relationship to the error
term as in panel (b), we have exogeneity. In this case, if we observe Y rising with
X, we can feel confident that X is causing Y.
The challenge is that the true error term is not observable. Hence, much of
what we do in econometrics attempts to get around the possibility that something
3
In Appendix E (page 541), we provide an equation for correlation and discuss how it relates to our
ordinary least squares estimates from Chapter 3. Correlation measures linear relationships between
variables; we’ll discuss non-linear relationships in ordinary least squares on page 221.
unobserved in the error term may be correlated with the independent variable. This
quest makes econometrics challenging and interesting.
As a practical matter, we should begin every analysis by assessing endogene-
ity. First, look away from the model for a moment and list all the things that could
determine the dependent variable. Second, ask if anything on the list correlates
with the independent variable in the model and explain why it might. That’s it. Do
that, and we are on our way to identifying endogeneity.
REMEMBER THIS
1. There are two fundamental challenges in econometrics: randomness and endogeneity.
2. Randomness can produce data that suggests X causes Y even when it does not. Randomness
can also produce data that suggests X does not cause Y even when it does.
3. An independent variable is endogenous if it is correlated with the error term in the model.
(a) An independent variable is exogenous if it is not correlated with the error term in the
model.
(b) The error term is not observable, making it a challenge to know whether an independent
variable is endogenous or exogenous.
(c) It is difficult to assess causality for endogenous independent variables.
Discussion Questions
1. Each panel of Figure 1.6 on page 12 shows relationships among three variables: X is an observed
independent variable, is a variable reflecting some unobserved characteristic, and Y is the
dependent variable. (In our donut example, X corresponds to the number of donuts eaten,
corresponds to an unobserved characteristic such as exercise, and Y corresponds to the outcome
of interest, which is weight.) If an arrow connects X and Y, then X has a causal effect on Y.
If an arrow connects and Y, then the unobserved characteristic has a causal effect on Y. If a
double arrow connects X and , then these two variables are correlated (and we won’t worry
about which causes which).
For each panel, explain whether endogeneity will cause problems for an analysis of the
relationship between X and Y. For concreteness, assume X is grades in college, is IQ, and Y
is salary at age 26.
2. Come up with your own independent variable, unmeasured error variable, and dependent
variable. Decide which of the panels in Figure 1.6 best characterizes the relationship of the
variables you chose, and discuss the implications for econometric analysis.
X X
(Observed) (Observed)
Y Y
(Dependent variable) (Dependent variable)
(Unobserved) (Unobserved)
(a) (b)
X X
Y Y
(c) (d)
X X
Y Y
(e) (f)
X X
Y Y
(g) (h)
FIGURE 1.6: Possible Relationships Between X, , and Y (for Discussion Questions)

CASE STUDY Flu Shots

A great way to appreciate the challenges raised by endo-
geneity is to look at real examples. Here is one we all can
relate to: Do flu shots work?
No one likes the flu. It kills about 36,000 people in
the United States each year, mostly among the elderly. At
the same time, no one enjoys schlepping down to some
hospital basement or drugstore lobby, rolling up a shirt
sleeve, and getting a flu shot. Nonetheless, every year
100,000,000 Americans dutifully go through this ritual.
The evidence that flu shots prevent people from
dying from the flu must be overwhelming, right? Sup-
pose we start by considering a study using data on whether people died (the
dependent variable) and whether they got a flu shot (the independent variable):
Deathi = β0 + β1 Flu shoti + i (1.3)
where Deathi is a (creepy) variable that is 1 if person i died in the time frame of the
study and 0 if he or she did not. Flu shoti is 1 if the person i got a flu shot and 0 if
not.4
A number of studies have done essentially this analysis and found that people
who get flu shots are less likely to die. According to some estimates, those who
receive flu shots are as much as 50 percent less likely to die. This effect is enormous.
Going home with a Band-Aid that has a little bloodstain is worth it after all.
But are we convinced? Is there any chance of endogeneity? If there exists some
factor in the error term that affected whether someone died and whether he or she
got a flu shot, we would worry about endogeneity.
What is in the error term? Goodness, lots of things affect the probability
of dying: age, health status, wealth, cautiousness—the list is immense. All these
factors and more are in the error term.
How could these factors cause endogeneity? Let’s focus on overall health.
Clearly, healthier people die at a lower rate than unhealthy people. If healthy people
are also more likely to get flu shots, we might erroneously attribute life-saving
power to flu shots when perhaps all that is going on is that people who are healthy
in the first place tend to get flu shots.
It’s hard, of course, to get measures of health for people, so let’s suppose we
don’t have them. We can, however, speculate on the relationship between health
and flu shots. Figure 1.7 shows two possible states of the world. In each figure we
plot flu-shot status on the X-axis. A person who did not get a flu shot is in the 0
group; someone who got a flu shot is in the 1 group. On the Y-axis we plot health
4
We discuss dependent variables that equal only 0 or 1 in Chapter 12 and independent variables that
equal 0 or 1 in Chapter 6.
Health Health
10 10
8 8
6 6
4 4
2 2
0 1 0 1
Flu shot Flu shot

(a) (b)
FIGURE 1.7: Two Scenarios for the Relationship between Flu Shots and Health
related to everything but flu (supposing we could get an index that factors in age,
heart health, absence of disease, etc.). In panel (a) of Figure 1.7, health and flu
shots don’t seem to go together; in other words the correlation is zero. If panel (a)
represents the state of the world, then our results that flu shots are associated with
lower death rates is looking pretty good because flu shots are not reflecting overall
health. In panel (b), health and flu shots do seem to go together, with the flu shot
population being healthier. In this case, we have correlation of our main variable
(flu shots) and something in the error term (health).
Brownlee and Lenzer (2009) discuss some indirect evidence suggesting that flu
shots and health are actually correlated. A clever approach to assessing this matter
is to look at death rates of people in the summer. The flu rarely kills people in the
summer, which means that if people who get flu shots also die at lower rates in the
summer, it is because they are healthier overall. And if people who get flu shots die
at the same rates as others during the summer, it would be reasonable to suggest
that the flu-shot and non-flu-shot populations have similar health. It turns out that
people who get flu shots have an approximately 60 percent lower probability of
dying outside the flu season.
Other evidence backs up the idea that healthier people get flu shots. As it
happened, vaccine production faltered in 2004, and 40 percent fewer people got
vaccinated. What happened? Flu deaths did not increase. And in some years, the flu
vaccine was designed to attack a set of viruses that turned out to be different from
the viruses that actually spread; again, there was no clear change in mortality. This
data suggests that people who get flu shots may live longer because getting flu
shots is associated with other healthy behavior, such as seeking medical care and
eating better.
The point is not to put us off flu shots. We’ve discussed only mortality—whether
people die from the flu—not whether they’re more likely to contract the virus or stay
home from work because they are sick.5 The point is to highlight how hard it is to
really know if something (in this case, a vaccine) works. If something as widespread
and seemingly straightforward as a flu shot is hard to assess definitively, think about
the care we must take when trying to analyze policies that affect fewer people and
have more complicated effects.
CASE STUDY Country Music and Suicide

Does music affect our behavior? Are we more serious
when we listen to classical music? Does bubblegum pop
make us bounce through the halls? Both ideas seem
plausible, but how can we know for sure?
Stack and Gundlach (1992) looked at data to assess
one particular question: Does country music depress
us? They argued that country music, with all its lyrics
about broken relationships and bad choices, may be so
depressing that it increases suicide rates.6 We can test
this claim with the following statistical model:
Suicide ratesi = β0 + β1 Country musici + i (1.4)
where Suicide ratesi is the suicide rate in metropolitan area i and Country musici is
the proportion of radio airtime devoted to country music in metropolitan area i.7
It turns out that suicides are indeed higher in metropolitan areas where radio
stations play more country music. But do we believe this is a causal relationship?
5
Demicheli, Jefferson, Ferroni, Rivetti, and Di Pietrantonj (2018) summarize 52 randomized
controlled trials of flu vaccines and conclude that the vaccines reduce the incidence of flu in healthy
adults from 2.3 to 0.9 percent. The flu vaccine also reduces the incidence of flu-like illness from 21.5
to 18.1 percent. The effect on hospitalization is not large and not statistically significant. There is no
evidence of reducing days off of work. See also DiazGranados, Denis, and Plotkin (2012) as well as
Osterholm, Kelley, Sommer, and Belongia (2012).
6
Really, this is an actual published paper.
7
Their analysis is based on a more complicated model, but this is the general idea.
(In other words, is country music exogenous?) If radio stations play more country
music, should we expect more suicides?
Let’s work through this example.
What does β 0 mean? What does β 1 mean? In this model, β0 is the expected level
of suicide in metropolitan areas that play no country music. β1 is the amount by
which suicide rates change for each one-unit increase in the proportion of country
music played in a metropolitan area. We don’t know what β1 is; it could be positive
(suicides increase), zero (no relation to suicides), or negative (suicides decrease). For
the record, we don’t know what β0 is either, but since this variable does not directly
characterize the relationship between music and suicides the way β1 does, we are
less interested in it.
What is in the error term? The error term contains factors that are associated
with higher suicide rates, such as alcohol and drug use, availability of guns, divorce
and poverty rates, lack of sunshine, lack of access to mental health care, and
probably many more.
What are the conditions for X to be endogenous? An independent variable is

endogenous if it is correlated with factors in the error term. Therefore, we need to
ask whether the amount of country music played on radio stations in metropolitan
areas is correlated with drinking, drug use, and all the other stuff in the error term.
Is the independent variable likely to be endogenous? Are booze, divorce, and

guns likely to be correlated to the amount of country music someone has listened
to? Have you listened to any country music? Drinking and divorce come up now and
again. Could this music appeal more in areas where people drink too much and get
divorced more frequently? (To complicate matters, country music could decrease
suicide because it lauds family and religion more than many other types of music.)
Or could it simply be that people in rural areas who like country music also have a lot
of guns? All of these factors—alcohol, divorce, and guns—are plausible influences
on suicide rates. To the extent that country music is correlated with any of them, the
country music variable would be endogenous.
Explain how endogeneity could lead to incorrect inferences. Suppose for a

moment that country music has no effect whatsoever on suicide rates, but that
regions with lots of guns and drinking also have more suicides and that people in
these regions also listen to more country music. If we look only at the relationship
between country music and suicide rates, we will see a positive relationship: places
with lots of country music will have higher suicide rates, and places with little
country music will have lower suicide rates. The explanation could be that the
country music areas have lots of drinking and guns and the areas with little country
music have less drinking and fewer guns. Therefore, while it may be correct to say
there are more suicides in places where there is more country music, it would be
incorrect to conclude that country music causes suicides. Or, to put it in another
way, it would be incorrect to conclude that we would save lives by banning country
music.
As it turns out, Snipes and Maguire (1995) account for the amount of guns and
divorce in metropolitan areas and find no relationship between country music and
metropolitan suicide rates. So there’s no reason to turn off the radio and put away
those cowboy boots.
1. Labor economists often study the returns on investment in education (see, e.g., Card 1999).
Suppose we have data on salaries of a set of people, some of whom went to college and some
of whom did not. A simple model linking education to salary is
Salaryi = β0 + β1 College graduatei + i
where the value of Salaryi is the salary of person i and the value of College graduatei is 1 if
person i graduated from college and is 0 if person i did not.
(a) What does β0 mean? What does β1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be endogenous?
(d) Is the independent variable likely to be endogenous? Why or why not?
(e) Explain how endogeneity could lead to incorrect inferences.
2. Donuts aren’t the only food that people worry about. Consider the following model based on
Solnick and Hemenway (2011):
Violencei = β0 + β1 Soft drinksi + i
where Violencei is the number of physical confrontations student i was in during a school year
and Soft drinksi is the average number of cans of soda student i drinks per week.
3. We know U.S. political candidates spend an awful lot of time raising money. And we know
they use the money to inflict mind-numbing ads on us. Do we know if the money and the ads
it buys actually work? That is, does campaign spending increase vote share? Jacobson (1978),
Erikson and Palfrey (2000), and others have grappled at length with this issue. Consider the
following model:
Vote sharei = β0 + β1 Campaign spendingi + i
where Vote sharei is the vote share of a candidate in state i and Campaign spendingi is the
spending by candidate i.
4. Researchers identified every outdoor advertisement in 228 census tracts in Los Angeles and
New Orleans and then interviewed 2,881 residents of the cities about weight. Their results
suggested that a 10 percent increase in outdoor food ads in a neighborhood was associated
with a 5 percent increase in obesity.
(a) Do you think there could be endogeneity?
(b) How would you test for a relationship between food ads and obesity?
(c) Read the article “Does This Ad Make Me Fat?” by Christopher Chabris and Daniel
Simons in the March 10, 2013, issue of the New York Times and see how your answers
compare to theirs.
1.3 Randomized Experiments as the Gold Standard

The best way to fight endogeneity is to have exogenous variation. A good way to
have exogenous variation is to create it. If we do so, we know that our independent
variable is unrelated to the other variables that affect the dependent variable.
In theory, it is easy to create exogenous variation with a randomized
experiment. In our donut example, we could randomly pick people and force
them to eat donuts while forbidding everyone else to eat donuts. If we can
pull this experiment off, the amount of donuts a person eats will be unrelated
to other unmeasured variables that affect weight. The only thing that would
determine donut eating would be the luck of the draw. The donut-eating group
would have some ice cream bingers, some health food nuts, some runners, some
round-the-clock video gamers, and so on. So, too, would the non-donut-eating
group. There wouldn’t be systematic differences in these unmeasured factors
across groups. Both treated and untreated groups would be virtually identical and
would resemble the composition of the population.
In an experiment like this, the variation in our independent variable X is
exogenous. We have won. If we observe that donut eaters weigh more or have
other health differences from non-eaters of donuts, we can reasonably attribute
these effects to donut consumption.
Simply put, the goal of such a randomized experiment is to make sure
the independent variable, which we also call the treatment, is exogenous. The
randomization The key element of such experiments is randomization, a process whereby the
process of determining value of the independent variable is determined by a random process. The
the experimental value value of the independent variable will depend on nothing but chance, meaning
of the key independent
that the independent variable will be uncorrelated with everything, including
variable based on a
random process.
any factor in the error term affecting the dependent variable. In other words,
a randomized independent variable is exogenous; analyzing the relationship
between an exogenous independent variable and the dependent variable allows
us to make inferences about a causal relationship between the two variables.
This is one of those key moments when a concept that may not be very compli-
cated turns out to have enormous implications. By randomly picking some people
to get a certain treatment, we rule out the possibility that there is some other way
for the independent variable to be associated with the dependent variable. If the
randomization is successful, the treated subjects are not systematically taller, more
athletic, or more food conscious—or more left-handed or stinkier, for that matter.
The basic structure of a randomized experiment, often referred to as a
randomized randomized controlled trial, is simple. Based on our research question, we
controlled trial An identify a relevant population that we randomly split into two groups: a treatment
experiment in which the group, which receives the policy intervention, and a control group, which does
treatment of interest is
not. After the treatment, we compare the behavior of the treatment and control
randomized.
groups on the outcome we care about. If the treatment group differs substantially
from the control group, we believe the treatment had an effect; if not, then we’re
treatment group In inclined to think the treatment had no effect.8
an experiment, the For example, suppose we want to know if an ad campaign increases
group that receives the enrollment in ObamaCare. We would identify a sample of uninsured people and
treatment of interest. split them into a treatment group that is exposed to the campaign and a control
group that is not. After the treatment, we compare the enrollment in ObamaCare
control group In an of the treatment and control groups. If the treated group enrolled at a substantially
experiment, the group higher rate, that outcome would suggest the campaign works.
that does not receive Because they build exogeneity into the research, randomized experiments
the treatment of interest.
are often referred to as the gold standard for causal inference. The phrase “gold
standard” usually means the best of the best. But experiments also merit the gold
standard moniker in another sense. No country in the world is actually on a gold
standard. The gold standard doesn’t work well in practice, and for many research
questions, neither do experiments. Simply put, experiments are great, but they can
be tricky when applied to real people going about their business.
8
We provide standards for making such judgments in Chapter 3 and beyond.
The human element of social scientific experiments makes them very different
from experiments in the physical sciences. My third grader’s science fair project
compared cucumber seeds planted in peanut butter and in dirt. She did not have to
worry that the cucumber seeds would get up and say, “There is NO way you are
planting me in that.” In the social sciences, though, people can object, not only to
being planted in peanut butter but also to things like watching TV commercials,
attending a charter school, changing health care plans, or pretty much anything
else we might want to study with an experiment.
Therefore, an appreciation of the virtues of experiments should come with
a recognition of their limits. We devote Chapter 10 to discussing the analytical
challenges that accompany experiments. No experiment should be designed
without thinking through these issues, and every experiment should be judged by
how well it deals with them.
Social scientific experiments can’t answer all social scientific research
questions for other reasons as well. The first is that experiments aren’t always
feasible. The financial costs of many experiments are beyond what most major
research organizations can fund, let alone what a student doing a term paper can
afford. And for many important questions, it’s not a matter of money. Do we
want to know if corruption promotes civil unrest? Good luck with our proposal to
randomly end corruption in some countries and not others. Do we want to know
if birthrates affect crime? Are we really going to randomly assign some regions
to have more babies? While the randomizing process could get interesting, we’re
unlikely to pull it off. Or do we want to know something historical? Forget about
an experiment.9
And even if an experiment is feasible, it might not be ethical. We see this
dilemma most clearly in medicine: If we believe a given treatment is better but
are not sure, how ethical is it to randomly subject some people to a procedure that
might not work? The medical community has developed standards relating to level
of risk and informed consent by patients, but such questions will never be easy to
answer.
Consider (again) flu shots. We may think that assessing the efficacy of this
public health measure is a situation made for a randomized experiment. It would
be expensive but conceptually simple. Get a bunch of people who want a flu shot,
tell them they are participating in a random experiment, and randomly give some
a flu shot and the others a placebo shot. Wait and see how the two groups do.
But would such a randomized trial of flu vaccine be ethical? When we say
“Wait and see how the two groups do,” we actually mean “Wait and see who dies.”
9
The range of randomized controlled trials can be astounding, though, ranging from a study of
layoffs (randomized!) (Heinz, Jeworrek, Mertins, Schumacher, and Sutter 2017) to a study of
epidural pain-relief for women in childbirth (Shen, Li, Xu, Wang, Fan, Qin, Zhou, and Hess 2017).
Here’s how I picture the randomized epidural study went down:
Doctor: About your pain relief during labor. Or should I say [makes air quote gesture] “pain
relief”. . .
Post-delivery mother: [punches doctor in nose]
Doctor: Ok, well yeah, that’s fair . . .
That changes the stakes a bit, doesn’t it? The public health community strongly
believes in the efficacy of the flu vaccine and, given that belief, considers it
unethical to deny people the treatment. Brownlee and Lenzer (2009) recount in
The Atlantic how one doctor first told interviewers that a randomized trial might
be acceptable, then got cold feet and called back to say that such an experiment
would be unethical.10
generalizable A Finally, experimental results may not be generalizable. That is, a specific
statistical result is experiment may provide great insight into the effect of a given policy intervention
generalizable if it applies at a given time and place, but how sure can we be that the same policy intervention
to populations beyond
will work somewhere else? Jim Manzi, the author of Uncontrolled (2012), argues
the sample in the
analysis.
that the most honest way to describe experimental results is that treatment X was
effective in a certain time and place in which the subjects had the characteristics
they did and the policy was implemented by people with the characteristics
they had. Perhaps people in different communities respond to treatments dif-
ferently. Or perhaps the scale of an experiment could matter: a treatment that
worked when implemented on a small scale might fail if implemented more
broadly.
Econometricians make this point by distinguishing between internal validity
internal validity A and external validity. Internal validity refers to whether the inference is biased;
research finding is external validity refers to whether an inference applies more generally. A
internally valid when it is well-executed experiment will be internally valid, meaning that the results will
based on a process that
on average lead us to make the correct inferences about the treatment and its
is free from systematic
error.
outcome in the context of the experiment. In other words, with internal validity,
we can say confidently that our research design will not systematically lead
us astray (even as randomness could point to incorrect conclusions for any
external validity A
given analysis). Even with internal validity, however, an experiment may not be
research finding is
externally valid when it externally valid: the causal relationship between the treatment and the outcome
applies beyond the could differ in other contexts. That is, even if we have internally valid evidence
context in which the from an experiment that aardvarks in Alabama procreate more if they listen
analysis was conducted. to Mozart, we can’t really be sure aardvarks in Alaska will respond in the
same way.
Hence, even as experiments offer a conceptually clear approach to defeating
endogeneity, they cannot always offer the final word for economic, policy, and
political research. Therefore, most scholars in most fields need to grapple with
observational non-experimental data. Observational studies use data that has been generated
studies Use data by non-experimental processes. In contrast to randomized experiments in which a
generated in an researcher controls at least one of the variables, in observational studies the data is
environment not
what it is, and we do the best we can to analyze it in a sensible way. Endogeneity
controlled by a
researcher. They are
will be a chronic problem, but we are not totally defenseless in the fight against it.
distinguished from Even if we have only observational data, the techniques explained in this book can
experimental studies help us achieve, or at least approximate, the exogeneity promised by randomized
and are sometimes experiments.
referred to as
non-experimental studies. 10
Another flu researcher cited in the article came to the opposite conclusion, saying, “What do you do
when you have uncertainty? You test . . .We have built huge, population-based policies on the
flimsiest of scientific evidence. The most unethical thing to do is to carry on business as usual.”
REMEMBER THIS
1. Experiments create exogeneity via randomization.
2. Social science experiments are complicated by practical challenges associated with the
difficulty of achieving randomization and full participation.
3. Experiments are not always feasible, ethical, or generalizable.
4. Observational studies use non-experimental data. They are necessary to answer many
questions.
1. Is it possible to have a non-random exogenous independent variable?
2. Think of a policy question of interest. Discuss how an experiment might work to address the
question.
3. Does foreign aid work? How should we create an experiment to assess whether aid to very poor
countries works? What might some of the challenges be?
4. Do political campaigns matter? How should we create an experiment to assess whether phone
calls, mailings, and visits by campaign workers matter? What might some of the challenges be?
5. How are health and medical spending affected when people have to pay each time they see
a doctor? How should we create an experiment to assess whether the amount of co-payments
(payments tendered at every visit to a doctor) affects health costs and quality? What might some
of the challenges be?
Conclusion
The point of econometric research is almost always to learn if X (the independent
variable) causes Y (the dependent variable). If we see high values of Y when
X is high and low values of Y when X is low, we might be tempted to think
X causes Y. We need always to be aware that the observed relationship could
have arisen by chance. Or, if X is endogenous, we need to remember that
interpreting the relationship between X and Y as causal could be wrong, possibly
completely wrong. When another factor both causes Y and is correlated with X,
any relationship we see between X and Y may be due to the effect of that other
factor.
Key Terms 23
We spend the rest of this book accounting for uncertainty and battling
endogeneity. Some approaches, like randomized experiments, seek to create
exogenous change. Other econometric approaches, like multivariate regression,
winnow down the number of other factors lurking in the background that can
cause endogeneity. These and other approaches have strengths, weaknesses,
tricks, and pitfalls. However, they all are united by a fundamental concern with
counteracting endogeneity. Therefore, if we understand the concepts in this
chapter, we understand the essential challenges of using econometrics to better
understand policy, economics, and politics.
Based on this chapter, we are on the right track if we can do the following:
• Section 1.1: Explain the terms in our core statistical model: Yi = β0 + β1

Xi + i .
• Section 1.2: Explain how randomness can make causal inference challeng-
ing, and explain how endogeneity can undermine causal inference.
• Section 1.3: Explain how experiments achieve exogeneity, and discuss

challenges and limitations of experiments.
Key Terms
Constant (4) External validity (21) Randomized controlled trial
Control group (19) Generalizable (21) (19)
Correlation (9) Independent variable (2) Scatterplot (3)
Dependent variable (2) Intercept (4) Slope coefficient (4)
Endogenous (8) Internal validity (21) Treatment group (19)
Error term (5) Observational studies (21)
Exogenous (9) Randomization (19)
2 Stats in the Wild: Good Data Practices
Our goal is to use data to better understand the

world. We saw in the previous chapter that ran-
domness and endogeneity make this hard. Much
of this book will be about how to take on these
and other challenges.
We need to focus on first things first, however.
Econometrics requires data. And if we screw up
our data, none of the tools we work on later will
save us.
It’s easy to overlook data gathering and
organization as unsexy. But it’s super impor-
tant. Consider what happened when economists
Carmen Reinhart and Ken Rogoff (2010) wanted to know whether government
debt affected economic growth. This is a huge question because the better we
understand growth, the better we can fight unemployment, which hurts lives,
threatens health, and basically sucks all around.
Reinhart and Rogoff gathered more than 3, 700 annual observations of
economic growth from a large sample of countries. Panel (a) of Figure 2.1 depicts
one of their key results, grouping average national growth of gross domestic
product (GDP) into four categories depending on the national ratio of public debt
to GDP. The shocking finding was that average economic growth dropped off a
cliff for countries when their government debt went above 90 percent of GDP.
The implication was obvious: governments should be very cautious about using
deficit spending to fight unemployment.
There was one problem with the economists’ story, though. The data didn’t
quite say what they said it did. Herndon, Ash, and Pollin (2014) did some digging
and found that some observations had been dropped, others were typos, and
most ignominiously, some calculations in Reinhart and Rogoff’s original Excel
spreadsheet were wrong. With the data corrected, the graph changed to the one
shown in panel (b) of Figure 2.1. Not quite the same story. Economic growth
24
Stats in the Wild: Good Data Practices 25
Real
GDP
growth 4 4
(percent)
3 3
2 2
1 1
0 0
0−30% 30−60% 60−90% Above 90% 0−30% 30−60% 60−90% Above 90%
Public debt/GDP by category Public debt/GDP by category

(original data) (corrected data)
(a) (b)
FIGURE 2.1: Two Versions of Debt and Growth Data
didn’t plummet once government debt passed 90 percent of GDP. While people
can debate whether the slope in panel (b) is a bunny hill or an intermediate hill, it
clearly is nothing like the cliff in the data originally reported.1
Reinhart and Rogoff’s discomfort can be our gain when we realize that even
top scholars can make data mistakes. Hence, we need to create habits that help
us minimize mistakes and maximize the chance that others can find them if
we do.
This chapter focuses on the crucial first steps for any econometric analysis.
First, we need to understand our data. Section 2.1 introduces tools for describing
data and sniffing out possible errors or anomalies. Second, we need to be
prepared to convince others. If others can’t recreate our results, people shouldn’t
1
A deeper question is whether we should treat this observational data as having any causal force.
Government debt levels are probably related to other factors that affect economic growth, like wars
and the quality of a country’s institutions. In other words, government debt likely is endogenous,
meaning that we probably can’t draw any conclusions about the effects of debt on growth without
implementing techniques we cover later in this book.
26 CHAPTER 2 Stats in the Wild: Good Data Practices
believe them. Therefore, Section 2.2 helps us establish good habits so that our
code is understandable to ourselves and others. Finally, we sure as heck aren’t
going to do all this work by hand. Therefore, Section 2.3 introduces two major
statistical software programs, Stata and R. This chapter is short because we’ll also
be spending time getting used to our software.
2.1 Know Our Data

Ideally, our data is produced in clean rooms staffed by stainless steel robots. That’s
not really how the world works, though. Social science experiments, if they can be
conducted at all, can produce some pretty messy data. Observational data is even
messier.2
Therefore, Job One in data analysis is to know our data. This rule sounds obvi-
ous and simple, but not everyone follows it, sometimes to their embarrassment. For
each variable, we should know the number of observations, the mean and standard
deviation, and the minimum and maximum values. Knowing this information gives
us a feel for data, helping us know if we have missing data and what the scales and
ranges of the variables are. Table 2.1 shows an example for the donut and weight
data we discussed on page 3. The number of observations, frequently referred to as
“N” (for number), is the same for all variables in this example, but it varies across
variables in many data sets if there is missing data. We all know the mean (also
standard deviation known as the average). The standard deviation measures how widely dispersed
The standard deviation the values of the observation are.3 The minimum and maximum, which tell us the
describes the spread of range of the data, can point to screwy values of a variable when the minimum or
the data.
maximum doesn’t make sense.
TABLE 2.1 Descriptive Statistics for Donut and Weight Data

Variable Observations Mean Standard Minimum Maximum
(N) deviation
Weight 13 171.85 76.16 70 310

Donuts 13 5.41 6.85 0 20.5
2
Chris Achen (1982, 53) memorably notes, “If the information has been coded by nonprofessionals
and not cleaned at all, as often happens in policy analysis projects, it is probably filthy.”
3
Appendix C contains more details (page 539). Here’s a quick refresher. The standard deviation of X
is a measure of the dispersion of X.The larger the standard deviation, the more spread out the values.

Standard deviation is calculated as N1 (Xi − X)2 , where X is the mean of X. We record how far
each observation is from the mean. We then square each value because for the purposes of calculating
dispersion, we don’t distinguish whether a value is below the mean or above it; when squared, all
these values become positive numbers. We record the average of these squared values. Finally, since
they’re squared values, taking the square root of the average brings the final value back to the scale of
the original variable.
2.1 Know Our Data 27
TABLE 2.2 Frequency Table for Male Variable

in Donut Data Set
Value Observations
0 4
1 9
TABLE 2.3 Frequency Table for Male Variable

in Second Donut Data Set
Value Observations
0 4
1 8
100 1
If a variable takes on only few values, it is also helpful to look at the

distribution of observed values. Table 2.2 is a frequency table for the male variable,
which equals 1 for men and 0 for women. The table indicates that the donut
data set consists of nine men and four women. Fair enough. But suppose that our
frequency table looked like Table 2.3 instead. Either we have a very manly man
in the sample, or (more likely) we have a mistake in our data. The econometric
tools we use later in this book will not necessarily flag such issues, so we need to
be alert.
Graphing data is useful because it allows us to see relationships and to notice
unusual observations. The tools we will develop later quantify these relationships,
but seeing them for ourselves is an excellent and necessary first step. For example,
Figure 2.2 shows the scatterplot of the weight and donut data that we saw earlier.
We can see that there does seem to be a relationship between the two variables.
We also see some relationships that we might have missed without graphing.
Lisa and Bart are children, for example, and their weight is much lower. We’ll
probably want to account for that in our analysis. Women also seem to weigh
less.
Effective figures are clean, clear, and attractive. We point to some resources
for effective visualization in the Further Reading section at the end of the chapter,
but here’s the bottom line: Get rid of clutter. Don’t overdo axis labels. Avoid
abbreviations and jargon. Pick colors that go together well. And 3D? Don’t. Ever.
REMEMBER THIS
1. A useful first step toward understanding data is to review sample size, mean, standard deviation,
and minimum and maximum for each variable.
2. Plotting data is useful for identifying patterns and anomalies in data.
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150
Marge Selma
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
FIGURE 2.2: Weight and Donuts in Springfield
2.2 Replication
At the heart of scientific knowledge is replication. Research that meets a
replication Research
replication standard can be duplicated based on the information provided at the
that meets a replication
standard can be time of publication. In other words, an outsider who used that information would
duplicated based on the produce identical results.
information provided at We need replication files to satisfy this standard. Replication files document
the time of publication. exactly how data is gathered and organized. Properly constructed, these files allow
others to check our work by following our steps and seeing if they get identical
replication files Files results.
that document exactly
Replication files also enable others to probe our analysis. Sometimes—often,
how data is gathered
and organized. When in fact—statistical results hinge on seemingly small decisions about what data
properly compiled, to include, how to deal with missing data, and so forth. People who really care
these files allow others about getting the answer right will want to see what we’ve done to our data and,
to reproduce our results realistically, will be wary until they determine for themselves that other reasonable
exactly. ways of doing the analysis produce similar results. If a certain coding or statistical
2.2 Replication 29
choice substantially changes results, we need to pay a lot of attention to that

choice.
Committing to a replication standard keeps our work honest. We need to make
sure that we base our choices on the statistical merits, not on whether they produce
the answer we want. If we give others the means to check our work, we’re less
likely to fall victim to the temptation of reporting only the results we like.
Therefore, every statistical or econometric project, whether a homework
assignment, a thesis, or a multimillion-dollar consulting project, should start with
codebook A file that replication files. One file is a data codebook that documents the data. Sometimes
describes sources for this file simply notes the website and when the data was downloaded. Often,
variables and any though, the codebook will include information about variables that come from
adjustments made. multiple sources. The codebook should note the source of the data, the type of
data, who collected it, and any adjustments the researcher made. For example, is
the data measured in nominal or real dollars? If it is in real dollars, which inflation
deflator has been used? Is the data measured by fiscal year or calendar year? Losing
track of this information can lead to frustrating and unproductive backtracking
later.
Table 2.4 contains a sample of a codebook for a data set on height and wages.4
The data set, which was used to assess whether tall people get paid more, is pretty
straightforward. It covers how much money people earned, how tall they were,
and their activities in high school. We see, though, that details matter. The wages
are stated in dollars per hour, which itself is calculated based on information from
an entire year of work. We could imagine data on wages in other data sets being
expressed in terms of dollars per month or year. There are two height variables, one
measured in 1981 and the other measured in 1985. The athletics variable indicates
whether the person did or did not participate in athletics. Given the coding, a person
who played multiple sports will have the same value for this variable as a person
who played one sport. Such details are important in the analysis, and we must be
careful to document them thoroughly.
TABLE 2.4 Codebook for Height and Wage Data

Variable name Description
wage96 Adult hourly wages (dollars) reported in 1996 (salary and

wages divided by hours worked in past calendar year)
height85 Adult height (inches), self-reported in 1985

height81 Adolescent height (inches), self-reported in 1981
athletics Participation in high-school athletics (1 = yes, 0 = no)

clubnum Number of club memberships in high school, excluding athletics,
academic/honor society clubs, and vocational clubs
male Male (1 = yes, 0 = no)
4
We analyze this data on page 74.
A second replication file should document the analysis, usually by providing

the exact code used to generate the results. Which commands were used to produce
the analysis? Sometimes the file contains a few simple lines of software code.
Often, however, we need to explain the complicated steps in merging or cleaning
the data. Or we need to detail how we conducted customized analysis. These
steps are seldom obvious from the description of data and methods that makes
its way into the final paper or report. It is a great idea to include commentary
in the replication material explaining the code and the reasons behind decisions.
Sometimes statistical code will be pretty impenetrable (even to the person who
wrote it!), and detailed commentary helps keeps things clear for everyone. We
show examples of well-documented code in the Computing Corner beginning on
page 34.
Having well-documented data and analysis is a huge blessing. Even a
modestly complex project can produce a head-spinning number of variables and
choices. And because the work often extends over days, weeks, or even months,
we learn quickly that what seems obvious when fresh can fade into oblivion
when we need it later. How exactly did we create our wonderful new variable
at 3 am, three weeks ago? An analysis we can’t recreate from scratch is useless.
We might as well have gone to bed. If we have a good replication file, on
the other hand, we can simply run the code again and be up to full speed in
minutes.
A replication file is also crucial in analyzing the robustness of our results. A
robust Statistical result is robust if it does not change when we change the model. For example,
results are robust if they if we believe that a certain observation was mismeasured, we might exclude it
do not change when from the data we analyze. A reader might be nervous about this exclusion. It
the model changes.
will therefore be useful to conduct a robustness check in which we estimate
the model including the contested observation. If the statistical significance and
magnitude of the coefficient of interest are essentially the same as before, then
we can assure others that the results are robust to inclusion of that observation.
If the results change, however, the coefficient of interest changes. Then the
results are not robust, and we have some explaining to do. Knowing that many
results are not robust, experienced researchers demand extensive robustness
checks.
REMEMBER THIS
1. Analysis that cannot be replicated cannot be trusted.
2. Replication files document data sources and methods that someone could use to exactly recreate
the analysis in question from scratch.
3. Replication files also allow others to explore the robustness of results by enabling them to assess
alternative approaches to the analysis.
2.2 Replication 31
CASE STUDY Violent Crime in the United States

Violent crime is one of our worst fears. The more we can
understand its causes, the more we can design public
policies to address it. Many wonder if crime is a result
of the breakdown of the family, poverty, or dense urban
living.
For a preliminary picture of how violent crime and
such demographic features are related, consider data on
crime drawn from 2009 for the 50 states and Washington,
DC. We can see in Table 2.5 that no data is missing
(because each variable has 51 observations). We also
see that the violent crime rate has a broad range, from
119.9 per 100,000 population all the way to 1,348.9 per
100,000 people. The percent-single-parents variable is on a 0-to-1 scale, also with
considerable range, from 0.18 to 0.61. The percent-urban variable (which is the
percent of people in the state living in a metropolitan area) is measured on
a 0-to-100 scale. These scales mean that 50 percent is indicated as 0.5 in the
single-parent variable and as 50 in the urban variable. Getting the scales mixed up
could screw up the way we interpret results about the relationships among these
variables.
Scatterplots provide excellent additional information about our data. Figure 2.3
shows scatterplots of state-level violent crime rate and percent urban, percent of
children with a single parent, and percent in poverty. Suddenly, the character of
the data is revealed. Washington, DC, is a clear outlier, being very much higher than
the 50 states in level of violent scrime. Perhaps it should be dropped.5
We can also use scatterplots to appreciate non-obvious things about our
data. We may think of highly urbanized states as being the densely populated
ones in the Northeast like Massachusetts and New Jersey. Actually, though,
the scatterplot helps us see that Nevada, Utah, and Florida are among the
most urbanized according to the Census Bureau measure. Understanding the
TABLE 2.5 Descriptive Statistics for State Crime Data

Variable Observations Mean Standard Minimum Maximum
(N) deviation
Violent crime rate 51 406.53 205.61 119.90 1, 348.90

(per 100,000 people)
Percent single parents 51 0.33 0.07 0.18 0.61

Percent urban 51 73.92 14.92 38.83 100.00
Percent poverty 51 13.85 3.11 8.50 21.92
5
Despite the fact that more people live in Washington, DC, than in Vermont or Wyoming! Or so says
the resident of Washington, DC . . .
Violent
crime
rate DC DC DC
(per
100,000
1,200
people)
1,000
800
NV NV NV
SC
TN TNSC SC
TN
NMDE FL
AK LA AK DENM
FL
LA AK DE
FL
LA
NM
600 MD MD MD
AR OK MI OK
MI AR MITX AR
IL MO OK
MO TXIL CA IL
MO
TX
AL MA MACA AL MA CA
AL
GA AZ GA
AZ GA
AZ
400 NC KS NY KS NC KS NC
PA PANY PA NY
CO
INOHWA CO
WAINOH CO INOH
WA
WV
MSMT IA NE CT NJ NJ
NEIAWV
CT
MT MS
NJ
CT NE MT WV
IA MS
ND WI
KY ID OR HI ND OR
WI HIRI HI RI
NDWI OR
MN RI MN VAKY
ID MN ID KY
200
SD WY VA UT UT WY SD WYVA UT SD
NH NH NH
VT
ME VTME VTME
40 50 60 70 80 90 100 0.2 0.3 0.4 0.5 0.6 8 10 12 14 16 18 20 22
Percent urban Percent single parent Percent poverty

(0-to-100 scale) (0-to-1 scale) (0-to-100 scale)
FIGURE 2.3: Scatterplots of Violent Crime against Percent Urban, Single Parent, and Poverty
reality of the urbanization variable helps us better appreciate what the data is
telling us.
Being aware of the data can help us detect possible endogeneity. Many of the
states showing high single-parent populations and high poverty are in the South.
If this leads us to suspect that southern states are distinctive in other social and
political characteristics, we should be on high alert for potential endogeneity in any
analysis that uses the poverty or single-parent variable. These variables capture not
only poverty and single parenthood, but also “southernness.”
2.3 Statistical Software

We need software to do statistics. We have many choices, and it’s worthwhile
to learn at least two different software packages. Because different packages are
good at different things, many researchers use one program for some tasks and
another program for other tasks. Also, knowing multiple programs reinforces clear
Conclusion 33
statistical thinking because it helps us think in terms of statistical concepts rather

than in terms of the software commands.
We refer to two major statistical packages throughout this book, as these are
the two most commonly used languages for applied econometrics: Stata and R.
(Yes, R is a statistical package referred to by a single letter; the folks behind it are a
bit minimalist.) Stata provides simple commands to do many complex econometric
analyses; the cost of this simplicity is that we sometimes need to do a lot of digging
to figure out what exactly Stata is up to. And it is expensive. R can be a bit harder
to get the hang of, but the coding is often more direct and less is hidden to the
user. Oh yes, it’s also free, at http://www.r-project.org/. Free does not mean cheap
or basic, though. In fact, R is so powerful that it is the program of choice for many
sophisticated econometricians.
In this book, we learn by doing, showing specific examples of code in the
Computing Corners sections. The best way to learn code is to get working; after
a while, the command names become second nature. Replication files are also
a great learning tool. Even if we forget a specific command, it’s not so hard to
remember “I want to do something like I did for the homework about education
and wages.” All we have to do, then, is track down the replication file and build
from that.6
REMEMBER THIS
1. Stata is a powerful statistical software program. It is relatively user friendly, but it can be
expensive.
2. R is another powerful statistical software program. It is less user friendly, but it is free.
Conclusion
This chapter prepares us for analyzing real data. We begin by understanding our
data. This vital first step makes sure that we know what we’re dealing with. We
should use descriptive statistics to get an initial feel for how much data we have
and the scales of the variables. Then we should graph our data. It’s a great way to
appreciate what we’re dealing with and to spot interesting patterns or anomalies.
The second step of working with data is documenting our data and analysis.
Social science depends crucially on replication. Analyses that cannot be replicated
6
In the Further Reading section at the end of chapter, we indicate some good sources for learning
Stata and R and mention some other statistical packages in use.
cannot be trusted. Therefore, all statistical projects should document data and
methods, ensuring that anyone (including the author!) can recreate all results.
We are on track with the key concepts in this chapter when we can do the
following:
• Section 2.1: Explain descriptive statistics and what to look for.
• Section 2.2: Explain the importance of replication and the two elements of
a replication file.
• Section 2.3 (and Computing Corner that follows): Do basic data description
in Stata and R.
Further Reading
King (1995) provides an excellent discussion of the replication standard.
Data visualization is a growing field, with good reason, as analysts increas-
ingly communicate primarily via figures. Tufte (2001) is a landmark book.
Schwabish (2004) and Yau (2011) are nice guides to graphics.
Chen, Ender, Mitchell, and Wells (2003) is an excellent online resource for
learning Stata. Gaubatz (2015) is an accessible and comprehensive introduction to
R. Other resources include Verzani (2004) and online tutorials.
Other programs are widely used as well. EViews is a powerful program often
chosen by those doing forecasting models (see eviews.com). Some people use
Excel for basic statistical analysis. It’s definitely useful to have good Excel skills,
but to do serious analysis, most people will need a more specialized program.
Key Terms
Codebook (29) Replication files (28) Standard deviation (26)
Replication (28) Robust (30)
Computing Corner
Stata
• The first thing to know is what to do when we get stuck (when, not if ).
In Stata, type help commandname if you have questions about a certain
command. For example, to learn about the summarize command, we can
type help summarize to get a description of that command. Probably the
most useful information comes in the form of the examples at the end of
these files. Often the best approach is to find an example that seems closest
Computing Corner 35
to what we’re trying to do and apply that example to the problem. Googling
usually helps, too.
• A comment line is a line in the code that provides notes for the user. A
comment line does not actually tell Stata to do anything, but it can be
incredibly useful to clarify what is going on in the code. Comment lines
in Stata begin with an asterisk (*). Using ** makes it easier to visually
identify these crucial lines.
• To open a “syntax file” to document an analysis, click on Window –

Do file editor – new Do-file editor. It’s helpful to resize this window
to be able to see both the commands and the results. Save the syntax
file as “SomethingSomething.do”; the more informative the name, the
better. Including the date in the file name aids version control. To run any
command in the syntax file, highlight the whole line and then press ctrl-d.
The results of the command will be displayed in the Stata results window.
• One of the hardest parts of learning new statistical software is loading data
into a program. While some data sets are prepackaged and easy, many
are not, especially those we create ourselves. Be prepared for the process
of loading data to take longer than expected. And because data sets can
sometimes misbehave (columns shifting in odd ways, for example), it is
very important to use the descriptive statistics diagnostics described in this
chapter to make sure the data is exactly what we think it is.
– To load Stata data files (which have .dta at the end of the file name), there
are two options.
1. Use syntax:
use "C:\Users\SallyDoe\Documents\DonutData.dta"
The “path” tells the computer where to find the file. In this exam-
ple, the path is C:\Users\SallyDoe\Documents\. The exact path
depends on a computer’s file structure.
2. Point-and-click: Go to the File – Open menu option in Stata and

browse to the file. Stata will then produce and display the command
for opening that particular file. It is a good idea to save this command
in the syntax file so that you document exactly the data being used.
– Loading non-Stata data files (files that are in tab-delimited, comma-

delimited, or other such format) depends on the exact format of the data.
For example, use the following to read in data that has tabs between
variables on each line:
1. Use syntax:
insheet using "C:\Users\SallyDoe\Documents\
DonutsData.raw"
2. Point-and-click: Go to File – Import and then select the file where

the data is stored. Stata will then produce and display the command
for opening that particular file. It is a good idea to save this command
in the syntax file so that you document exactly the data being used.
Often it is easiest to use point-and-click the first time and syntax
after that.
• To see a list of variables loaded into Stata, look at the variable window that
lists all variables. We can also click on Data – Data editor to see variables.
• To make sure the data loaded correctly, display it with the list command.
To display the first 10 observations of all variables, type list in 1/10. To
display the first eight observations of only the weight variable, type list
weight in 1/8. We can also look at the data in Stata’s “Data Browser”
by going to Data/Data editor in the toolbar.
• To see descriptive statistics on the weight and donut data as in Table 2.1,
use summarize weight donuts.
• To produce a frequency table such as Table 2.2, type tabulate male. Use
this command only for variables that take on a limited number of possible
values.
• Use the if subcommand to limit the data used in Stata analyses. The
syntax list name if male == 1 will list the names of individuals who
are male. The syntax list name if male != 1 will list the names of
individuals who are not male. The syntax list name if male == 1 &
age > 18 will list the names of individuals who are male and over 18.
The syntax list name if male == 1 | age > 18 will list the names
of individuals who are male or over 18.
• To plot the weight and donut data as in Figure 2.2, type scatter
weight donuts. There are many options for creating figures. For example,
to plot the weight and donut data for males only with labels from a
variable called “name,” type scatter weight donuts if male = =
1, mlabel(name).
R
• To get help in R, type ?commandname for questions about a certain
command. For questions about the mean command, type ?mean to get
a description of the command, options, and most importantly, examples.
Computing Corner 37
Often the best approach is to find an example that seems closest to what
we’re trying to do and apply that example to the problem. Googling usually
helps, too.
• Comment lines in R begin with a pound sign (#). Using ## makes it easier
to visually identify these crucial lines.
• To open a syntax file where we document our analysis, click on File – New
script. It’s helpful to resize this window to be able to see both the commands
and the results. Save the syntax file as “SomethingSomething.R”; the more
informative the name, the better. Including the date in the file name aids
version control. To run any command in the syntax file, highlight the whole
line and then press ctrl-r. The results of the command will be displayed in
the R console window.
• To load R data files (which have .RData at the end of the file name), the
easiest option is to save the file to your computer and then to use the File –
Load Workspace menu option in the R console (where we see results) and
browse to the file. You will see the R code to load the data in the R console
and can paste that to your syntax file.
• Loading non-R data files (files that are in .txt or other such format) requires
more care. For example, to read in data that has commas between variables
on each line, use read.table:
RawData = read.table("C:\Users\SallyDoe\Documents\
DonutData.raw", header=TRUE)
This command saves variables as Data$VariableName (or, e.g., Raw-
Data$weight, RawData$donuts). It is also possible to install special
commands that load in various types of data. For example, search the Web
for “read.dta” to see more information on how to install a special command
that reads Stata files directly into R.
• It is also possible to manually load data into R. Here’s a sample set:

weight = c(275, 141, 70, 75, 310, 80, 160, 263, 205,
185, 170, 155, 145)
donuts = c(14, 0, 0, 5, 20.5, 0.75, 0.25, 16, 3, 2,
0.8, 4.5, 3.5)
name = c("Homer", "Marge", "Lisa", "Bart", "Comic
Book Guy", "Mr. Burns", "Smithers", "Chief Wiggum",
"Principal Skinner", "Rev. Lovejoy", "Ned Flanders",
"Patty", "Selma").
• To make sure the data loaded correctly, use the following tools to display
the data in R:
1. Use the objects() command to show the variables and objects loaded
into R.
2. For a single variable, enter the variable’s name in the R console or

highlight it in the syntax file and press ctrl-r.7
3. To display only some observations for a single variable, use brackets.

For example, to see the first 10 observations of the donuts variable, use
donuts[1:10].
• To see the average of the weight variable, type mean(weight). One

tricky thing R does is choke on variables that having missing data; this is
undesirable because if a single observation is missing, the simple version
of the mean command will produce a result of “NA.” Therefore, we
need to tell R what to do with missing data by modifying the command
to mean(weight, na.rm=TRUE). R refers to missing observations with
an “NA.” The “.rm” is shorthand for remove. A way to interpret the
command, then, is that you are telling R, “Yes, it is true that we will remove
missing data from our calculations.” This syntax works for other descriptive
statistics commands as well. Working with the na.rm command is a bit of
an acquired taste, but it becomes second nature soon enough.
To see the standard deviation of the weight variable, type sqrt(var
((weight)), where sqrt refers to the square root function. The minimum
and maximum of the weight variable are displayed with min(weight)
and max(weight). To see the number of observations for a variable,
use sum(is.finite(weight)). This command is a bit clumsy: the
is.finite function creates a variable that equals 1 for each non-missing
observation, and the sum function sums this variable, creating a count of
non-missing observations.
• To produce a frequency table such as Table 2.2 on page 27, type

table(male). Use this command only for variables that take on a limited
number of possible values.
• There are many useful tools to limit the sample. The syntax donuts[male
== 1] tells R to use only values of donuts for which male equals 1. The
syntax donuts[male != 1] tells R to use only values of donuts for which
male does not equal 1. The syntax donuts[male == 1 & age > 18]
tells R to use only values of donuts for which male equals 1 and age is
7
R can load variables directly such that each variable has its own variable name. Or it can load
variables as part of data frames such that the variables are loaded together. For example, our
commands to load the .RData file loaded each variable separately, while our commands to load data
from a text file created an object called “RawData” that contains all the variables. To display a
variable in the “RawData” object called “donuts,” type RawData$donuts in the .R file, highlight it,
and press ctrl-r. This process may take some getting used to, but if you experiment freely with any
data set you load, it should become second nature.
Exercises 39
greater than 18. The syntax donuts[male == 1 | age > 18] tells R to
use only values of donuts for which male equals 1 or age is greater than 18.
• To plot the weight and donut data as in Figure 2.2, type plot(donuts,
weight). For example, to plot the weight and donut data for males only
with labels from a variable called “name,” type
plot(donuts[male == 1], weight[male == 1])
text(donuts[male == 1], weight[male == 1], name[male == 1]).
There are many options for creating figures.8
Exercises
1. The data set DonutDataX.dta contains data from our donuts example on
page 26. There is one catch: each of the variables has an error. Use the
tools discussed in this chapter to identify the errors.
2. What determines success at the Winter Olympics? Does population

matter? Income? Or is it simply a matter of being in a cold place with lots
of mountains? Table 2.6 describes variables in olympics_HW.dta related
to the Winter Olympic Games from 1980 to 2014.
(a) Summarize the medals, athletes, and GDP data.
TABLE 2.6 Variables for Winter Olympics Questions

ID Unique number for each country in the data set

country Name of country
year Year
medals Total number of combined medals won
athletes Number of athletes in Olympic delegation
GDP Gross domestic product of country (per capita GDP in $10,000 U.S. dollars)
temp Average high temperature (in Fahrenheit) in January if country is in Northern
Hemisphere or July if Southern Hemisphere (for largest city)
population Population of country (in 100,000)
host Equals 1 if host nation and 0 otherwise
8
To get a flavor of plotting options, use text(donuts[male == 1], weight[male == 1],
name[male == 1], cex=0.6, pos=4) as the second line of the plot sequence of code. The cex
command controls the size of the label, and the pos=4 puts the labels to the right of the plotted point.
Refer to the help menus in R, or Google around for more ideas.
(b) List the first five observations for the country, year, medals, athletes,
and GDP data.
(c) How many observations are there for each year?
(d) Produce a scatterplot of medals and the number of athletes. Describe

the relationship depicted.
(e) Explain any suspicion you might have that other factors could
explain the observed relationship between the number of athletes and
medals.
(f) Create a scatterplot of medals and GDP. Briefly describe any clear
patterns.
(g) Create a scatterplot of medals and population. Briefly describe any

clear patterns.
(h) Create a scatterplot of medals and temperature. Briefly describe any

clear patterns.
3. Persico, Postlewaite, and Silverman (2004) analyzed data from the

National Longitudinal Survey of Youth 1979 cohort to assess the relation-
ship between height and wages for white men who were between 14 and 22
years old in 1979. This data set consists of answers from individuals who
were asked questions in various years between 1979 and 1996. Here we
explore the relationship between height and wages for the full sample that
includes men and women and all races. Table 2.7 describes the variables
we use for this question.
(a) Summarize the wage, height (both height85 and height81), and
sibling variables. Discuss briefly.
(b) Create a scatterplot of wages and adult height (height85). Discuss

any distinctive observations.
TABLE 2.7 Variables for Height and Wage Data in the United States
wage96 Hourly wages (in dollars) in 1996

height85 Adult height: height (in inches) measured in 1985
height81 Adolescent height: height (in inches) measured in 1981

siblings Number of siblings
Exercises 41
(c) Create a scatterplot of wages and adult height that excludes the
observations with wages above $500 per hour.
(d) Create a scatterplot of adult height against adolescent height. Identify

the set of observations where people’s adolescent height is more than
their adult height. Do you think we should use these observations in
any future analysis we conduct with this data? Why or why not?
4. Anscombe (1973) created four data sets that had interesting properties.
Let’s use tools from this chapter to describe and understand these data
sets. The data is in a Stata data file called AnscombesQuartet.dta. There are
four possible independent variables (X1–X4) and four possible dependent
variables (Y1–Y4). Create a replication file that reads in the data and
implements the analysis necessary to answer the following questions.
Include comment lines that explain the code.
(a) Briefly note the mean and variance for each of the four X variables.
Briefly note the mean and variance for each of the four Y variables.
Based on these, would you characterize the four sets of variables as
similar or different?
(b) Create four scatterplots: one with X1 and Y1, one with X2 and Y2,
one with X3 and Y3, and one with X4 and Y4.
(c) Briefly explain any differences and similarities across the four
scatterplots.
PA R T I
The OLS Framework

Bivariate OLS: The Foundation of 3
Econometric Analysis
Every four years, Americans elect a president. Each

campaign has drama: controversies, gaffes, over-the-top
commercials. And 2016 may have been the craziest yet.
We had weird debates, noxious videos, MAGA hats, and
maybe even some Russians. Who could have predicted
such a campaign; how could we ever hope to explain
them in general terms?
And yet, a simple trick explains presidential election
results surprisingly well. If we know the rate of economic
growth, we can make quite good predictions about
the vote share of the candidate from the incumbent
president’s party. Figure 3.1 displays a scatterplot of the
vote share of the incumbent U.S. president’s party (on the Y-axis) and changes in
income (on the X-axis) for each election between 1948 and 2016.1 The relationship
jumps out: higher income growth is associated with larger vote shares.
We have included a line in Figure 3.1 that characterizes the relationship
between income and votes. In this chapter, we learn how to draw such a line
and, more importantly, how to interpret it and how to understand its statistical
properties. The specific tool we introduce is OLS, which stands for ordinary least
squares; we’ll explain why later. It’s not the best name. Regression and linear
regression are other commonly used names for the method—and are also lame
names.2
1
The figure is an updated version of a figure in Noel (2010). The figure plots vote share as a percent
of the total votes given to Democrats and Republicans only. We use these data to avoid the
complication that in some years, third-party candidates such as Ross Perot (in 1992 and 1996) or
George Wallace (in 1968) garnered non-trivial vote share.
2
In the late nineteenth century, Francis Galton used the term regression to refer to the phenomenon
that children of very tall parents tended to be less tall than their parents. He called this phenomenon
“regression to the mean” in heights of children because children of tall parents tend to “regress”
(move back) to average heights. Somehow the term regression bled over to cover statistical methods
for analyzing relationships between dependent and independent variables. Go figure.
45
46 CHAPTER 3 Bivariate OLS: The Foundation of Econometric Analysis
1972
1964
Incumbent party’s
vote percent
60
1984
1956
1996
55
1988
1948
2012
2016
2004 2000
50 1960
1976
1968
2008 1992
45 1952
1980
–1 0 1 2 3 4 5 6
Percent change in income
FIGURE 3.1: Relationship between Income Growth and Vote for the Incumbent President’s Party,
1948–2016
The OLS model allows us to quantify the relationship between two variables
and to assess whether the relationship occurred by chance or resulted from some
real cause. We build on these methods in the rest of the book in ways that help us
differentiate, as best we can, true causes from simple associations.
In this chapter, we learn how to draw a regression line and understand
the statistical properties of the OLS model. Section 3.1 shows how to estimate
coefficients in an OLS model and how those coefficients relate to the regression
line we can draw in scatterplots of our data. Section 3.2 demonstrates that the
OLS coefficient estimates are themselves random variables. Section 3.3 explains
one of the most important concepts in statistics: the OLS estimates of β̂ 1 will be
biased if X is endogenous. That is, the estimates will be systematically higher
or lower than the true values if the independent variable is correlated with the
error term. Section 3.4 shows how to characterize the precision of the OLS
estimates. Section 3.5 shows how the distribution of OLS estimate converges
to a point as the sample size gets very, very large. Section 3.6 discusses issues
that complicate the calculation of the precision of our estimates. These issues
have intimidating names like heteroscedasticity and autocorrelation. Their bark
is worse than their bite, however, and statistical software can easily address
them. Finally, Sections 3.7 and 3.8 discuss tools for assessing how well the
model fits the data and whether any unusual observations could distort our
conclusions.
3.1 Bivariate Regression Model

Bivariate OLS is a technique we use to estimate a model with two variables—a
dependent variable and an independent variable. In this section, we explain the
model, estimate it, and try it out on our presidential election example. We extend
the model in later chapters when we discuss multivariate OLS, a technique we use
to estimate models with multiple independent variables.
The bivariate model

Bivariate OLS allows us to quantify the degree to which X and Y move together.
We work with the core statistical model we introduced on page 5:
Yi = β0 + β1 Xi + i (3.1)
where Yi is the dependent variable and X is the independent variable. The parame-
ter β0 is the intercept (or constant). It indicates the expected value of Y when Xi is
zero. The parameter β1 is the slope. It indicates how much Y changes as X changes.
The random error term i captures everything else other than X that affects Y.
Adapting the generic bivariate equation to the presidential election example
produces
Incumbent party vote sharei = β0 + β1 Income changei + i (3.2)
where Incumbent party vote sharei is the dependent variable and Income changei is
the independent variable. The parameter β0 indicates the expected vote percentage
for the incumbent when income change equals zero. The parameter β1 indicates
how much more we expect vote share to rise as income change increases by one
unit.
This model is an incredibly simplified version of the world. The data will not
fall on a completely straight line because elections are affected by many other
factors, ranging from wars to scandals to social issues and so forth. These factors
comprise our error term, i .
For any given data set, OLS produces estimates of the β parameters that best
explain the data. We indicate estimates as β̂ 0 and β̂ 1 , where the “hats” indicate
that these are our estimates. Estimates are different from the true values, β0 and
β1 , which don’t get hats in our notation.3
How can these parameters best explain the data? The β̂’s define a line with an
intercept (β̂ 0 ) and a slope (β̂ 1 ). The task boils down to picking a β̂ 0 and β̂ 1 that
define the line that minimizes the aggregate distance of the observations from the
line. To do so, we use two concepts: the fitted value and the residual.
fitted value A fitted The fitted value is the value of Y predicted by our estimated equation. The
value, Ŷi , is the value of Y fitted value Ŷ (which we call “Y hat”) from our bivariate OLS model is
predicted by our
estimated equation. For Ŷi = β̂ 0 + β̂ 1 Xi (3.3)
a bivariate OLS model it
is Ŷi = β̂ 0 + β̂ 1 Xi . Also
called predicted value. Note the differences from Equation 3.1—there are lots of hats and no i . This is
the equation for the regression line defined by the estimated β̂ 0 and β̂ 1 parameters
and Xi .
regression line The
fitted line from a A fitted value tells us what we would expect the value of Y to be given the
regression. value of the X variable for that observation. To calculate a fitted value for any value
of X, use Equation 3.3. Or, if we plot the line, we can simply look for the value of
the regression line at that value of X. All observations with the same value of Xi
will have the same Ŷi , which is the fitted value of Y for observation i. Fitted values
are also called predicted values.
residual The A residual measures the distance between the fitted value and an actual
difference between the observation. In the true model, the error, i , is that part of Yi not explained by
fitted value and the β0 + β1 Xi . The residual is the estimated counterpart to the error. It is the portion of
observed value.
Yi not explained by β̂ 0 + β̂ 1 Xi (notice the hats). If our coefficient estimates exactly
equaled the true values, then the residual would be the error; in reality, of course,
our estimates β̂ 0 and β̂ 1 will not equal the true values β0 and β1 , meaning that our
residuals will differ from the error in the true model.
The residual for observation i is î = Yi − Ŷi . Equivalently, we can say a
residual is î = Yi − β̂ 0 − β̂ 1 Xi . We indicate residuals with ˆ (“epsilon hat”). As
with the β’s, a Greek letter with a hat is an estimate of the true value. The residual
î is distinct from i , which is how we denote the true, but not directly observed,
error.
Estimation
The OLS estimation strategy is to identify values of β̂ 0 and β̂ 1 that define
a line that minimizes the sum of the squared residuals. We square the resid-
uals because we want to treat a residual of +7 (as when an observed Yi is
7 units above the fitted line) as equally undesirable as a residual of −7 (as when
an observed Yi is 7 units below the fitted line). Squaring the residuals converts all
residuals to positive numbers. Our +7 residual and −7 residual observations will
both register as +49 in the sum of squared residuals.
3
Another common notation is to refer to estimates with regular letters rather than Greek letters
(e.g., b0 and b1 ). That’s perfectly fine, too, of course, but we stick with the hat notation for
consistency throughout this book.
Specifically, the expression for the sum of squared residuals for any given
estimates of β̂ 0 and β̂ 1 is

N
N
î2 = (Yi − β̂ 0 − β̂ 1 Xi )2
i=1 i=1
The OLS process finds the β̂ 1 and β̂ 0 that minimize the sum of squared
residuals. The “squares” in “ordinary least squares” comes from the fact that
we’re squaring the residuals. The “least” bit is from minimizing the sum of
squares. The word “ordinary” indicates that we haven’t progressed to anything
fancy yet.
As a practical matter, we don’t need to carry out the minimization
ourselves—we can leave that to the software. The steps are not that hard, though,
and we step through a simplified version of the minimization task in Chapter 14
(page 494). This process produces specific equations for the OLS estimates of β̂ 0
and β̂ 1 . These equations provide estimates of the slope (β̂ 1 ) and intercept (β̂ 0 )
combination that characterizes the line that best fits the data.
The OLS estimate of β̂ 1 is
N
i=1 (Xi − X)(Yi − Y)
β̂1 = N (3.4)
i=1 (Xi − X)
2
where X (read as “X bar”) is the average value of X and Y is the average value
of Y.
Equation 3.4 shows that β̂ 1 captures how much X and Y move together. The
N
numerator has i=1 (Xi − X)(Yi − Y). The first bit inside the sum is the difference
of X from its mean for the ith observation; the second bit is the difference of Y
from its mean for the ith observation. The product of these bits is summed over
observations. So, if Y tends to be above its mean [meaning (Yi − Y) is positive]
when X is above its mean [meaning (Xi − X) is positive], there will be a bunch
of positive elements in the sum in the numerator. If Y tends to be below its mean
[meaning (Yi − Y) is negative] when X is below its mean [meaning (Xi − X) is
negative], we’ll also get positive elements in the sum because a negative number
times a negative number is positive. Such observations will also push β̂ 1 to be
positive.
On the other hand, β̂ 1 will be negative when the signs of Xi − X and Yi − Y
are mostly opposite signs. For example, if X is above its mean [meaning (Xi − X)
is positive] when Y is below its mean [meaning (Yi − Y) is negative], we’ll get
negative elements in the sum and β̂ 1 will tend to be negative.4
4
There is a close affinity between the regression coefficient in bivariate OLS and covariance and
correlation. By using the equations for variance and covariance from Appendices C and D (pages 539
and 540), we see that Equation 3.4 can be rewritten as cov(X,Y)
var(X)
. The relationship between covariance
and correlation can be used to show that Equation 3.4 can equivalently be written as corr(X, Y) σσY ,
X
which indicates that the bivariate regression coefficient is simply a rescaled correlation coefficient.
The OLS equation for β̂ 0 is easy once we have β̂ 1 . It is
β̂ 0 = Y − β̂ 1 X (3.5)
We focus on the equation for β̂ 1 because this is the parameter that defines the
relationship between X and Y, which is what we usually care most about.
Bivariate OLS and presidential elections

For the election and income data plotted in Figure 3.2, the equations for β̂ 0 and
β̂ 1 produce the following estimates:

Incumbent party vote sharei = β̂ 0 + β̂ 1 Income changei
= 46.1 + 2.2 × Income changei
Figure 3.2 shows what these coefficient estimates mean. The β̂ 1 estimate
implies that the incumbent party’s vote percentage went up by 2.2 percentage
points for each one-percent increase in income. The β̂ 0 estimate implies that the
expected election vote share for the incumbent president’s party for a year with
zero income growth was 46.1 percent.
Table 3.1 and Figure 3.3 show predicted values and residuals for specific
presidential elections. In 2016, income growth was low (at 0.69 percent). The
value of the dependent variable for 2016 was the vote share of Hillary Clinton,
who, as a Democrat, was in the same party as the incumbent president, Barack
Obama. Hillary Clinton received 51.1 percent of the vote. The fitted value, denoted
by a triangle in Figure 3.3, is 46.1 + 2.2 × 0.69 = 47.6. The residual, which
is the difference between the actual and fitted, is 51.1 − 47.6 = 3.5 percent.
In other words, Hillary Clinton did 3.5 percentage points better than would
be expected based on the regression line in 2016. Think of that as her
“Trump bump.”
We can go through the same process to understand the fitted values and
residuals displayed in the Figure 3.3 and Table 3.1. In 2000, the fitted value based
on the regression line is 46.1 + 2.2 × 3.87 = 54.6. The residual, which is the
difference between the actual and the fitted, is 50.2 − 54.6 = −4.4 percent. The
negative residual means that Al Gore, who, as a Democrat, was the candidate
of the incumbent president’s party, did 4.4 percentage points worse than would
be expected based on the regression line. In 1964, the Democrats controlled the
presidency at the time of the election, and they received 61.3 percent of the
vote when Democrat Lyndon Johnson trounced Republican Barry Goldwater.
The correlation coefficient indicates the strength of the association, while the bivariate regression
coefficient indicates the effect of a one-unit increase in X on Y. It’s a good lesson to remember. We all
know “correlation does not imply causation”; this little nugget tells us that bivariate regression (also!)
does not imply causation. Appendix E provides additional details (page 541).
1972
1964
Incumbent party’s
vote percent
60
1984
1956
1996
55
1988
1948
2012
2016
2004 2000
1960
50
)
pe 1976
slo 1968
e
(th
2
2.
=
β1 2008 1992
β0 = 46.1
45 1952
1980
–1 0 1 2 3 4 5 6
FIGURE 3.2: Elections and Income Growth with Model Parameters Indicated
TABLE 3.1 Selected Observations from Election and Income Data

Year Income change Incumbent party Fitted value Residual
(X) vote share (Y) (Ŷ) (ˆ )
2016 0.69 51.1 47.6 3.5

2000 3.87 50.2 54.6 −4.4
1964 5.63 61.3 58.5 2.8
The fitted value based on the regression line is 46.1 + 2.2 × 5.63 = 58.5. The
residual, which is the difference between the actual and the fitted, is 61.3 −
5 = 2.8 percent. In other words, in 1964 the incumbent president’s party did
2.8 percentage points better than would be expected based on the regression
line.
Incumbent party’s
vote percent
60 Residual for 1964
Fitted
value
for 1964
Fitted value for 2000

55
Residual for 2000
50
Residual for
2016
Fitted value for 2016
45
−1 0 1 2 3 4 5 6
FIGURE 3.3: Fitted Values and Residuals for Observations in Table 3.1
REMEMBER THIS
1. The bivariate regression model is
Yi = β0 + β1 Xi + i
• The slope parameter is β1 . It indicates the change in Y associated with an increase of X by

one unit.
• The intercept parameter is β0 . It indicates the expected value of Y when X is zero.
• β1 is almost always more interesting than β0 .
3.2 Random Variation in Coefficient Estimates 53
2. OLS estimates β̂ 1 and β̂ 0 by minimizing the sum of squared residuals:

N
N
î2 = (Yi − β̂ 0 − β̂ 1 Xi )2
i=1 i=1
• A fitted value for observation i is Ŷi = β̂ 0 + β̂ 1 Xi .

• The residual for observation i is the difference between the actual and fitted values for
person i: î = Yi − Ŷi .
3.2 Random Variation in Coefficient Estimates

The goal of bivariate OLS is to get the most accurate idea of β0 and β1 that the
data can provide. The challenge is that we don’t observe the values of the β’s. All
we can do is estimate the true values based on the data we observe. And because
the data we observe is random, at least in the sense of containing a random error
term, our estimates will have a random element, too.
In this section, we explain where the randomness of our coefficient estimates
comes from, introduce the concept of probability distributions, and show that our
coefficient estimates come from a normal probability distribution.
β estimates are random

β̂
There are two different ways to think about the source of randomness in our
sampling coefficient estimates. First, our estimates may have sampling randomness. This
randomness Variation variation exists because we may be observing only a subset of an entire population.
in estimates that is seen Think of some population, say the population of ferrets in Florida. Suppose we
in a subset of an entire
want to know whether old ferrets sleep more. There is some relationship between
population. If a given
sample had a different
ferret age and sleep in the overall population, but we are able to get a random
selection of people, we sample of only 1,000 ferrets. We estimate the following bivariate OLS model:
would observe a
different estimated Sleepi = β0 + β1 Agei + i (3.6)
coefficient.
Based on the sample we have selected, we generate a coefficient β̂ 1 . We’re
sensible enough to know that if we had selected a different 1, 000 ferrets in
our random sample, we would have gotten a different value of β̂ 1 because the
specific values of sleep and age for the selected ferrets would differ. Every time
we select a different 1, 000 ferrets, we get a different estimate β̂ 1 even though the
underlying population relationship is fixed at the true value, β1 . Such variation is
called random variation in β̂ 1 due to sampling. Opinion surveys typically involve a
random sample of people and are often considered through the sampling variation
perspective.
modeled Second, our estimates will have modeled randomness. Think again of the
randomness Variation population of ferrets. Even if we were to get data on every last one of them, our
attributable to inherent model has random elements. The ferret sleep patterns (the dependent variable) are
variation in the data-
subject to randomness that goes into the error term. Maybe one ferret had a little
generation process. This
source of randomness
too much celery, another got stuck in a drawer, and yet another broke up with his
exists even when we girlferret. Unmeasured factors denoted by affect ferret sleep, and having data on
observe data for an every single ferret would not change that fact.
entire population. In other words, there is inherent randomness in the data-generation process
even when data is measured for an entire population. So, even if we observe a
complete population at any given time, thus eliminating any sampling variation,
we will have randomness due to the data-generation process. Put another way,
virtually every model has some unmeasured component that explains some of
the variation in our dependent variable, and the modeled-randomness perspective
highlights this.
An OLS estimate of β̂ 1 inherits randomness whether from sampling or
random variable A modeled randomness. The estimate β̂ 1 is therefore a random variable—that
variable that takes on is, a variable that takes on a set of possible different values, each with some
values in a range and probability. An easy way to see why β̂ 1 is random is to note that the equation
with the probabilities
for β̂ 1 (Equation 3.4) depends on the values of the Yi ’s, which in turn depend on
defined by a
distribution.
the i values, which themselves are random.
β estimates
Distributions of β̂
distribution The To understand these random β̂ 1 ’s, it is best to think of the distribution of β̂ 1 . That
range of possible values is, we want to think about the various values we expect β̂ 1 to take and the relative
for a random variable likelihood of these values.
and the associated
Let’s start with random variables more generally. A random variable with
relative probabilities for
each value.
discrete outcomes can take on one of a finite set of specific outcomes. The flip
of a coin or roll of a die yields a random variable with discrete outcomes. These
probability random variables have probability distributions. A probability distribution is a
distribution A graph graph or formula that identifies the probability for each possible value of a random
or formula that gives the variable.
probability for each
Many probability distributions of random variables are intuitive. We all know
possible value of a
random variable.
the distribution of a coin toss: heads with 50 percent probability and tails with
50 percent probability. Panel (a) of Figure 3.4 plots this data, with the outcome
on the horizontal axis and the probability on the vertical axis. We also know the
distribution of the roll of a six-sided die. There is a 16 probability of seeing each
of the six numbers on it, as panel (b) of Figure 3.4 shows. These are examples of
random variables with a specific number of possible outcomes: two (as with a coin
toss) or six (as with a roll of a die).
continuous variable This logic of distributions extends to continuous variables. A continuous
A variable that takes on variable is a variable that can take on any value in some range. Weight in our
any possible value over donut example from Chapter 1 is essentially a continuous variable. Because weight
some range.
can be measured to a very fine degree of precision, we can’t simply say there
is some specific number of possible outcomes. We don’t identify a probability
3.2 Random Variation in Coefficient Estimates 55
Probability Probability
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Heads Tails 1 2 3 4 5 6
(a) (b)
density density
−4 −2 0 2 4 60 65 70 75
(c) (d)
FIGURE 3.4: Four Distributions
for each possible outcome for continuous variables because there is an unlimited
probability density number of possible outcomes. Instead we identify a probability density, which
A graph or formula that is a graph or formula that describes the relative probability that a random variable
describes the relative is near a specified value for the range of possible outcomes for the random
probability that a
variable.
random variable is near
a specified value.
Probability densities run the gamut from familiar to weird. On the familiar
end of things is a normal distribution, which is the classic bell curve in panel
normal distribution (c) of Figure 3.4. This plot indicates the probability of observing realizations of
A bell-shaped the random variable in any given range. For example, since half of the area of the
probability density that density shown in panel (c) is less than zero, we know that there is a 50 percent
characterizes the
chance that this particular normally distributed random variable will be less than
probability of observing
outcomes for normally
zero. Because the probability density is high in the middle and low on the ends, we
distributed random can say, for example, that the normal random variable plotted in panel (c) is more
variables. likely to take on values around zero than values around −4. The odds of observing
values around +1 or −1 are still reasonably high, but the odds of observing values
near +3 or −3 are small.
Probability densities for random variables can have odd shapes, as in panel
(d) of Figure 3.4, which shows a probability density for a random variable that has
its most likely outcomes near 64 and 69.5 The point of panel (d) is to make it clear
that not all continuous random variables follow the bell-shaped distribution. We
could draw a squiggly line, and if it satisfied a few conditions, it, too, would be a
valid probability distribution.
If the concept of probability densities is new to you (or you are rusty on the
idea), read more on probability densities in Appendix F starting on page 541. The
normal density in particular will be important for us. Appendix G explains how
to work with the normal distribution, something that we will see again in the next
chapter.
β estimates are normally distributed

β̂
The cool thing about OLS is that for large samples, the β̂’s will be normally
distributed random variables. While we can’t know exactly what the value of β̂ 1
will be for any given true β1 , we know that the distribution of β̂ 1 will follow a
normal bell curve. We’ll discuss how to calculate the width of the bell curve in
Section 3.4, but knowing the shape of the probability density for β̂ 1 is a huge
advantage. The normal distribution has well-known properties and is relatively
easy to deal with, making our lives much easier in what is to come.
The normality of our OLS coefficient estimates is amazing. If we have enough
data, the distribution β̂ 1 will have a bell shape even if the error follows a weird
distribution like the one in panel (d) of Figure 3.4. In other words, just pour i val-
ues from any crazy random distribution into our OLS machine, and as long as our
sample is large enough, it will spit out β̂ 1 estimates that are normally distributed.6
Why is β̂ 1 normally distributed for large samples? The reason is a theorem at
central limit the heart of all statistics: the central limit theorem. This theorem states that the
theorem The mean of average of any random variable follows a normal distribution.7 In other words, get
a sufficiently large a sample of data from some distribution and calculate the average. For example,
number of independent
roll a six-sided die 50 times and calculate the average across the 50 rolls. Then roll
draws from any
distribution will be
the die another 50 times and take the average again. Go through this routine again
normally distributed. and again and again, and plot a histogram of the averages. If we’ve produced a
large number of averages, the histogram will look like a normal distribution. The
most common averages will be around the true average of 3.5 (the average of the
six numbers on a die). In some of our sets of 50 rolls, we’ll see more 6s than usual,
and those averages will tend to be closer to 4. In other sets of 50 rolls, we’ll see
5
The distribution of adult heights measured in inches looks something like this. What explains the
two bumps in the distribution?
6
If the errors in the model (the ’s) are normally distributed, then the β̂ 1 values will be normally
distributed no matter what the sample size is. Therefore, in small samples, if we could make
ourselves believe the errors are normally distributed, that belief would be a basis for treating the β̂ 1
values as coming from a normal distribution. Unfortunately, many people doubt that errors are
normally distributed in most empirical models. Some statisticians therefore pour a great deal of
energy into assessing whether errors are normally distributed (just Google “normality of errors”). But
we don’t need to worry about this debate as long as we have a large sample.
7
Some technical assumptions are necessary. For example, the “distribution” of the values of the error
term cannot consist solely of a single number.
more 1s than usual, and those averages will tend to be closer to 3. Crucially, the
shape of the distribution will look more and more like a normal distribution the
larger our sample of averages gets.
Even though the central limit theorem is about averages, it is relevant for OLS.
Econometricians deriving the distribution of β̂ 1 invoke the central limit theorem
to prove that β̂ 1 will be normally distributed for a sufficiently large sample size.8
What sample size is big enough for the central limit theorem and, therefore,
normality to kick in? There is no hard-and-fast rule, but the general expectation is
that around 100 observations is enough. If we have data with some really extreme
outliers or other pathological cases, we may need a larger sample size. Happily,
though, the normality of the β̂ 1 distribution is a reasonable approximation even
for data sets with as few as 100 observations. Exercise 2 at the end of this chapter
provides a chance to see distributions of coefficients for ourselves.
REMEMBER THIS
1. Randomness in coefficient estimates can be the result of
• Sampling variation, which arises due to variation in the observations selected into the
sample. Each time a different random sample is analyzed, a different estimate of β̂ 1 will be
produced even though the population (or “true”) relationship is fixed.
• Modeled variation, which arises because of inherent uncertainty in outcomes. Virtually
any data set has unmeasured randomness, whether the data set covers all observations in a
population or some subsample (random or not).
2. The central limit theorem implies the β̂ 0 and β̂ 1 coefficients will be normally distributed random
variables if the sample size is sufficiently large.
3.3 Endogeneity and Bias

We know that β̂ 1 is not simply the true value β1 ; it is an estimate, after all. But
how does β̂ 1 relate to β1 ? In this section, we introduce the concept of bias, explain
the condition under which our estimates are biased, and characterize the nature of
the bias.
8
One way to see why is to think of the OLS equation for β̂ 1 as a weighted average of the dependent
variable. That’s not super obvious, but if we squint our eyes and look at Equation 3.4, we see that we
(Xi −X)
could rewrite it as β̂ 1 = N
i=1 wi (Yi − Y), where wi = N 2 . (We have to squint really hard!) In
i=1 (Xi −X)
other words, we can think of the β̂ 1 ’s as a weighted sum of the Yi ’s, where wi is the weight (and we
happen to subtract the mean of Y from each Yi ). It’s not to hard to get from a weighted sum to an
average. Doing so opens the door for the central limit theorem (which is, after all, about averages) to
work its magic and establish that β̂ 1 will be normally distributed for large samples.
Conditions for unbiased estimates

unbiased estimator Perhaps the central concept of this whole book is that β̂ 1 is an unbiased estimator
An estimator that of the true value β1 when X is uncorrelated with . This concept is important; go
produces estimates that slowly if it is new to you.
are on average equal to
In ordinary conversation, we say a source of information is biased if it slants
the true value of the
parameter of interest.
things against the truth. The statistical concept of bias is rather close. For example,
our estimate β̂ 1 would be biased if the β̂ 1 ’s we observe are usually around −12 but
the true value of β1 is 16. In other words, if our system of generating a β̂ 1 estimate
was likely to produce a negative value when the true value was 16, we’d say the
bias A biased estimating procedure was biased. As we discuss here, such bias happens a lot (and
coefficient estimate will the villain is often endogeneity).
systematically be higher Our estimate β̂ 1 is unbiased if the average value of the β̂ 1 distribution is equal
or lower than the true to the true value. An unbiased distribution will look like Figure 3.5, which shows
value. a distribution of β̂ 1 ’s centered around the true value of β1 . The good news about
an unbiased estimator is that on average, our β̂ 1 should be pretty good. The bad
news is that any given β̂ 1 could be far from the true value, depending on how wide
the distribution is and on luck—by chance alone, we could get a value at the low
or high end of the distribution.
In other words, unbiased does not mean perfect. It just means that, in general,
there is no systematic tendency to be too high or too low. If the distribution of β̂ 1
Probability
density
Distribution of β1
β1
FIGURE 3.5: Distribution of β̂ 1

happens to be quite wide, even though the average is the true value, we might still
observe values of β̂ 1 that are far from the true value, β1 .
Think of the figure skating judges at the Olympics. Some are biased—perhaps
blinded by nationalism or wads of cash—and they systematically give certain
skaters higher or lower scores than the skaters deserve. Other judges (most?) are
not biased. Still, these judges do not get the right answer every time.9 Sometimes
an unbiased judge will give a score that is higher than it should be, and sometimes
a score that is lower. Similarly, an OLS regression coefficient β̂ 1 that qualifies as
an unbiased estimate of β1 can be too high or too low in a given application.
Here are two thought experiments that shed light on unbiasedness. First, let’s
approach the issue from the sampling-randomness framework from Section 3.2.
Suppose we select a sample of people, measure some dependent variable Yi and
independent variable Xi for each, and use those to estimate the OLS β̂ 1 . We write
that down and then select another sample of people, get the data, estimate the
OLS model again, and write down the new estimate of β̂ 1 . The new estimate will
be different because we’ll have different people in our data set. Repeat the process
again and again, write down all the different β̂ 1 ’s, and then calculate the average
of the estimated β̂ 1 ’s. While any given realization of β̂ 1 could be far from the true
value, we will call the estimates unbiased if the average of the β̂ 1 ’s is the true
value, β1 .
We can also approach the issue from the modeled-randomness framework
from Section 3.2. Suppose we generate our data. We set the true β1 and β0 values
as some specific values. We also fix the value of Xi for each observation. Then we
draw the i for each observation from some random distribution. These values will
come together in our standard equation to produce values of Y that we then use in
the OLS equation for β̂ 1 . Then we repeat the process of generating random error
terms (while keeping the true β and X values the same). Doing so produces another
set of Yi values and a different OLS estimate for β̂ 1 . We keep running this process
a bunch of times, writing down the β̂ 1 estimates from each run. If the average of
the β̂ 1 ’s we have recorded is equal to the true value, β1 , then we say that β̂ 1 is an
unbiased estimator of β1 .
OLS does not automatically produce unbiased coefficient estimates. A crucial
condition must be satisfied for OLS estimates to be unbiased: the error term cannot
be correlated with the independent variable. The exogeneity condition, which we
discussed in Chapter 1, is at the heart of everything. If this condition is violated,
then something in the error term is correlated with our independent variable and
will contaminate the observed relationship between X and Y. In other words, while
observing large values of Y associated with large values of X naturally inclines us
to think X pushes Y higher, we worry that something in the error term that is big
when X is big is actually what is causing Y to be high. In that case, the relationship
between X and Y is spurious, and the real causal influence is that unidentified factor
in the error term.
9
We’ll set aside for now the debate about whether a right answer even exists. Let’s imagine there is a
score that judges would on average give to a performance if the skater’s identity were unknown.
Bias in crime and ice cream example

Almost every interesting relationship between two variables in the worlds of policy
and economic has some potential for correlation between X and the error term.
Let’s start with a classic example. Suppose we wonder whether ice cream makes
people violent.10 We estimate the following bivariate OLS model:
Violent crimet = β0 + β1 Ice cream salest + t (3.7)
where violent crime in period t is the dependent variable and ice cream sales
in period t is the independent variable. We’d find that β̂ 1 is greater than zero,
suggesting crime is indeed higher when ice cream sales go up.
Does this relationship mean that ice cream is causing crime? Maybe. But
probably not. OK, no, it doesn’t. So what’s going on? There are a lot of factors
in the error term, and one of them is probably truly associated with crime and
correlated with ice cream sales. Any guesses?
Heat. Heat makes people want ice cream and, it turns out, makes them cranky
(or gets them out of doors) such that crime goes up. Hence, a bivariate OLS model
with just ice cream sales will show a relationship, but because of endogeneity, this
relationship is really just correlation, not causation.
Characterizing bias
As a general matter, we can say that as the sample size gets large, the estimated
coefficient will on average be off by some function of the correlation between the
included variable and the error term. We show in Chapter 14 (page 495) that the
expected value of our bivariate OLS estimate is
σ
E[β̂ 1 ] = β1 + corr(X, ) (3.8)
σX
where E[β̂ 1 ] is short for the expectation of β̂ 1 ,11 corr(X, ) is the correlation of X
and , σ (the lowercase Greek letter sigma) is the standard deviation of , and σX
is the standard deviation of X. The fraction at the end of the equation is more a
normalizing factor, so we don’t need to worry too much about it.12
The key thing is the correlation of X and . The bigger this correlation, the
further the expected value of β̂ 1 will be from the true value. Or, in other words, the
more the independent variable and the error are correlated, the more biased OLS
will be.
Much of the rest of this book mostly centers around what to do if the corre-
lation of X and is not zero. The ideal solution is to use randomized experiments
10
Why would we ever wonder that? Work with me here . . .
11
Expectation is a statistical term that essentially refers the the average value over many realizations
of a random value. We discuss the concept in Appendix C on page 539.
12
If we use corr(X, ) = covariance(X,)
σ σ
, we can write Equation 3.8 as E[β̂ 1 ] = β1 + cov(X,)
σ2
, where cov is
X X
short for covariance.
for which corr(X1 , ) is zero by design. But in the real world, experiments often
fall prey to challenges discussed in Chapter 10. For observational studies, which
are more common than experiments, we’ll discuss lots of tricks in the rest of this
book that help us generate unbiased estimates even when corr(X1 , ) is non-zero.
REMEMBER THIS
1. The distribution of an unbiased estimator is centered at the true value, β1 .
2. The OLS estimator β̂ 1 is a biased estimator of β1 if X and are correlated.
σ
3. If X and are correlated, the expected value of β̂ 1 is β1 + corr(X, ) .
σX
3.4 Precision of Estimates

There are two ways to get a β̂ 1 estimate that is not close to the true value. One
is bias, as discussed earlier. The other is random chance. Our OLS estimates are
random, and with the luck of the draw, we might get an estimate that’s not very
good. Therefore, characterizing the variance of our random β̂ 1 estimates will help
us appreciate when we should expect estimates near the true value and when we
shouldn’t. In this section, we explain what we mean by “precision of estimates”
and provide an equation for the variance of our coefficient estimates.
Estimating coefficients is a bit like trick-or-treating. We show up at a house
and reach into a bowl of candy. We’re not quite sure what we’re going to get. We
might get a Snickers (yum!), a Milky Way (not bad), a Mounds bar (trade-bait), or
a severed human pinkie (run away!). When we estimate OLS coefficients, it’s like
we’re reaching into a bowl of possible β̂ 1 ’s and pulling out an estimate. When we
reach into the unknown, we never quite know what we’ll get.
We do know certain properties of the β̂ 1 ’s that went in to the bowl, however.
If the exogeneity condition holds, for example, the average of the β̂ 1 ’s in the bowl
is β1 . It also turns out that we can say a lot about the range of β̂ 1 ’s in the bowl. We
do this by characterizing the width of the β̂ 1 distribution.
variance A measure To give you a sense of what’s at stake, Figure 3.6 shows two distributions for
of how much a random a hypothetical β̂ 1 . The lighter, lower dashed curve is much wider than the darker,
variable varies. higher curve. The darker curve is more precise because more of the distribution is
near the true value.
The primary measure of precision is the variance of β̂ 1 . The variance
standard error The is—you guessed it—a measure of how much something varies. The wider the
square root of the
distribution, the larger its variance. The square root of the variance is the
variance. Commonly
used to refer to the standard error (se) of β̂ 1 . The standard error is a measure of how much β̂ 1 will
precision of a parameter vary. A large standard error indicates that the distribution of β̂ 1 is very wide; if the
estimate. standard error is small, the distribution of β̂ 1 is narrower.
Probability
Smaller variance
density
Larger variance
−6 −4 −2 0 2 4 6 8 10
β1
FIGURE 3.6: Two Distributions with Different Variances of β̂ 1
The variance and standard error of an estimate contain the same information,
just in different forms as the variance is simply the standard deviation squared.
We’ll see later that it is often more convenient to use standard errors to characterize
the precision of estimates because they are on the same scale as the independent
variable (meaning, for example, that if X is measured in feet, we can interpret the
standard error in terms of feet as well).13
We prefer β̂ 1 to have a smaller variance. With a smaller variance, values close
to the true value are more likely, meaning we’re less likely to be far off when we
generate the β̂ 1 . In other words, our bowl of estimates will be less likely to have
wacky stuff in it.
Under the right conditions, we can characterize the variance (and, by
extension, the standard error) of β̂ 1 with a simple equation. We discuss the
conditions on page 67. If they are satisfied, the estimated variance of β̂ 1 for a
bivariate regression is
σ̂ 2
var(β̂ 1 ) = (3.9)
N × var(X)
This equation tells us how wide our distribution of β̂ 1 is.14 We don’t need to
calculate the variance of β̂ 1 by hand. That is, after all, why we have computers.
13
The difference between standard errors and standard deviations can sometimes be confusing. The
standard error of a parameter estimate is the standard deviation of the sampling distribution of the
parameter estimate.
14
We derive a simplified version of the equation on page 499 in Chapter 14.
We can, however, understand what causes precise or imprecise β̂ 1 estimates by

looking at each part of this equation.
First, note that the variance of β̂ 1 depends directly on the variance of the
variance of the regression, σ̂ 2 . The variance of the regression measures how well the model
regression The explains variation in Y. (And, just to be clear, the variance of the regression is
variance of the different from the variance of β̂ 1 .) That is, do the actual observations cluster fairly
regression measures
closely to the line implied by β̂ 0 and β̂ 1 ? If so, the fit is pretty good and σ̂ 2 will
how well the model
explains variation in the
be low. If the observations are not particularly close to the line implied by the β̂’s,
dependent variable. the fit is pretty poor and σ̂ 2 will be high.
We calculate σ̂ 2 based on how far the fitted values are from the actual observed
values. The equation is
N
i=1 (Yi − Ŷi )
2
σ̂ 2 = .
N −k
N 2
ˆ
= i=1 i (3.10)
N −k
which is (essentially) the average squared deviation of fitted values of Y from the
actual values. It’s not quite an average because the denominator is N − k rather
degrees of freedom than N. The N − k in the denominator is the degrees of freedom, where k is the
The sample size minus number of variables (including the constant) in the model.15
the number of The numerator of Equation 3.10 indicates that the more each individual
parameters. It refers to
observation deviates from its fitted value the higher σ̂ 2 will be. The estimated
the amount of
information we have
σ̂ 2 is also an estimate of the variance of in our core model, Equation 3.1.16
available to use in the Next, look at the denominator of the variance of β̂ 1 (Equation 3.9). It is N ×
estimation process. var(X). Yawn. There are, however, two important substantive facts in there. First,
the bigger the sample size (all else equal), the smaller the variance of β̂ 1 . In other
words, more data means lower variance. More data is a good thing.
Second, we see that variance of X reduces the variance of β̂ 1 . The variance
N
(Xi −X)2
of X is calculated as i=1 N . This puts the variance of β̂ 1 on the same scale
as the variance of the X variable. It is also the case that the more our X variable
varies, the more precisely we will be able to learn about β1 .17
15
For bivariate regression, k = 2 because we estimate two parameters (β̂ 0 and β̂ 1 ). We can think of
the degrees of freedom correction as a penalty for each parameter we estimate; it’s as if we use up
some information in the data with each parameter we estimate and cannot, for example, estimate
more parameters than the number of observations we have. If N is large enough, the k in the
denominator will have only a small effect on the estimate of σ̂ 2 . For small samples, the degrees of
freedom issue can matter more. Every statistical package will get this right, and the core intuition is
that σ̂ 2 measures the average squared distance between actual and fitted values.

(ˆ −ˆ )2
16
Recall that the variance of ˆ will be i
N
. The OLS minimization process automatically
creates residuals with a average of zero (meaning ˆ = 0). Hence, the variance of the residuals reduces
to Equation 3.10.
17
Here we’re assuming a large sample. If we had a small sample, we would calculate the variance of
N 2
i=1 (Xi −X)
X with a degrees of freedom correction such that it would be N−1
.
Dependent Dependent
variable variable
45 45
40 40
35 35
30 30
25 25
20 20
15 15
4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable

(a) (b)
Dependent Dependent
variable variable
50 50
45 45
40 40
35 35
30 30
25 25
20 20
15 15
10 10
4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable

(c) (d)
FIGURE 3.7: Four Scatterplots (for Review Questions)
Review Questions
1. Will the variance of β̂ 1 be smaller in panel (a) or panel (b) of Figure 3.7? Why?
2. Will the variance of β̂ 1 be smaller in panel (c) or panel (d) of Figure 3.7? Why?
3.5 Probability Limits and Consistency 65
REMEMBER THIS
1. The variance of β̂ 1 measures the width of the β̂ 1 distribution. If the conditions discussed later
in Section 3.6 are satisfied, then the estimated variance of β̂ 1 is
σ̂ 2
var(β̂ 1 ) =
N × var(X)
2. Three factors influence the estimated variance of β̂ 1 :

(a) Model fit: The variance of the regression, σ̂ 2 , is a measure of how well the model explains
variation in Y. It is calculated as
N
(Yi − Ŷi )2
σ̂ = i=1
2
N −k
The lower σ̂ 2 , the lower the var(β̂ 1 ).

(b) Sample size: The more observations, the lower the var(β̂ 1 ).
(c) Variation in X: The more the X variable varies, the lower the var(β̂ 1 ).
3.5 Probability Limits and Consistency

The variance of β̂ 1 shrinks as the sample size increases. This section discusses the
implications of this fact by introducing the statistical concepts of probability limit
and consistency, both of which are crucial to econometric analysis.
probability limit The probability limit of an estimator is the value to which the estimator
The value to which a converges as the sample size gets very large. Figure 3.8 illustrates the intuition
distribution converges behind probability limit by showing the probability density of β̂ 1 for hypothetical
as the sample size gets
experiments in which the true value of β1 is zero. The flatter, dark curve is the
very large.
probability density for β̂ 1 for an experiment with N = 10 people. The most likely
value of β̂ 1 is 0 because this is the place where the density is highest, but there’s
still a pretty good chance of observing a β̂ 1 near 1.0 and even a reasonable chance
of observing a β̂ 1 near 4. For a sample size of 100, the variance shrinks, which
means we’re less likely to see β̂ 1 values near 4 than we were when the sample size
was 10. For a sample size of 1,000, the variance shrinks even more, producing the
tall, thin distribution. Under this distribution, not only are we unlikely to see β̂ 1
near 4, we’re also very unlikely to see β̂ 1 near 2.
If we were to keep plotting distributions for larger sample sizes, we would
see them getting taller and thinner. Eventually, the distribution would converge
Probability
density
N = 1,000
N = 100
N = 10
−4 −2 0 2 4
β1
FIGURE 3.8: Distributions of β̂ 1 for Different Sample Sizes
consistency A to a vertical line at the true value. If we had an infinite number of observations,
consistent estimator is we would get the right answer every time. That may be cold comfort if we’re
one for which the stuck with a sad little data set of 37 observations, but it’s awesome when we have
distribution of the
100, 000 observations.
estimate gets closer and
closer to the true value
Consistency is an important property of OLS estimates. An estimator, such
as the sample size as OLS, is a consistent estimator if the distribution of β̂ 1 estimates shrinks to be
increases. The OLS closer and closer to the true value, β1 , as we get more data. If the exogeneity
estimate β̂ 1 consistently condition is true, then β̂ 1 is a consistent estimator of β1 .18 Formally, we say
estimates β1 if X is
uncorrelated with .
plim β̂ 1 = β1 (3.11)
plim A widely used where plim is short for “probability limit.”

abbreviation for
probability limit, the
value to which an 18
There are some more technical conditions necessary for OLS to be consistent. For example, the
estimator converges as values of the independent variable have to vary enough to ensure that the variance of β̂ 1 will actually
the sample size gets get smaller as the sample increases. This condition would not be satisfied if all values of X were the
very, very large. same, no matter how large the sample size.
Consistency is quite intuitive. If we have only a couple of people in our

sample, it is unreasonable to expect OLS to provide a precise sense of the true
value of β1 . If we have a bajillion observations in our sample, our β̂ 1 estimate
should be very close to the true value. Suppose, for example, that we wanted to
assess the relationship between height and grades in a given classroom. If we base
our estimate on information from only two students, we’re not very likely to get
an accurate estimate. If we ask 10 students, our answer is likely to be closer to the
true relationship in the the classroom, and if we ask 20 students, we’re even more
likely to be closer to the true relationship.
Under some circumstances, an OLS or other estimator will be inconsistent,
meaning it will converge to a value other than the true value. Even though the
details can get pretty technical, the probability limit of an estimator is often easier
to work with than the expectation. This is why statisticians routinely characterize
problems in terms of probability limits that deviate from the true value. We see
an example of probability limits that go awry when we assess the influence of
measurement error in Section 5.3.19
REMEMBER THIS
1. The probability limit of an estimator is the value to which the estimator converges as the sample
size gets very, very large.
2. When the error term and X are uncorrelated, OLS estimates of β are consistent, meaning that
plim β̂ = β.
Solvable Problems: Heteroscedasticity

3.6 and Correlated Errors
Equation 3.9 on page 62 accurately characterizes the variance of β̂ 1 when
certain conditions about the error term are true. In this section, we explain those
conditions. If these conditions do not hold, the calculation of the variance of β̂ 1
will be more involved, but the intuition we have introduced about σ̂ 2 , sample size,
and variation in X will carry through. We discuss the calculation of var(β̂ 1 ) under
these circumstances in this section and in Chapter 13.
19
The two best things you can say about an estimator are that it is unbiased and that it is consistent.
OLS estimators are both unbiased and consistent when the error is uncorrelated with the independent
variable and there are no post-treatment variables in the model (something we discuss in Chapter 7).
These properties seem pretty similar, but they can be rather different. These differences are typically
only relevant in advanced statistical work. For reference, we discuss in the citations and notes section
on page 556 examples of estimators that are unbiased but not consistent, and vice versa.
Homoscedasticity
The first condition for Equation 3.9 to be appropriate is that the variance of i
must be the same for every observation. That is, once we have taken into account
the effect of our measured variable (X), the expected degree of uncertainty in the
model must be the same for all observations. If this condition holds, the variance
of the error term is the same for low values of X as for high values of X. This
condition gets a fancy name, homoscedasticity. “Homo” means same. “Scedastic”
homoscedastic (yes, that’s a word) means variance. Hence, errors are homoscedastic when they
Describing a random all have the same variance.
variable having the When errors violate this condition, they are heteroscedastic, meaning that the
same variance for all
variance of i is different for at least some observations. That is, some observations
observations.
are on average closer to the predicted value than others. Imagine, for example, that
heteroscedastic A we have data on how much people weigh from two sources: some people weighed
random variable is themselves with a state-of-the-art scale, and others had a guy at a state fair guess
heteroscedastic if the their weight. Definite heteroscedasticity there, as the weight estimates on the scale
variance differs for some
would be very close to the truth (small errors), and the weight estimates from the
observations.
fair dude will be further from the truth (large errors).
Violating the homoscedasticity condition doesn’t cause OLS β̂ 1 estimates
heteroscedasticity- to be biased. It simply means we shouldn’t use Equation 3.9 to calculate
consistent standard the variance of β̂ 1 . Happily for us, the intuitions we have discussed so far
errors Standard errors about what causes var(β̂ 1 ) to be big or small are not nullified, and there
for the coefficients in are relatively simple ways to implement procedures for this case. We show
OLS that are appropriate how to generate these heteroscedasticity-consistent standard errors in Stata
even when errors are
and R in the Computing Corner of this chapter (pages 83 and 86). This
heteroscedastic.
approach to accounting for heteroscedasticity does not affect the values of the β̂
estimates.20
Errors uncorrelated with each other

The second condition for Equation 3.9 to provide an appropriate estimate of the
variance of β̂ 1 is that the errors must not be correlated with each other. If errors
are correlated with each other, knowing the value of the error for one observation
provides information about the value of the error for another observation.
20
The equation for heteroscedasticity-consistent standard errors is ugly. If you must know, it is
2
1
var(β̂ 1 ) = (Xi − X)2 î2 (3.12)
(Xi − X) 2
This is less intuitive than in Equation 3.9, so we do not emphasize it. As it turns out, we derive
heteroscedasticity-consistent standard errors in the course of deriving the standard errors that assume
homoscedasticity (see Chapter 14 Page 499). Heteroscedasticity-consistent standard errors are also
referred to as robust standard errors (because they are robust to heteroscedasticity) or as
Huber-White standard errors. Another approach to dealing with heteroscedasticity is to use
“weighted least squares.” This approach is more statistically efficient, meaning that the variance of
the estimate will theoretically be lower. The technique produces β̂ 1 estimates that differ from the
OLS β̂ 1 estimates. We point out references with more details on weighted least squares in the Further
Reading section at the end of this chapter.
There are two fairly common situations in which errors are correlated. The
first involves clustered errors. Suppose, for example, we’re looking at test scores
of all eighth graders in California. It is possible that the unmeasured factors in the
error term cluster by school. Maybe one school attracts science nerds and another
attracts jocks. If such patterns exist, then knowing the error term for a kid in a
school gives some information about the error terms of other kids in the same
school, which means errors are correlated. In this case, the school is the “cluster,”
and errors are correlated within the cluster. It’s inappropriate to use Equation 3.9
when errors are correlated.
This sounds worrisome. And it is, but not terribly so. As with heteroscedas-
ticity, violating the condition that errors must not be correlated doesn’t cause an
OLS β̂ 1 estimate to be biased. Autocorrelated errors only render Equation 3.9
inappropriate.
So what should we do if errors are correlated? Get a better equation for the
variance of β̂ 1 ! It’s actually a bit more complicated than that, but it is possible to
derive the variance of β̂ 1 when errors are correlated within cluster. We simply note
the issue here and use the computational procedures discussed in the Computing
time series data Corner to deal with clustered standard errors.
Consists of observations Correlated errors are also common in time series data—that is, data on
for a single unit over a specific unit over time. Examples include U.S. growth rates since 1945 or
time. Time series data is data on annual attendance at New York Yankees games since 1913. Errors in
typically contrasted to time series data are frequently correlated in a pattern we call autocorrelation.
cross-sectional and
Autocorrelation occurs when the error in one time period is correlated with the
panel data.
error in the previous time period.
Correlated errors can occur in time series when an unmeasured variable
autocorrelation in the error term is sticky, such that a high value in one year implies a high
Errors are autocorrelated value in the next year. Suppose, for example, we are modeling annual U.S.
if the error in one time economic growth since 1945 and we lack a variable for technological innovation
period is correlated with
(which is very hard to measure). If technological innovation was in the error term
the error in the previous
time period. boosting the economy in one year, it probably did some boosting to the error
Autocorrelation is term the next year. Similar autocorrelation is likely in many time series data sets,
common in time series ranging from average temperature in Tampa over time to monthly Frisbee sales in
data. Frankfurt.
As with the other issues raised in this section, autocorrelation does not
cause bias. Autocorrelation only renders Equation 3.9 inappropriate. Chapter 13
discusses how to generate appropriate estimates of the variance of β̂ 1 when errors
are autocorrelated.
It is important to keep these conditions in perspective. Unlike the exogeneity
condition (that X and the errors are uncorrelated), we do not need the homoscedas-
ticity and uncorrelated-errors conditions for unbiased estimates. When these
conditions fail, we simply do some additional steps to get back to a correct
equation for the variance of β̂ 1 . Violations of these conditions may seem to
be especially important because they have fancy labels like “heteroscedasticity”
and “autocorrelation.” They are not. The exogeneity condition matters much
more.
REMEMBER THIS
1. The standard equation for the variance of β̂ 1 (Equation 3.9) requires errors to be homoscedastic
and uncorrelated with each other.
• Errors are homoscedastic if their variance is constant. When errors are heteroscedastic, the
variance of errors is different across observations.
• Correlated errors commonly occur in clustered data in which the error for one observation
is correlated with the error for another observation from the same cluster (e.g., a
school).
• Correlated errors are also common in time series data where errors are autocorrelated,
meaning the error in one period is correlated with the error in the previous period.
2. Violating the homoscedasticity or uncorrelated-error conditions does not bias OLS coefficients.
Come up with an example of an interesting relationship you would like to test.
1. Write down a bivariate OLS model for this relationship.

2. Discuss what is in the error term and whether you suspect endogeneity.
3. Approximate how many observations you would expect to have (speculate if necessary). What
are the implications for the econometric analysis? Focus on the effect of sample size on
unbiasedness and precision.
4. Do you suspect heteroscedasticity or correlated errors? Why or why not? Explain the
implications of your answer for your OLS model.
3.7 Goodness of Fit

Goodness of fit is a statistical concept that refers to how well a model fits the data.
goodness of fit How If a model fits well, knowing X gives us a pretty good idea of what Y will be. If the
well a model fits the model fits poorly, knowing X doesn’t give as good an idea of what Y will be. In this
data. section, we present three ways to characterize the goodness of fit. We should not
worry too much about goodness of fit, however, as we can have useful, interesting
results from models with poor fit and biased, useless results from models with
great fit.
σ)
Standard error of the regression (σ̂
We’ve already seen one goodness of fit measure, the variance of the regression
(denoted as σ̂ 2 ). One limitation with this measure is that the scale is not intuitive.
For example, if our dependent variable is salary, the variance of the regression will
be measured in dollars squared (which is odd).
standard error of Therefore, the standard error of the regression is commonly used as a
the regression A measure of goodness of fit. It is simply the square root of the variance of the
measure of how well the regression and is denoted as σ̂ . It corresponds, roughly, to the average distance of
model fits the data. It is
observations from fitted values. The scale of this measure will be the same units
the square root of the
variance of the
as the dependent variable, making it much easier to relate to.
regression. The trickiest thing about the standard error of the regression may be that it
goes by so many different names. Stata refers to σ̂ as the root mean squared error
(or root MSE for short); root refers to the square root and MSE to mean squared
error, which is how we calculate σ̂ 2 , or the mean of the squared residuals. R refers
to σ̂ 2 as the residual standard error because it is the estimated standard error for
the errors in the model based on the residuals.
Plot of the data

Another way to assess goodness of fit is to plot the data and see for ourselves how
close the observations are to the fitted line. Plotting also allows us to see outliers
or other surprises in the data. Assessing goodness of fit based on looking at a plot
is pretty subjective, though, and hard to communicate to others.
R2
Finally, a very common measure of goodness of fit is R2 , so named because it
is a measure of the squared correlation of the fitted values and actual values.21
Correlation is often indicated with an “r,” so R2 is simply the square of this
value. (Why one is lowercase and the other is uppercase is one of life’s little
mysteries.) The value of R2 also represents the percent of the variation in the
dependent variable explained by the included independent variables in the linear
model.
If the model explains the data well, the fitted values will be highly correlated
with the actual values and R2 will be high. If the model does not explain the data
well, the fitted values will not correlate very highly with the actual values and R2
will be near zero. Possible values of R2 range from 0 to 1.
21
This interpretation works only if an intercept is included in the model, which it usually is.
R2 values often help us understand how well our model predicts the dependent
variable, but the measure may be less useful than it seems. A high R2 is neither
necessary nor sufficient for an analysis to be useful. A high R2 means the predicted
values are close to the actual values. It says nothing more. We can have a
model loaded with endogeneity that generates a high R2 . The high R2 in this
case means nothing; the model is junk, the high R2 notwithstanding. And to
make matters worse, some people have the intuition that a good fit is necessary
for believing regression results. This intuition isn’t correct, either. There is no
minimum value we need for a good regression. In fact, it is very common for
experiments (the gold standard of statistical analyses) to have low R2 values.
There can be all kinds of reasons for low R2 —the world could be messy, such
that σ 2 is high, for example—but the model could nonetheless yield valuable
insight.
Figure 3.9 shows various goodness of fit measures for OLS estimates of
two different hypothetical data sets of salary at age 30 (measured in thousands
of dollars) and years of education. In panel (a), the observations are pretty
closely clustered around the regression line. That’s a good fit. The variance of
the regression is 91.62; it’s not really clear what to make of that, however, until
we look at its square root, σ̂ (also known as the standard error of the regression,
among other terms), which is 9.57. Roughly speaking, this value of the standard
error of the regression means that the observations are on average within 9.57 units
of their fitted values.22 From this definition, therefore, on average the fitted values
are within $9,570 of actual salary. The R2 is 0.89. That’s pretty high. Is that value
high enough? We can’t answer that question because it is not a sensible question
for R2 values.
In panel (b) of Figure 3.9, the observations are more widely dispersed and so
not as good a fit. The variance of the regression is 444.2. As with panel (a), it’s
not really clear what to make of the variance of the regression until we look at
its square root, σ̂ , which is 21.1. This value means that the observations are on
average within $21,100 of actual salary. The R2 is 0.6. Is that good enough? Silly
question.
REMEMBER THIS
There are four ways to assess goodness of fit.
1. The variance of the regression (σ̂ 2 ) is used in the equation for var(β̂ 1 ). It is hard to interpret
directly.
22
We say “roughly speaking” because this value is actually the square root of the average of the
squared residuals. The intuition for that value is the same, but it’s quite a mouthful.
Salary Salary
(in $1,000s) (in $1,000s)
120 120
100 100
80 80
60 60
40 40
2 2
= 91.62 = 444.2
= 9.57 = 21.1
20 20
R 2 = 0.89 R 2 = 0.6
0 4 8 12 16 0 4 8 12 16
Years of education Years of education

(a) (b)
FIGURE 3.9: Plots with Different Goodness of Fit
2. The standard error of the regression (σ̂ ) is measured on the same scale as the dependent
variable and roughly corresponds to the average distance between fitted values and actual
values.
3. Scatterplots can be quite informative, not only about goodness of fit but also about possible
anomalies and outliers.
4. R2 is a widely used measure of goodness of fit.
• It is the square of the correlation between the fitted and observed values of the dependent
variable.
• R2 ranges from 0 to 1.
• A high R2 is neither necessary nor sufficient for an analysis to be useful.
CASE STUDY Height and Wages

You may have heard that tall people get paid more—and not just in the
NBA. If true, that makes us worry about what exactly our economy and
society are rewarding.
Persico, Postlewaite, and Silverman (2004) tested this idea by analyz-
ing data on height and wages from a nationally representative sample.
Much of their analysis used the multivariate techniques we’ll discuss in
Chapter 5, but let’s use bivariate OLS to start thinking about the issue.
The researchers limited their data set to white males to avoid potentially
important (and unfair) influences of race and gender on wages. (We look
at other groups in the homework exercises for Chapter 5.)
Figure 3.10 shows the data. On the X-axis is the adult height of each
guy, and on the Y-axis is his wage in 1996. The relationship is messy, but
that’s not unusual. Data is at least as messy as life.23
The figure includes a fitted regression line based on the following
regression model:
Wagei = β0 + β1 Adult heighti + i
The results reported in Table 3.2 look pretty much like the results any
statistical software will burp out. The estimated coefficient on adult height
(β̂ 1 ) is 0.412. The standard error estimate will vary depending on whether
we assume errors are or are not homoscedastic. The column on the left
shows that if we assume homoscedasticity (and therefore use Equation 3.9), the
estimated standard error of β̂ 1 is 0.0975. The column on the right shows that if we
allow for heteroscedasticity, the estimated standard error for β̂ 1 is 0.0953. This isn’t
much of a difference, but the two approaches to estimating standard errors can
differ more substantially for other examples. The estimated constant (β̂ 0 ) is −13.093
with estimated standard error estimates of 6.897 and 6.691, depending on whether
or not we use heteroscedasticity-consistent standard errors.
Notice that the β̂ 0 and β̂ 1 coefficients are identical across the columns, as
the heteroscedasticity-consistent standard error estimate has no effect on the
coefficient.
23
The data is adjusted in two ways for the figure. First, we jitter the data to deal with the problem that
many observations overlap perfectly because they have the same values of X and Y. Jittering adds a
small random number to the height, causing each observation to be at a slightly different point. If
there are only two observations with the same specific combination of X and Y values, the jittered
data will show two circles, probably overlapping a bit. If there are many observations with some
specific combination of X and Y values, the jittered data will show many circles, overlapping a bit,
but creating a cloud of data that indicates lots of data near that point. We don’t use jittered data in the
statistical analysis; we use jittered data only for plotting data. Second, six outliers who made a ton of
money ($750 per hour for one of them!) are excluded. If they were included, the scatterplot would be
so tall that most observations would get scrunched up at the bottom.
Hourly
wages
(in $)
80
60
40
20
60 65 70 75 80
Height in inches
FIGURE 3.10: Height and Wages
TABLE 3.2 Effect of Height on Wages

Variable Assuming Allowing
homoscedasticity heteroscedasticity
Adult height 0.412 0.412

(0.0975) (0.0953)
Constant −13.093 −13.093

(6.897) (6.691)
N 1,910 1,910
2
σ̂ 142.4 142.4
σ̂ 11.93 11.93
R2 0.009 0.009
Standard errors in parentheses.

What, exactly, do these numbers mean? First, let’s interpret the slope coef-
ficient, β̂ 1 . A coefficient of 0.412 on height implies that a one-inch increase in
height is associated with an increase in wages of 41.2 cents per hour. That’s
a lot!24
The interpretation of the constant, β̂ 0 , is that someone who is zero inches tall
would get negative $13.09 dollars an hour. Hmmm. Not the most helpful piece of
information. What’s going on is that most observations of height (the X variable)
are far from zero (they are mostly between 60 and 75 inches). For the regression
line to go through this data, it must cross the Y-axis at −13.09 for people who are
zero inches tall. This example explains why we don’t spend a lot of time on β̂ 0 . It’s
kind of weird to want to know—or believe—the extrapolation of our results to such
people who are zero inches tall.
If we don’t care about β̂ 0 why do we have it in the model? Because it still plays
a very important role. Remember that we’re fitting a line, and the value of β̂ 0 pins
down where the line starts when X is zero. Failing to estimate the parameter is
the same as setting β̂ 0 to zero (because the fitted value would be Ŷi = β̂ 1 Xi , which
is zero when Xi = 0). Forcing β̂ 0 to be zero will typically lead to a much worse
model fit than letting the data tell us where the line should cross the Y-axis when X
is zero.
The results are not only about the estimated coefficients. They also include
standard errors, which are quite important as they give us a sense of how accurate
our estimates are. The standard error estimates come from the data and tell us how
wide the distribution of β̂ 1 is. If the standard error of β̂ 1 is huge, then we should
not have much confidence that our β̂ 1 is necessarily close to the true value. If the
standard error of β̂ 1 is small, then we should have more confidence that our β̂ 1 is
close to the true value.
Are these results the final word on the relationship between height and wages?
(Hint: NO!) As for most observational data, a bivariate analysis may not be sufficient.
We should worry about endogeneity. In other words, there could be elements in the
error term (factors that influence wages but have not been included in the model)
that could be correlated with adult height, and if so, then the result that height
causes wages to go up may be incorrect. Can you think of anything in the error
term that is correlated with height? We come back to this question in Chapter 5
(page 131), where we revisit this data set.
Table 3.2 also shows several goodness of fit measures. The σ̂ 2 is 142.4; this
number is pretty hard to get our heads around. Much more useful is the standard
error of the regression, σ̂ , which is 11.93, meaning roughly that the average distance
between fitted and actual heights is almost $12 per hour. In other words, the fitted
values really aren’t particularly accurate. The R2 is close to 0.01. This value is low, but
as we said earlier, there is no set standard for R2 .
24
To put that estimate in perspective, we can calculate how much being an inch taller is worth per
year for someone who works 40 hours a week for 50 weeks per year: 0.412 × 1 × 40 × 50 = $820 per
year. Being three inches taller is associated with earning 0.41 × 3 × 40 × 50 = $2, 460 more per year.
Being tall has its costs, though: tall people live shorter lives (Palmer 2013).
3.8 Outliers 77
One reasonable concern might be that we should be wary of the OLS results
because the model fit seems pretty poor. That’s not how it works, though. The
coefficients provide the best estimates, given the data. The standard errors of the
coefficients incorporate the poor fit (via the σ̂ 2 ). So, yes, the poor fit matters, but it’s
incorporated into the OLS estimation process.
3.8 Outliers
One practical concern we have in statistics is dealing with outliers, or observations
outliers Observation that are extremely different from the rest of sample. The concern is that a single
that are extremely goofy observation can skew the analysis.
different from those in We saw on page 32 that Washington, DC, is quite an outlier in a plot of crime
the rest of sample.
data for the United States. Figure 3.11 shows a scatterplot of violent crime and
Violent
crime
rate DC
(per 100,000
people)
1,200
1,000
800
NV
SC
TN
AK LA NM DE
600 FL
MD
AR OK MO MI TX IL
AL MA CA
GA AZ
400 NC KS
PA NY
IN OH WACO
WV CT NJ
MS MT IA NE HI
ND
KY WI OR RI
ID MNVA
200 SD WY UT
NH
VT
ME
40 50 60 70 80 90 100
Percent urban
FIGURE 3.11: Scatterplot of Violent Crime and Percent Urban

TABLE 3.3 OLS Models of Crime in U.S. States

With DC Without DC With DC Without DC With DC Without DC
Urban 5.61 3.58

(1.80) (1.47)
Single parent 23.17 16.91
(3.03) (3.55)
Poverty 23.13 14.73
(8.85) (7.06)
Constant −8.37 124.67 −362.74 −164.57 86.12 184.94
(135.57) (109.56) (102.58) (117.59) (125.55) (99.55)
N 51 50 51 50 51 50
R2 0.17 0.11 0.54 0.32 0.12 0.08
percent urban. Imagine drawing an OLS line by hand when the nation’s capital
is included. Then imagine drawing an OLS line by hand when it’s excluded.
The line with Washington, DC, will be steeper in order to get close to the
observation for Washington, DC; the other line will be flatter because it can
stay in the mass of the data without worrying about Washington, DC. Hence, a
reasonable person may worry that the DC data point could substantially influence
the estimate. On the other hand, if we were to remove an observation in the
middle of the mass of the data, such as Oklahoma, the estimated line would move
little.
We can see the effect of including and excluding DC in Table 3.3, which
shows bivariate OLS results in which violent crime rate is the dependent variable.
In the first column, percent urban is the independent variable and all states
plus DC are included (therefore the N is 51). The coefficient is 5.61 with a
standard error of 1.80. The results in the second column are based on data without
Washington, DC (dropping the N to 50). The coefficient is quite a bit smaller,
coming in at 3.58, which is consistent with our intuition from our imaginary line
drawing.
The table also shows bivariate OLS coefficients for a model with single-parent
percent as the independent variable. The coefficient when we include DC is 23.17.
When we exclude DC, the estimated relationship weakens to 16.91. We see a
similar pattern with crime and poverty percent in the last two columns.
Figure 3.12 shows scatterplots of the data with the fitted lines included. The
fitted lines based on all data are the solid lines, and the fitted lines when DC is
excluded are the dashed lines. In every case, the fitted lines including DC are
steeper than the fitted lines when DC is excluded.
So what are we to conclude here? Which results are correct? There may be
no clear answer. The important thing is to appreciate that the results in these
cases depend on a single observation. In such cases, we need to let the world
know. We should show results with and without the excluded observation and
3.8 Outliers 79
Violent
crime
rate DC DC DC
(per
Fitted line with DC Fitted line with DC Fitted line with DC
100,000 1,200 Fitted line without DC Fitted line without DC Fitted line without DC
people)
1,000
800
NV NV NV
SC
TN TNSC SC
TN
NMDE FL
AK LA AK DENMLA AK DE LA
NM
600 MD MDFL MD FL
AR OK MI OK
MI AR MITX AR
IL MO OK
MO TXIL CA IL
CAMO
TX CA
AL MA MA AL MA AL
GA AZ GA
AZ GA
AZ
400 NC KSPA NY KS NY
PANC
KS
PA NY NC
CO
INOHWA CO
WA
INOH CO INOH
WA
WV
MSMT NE CTNJ NJ
NEIAWV
CT
MT MS
NJ
CT NE MT WV MS
NDIA WI
KY ID
MN OR HI
RI ND
ID OR
WIHIRI
MN VAKY MN
IA
HI RI
NDWI OR
ID KY
SD WY VA UT UT WY SD VA
WY UT SD
200
NH NH NH
VT
ME VTME VTME
40 50 60 70 80 90 100 20 30 40 50 60 8 10 12 14 16 18 20 22
Percent urban Percent single parent Poverty percent
FIGURE 3.12: Scatterplots of Crime against Percent Urban, Single Parent, and Poverty with OLS Fitted Lines
justify substantively why an observation might merit exclusion. In the case of the
crime data, for example, we could exclude DC on the grounds that it is not (yet!)
a state.
Outlier observations are more likely to influence OLS results when the
number of observations is small. Given that OLS will minimize the sum of squared
residuals from the fitted line, a single observation is more likely to play a big role
when only a few residuals must be summed. When data sets are very large, a single
observation is less likely to move the fitted line substantially.
An excellent way to identify potentially influential observations is to plot the
data and look for unusual observations. If an observation looks out of whack,
it’s a good idea to run the analysis without it to see if the results change. If
they do, explain the situation to readers and justify including or excluding the
outlier.25
25
Most statistical packages provide tools to assess the influence of each observation. For a sample
size N, these commands essentially run N separate OLS models, each one excluding a different
observation. For each of these N regressions, the command stores a value indicating how much the
coefficient changes when that particular observation is excluded. The resulting output reflects how
much the coefficients change with the deletion of each observation. In Stata, the command is
dfbeta, where df refers to difference and beta refers to β̂. In other words, the command will tell us
for each observation the difference in estimated β̂’s when that observation is deleted. In R, the
command is also called dfbeta. Google these command names to find more information on how to
use them.
REMEMBER THIS
Outliers are observations that are very different from other observations.
1. When sample sizes are small, a single outlier can exert considerable influence on OLS
coefficient estimates.
2. Scatterplots are useful in identifying outliers.
3. When a single observation substantially influences coefficient estimates, we should
• Inform readers of the issue.
• Report results with and without the influential observation.
• Justify including or excluding that observation.
Conclusion
Ordinary least squares is an odd name that refers to the way in which the β̂
estimates are produced. That’s fine to know, but the real key to understanding OLS
is appreciating the properties of the estimates produced.
The most important property of OLS estimates is that they are biased if
X is uncorrelated with the error. We’ve all heard “correlation does not imply
causation,” but “regression does not imply causation” is every bit as true. If there
is endogeneity, we may observe a big regression coefficient even in the absence of
causation or a tiny regression coefficient even when there is causation.
OLS estimates have many other useful properties. With a large sample size,
β̂ 1 is a normally distributed random variable. The variance of β̂ 1 reflects the width
of the β̂ 1 distribution and is determined by the fit of the model (the better the
fit, the thinner the distribution), the sample size (the more data, the thinner the
distribution), and the variance of X (the more variance, the thinner the distribution).
If the errors satisfy the homoscedasticity and no-correlation conditions, the
variance of β̂ 1 is defined by Equation 3.9. If the errors are heteroscedastic or
correlated with each other, OLS still produces unbiased coefficients, but we will
need other tools, covered here and in Chapter 13, to get appropriate standard errors
for our β̂ 1 estimates.
We’ll have mastered bivariate OLS when we can accomplish the
following:
• Section 3.1: Write out the bivariate regression equation, and explain all its
elements (dependent variable, independent variable, slope, intercept, error
term). Draw a hypothetical scatterplot with a small number of observations,
and show how bivariate OLS is estimated, identifying residuals, fitted
Further Reading 81
values, and what it means to be a best-fit line. Sketch an appropriate best-fit

line, and identify β̂ 0 and β̂ 1 on the sketch. Write out the equation for β̂ 1 ,
and explain the intuition in it.
• Section 3.2: Explain why β̂ 1 is a random variable, and sketch its dis-
tribution. Explain two ways to think about randomness in coefficient
estimates.
• Section 3.3: Explain what it means for the OLS estimate β̂ 1 to be an

unbiased estimator. Explain the exogeneity condition and why it is so
important.
• Section 3.4: Write out the standard equation for the variance of β̂ 1 in
bivariate OLS, and explain three factors that affect this variance.
• Section 3.5: Define probability limit and consistency.
• Section 3.6: Identify the conditions required for the standard variance
equation of β̂ 1 to be accurate. Explain why these two conditions are less
important than the exogeneity condition.
• Section 3.7: Explain four ways to assess goodness of fit. Explain why R2
alone does not measure whether or not a regression was successful.
• Section 3.8: Explain what outliers are, how they can affect results, and what
to do about them.
Further Reading
Beck (2010) provides an excellent discussion of what to report from a regression
analysis.
Weighted least squares is a type of generalized least squares that can be
used when dealing with heteroscedastic data. Chapter 8 of Kennedy (2008)
discusses weighted least squares and other issues associated with errors that are
heteroscedastic or correlated with each other. These issues are often referred to
as violations of a “spherical errors” condition. Spherical errors is fancy statistical
jargon meaning that errors are both homoscedastic and not correlated with each
other.
Murray (2006b, 500) provides a good discussion of probability limits and
consistency for OLS estimates.
We discuss what to do with autocorrelated errors in Chapter 13. The Further
Reading section at the end of that chapter provides links to the very large literature
on time series data analysis.
Key Terms
Autocorrelation (69) Heteroscedasticity-consistent Regression line (48)
Bias (58) standard errors (68) Residual (48)
Central limit theorem (56) Homoscedastic (68) Sampling randomness (53)
Consistency (66) Modeled randomness (54) Standard error (61)
Continuous variable (54) Normal distribution (55) Standard error of the
Degrees of freedom (63) Outliers (77) regression (71)
Distribution (54) plim (66) Time series data (69)
Fitted value (48) Probability density (55) Unbiased estimator (58)
Goodness of fit (70) Probability distribution (54) Variance (61)
Heteroscedastic (68) Probability limit (65) Variance of the regression
Random variable (54) (63)
Computing Corner
Stata
1. Use the donut and weight data described in Chapter 1 on page 3 to

estimate a bivariate OLS regression by typing reg weight donuts. The
command reg stands for “regression.” The general format is reg Y X for
a dependent variable Y and independent variable X.
Stata’s regression output looks like this:
Source | SS df MS Number of obs = 13
-------------+------------------------------ F( 1, 11) = 22.48
Model | 46731.7593 1 46731.7593 Prob > F = 0.0006
Residual | 22863.933 11 2078.53936 R-squared = 0.6715
-------------+------------------------------ Adj R-squared = 0.6416
Total | 69595.6923 12 5799.64103 Root MSE = 45.591
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
donuts | 9.103799 1.919976 4.74 0.001 4.877961 13.32964
_cons | 122.6156 16.36114 7.49 0.000 86.60499 158.6262
------------------------------------------------------------------------------
There is a lot of information here, not all of which is useful. The vital
information is in the bottom table that shows β̂ 1 is 9.10 with a standard
error of 1.92 and β̂ 0 is 122.62 with a standard error of 16.36. We cover t,
P>|t|, and 95% confidence intervals in Chapter 4.
The column on the upper right has some useful information, too, indicating
the number of observations, R2 , and Root MSE. (As we noted in the
Computing Corner 83
chapter, Stata refers to the standard error of the regression, σ̂ , as root MSE,
which is Stata’s shorthand for the square root of the mean squared error.)
We discuss the adjusted R2 later (page 150). The F and Prob > F to the
right of the output relate information that we also cover later (page 159);
it’s generally not particularly useful.
The table in the upper left is pretty useless. Contemporary researchers sel-
dom use the information in the Source, SS, df, and MS
columns.
2. In Stata, commands often have subcommands that are invoked after a

comma. To estimate the model with heteroscedasticity-consistent standard
errors (as discussed on page 68), simply add the , robust subcommand
to Stata’s regression command: reg weight donuts, robust.
3. To generate predicted values, type predict YourNameHere after run-

ning an OLS model. This command will create a new variable named
“YourNameHere.” In our example, we name the variable Fitted: predict
Fitted. A variable containing the residuals is created by adding
a , residuals subcommand to the predict command: predict
Residuals, residuals.
We can display the actual values, fitted values, and residuals with the list
command: list weight Fitted Residuals.
| weight Fitted Residuals|

|-------------------------------|
1. | 275 250.0688 24.93121 |
2. | 141 122.6156 18.38439 |
3. | 70 122.6156 -52.61561 |
...
4. In Chapter 2, we plotted simple scatterplots. To produce more elaborate

plots, work with Stata’s twoway command (yes, it’s an odd command
name). For example, to add a regression line to a scatterplot, use
twoway (scatter weight donuts) (lfit weight donuts). The
lfit command name stands for linear fit.26
5. To exclude an observation from a regression, use the if subcommand.

The syntax “!=” means “not equal.” For example, to run a regression on
data that excludes observations for which name is not Homer, run reg
weight donuts if name !="Homer". In this example, we use quotes
26
We jittered the data in Figure 3.10 to make it a bit easier to see more data points. Stata’s jitter
subcommand jitters data [e.g., scatter weight donuts, jitter(3)]. The bigger the number in
parentheses, the more the data will be jittered.
because the name variable is a string variable, meaning it is not a number.

To include only observations where weight is greater than 100, we can type
reg weight donuts if weight > 100.
1. The following commands use the donut data from Chapter 1 (page 3).
Since R is an object-oriented language, our regression commands create
objects containing information, which we ask R to display as needed. To
estimate an OLS regression, we create an object called “OLSResults” (we
could choose a different name) by typing OLSResults = lm(weight
~ donuts). This command stores information about the regression
results in the object called OLSResults. The lm command stands for
“linear model” and is the R command for OLS. The general format
is lm(Y ~ X) for a dependent variable Y and independent variable X.
To display these regression results, type summary(OLSResults), which
produces
lm(formula = weight ~ donuts)
Residuals:
Min 1Q Median 3Q Max
-93.135 -9.479 0.757 35.108 55.073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.616 16.361 7.494 0.0000121
donuts 9.104 1.920 4.742 0.000608
Residual standard error: 45.59 on 11 degrees of freedom
Multiple R-squared: 0.6715, Adjusted R-squared: 0.6416
F-statistic: 22.48 on 1 and 11 DF, p-value: 0.0006078
The vital information is in the bottom table that shows that β̂ 1 is 9.104
with a standard error of 1.920 and β̂ 0 is 122.616 with a standard error of
16.361. We cover t value and Pr(>|t|) in Chapter 4.
R refers to the standard error of the regression (σ̂ ) as the residual standard
error and lists it below the regression results. Next to that is the degrees of
freedom. To calculate the number of observations in the data set analyzed,
recall that degrees of freedom equals N − k. Since we know k (the number
of estimated coefficients) is 2 for this model, we can infer the sample
size is 13. (Yes, this is probably more work than it should be to display
sample size.)
The multiple R2 (which is just the R2 ) is below the residual standard error.
We discuss the adjusted R2 later (page 150). The F statistic at the bottom
refers to a test we cover on page 159. It’s usually not a center of attention.
Computing Corner 85
The information on residuals at the top is pretty useless. Contemporary

researchers seldom use that information.
2. The regression object created by R contains lots of other information as

well. The information can be listed by typing the object name, a dollar sign,
and the appropriate syntax. For example, the fitted values for a regression
model are stored in the format of Object$fitted.values. In our case,
they are OLSResults$fitted.values. For more details, type help(lm)
in R and look for the list of components associated with “objects of class
lm,” which is R’s way of referring to the regression results like those we
just created. To see the fitted values, type OLSResults$fitted.values,
which produces
1 2 3 4 5 6 ...
250.0688 122.6156 122.6156 168.1346 309.2435 129.4435 . . .
To see the residuals, type OLSResults$residuals, which produces
1 2 3 4 5 6 ...
24.9312 18.3843 −52.6156 −93.1346 0.7565 −49.4434 . . .
3. To create a scatterplot with a regression line included, we can type27

plot(donuts, weight)
abline(OLSResults)
4. One way to exclude an observation from a regression is to use

brackets to limit the variable to observations for which the condi-
tion in the brackets is true; to indicate a “not equal” condition, use
“!=”. In other words, weight[name != "Homer"] refers to values
of the weight variable for which the name variable is not equal to
“Homer.” To run a regression on data that excludes observations for
which name is Homer, run OLSResultsNoHomer = lm(weight[name
!= "Homer"] ~ donuts[name != "Homer"]). Here we use quotes
because the name variable is a string variable, meaning it is not a
number.28 To include only observations where weight is greater than
100, we can type OLSResultsNoLow = lm(weight[weight>100] ~
donuts[weight>100]).
5. There are a number of ways to estimate the model with heteroscedasticity-

consistent standard errors (as discussed on page 68). The easiest may
be to use an R package, which is a set of R commands that we install
for specific tasks. For heteroscedasticity-consistent standard errors, the
27
Figure 3.10 jittered the data to make it a bit easier to see more data points. To jitter data in an R
plot, type plot(jitter(donuts), jitter(weight)).
28
There are more efficient ways to exclude data when we are using data frames. For example, if the
variables are all included in a data frame called dta, we could type OLSResultsNoHomer =
lm(weight ~ donuts, data = dta[name != "Homer", ]).
useful AER package must be installed once and loaded at each use, as
follows:
• To install the package, type install.packages("AER"). R will ask

us to pick a location—this is the source from which the package will
be downloaded. It doesn’t matter what location we pick. We can also
install a package manually from the packages command in the toolbar.
Once installed on a computer, the package will be saved and available
for use by R.
• Tell R to load the package every time we open R and want to use the
commands in the AER (or other) package. We do this with the library
command. We have to use the library command in every session we use
a package.
Assuming the AER package has been installed, we can run OLS
with heteroscedasticity-consistent standard errors via the following
code:
library(AER)
OLSResults = lm(weight ~ donuts)
coeftest(OLSResults, vcov = vcovHC(OLSResults,
type = "HC1"))
The last line is elaborate. The command coeftest is asking for informa-
tion on the variance of the estimates (among other things) and the vcov =
vcovHC part of the command is asking for heteroscedasticity-consistent
standard errors. There are multiple ways to estimate such standard errors,
and the HC1 asks for the most commonly used form of these standard
errors.29
Exercises
1. Use the data in PresVote.dta to answer the following questions about the
relationship between changes in real disposable income and presidential
election results. Table 3.4 describes the variables.
(a) Create a scatterplot like Figure 3.1.
(b) Estimate an OLS regression in which the vote share of the incumbent
party is regressed on change in real disposable income. Report the
estimated regression equation, and interpret the coefficients.
29
The “vcov” terminology is short for variance-covariance, and “vcovHC” is short for
heteroscedasticity-consistent standard errors.
Exercises 87
TABLE 3.4 Variables for Questions on Presidential Elections and the Economy
year Year of election
rdi4 Change in real disposable income since previous election

vote Percent of two-party vote received by the incumbent president’s party
demcand Name of the Democratic candidate

repcand Name of the Republican candidate
reelection Equals 1 if incumbent is running for reelection and 0 if not
(c) What is the fitted value for 1996? For 1972?
(d) What is the residual for 1996? For 1972?
(e) Estimate an OLS regression only on years in which the variable

Reelection equals 1—that is, years in which an incumbent president
is running for reelection. Interpret the coefficients.
(f) Estimate an OLS regression only on years in which the variable

Reelection equals 0—that is, years in which an incumbent president
is not running for reelection. Interpret the coefficients, and discuss
the substantive implications of differences from the model with
incumbents only.
2. Suppose we are interested in the effect of education on salary as expressed

in the following model:
Salaryi = β0 + β1 Educationi + i
For this problem, we are going to assume that the true model is
Salaryi = 12,000 + 1,000Educationi + i
The model indicates that the salary for each person is $10,000 plus
$1,000 times the number of years of education plus the error term for the
individual. Our goal is to explore how much our estimate of β̂ 1 varies.
The book’s website provides code that will simulate a data set with
100 observations. (Stata code is in Ch3_SimulateBeta_StataCode.do; R
code is in Ch3_SimulateBeta_StataCode.R.) Values of education for each
observation are between 0 and 16 years. The error term will be a normally
distributed error term with a standard deviation of 10,000.
(a) Explain why the means of the estimated coefficients across the
multiple simulations are what they are.
(b) What are the minimum and maximum values of the estimated
coefficients on education? Explain whether these values are
inconsistent with our statement in the chapter that OLS estimates are
unbiased.
(c) Rerun the simulation with a larger sample size in each simulation.
Specifically, set the sample size to 1,000 in each simulation. Com-
pare the mean, minimum, and maximum of the estimated coefficients
on education to the original results above.
(d) Rerun the simulation with a smaller sample size in each simulation.
Specifically, set the sample size to 20 in each simulation. Compare
the mean, minimum, and maximum of the estimated coefficients on
education to the original results above.
(e) Reset the sample size to 100 for each simulation, and rerun the
simulation with a smaller standard deviation (equal to 500) for each
simulation. Compare the mean, minimum, and maximum of the
estimated coefficients on education to the original results above.
(f) Keeping the sample size at 100 for each simulation, rerun the
simulation with a larger standard deviation for each simulation.
Specifically, set the standard deviation to 50,000 for each simulation.
Compare the mean, minimum, and maximum of the estimated
coefficients on education to the original results above.
(g) Revert to original model (sample size at 100 and standard deviation
at 10,000). Now run 500 simulations. Summarize the distribution of
the β̂ Education estimates as you’ve done so far, but now also plot the
distribution of these coefficients using code provided. Describe the
density plot in your own words.
3. In this chapter, we discussed the relationship between height and wages

in the United States. Does this pattern occur elsewhere? The data set
heightwage_british_males.dta contains data on males in Britain from
Persico, Postlewaite, and Silverman (2004). This data is from the British
National Child Development Survey, which began as a study of children
born in Britain during the week of March 3, 1985. Information was
gathered when these subjects were 7, 11, 16, 23, and 33 years old. For
this question, we use just the information about respondents at age 33.
Table 3.5 shows the variables we use.
(a) Estimate a model where height at age 33 explains income at age 33.
Explain β̂ 1 and β̂ 0 .
(b) Create a scatterplot of height and income at age 33. Identify outliers.
Exercises 89
TABLE 3.5 Variables for Height and Wage Data in Britain

gwage33 Hourly wages (in British pounds) at age 33
height33 Height (in inches) measured at age 33
(c) Create a scatterplot of height and income at age 33, but exclude
observations with wages per hour more than 400 British pounds and
height less than 40 inches. Describe the difference from the earlier
plot. Which plot seems the more reasonable basis for statistical
analysis? Why?
(d) Reestimate the bivariate OLS model from part (a), but exclude four
outliers with very high wages and outliers with height below 40
inches. Briefly compare results to earlier results.
(e) What happens when the sample size is smaller? To answer this ques-
tion, reestimate the bivariate OLS model from above (that excludes
outliers), but limit the analysis to the first 800 observations.30 Which
changes more from the results with the full sample: the estimated
coefficient on height or the estimated standard error of the coefficient
on height? Explain.
4. Table 3.6 lists the variables in the WorkWomen.dta and WorkMen.dta data
sets, which are based on Chakraborty, Holter, and Stepanchuk (2012).
Answer the following questions about the relationship between hours
worked and divorce rates:
(a) For each data set (for women and for men), create a scatterplot of
hours worked on the Y-axis and divorce rates on the X-axis.
TABLE 3.6 Variables for Divorce Rate and Hours Worked


country Name of the country
hours Average yearly labor (in hours) for gender specified in data set
divorcerate Divorce rate per thousand
taxrate Average effective tax rate
30
To do this in Stata, include if _n < 800 at the end of the Stata regress command. Because some
observations have missing data and others are omitted as outliers, the actual sample size in the
regression will fall a bit lower than 800. The _n notation is Stata’s way of indicating the observation
number, which is the row number of the observation in the data set. In R, create and use a new data
set with the first 800 observations (e.g., dataSmall = data[1:800,]).
(b) For each data set, estimate an OLS regression in which hours
worked is regressed on divorce rates. Report the estimated regression
equation, and interpret the coefficients. Explain any differences in
coefficients.
(c) What are the fitted value and residual for men in Germany?
(d) What are the fitted value and residual for women in Spain?
5. Use the data described in Table 3.6 to answer the following questions about
the relationship between hours worked and tax rates:
(a) For each data set (for women and for men), create a scatterplot of
hours worked on the Y-axis and tax rates on the X-axis.
(b) For each data, set estimate an OLS regression in which hours worked
is regressed on tax rates. Report the estimated regression equation,
and interpret the coefficients. Explain any differences in coefficients.
(c) What are the fitted value and residual for men in the United States?
(d) What are the fitted value and residual for women in Italy?
Hypothesis Testing and Interval 4
Estimation: Answering Research
Questions
Sometimes the results of an experiment are obvi-

ous. In 1881, Louis Pasteur administered an
anthrax vaccine to 24 sheep and selected 24 other
sheep to be a control group. He exposed all 48
sheep to a deadly dose of anthrax and asked
visitors to come back in two days. By then, 21
of the unvaccinated sheep had died. Two more
unvaccinated sheep died before the visitors’ eyes,
and the last unvaccinated sheep died the next
day. Of the vaccinated sheep, only one died,
and its symptoms were inconsistent with anthrax.
Nobody needed fancy econometrics to conclude
the vaccine worked; they only needed masks to cover the smell.
Mostly, though, the conclusions from an experiment are not so obvious. What
if the death toll had been two unvaccinated sheep and one vaccinated sheep? That
well could have happened by chance. What if five unvaccinated sheep died and no
vaccinated sheep died? That outcome would seem less likely to have happened
simply by chance. But would it be enough for us to believe that the vaccine
treatment can prevent anthrax?
Questions like these pervade all econometric analysis. We’re trying to answer
questions, and while it’s pretty easy to see whether a policy is associated with
more of a given outcome, it’s much harder to know at what point we should
become convinced the relationship is real, rather than the result of the hurly-burly
randomness of real life.
hypothesis testing Hypothesis testing is the infrastructure that statistics provides for answering
A process assessing these questions. Hypothesis testing allows us to assess whether the observed data is
whether the observed or is not consistent with a claim of interest. The process does not yield 100 percent
data is or is not
definitive answers; rather, it translates our statistical estimates into statements like
consistent with a claim
of interest.
“We are quite confident that the vote share of the incumbent U.S. president’s party
goes up when the economy is good” or “We are quite confident that tall people get
paid more.”
91
92 CHAPTER 4 Hypothesis Testing and Interval Estimation: Answering Research Questions
The standard statistical way to talk about hypotheses is a bit of an acquired

taste. Suppose there is no effect (i.e., β1 = 0). What is the probability that when we
run OLS on the data we actually have that we will see a coefficient as large as what
we actually observe? That is, suppose we want to test the claim that β1 = 0. If this
claim were true (meaning β1 = 0), what is the probability of observing β̂1 = 0.4
or 7.2 or whatever result our OLS produced? If this probability of observing the
β̂1 we actually observe is very small when β1 = 0, then we can reasonably infer
that the hypothesis β1 = 0 is probably not true.
Intuitively, we know that if a treatment has no effect, the probability of
seeing a huge difference is low and the chance of seeing a small difference is
large. The magic of stats—and it is quite remarkable—is that we can quantify the
probabilities of seeing any observed effect given that the effect really is zero.
In this chapter, we discuss the tools of hypothesis testing. Section 4.1 lays out
the core logic and terminology. Section 4.2 covers the workhorse of hypothesis
testing, the t test. Section 4.3 introduces p values, which are a useful by-product of
the hypothesis testing enterprise. Section 4.4 discusses statistical power, a concept
that’s sometimes underappreciated despite its cool name. Power helps us recognize
the difference between finding no relationship because there is no relationship and
finding no relationship because we don’t have enough data. Section 4.5 discusses
some of the very real limitations to the hypothesis testing approach, and Section
4.6 then introduces the confidence interval approach to estimation, which avoids
some of the problems of hypothesis testing.
Much of the material in this chapter will be familiar to those who have
had a probability and statistics course. Learning this material or tuning up our
understanding of it will put us in great position to understand OLS as it is
practiced.
4.1 Hypothesis Testing

We want to use statistics to answer questions, and the main way to do so is to use
OLS to assess hypotheses. In this section, we introduce the null and alternative
hypotheses, apply the concepts to our presidential election example, and then
develop the important concept of significance level.
null hypothesis A Hypothesis testing begins with a null hypothesis, which is typically a
hypothesis of no effect. hypothesis of no effect. Consider the height and wage example from page 74:
Wagei = β0 + β1 Adult heighti + i (4.1)
The standard null hypothesis is that height has no effect on wages. Or, more
formally,
H0 : β1 = 0
where the subscript zero after the H indicates that this is the null hypothesis.
Statistical tools do not allow us to prove or disprove a null hypothesis.

Instead, we “reject” or “fail to reject” the null hypotheses. When we reject a null
hypothesis, we are actually saying that the probability of seeing the β̂1 that we
estimated is very low if the null hypothesis is true. For example, it is unlikely we
will observe a large β̂1 with a small standard error if the truth is β1 = 0. If we do
nonetheless observe a large β̂1 with a small standard error, we will reject the null
statistically hypothesis and refer to the coefficient as statistically significant.
significant A When we fail to reject a null hypothesis, we are saying that the β̂1 we observe
coefficient is statistically would not be particularly unlikely if the null hypothesis were true. For example,
significant when we
we typically fail to reject the null hypothesis when we observe a small β̂1 . That
reject the null
hypothesis that it is zero.
outcome would not be surprising at all for β1 = 0. We can also fail to reject null
hypotheses when uncertainty is high. That is, a large β̂1 may not be too surprising
even when β1 = 0 if the variance of β̂1 is large relative to the value of β̂1 . We
formalize this logic when we discuss t statistics in the next section.
The heart of proper statistical analysis is to recognize that we might be making
a mistake. When we reject a null hypothesis, we are concluding that given the β̂1
we observe, it is unlikely that β1 = 0. We are not saying it is impossible.1
When we fail to reject a null hypothesis, we are saying that given the β̂1 we
observe, it would not surprise us if β1 = 0. We are definitely not saying that we
know that β1 = 0 when we fail to reject the null. Instead, the situation is like a
“not guilty” verdict from a jury: the accused may be guilty, but the evidence is not
sufficient to convict.
Type I error A We characterize possible mistakes in two ways. Type I errors occur when we
hypothesis testing error reject a null hypothesis that is in fact true. If we say height increases wages but
that occurs when we actually it doesn’t, we’re committing a Type I error. The phrase “Type I error”
reject a null hypothesis
is a bit opaque. Someday maybe statisticians will simply say “false positive,”
that is in fact true.
which is more informative. Type II errors occur when we fail to reject a null
hypothesis that is in fact false. If we say that there is no relationship between height
Type II error A and wages but there actually is one, we’re committing a Type II error. Table 4.1
hypothesis testing error summarizes this terminology. A more natural term for Type II error is “false
that occurs when we fail
negative.”
to reject a null
hypothesis that is in fact Standard hypothesis testing focuses heavily on Type I error. That is, the
false. approach is built around specifying an acceptable level of Type I error and
proceeding from there. We should not forget Type II error, though. In many
TABLE 4.1 Type I and Type II Errors

β1 =
/0 β1 = 0
Reject H0 Correct inference Type I error/ false positive:

wrongly reject null
Fail to reject H0 Type II error/false negative: Correct inference

wrongly fail to reject null
1
That’s why there is a t-shirt that says “Being a statistician means never having to say you are
certain.”
situations, we must take the threat of Type II error seriously; we consider some
when we discuss statistical power in Section 4.4.
alternative If we reject the null hypothesis, we accept the alternative hypothesis. We do
hypothesis An not prove the alternative hypothesis is true. Rather, the alternative hypothesis is
alternative hypothesis is the idea we hang onto when we have evidence that is inconsistent with the null
what we accept if we
hypothesis.
reject the null
hypothesis.
An alternative hypothesis is either one sided or two sided. A one-sided
alternative hypothesis has a direction. For example, if we have theoretical reasons
one-sided to believe that being taller increases wages, then the alternative hypothesis for the
alternative model
hypothesis An
alternative to the
Wagei = β0 + β1 Adult heighti + i (4.2)
null hypothesis that has
a direction—for
example, HA : β1 > 0 or would be written as HA : β1 > 0.
HA : β1 < 0. A two-sided alternative hypothesis has no direction. For example, if we
think height affects wages but we’re not sure whether tall people get paid
more or less, the alternative hypothesis would be HA : β1 = 0. If we’ve done
enough thinking to run a statistical model, it seems reasonable to believe that we
two-sided
alternative
should have at least an idea of the direction of the coefficient on our variable
hypothesis An of interest, implying that two-sided alternatives might be rare. They are not,
alternative to the however, in part because they are more statistically cautious, as we will discuss
null hypothesis that shortly.
indicates the coefficient Formulating appropriate null and alternative hypotheses allows us to translate
is not equal to 0 (or substantive ideas into statistical tests. For published work, it is generally a breeze
some other specified
to identify null hypotheses: just find the β̂ that the authors jabber on about most.
value)—for example,
HA : β1 = 0.
The main null hypothesis is almost certainly that that coefficient is zero.
OLS coefficients under the null hypothesis for

the presidential election example
With a null hypothesis in hand, we can move toward serious econometric analysis.
Let’s consider the presidential election example that opened Chapter 3. To identify
a null hypothesis, we first need a model, such as
Vote sharet = β0 + β1 Change in incomet + t (4.3)
where Vote sharet is percent of the vote received by the incumbent president’s party
in year t and the independent variable, Change in incomet , is the percent change
in real disposable income in the United States in the year before the presidential
election. The null hypothesis is that there is no effect, or H0 : β1 = 0.
What is the distribution of β̂1 under the null hypothesis? Pretty simple: if the
correlation of change in income and is zero (which we assume for this example),
then β̂1 is a normally distributed random variable centered on zero. This is because
OLS produces unbiased estimates, and if the true value of β1 is zero, then an
unbiased distribution of β̂1 will be centered on zero.
TABLE 4.2 Effect of Income Changes on Presidential Elections

Variable Coefficient Standard error
Change in income 2.20 0.55
Constant 46.11 1.72

N = 18
How wide is the distribution of β̂1 under the null hypothesis? Unlike the mean
of the distribution, which we know under the null, the width of the β̂1 distribution
depends on the data. In other words, we allow the data to tell us the variance and
standard error of the β̂1 estimate under the null hypothesis.
Table 4.2 shows the results for the presidential election model. Of particular
interest for us at this point is that the standard error of the β̂1 estimate is 0.55.
This number tells us how wide the distribution of the β̂1 will be under the
null.
With this information, we can depict the distribution of β̂1 under the null.
Specifically, Figure 4.1 shows the probability density function of β̂1 under the null
hypothesis, which is a normal probability density centered at zero with a standard
deviation of 0.55. We also refer to this as the distribution of β̂1 under the null
hypothesis. We introduced probability density functions in Section 3.2 and discuss
them in further detail in Appendix F starting on page 541.
Figure 4.1 illustrates the key idea of hypothesis testing. The actual value of β̂1
that we estimated is 2.2. That number seems pretty unlikely, doesn’t it? Under the
null hypothesis, most of the distribution of β̂1 is to the left of the β̂1 observed. We
formalize things in the next section, but intuitively, it’s reasonable to think that the
observed value β̂1 is so unlikely if the null is true that, well, the null hypothesis is
probably not true.
Now name a value of β̂1 that would lead us not to reject the null hypothesis. In
other words, name a value of β̂1 that is perfectly likely under the null hypothesis.
We show one such example in Figure 4.1: the line at β̂1 = −0.3. A value like this
would be completely unsurprising if the null hypothesis were true. Hence, if we
observed such a value for β̂1 , we would deem it to be consistent with the null
hypothesis, and we would not reject the null hypothesis.
Significance level
Given that our strategy is to reject the null hypothesis when we observe a β̂1 that
significance level is quite unlikely under the null, the natural question is: Just how unlikely does
The probability of β̂1 have to be? We get to choose the answer to this question. In other words, we
committing a Type I get to decide our standard for what we deem to be sufficiently unlikely to reject
error for a hypothesis
the null hypothesis. We’ll call this probability the significance level and denote
test (i.e., how unlikely a
result has to be under
it with α (the Greek letter alpha). A significance level determines how unlikely a
the null hypothesis for result has to be under the null hypothesis for us to reject the null. A very common
us to reject the null). significance level is 5 percent (meaning α = 0.05).
Probability
density
Distribution of β1 under the

null hypothesis that β1 = 0
(standard error is 0.55)
Example of β1
Actual
for which we
value
would fail to
of β1
reject the null
−0.3 2.2
−2 −1 0 1 2
β1
FIGURE 4.1: Distribution of β̂1 under the Null Hypothesis for Presidential Election Example
If we set α = 0.05, then we reject the null when we observe a β̂1 so large that
we would expect a 5 percent chance of seeing the observed value or higher under
only the null hypothesis. Setting α = 0.05 means that there is a 5 percent chance
that we would see a value high enough to reject the null hypothesis even when the
null hypothesis is true, meaning that α is the probability of making a Type I error.
If we want to be more cautious (in the sense of requiring a more extreme result
to reject the null hypothesis), we can choose α = 0.01, in which case we will reject
the null if we have a one percent or lower chance of observing a β̂1 as large as we
actually did if the null hypothesis were true.
Reducing α is not completely costless, however. As the probability of making
a Type I error decreases, the probability of making a Type II error increases. In
other words, the more we say we’re going to need really strong evidence to reject
the null hypothesis (which is what we say when we make α small), the more likely
it is that we’ll fail to reject the null hypothesis when the null hypothesis is wrong
(which is the Type II error).
REMEMBER THIS
1. A null hypothesis is typically a hypothesis of no effect, written as H0 : β1 = 0.
• We reject a null hypothesis when the statistical evidence is inconsistent with the null
hypothesis. A coefficient estimate is statistically significant if we reject the null hypothesis
that the coefficient is zero.
• We fail to reject a null hypothesis when the statistical evidence is consistent with the null
hypothesis.
• Type I error occurs when we wrongly reject a null hypothesis.
• Type II error occurs when we wrongly fail to reject a null hypothesis.
2. An alternative hypothesis is the hypothesis we accept if we reject the null hypothesis.
• We choose a one-sided alternative hypothesis if theory suggests either β1 > 0 or β1 < 0.
• We choose a two-sided alternative hypothesis if theory does not provide guidance as to
whether β1 is greater than or less than zero.
3. The significance level (α) refers to the probability of a Type 1 error for our hypothesis test. We
choose the value of the significance level, typically 0.01 or 0.05.
4. There is a trade-off between Type I and Type II errors. If we lower α, we decrease the probability
of making a Type I error but increase the probability of making a Type II error.
1. Translate each of the following questions into a bivariate model with a null hypothesis that
could be tested. There is no single answer for each.
(a) “What causes test scores to rise?”
(b) “How can Republicans increase support among young voters?”
(c) “Why did unemployment spike in 2008?”
2. For each of the following, identify the null hypothesis, draw a picture of the distribution of β̂1
under the null, identify values of β̂1 that would lead you to reject or fail to reject the null, and
explain what it would mean to commit Type I and Type II errors in each case.
(a) We want to know if height increases wages.
(b) We want to know if gasoline prices affect the sales of SUVs.
(c) We want to know if handgun sales affect murder rates.
4.2 t Tests
The most common tool we use for hypothesis testing in OLS is the t test. There’s
ˆ
t test A hypothesis a quick rule of thumb for t tests: if the absolute value of se(ββ1ˆ ) is bigger than 2,
test for hypotheses 1
about a normal random reject the null hypothesis. (Recall that se( β̂1 ) is the standard error of our coefficient
variable with an estimate.) If not, don’t. This section provide the logic and tools of t testing, which
estimated standard will enable us to be more precise, but this rule of thumb is pretty much all there is
error. to it.
β 1 and standard errors

β̂
To put our t tests in context, let’s begin by stating that we have calculated β̂1
and are trying to figure out whether β̂1 would be highly surprising if the null
hypothesis were true. A challenge is that the scale of our β̂1 could be anything.
In our presidential election model, we estimated β̂1 to be 2.2. Is that estimate
surprising under the null? As we saw in Figure 4.1, a β̂1 that big is unlikely to
appear when the standard error of β̂1 is only 0.55. What if the standard error
of β̂1 were 2.0? The distribution of β̂1 under the null hypothesis would still be
centered at zero, but it would be really wide, as in Figure 4.2. In this case, it really
wouldn’t be so surprising to see a β̂1 of 2.2 even if the null hypothesis that β1 = 0
were true.
What we really care about is not the β̂1 coefficient estimate by itself but,
rather, how large the β̂1 coefficient is relative to its standard error. In other words,
we are unlikely to observe a β̂1 coefficient that is much bigger than its standard
error, which would place it outside the range of the most likely outcomes for a
normal distribution.
Therefore, we use a test statistic that consists of the estimated coefficients
ˆ
divided by the estimated standard deviation of the coefficient: se(ββ1ˆ ) . Thus, our
1
test statistic reflects how many standard errors above or below zero the estimated
coefficient is. If the β̂1 is 6 and se( β̂1 ) is 2, our test statistic will be 3 because the
estimated coefficient is 3 standard errors above zero. If the standard error had been
12 instead, the value of our test statistic would have been 0.5.
The t distribution
Dividing β̂1 by its standard error solves the scale problem but introduces
another challenge. We know β̂1 is normally distributed, but what is the distribution
ˆ
of se(ββ1ˆ ) ? The se( β̂1 ) term is also a random variable because it depends on the
1
estimated β̂1 . It’s a tricky question, and now is a good time to turn to our friends
at Guinness Brewery for help. Really. Not for what you might think, but for work
ˆ
they did in the early twentieth century demonstrating that the distribution of se(ββ1ˆ )
1
4.2 t Tests 99
Probability
density
Distribution of β1 under the

null hypothesis that β1 = 0
(standard error is 2.0)
Actual
value
of β1
2.2
−5 −4 −3 −2 −1 0 1 2 3 4 5
β1
FIGURE 4.2: Distribution of β̂1 under the Null Hypothesis with Larger Standard Error for Presiden-
tial Election Example
t distribution A follows a distribution we call the t distribution.2 The t distribution is bell shaped
distribution that looks like a normal distribution but has “fatter tails.”3 We say it has fat tails because the
like a normal values on the far left and far right have higher probabilities than what we find for
distribution, but with the normal distribution. The extent of these chubby tails depends on the sample
fatter tails. The exact size: as the sample size gets bigger, the tails melt down to become the same as the
shape of the distribution normal distribution. What’s going on is that we need to be more cautious about
depends on the degrees
of freedom. This
rejecting the null because it is possible that by chance our estimate of se( β̂1 ) will
distribution converges
to a normal distribution 2
Like many statistical terms, t distribution and t test have quirky origins. William Sealy Gosset
for large sample sizes. devised the test in 1908 when he was working for Guinness Brewery in Dublin. His pen name was
“Student.” There already was an s test (now long forgotten), so Gosset named his test and distribution
after the second letter of his pen name. Technically, the standard error of β̂1 follows a statistical
distribution called a χ 2 distribution, and the ratio of a normally distributed random variable and a χ 2
random variable follows a t distribution. More details are in Appendix H on page 549. For now, just
note that the Greek letter χ (chi) is pronounced like “ky,” as in Kyle.
3
That’s a statistical term. Seriously.
ˆ
be too small, which will make se(ββ1ˆ ) appear to be really big. When we have small
1
amounts of data, the issue is serious because we will be quite uncertain about
se( β̂1 ); when we have lots of data, we’ll be more confident about our estimate
of se( β̂1 ) and, as we’ll see, the fat tails of the t distribution fade away and the t
distribution and normal distribution become virtually indistinguishable.
The specific shape of a t distribution depends on the degrees of freedom,
which is sample size minus the number of parameters. A bivariate OLS model
estimates two parameters (β̂ 0 and β̂1 ), which means, for example, that the degrees
of freedom for a bivariate OLS model with a sample size of 50 is 50 − 2 = 48.
Figure 4.3 displays three different t distributions; a normal distribution is
plotted in the background of each panel as a dotted line. Panel (a) shows a t
distribution with degrees of freedom (d.f.) equal to 2. The probability of observing
Probability
density t distribution d.f. = 2
normal distribution
(a)
−3 −2 −1 0 1 2 3
β1/se(β1)
Probability
normal distribution
(b)
−3 −2 −1 0 1 2 3
β1/se(β1)
Probability
normal distribution
(c)
Note: normal distribution is
covered by t distribution
−3 −2 −1 0 1 2 3
β1/se(β1)
FIGURE 4.3: Three t Distributions

4.2 t Tests 101
TABLE 4.3 Decision Rules for Various Alternative Hypotheses

Alternative hypothesis Decision rule
Reject H0 if | β1ˆ | > appropriate critical value

ˆ
HA : β1 /
= 0 (two-sided alternative)
se( β1 )
βˆ1
HA : β1 > 0 (one-sided alternative) Reject H0 if > appropriate critical value
se( βˆ1 )
β1
ˆ
HA : β1 < 0 (one-sided alternative) Reject H0 if < −1 times appropriate critical value
se( βˆ1 )
a value as high as 3 is higher for the t distribution than for the normal distribution.
The same thing goes for the probability of observing a value as low as –3. Panel
(b) shows a t distribution with degrees of freedom equal to 5. If we look closely,
we can see some chubbiness in the tails because the t distribution has higher
probabilities at, for example, values greater than 2. We have to look pretty closely
to see that, though. Panel (c) shows a t distribution with degrees of freedom equal
to 50. It is visually indistinguishable from a normal distribution and, in fact, covers
up the normal distribution so we cannot see it.
Critical values
ˆ
critical value In Once we know the distribution of se(ββ1ˆ ) , we can come up with a critical value.
1
hypothesis testing, a A critical value is the threshold for our test statistic. Loosely speaking, we reject
value above which a β̂1 ˆ
would be so unlikely the null hypothesis if se(ββ1ˆ ) (the test statistic) is greater than the critical value; if
1
that we reject the null. βˆ1
is below the critical value, we fail to reject the null hypothesis.
se( βˆ1 )
More precisely, our specific decision rule depends on the nature of the
alternative hypothesis. Table 4.3 displays the specific rules. Rather than trying to
memorize these rules, it is better to concentrate on the logic behind them. If the
alternative hypothesis is two sided, then big values of β̂1 relative to the standard
error incline us to reject the null. We don’t particularly care if they are very positive
or very negative. If the alternative hypothesis is that β > 0, then only large, positive
values of β̂1 will incline us to reject the null hypothesis in favor of the alternative
hypothesis. Observing a very negative β̂1 would be odd, but certainly it would
not incline us to believe the alternative hypothesis that the true value of β is
greater than zero. Similarly, if the alternative hypothesis is that β < 0, then only
very negative values of β̂1 will incline us to reject the null hypothesis in favor of
the alternative hypothesis. We refer to the appropriate critical value in the table
because the actual value of the critical value will depend on whether the test is one
sided or two sided, as we discuss shortly.
The critical value for t tests depends on the t distribution and identifies the
ˆ
point at which we decide the observed se(ββ1ˆ ) is unlikely enough under the null
1
hypothesis to justify rejecting the null hypothesis.
Critical values depend on the significance level (α) we choose, our degrees of
freedom, and whether the alternative is one sided or two sided. Figure 4.4 depicts
critical values for various scenarios. We assume the sample size is large in each,
allowing us to use the normal approximation to the t distribution. Appendix G
explains the normal distribution in more detail. If you have not seen or do not
remember how to work with the normal distribution, it is important to review this
material.
Panel (a) of Figure 4.4 shows critical values for α = 0.05 and a two-sided
alternative hypothesis. The distribution of the t statistic is centered at zero under
the null hypothesis that β1 = 0. For a two-sided alternative hypothesis, we want to
Probability Two-sided alternative hypothesis, α = 0.05

density
2.5% of normal
2.5% of normal distribution is to
(a) distribution is to right of 1.96
left of −1.96
−1.96 1.96
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
Probability Two-sided alternative hypothesis, α = 0.01

density
0.5% of normal
0.5% of normal distribution is to
(b) distribution is to right of 2.58
left of −2.58
−2.58 2.58
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
Probability One-sided alternative hypothesis, α = 0.05

density
5% of normal distribution
(c) is to right of 1.64
1.64
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
FIGURE 4.4: Critical Values for Large-Sample t Tests. Using Normal Approximation to t Distribution
4.2 t Tests 103
identify ranges that are far from zero and unlikely under the null hypothesis. For
α = 0.05, we want to find the range that constitutes the least-likely 5 percent of the
distribution under the null. This 5 percent is the sum of the 2.5 percent on the far
left and the 2.5 percent on the far right. Values in these ranges are not impossible,
but they are unlikely. For a large sample size, the critical values that mark off the
least-likely 2.5 percentage regions of the distribution are –1.96 and 1.96.
Panel (b) of Figure 4.4 depicts another two-sided alternative hypothesis, this
time α = 0.01. Now we’re saying that to reject the null hypothesis, we’re going to
need to observe an even more unlikely β̂1 under the null hypothesis. The critical
value for a large sample size is 2.58. This number defines the point at which there
is a 0.005 probability (which is half of α) of being higher than than the critical
value and at which there is a 0.005 probability of being less than the negative
of it.
The picture and critical values differ a bit for a one-tailed test in which we look
only at one side of the distribution. In panel (c) of Figure 4.4, α = 0.05 and HA :
β1 > 0. Here 5 percent of the distribution is to the right of 1.64, meaning that we
ˆ
will reject the null hypothesis in favor of the alternative that β1 > 0 if se(β1β̂) > 1.64.
Note that the one-sided critical value for α = 0.05 is lower than the two-sided
critical value. One-sided critical values will always be lower for any given value
of α, meaning that it is easier to reject the null hypothesis for a one-sided
alternative hypothesis than for a two-sided alternative hypothesis. Hence, using
critical values based on a two-sided alternative is statistically cautious insofar as
we are less likely to appear overeager to reject the null if we use a two-sided
alternative.
Table 4.4 displays critical values of the t distribution for one-sided and
two-sided alternative hypotheses for common values of α. When the degrees of
freedom are very small (typically owing to a small sample size), the critical values
are relatively large. For example, with 2 degrees of freedom and α = 0.05, we
TABLE 4.4 Critical Values for t Distribution

α (1-sided) ⇒ 0.05 0.025 0.01 0.005
α (2-sided) ⇒ 0.10 0.050 0.02 0.01
2 2.92 4.30 6.97 9.92

5 2.01 2.57 3.37 4.03
10 1.81 2.23 2.76 3.17

Degrees of freedom 15 1.75 2.13 2.60 2.95
20 1.73 2.09 2.53 2.85

50 1.68 2.01 2.40 2.68
100 1.66 1.98 2.37 2.63
∞ 1.64 1.96 2.32 2.58
A t distribution with ∞ degrees of freedom is the same as a normal distribution.

need to see a t stat above 2.92 to reject the null.4 With 10 degrees of freedom
α = 0.05, we need to see a t stat above 1.81 to reject the null. With 100 degrees
of freedom and α = 0.05, we need a t stat above 1.66 to reject the null. As the
degrees of freedom get higher, the t distribution looks more and more like a
normal distribution; for infinite degrees of freedom, it is exactly like a normal
distribution, producing identical critical values. For degrees of freedom above
100, it is reasonable to use critical values from the normal distribution as a good
approximation.
ˆ
We compare se(ββ1ˆ ) to our critical value and reject if the magnitude is larger
1
ˆ
t statistic The test than the critical value. We refer to the ratio of se(ββ1ˆ ) as the t statistic (or “t stat,”
1
statistic used in a t test. It as the kids say). The t statistic is so named because that ratio will be compared
ˆ −β Null
is equal to β1se( βˆ )
. to a critical value that depends on the t distribution in the manner just outlined.
1
Tests based on two-sided alternative tests with α = 0.05 are very common. When
the sample size is large, the critical value for such a test is 1.96, hence the rule of
thumb is that a t statistic bigger than 2 is statistically significant at conventional
levels.
Revisiting the height and wages example

To show t testing in action, Table 4.5 provides the results of the height and wages
models from page 75 in Chapter 3 but now adds t statistics. As before, we show
results by using standard errors estimated by the equation that requires errors to
be homoscedastic and standard errors estimated via an equation that allows errors
to be heteroscedastic. The coefficients across models are identical.
TABLE 4.5 Effect of Height on Wages with t Statistics

Variable Assuming Allowing
homoscedasticity heteroscedasticity
Adult height 0.412 0.412

(0.0975) (0.0953)
[t = 4.23] [t = 4.33]
Constant –13.093 –13.093
(6.897) (6.691)
[t = 1.90] [t = 1.96]
N 1,910 1,910
2
σ̂ 142.4 142.4
σ̂ 11.93 11.93
2
R 0.009 0.009
4
It’s unlikely that we would seriously estimate a model with 2 degrees of freedom. For a bivariate
OLS model, that would mean estimating a model with just four observations, which is a silly idea.
4.2 t Tests 105
The column on the left shows that the t statistic from the homoscedastic
model for the coefficient on adult height is 4.23, meaning that β̂1 is 4.23 standard
deviations away from zero. The t statistic from the heteroscedastic model for
the coefficient on adult height is 4.33, which is essentially the same as in the
homoscedastic model. For simplicity, we’ll focus on the homoscedastic model
results.
Is this coefficient on adult height statistically significant? To answer that
question, we’ll need a critical value. To pick a critical value, we need to choose a
one-sided or two-sided alternative hypothesis and a significance level. Let’s start
with a two-sided test and α = 0.05.
For a t distribution, we also need to know the degrees of freedom. Recall that
to find the degrees of freedom, we take the sample size and subtract the number
of parameters estimated. The smaller the sample size, the more uncertainty we
have about our standard error estimate, hence the larger we make our critical
value. Here the sample size is 1, 910 and we estimate two parameters, so the
degrees of freedom are 1, 908. For a sample this large, we can reasonably use
the critical values from the last row of Table 4.4. The critical value for a two-sided
test with α = 0.05 and a high number for degrees of freedom is 1.96. Because
our t statistic of 4.22 is higher than 1.96, we reject the null hypothesis. It’s
that easy.
Other types of null hypothesis

Finally, it’s worth noting that we can extend the t test logic to cases in which the
null hypothesis refers to some value other than zero. Such cases are not super
common, but also not unheard of. Suppose, for example, that our null hypothesis
is H0 : β1 = 7 versus HA : β1 = 7. In this case, we simply need to check how many
βˆ1 −7
standard deviations β̂1 is away from 7, so we compare se( βˆ1 )
against the standard
critical values we have already developed. More generally, to test a null hypothesis
ˆ1 −β Null
that H0 : β1 = β Null , we look at βse( βˆ1 )
, where β Null is the value of β indicated in
the null hypothesis.
REMEMBER THIS
1. We use a t test to test a null hypotheses such as H0 : β1 = 0. The steps are as follows:
(a) Choose a one-sided or two-sided alternative hypothesis.
(b) Set a significance level, α, usually equal to 0.01 or 0.05.
(c) Find a critical value based on the t distribution. This value depends on α, whether the
alternative hypothesis is one sided or two sided, and the degrees of freedom (equal to
sample size minus number of parameters estimated).
(d) Use OLS to estimate parameters.

ˆ
• For a two-sided alternative hypothesis, we reject the null hypothesis if se(ββ1ˆ ) > the
1
critical value. Otherwise, we fail to reject the null hypothesis.
• For a one-sided alternative hypothesis that β1 > 0, we reject the null hypothesis if
βˆ1
se( βˆ )
> the critical value.
1
• For a one-sided alternative hypothesis that β1 < 0, we reject the null hypothesis if
βˆ1
se( βˆ )
< −1 times the critical value.
1
βˆ1 −β Null
2. We can test any hypothesis of the form H0 : β1 = β Null by using se( βˆ1 )
as the test statistic for
a t test.
Review Questions
1. Refer to the results in Table 4.2 on page 95.
(a) What is the t statistic for the coefficient on change in income?
(b) What are the degrees of freedom?
(c) What is the critical value for a two-sided alternative hypothesis and α = 0.01? Do we
accept or reject the null?
(d) What is the critical value for a one-sided alternative hypothesis and α = 0.05? Do we
accept or reject the null?
2. Which is bigger: the critical value from one-sided tests or two-sided tests? Why?
3. Which is bigger: the critical value from a large sample or a small sample? Why?
4.3 p Values
The p value is a useful by-product of the hypothesis testing framework. It indicates
p value The the probability of observing a coefficient as extreme as we actually did if the null
probability of observing hypothesis were true. In this section, we explain how to calculate p values and why
a coefficient as extreme they’re useful.
as we actually observed
As a practical matter, the thing to remember is that we reject the null if the
if the null hypothesis
were true.
p value is less than α. Our rule of thumb here is “small p value means reject”: low
p values are associated with rejecting the null, and high p values are associated
with failing to reject the null hypothesis.
4.3 p Values 107
Although p values can be calculated for any null hypothesis, we focus on the
most common null hypotheses in which β1 = 0. Most statistical software reports
a two-sided p value, which indicates the probability that a coefficient is larger in
magnitude (either positively or negatively) than the coefficient we observe.
Panel (a) of Figure 4.5 shows the p value calculation for the β̂1 estimate
in the wage and height example we discussed on page 104. The t statistic
Probability
density
Case 1: t statistic is 4.23
p value is 0.0000244
0.00122% of distribution 0.00122% of distribution

is to left of −4.23 is to right of 4.23
−4.23 4.23
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
(a)
Probability
density
Case 2: t statistic is 1.73
p value is 0.084
4.2% of distribution 4.2% of distribution

is to left of −1.73 is to right of 1.73
−1.73 1.73
−4 −3 −2 −1 0 1 2 3 4
β1/se(β1)
(b)
FIGURE 4.5: Two Examples of p Values

is 4.23. The p value is calculated by finding the likelihood of getting a t statistic

larger in magnitude than is observed under the null hypothesis. There is a
0.0000122 probability that the t statistic will be larger than 4.23. (In other words,
there is a tiny probability we would observe a t statistic as high as 4.23 if
the null hypothesis were true.) Because the normal distribution is symmetric,
there is also a 0.0000122 probability that the t statistic will be less than –4.23.
Hence, the p value will be twice the probability of being above the observed
t statistic, or 0.0000244.5 Here we see a very small p value, meaning that if
β1 were actually equal to 0, the observed β̂1 would have been really, really
unlikely.
Suppose, however, that our β̂1 were 0.169 (instead of the 0.412 it actually
0.169
was). The t statistic would be 0.0975 = 1.73. Panel (b) of Figure 4.5 shows the p
value in this case. There is a 0.042 probability of observing a t statistic greater than
1.73 under the null hypothesis and a 0.042 probability of observing a t statistic less
than –1.73 under the null, so the p value in this case would be 0.084. In this case,
just by looking at the p value, we could reject the null for α = 0.10 but fail to reject
the null for α = 0.05.
The p value is helpful because it shows us not only whether we reject the null
hypothesis, but also whether we really reject the null or just barely reject the null.
For example, a p value of 0.0001 indicates only a 0.0001 probability of observing
the β̂1 as large as what we observe if β1 = 0. In this case, we are not only rejecting,
we are decisively rejecting. Seeing a coefficient large enough to produce such a p
value is highly, highly unlikely if β1 = 0. On the other hand, if the p value is 0.049,
we are just barely rejecting the null for α = 0.05 and would, relatively speaking,
have less confidence that the null is false. For α = 0.05, we just barely fail to reject
the null hypothesis with a p value of 0.051.
Since any statistical package that conducts OLS will provide p values, we
typically don’t need to calculate them ourselves. Our job is to know what they
mean. Calculating p values is straightforward, though, especially for large sample
sizes. The Computing Corner in this chapter provides details.6
5
Here we are calculating two-sided p values, which are the output most commonly reported by
statistical software. If se(ββ1ˆ ) is greater than zero, then the two-sided p value is twice the probability of
ˆ
1
βˆ1
being greater than that value. If se( βˆ1 )
is less than zero, the two-sided p value is twice the probability
of being less than that value. A one-sided p value is simply half the two-sided p value.
6
For a two-sided p value, we want to know the probability of observing a t statistic higher than the
absolute value
of ˆthetstatistic we actually observe under the null hypothesis. This is

2 × 1 − Φ se(ββ1ˆ ) , where Φ is the capital Greek letter phi (pronounced like the “fi” in
1
“Wi-Fi”) and Φ() indicates the normal cumulative density function (CDF). (We see the normal CDF
in our discussion of statistical power in Section 4.4; Appendix G on page 543 supplies more details).
If the alternative hypothesis is HA : β1 > 0, the p value is the probability of observing a t statistic
ˆ
higher than the observed t statistic under the null hypothesis: 1 − Φ se(ββ1ˆ ) . If the alternative
1
hypothesis is HA : β1 < 0, the p value is the probability of observing a t statistic less than the observed
βˆ1
t statistic under the null hypothesis: Φ se( βˆ ) .
1
4.4 Power 109
REMEMBER THIS
The p value is the probability of observing a coefficient as large in magnitude as actually observed if
the null hypothesis is true.
1. The lower the p value, the less consistent the estimated β̂1 is with the null hypothesis.
2. We reject the null hypothesis if the p value is less than α.
3. A p value can be useful to indicate the weight of evidence against a null hypothesis.
4.4 Power
The hypothesis testing infrastructure we’ve discussed so far is designed to deal
with the possibility of Type I error, which occurs when we reject a null hypothesis
that is actually true. When we set the significance level, we are setting the
probability of making a Type I error. Obviously, we’d really rather not believe
the null is false when it is true.
Type II errors aren’t so hot either, though. We make a Type II error when
β is really something other than zero and we fail to reject the null hypothesis
that β is zero. In this section, we explain statistical power, the statistical concept
associated with Type II errors. We discuss the importance and meaning of Type II
error and how power and power curves help us understand our ability to avoid such
error.
Incorrectly failing to reject the null hypothesis

Type II error can be serious. For example, suppose there’s a new medicine that
really saves lives, but in the analysis the U.S. Food and Drug Administration (FDA)
relies on, the β̂1 estimate of the drug’s efficacy is not statistically significant. If
on the basis of that analysis the FDA fails to approve the drug, people will die
unnecessarily. That’s not “Oops”; that’s horrific. Even when the stakes are lower,
imagine how stupid we’d feel if we announced that a policy doesn’t work when in
fact it does—we just happened to get a random realization of β̂1 that was not high
enough to be statistically significant.
Type II error happens because it is possible to observe values of β̂1 that
are less than the critical value even if β1 (the true value of the parameter) is
greater than zero. This is more likely to happen when the standard error of β̂1 is
high.
Figure 4.6 shows the probability of Type II error for three different values
of β. In these plots, we assume a large sample (allowing us to use the normal
distribution for critical values) and test H0 : β1 = 0 against a one-sided alternative
Probability
density
β1 distribution centered on 1 when β1 = 1
(a)
Probability of Type II
error is 0.91
Probability of rejecting the null is 0.09
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
Probability
density
β1 distribution centered on 2 when β1 = 2
(b)
Probability
Probability of rejecting
of Type II the null is 0.37
error is 0.63
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
Probability
density
β1 distribution centered
on 3 when β1 = 3
(c)
Probability of rejecting
the null is 0.75
Probability
of Type II
error is 0.25
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
β1
FIGURE 4.6: Statistical Power for Three Values of β1 Given α = 0.01 and a One-Sided Alternative Hypothesis
hypothesis HA : β1 > 0, with α = 0.01. In this case, the critical value is 2.32, which
ˆ
means that we reject the null hypothesis if we observe se(ββ1ˆ ) greater than 2.32. For
1
simplicity, we’ll suppose se( β̂1 ) is 1.
Panel (a) of Figure 4.6 displays the probability of Type II error if the true
value of β equals 1. In this case, the distribution of β̂1 will be centered at 1. Only
4.4 Power 111
9 percent of this distribution is to the right of 2.32, meaning that we have only a
9 percent chance of rejecting the null hypothesis and a 91 percent chance of failing
to reject the null hypothesis. In other words, the probability of Type II error is
91.7 This means that even though the null hypothesis actually is false—remember,
β1 = 1 in this example, not 0—we have a roughly 9 in 10 chance of committing a
Type II error. In this example, our hypothesis test is not particularly able to provide
statistically significant results when the true value of β is 1.
Panel (b) of Figure 4.6 displays the probability of a Type II error if the true
value of β equals 2. In this case, the distribution of β̂1 will be centered at 2.
Here 37 percent of the distribution is to the right of 2.32, and therefore, we have
a 63 percent chance of of committing a Type II error. Better, but not by much:
even though β1 > 0, we have a roughly 2 in 3 chance of committing a Type
II error.
Panel (c) of Figure 4.6 displays the probability of a Type II error if the true
value of β equals 3. In this case, the distribution of β̂1 will be centered at 3.
Here 75 percent of the distribution is to the right of 2.32, meaning there is a
25 percent probability of committing a Type II error. We’re making progress,
but still far from perfection. In other words, the true value of β must be near or
above 3 before we have a 75 percent chance of rejecting the null hypothesis when
we should.
These examples illustrate why we use the somewhat convoluted “fail to reject
the null” terminology. That is, when we observe a β̂1 less than the critical value,
it is still quite possible that the true value is not zero. Failure to find an effect is
not the same as finding no effect.
power The ability of An important statistical concept related to Type II error is power. The
our data to reject the statistical definition of power differs from how we use the the word in ordinary
null. A high-powered conversation. Power in the statistical sense refers to the ability of our data to reject
statistical test will reject
the null hypothesis. A high-powered statistical test will reject the null with a very
the null with a very high
probability when the
high probability when the null is false; a low-powered statistical test will reject
null is false; a low- the null with a low probability when the null is false. Think of statistical power
powered statistical test like the power of a microscope. Using a high-powered microscope allows us to
will reject the null with a distinguish small differences in an object, differences that are there but invisible
low probability when to us when we look through a low-powered microscope.
the null is false. The logic of (statistical) power is pretty simple: power is 1-Pr(Type II error)
for a given true value of β. A key characteristic of power is that it varies with the
true value of β. In the example in Figure 4.6, panel (a) shows that the power of the
test is 0.09 when β = 1. Panel (b) shows that the power rises to 0.37 when β = 2,
and panel (c) shows that the power is 0.75 when β = 3. Calculating power can be
a bit clunky; we leave the details to Section 14.3.
Since we don’t know the true value of β (if we did, we would not need
hypothesis testing!), it is common to think about power for a range of possible true
power curve values. We can do this with a power curve, which characterizes the probability
Characterizes the
probability of rejecting 7
the null for each possible Calculating the probability of a Type II error follows naturally from the properties of the normal
distribution described in Appendix G. Using the notation from that appendix, Pr(Type II error) =
value of the parameter.
Φ(2.32 − 1) = 0.09. See also Section 14.3 for more detail on these kinds of calculations.
Probability of Power curve for

se( β1) = 1.0
rejecting the
null hypothesis
for α = 0.01
0.75
Power curve for

se( β1) = 2.0
0.5
0.37
0.25
0.09
0 1 2 3 4 5 6 7 8 9 10
β1
FIGURE 4.7: Power Curves for Two Values of se( β̂1 )
of rejecting the null for a range of possible values of the parameter of interest
(which is, in our case, β1 ). Figure 4.7 displays two power curves. The solid line on
top is the power curve for when se( β̂1 ) = 1.0 and α = 0.01. On the horizontal
axis are hypothetical values of β1 . The line shows the probability of rejecting
the null for a one-tailed test of H0 : β1 = 0 versus HA : β1 > 0 for α = 0.01
and a sample large enough to permit us to use the normal approximation to the
t distribution. To reject the null under these conditions requires a t stat greater
than 2.32 (see Table 4.4). This power curve plots for each possible value of β1
the probability that se(β̂β̂) (which in this case is 1.0
β̂
) is greater than 2.32. This
curve includes the values we calculated in Figure 4.6 but now also covers all
values of β1 between 0 and 10. We can see, for example, that the probability
of rejecting the null when β = 2 is 0.37, which is what we saw in panel (b) of
Figure 4.6.
4.4 Power 113
Look first at the values of β1 that are above zero, but still small. For these
values, the probability of rejecting the null is quite small. In other words, even
though the null hypothesis is false for these values (since β1 > 0), we’re unlikely
to reject the null hypothesis that β1 = 0. As β1 increases, this probability increases,
and by around β1 = 4, the probability of rejecting the null approaches 1.0. That
is, if the true value of β1 is 4 or bigger, we will reject the null with almost
certainty.
The dashed line in Figure 4.7 displays a second power curve for which the
standard error is bigger, here equal to 2.0. The significance level is the same
as for the first power curve, α = 0.01. We immediately see that the statistical
power is lower. For every possible value of β1 , the probability of rejecting the null
hypothesis is lower than when se( β̂1 ) = 1.0 because there is more uncertainty with
the higher standard error for the estimate. For this standard error, the probability
of rejecting the null when β1 equals 2 is 0.09. So even though the null is false, we
will have a very low probability of rejecting it.8
Figure 4.7 illustrates an important feature of statistical power: the higher the
standard error of β̂1 , the lower the power. This implies that anything that increases
se( β̂1 ) (see page 65) will lower power. Since a major determinant of standard
errors is sample size, a useful rule of thumb is that hypothesis tests based on large
samples are usually high in power and hypothesis tests based on small samples
are usually low in power. In Figure 4.7, we can think of the solid line as the power
curve for a large sample and the dashed line as the power curve for a smaller
sample. More generally, though, statistical power is a function of the variance of
β̂1 and all the factors that affect it.
null result A finding Power is particularly relevant when someone presents a null result, or a
in which the null finding in which the null hypothesis is not rejected. For example, someone may
hypothesis is not say class size is not related to test scores or that an experimental treatment does not
rejected.
work. In this case, we need to ask what the power of the test was. It could be, for
example, that the sample size is very small, such that the probability of rejecting
the null is small even for substantively large values of β1 .
What can we do to increase power? If we can lower the standard errors of
our coefficients, we should do that, of course, but usually that’s not an option.
We could also choose a higher value of α, which determines our statistical
significance level. Doing so would make it easier to reject a null hypothesis. The
catch, though, is that doing so would also increase the probability of a Type I
error. In other words, there is an inherent trade-off between Type I and Type
II error.
Figure 4.8 illustrates this tradeoff. Panel (a) shows the distribution of β̂1
when the null hypothesis that β1 = 0 is true. The distribution is centered at 0 (for
8
What happens when β1 actually is zero? In this case, the null hypothesis is true and power isn’t the
right concept. Instead, the probability of rejecting the null here is the probability of rejecting the null
when it is true. In other words, the probability of rejecting the null when β1 = 0 is the probability of
committing a Type I error, which is the α level we set.
Probability
density
Distribution of β1 Critical value for
when null hypothesis α = 0.01
is true
(a)
Type I error
2.32
−3 −2 −1 0 1 2 3 4
β1
Probability
density
Critical value for

α = 0.01
Example of distribution of β1
(b)
when null hypothesis is false
(β1 = 1)
Type II error
−3 −2 −1 0 1 2 2.32 3 4
β1
FIGURE 4.8: Tradeoff between Type I and Type II Error
simplicity, we use an example with the standard error of β̂1 = 1). We use α = 0.01
and a one-sided test, so the critical value is 2.32. The probability of a Type I error is
one percent, as highlighted in the figure. Panel (b) shows an example when the null
hypothesis is false—in this case, β1 = 1. We still use α = 0.01 and a one-sided test,
so the critical value remains 2.32. Here every realization of β̂1 to the left of 2.32
will produce a Type II error because for those realizations we will fail to reject the
null even though the null hypothesis is false. If we wanted to lower the probability
of a Type II error in panel (b), we could chose a higher value of α, which would
shift the critical value to the left (see Table 4.4 on page 103). A higher α would
also move the critical value in panel (a) to the left, increasing the probability of
a Type I error. If we wanted to lower the probability of a Type I error, we could
chose lower value of α, which would shift the critical value to the right in both
panels, lowering the probability of a Type I error in panel (a) but increasing the
probability of a Type II error in panel (b).
4.5 Straight Talk about Hypothesis Testing 115
REMEMBER THIS
Statistical power refers to the probability of rejecting a null hypothesis for a given value of β1 .
1. A power curve shows the probability of rejecting the null for a range of possible values
of β1 .
2. Large samples typically produce high-power statistical tests. Small samples typically produce
low-power statistical tests.
3. It is particularly important to discuss power in the presentation of null results that fail to reject
the null hypothesis.
4. There is an inherent trade-off between Type I and Type II errors.
4.5 Straight Talk about Hypothesis Testing

The ins and outs of hypothesis testing can be confusing. There are t distributions,
degrees of freedom, one-sided tests, two-sided tests, lions, tigers, and bears. Such
confusion is unfortunate for two reasons. First, the essence is simple: high t
statistics indicate that the β̂1 we observe would be quite unlikely if β1 = 0. Second,
as a practical matter, computers make hypothesis testing super easy. They crank
out t stats and p values lickety-split.
Sometimes these details distract us from the big picture: hypothesis testing
is not the whole story. In this section, we discuss four important limits to the
hypothesis testing framework.
First, and most important, all hypothesis testing tools we develop—all
of them!—are predicated on the assumption of no endogeneity. If there
is endogeneity, these tools are useless. If the input is junk, even a fancy
triple-backflip-somersault hypothesis test produces junk. We discussed endogene-
ity in Section 1.2 and will cover it in detail in Chapter 5.
Second, hypothesis tests can be misleadingly decisive. Suppose we have a
sample of 1,000 and are interested in a two-sided hypothesis test for α = 0.05.
If we observe a t statistic of 1.95, we will fail to reject the null. If we observe
a t statistic of 1.97, we will reject the null. The world is telling us essentially the
same thing in both cases, but the hypothesis testing approach gives us dramatically
different answers.
Third, a hypothesis test can mask important information. Suppose the t
statistic on one variable is 2 and the t statistic for another is 25. In both cases, we
reject the null. But there’s a big difference. We’re kinda-sorta confident the null is
not correct when the t stat is 2. We’re damn sure the null sucks when the t stat is
25. Hypothesis testing alone does not make such a distinction. We should, though.
The p values we discussed previously are helpful in this, as are the confidence
intervals we’ll discuss shortly.
Fourth, hypothesis tests and their focus on statistical significance can distract
substantive us from substantive significance. A substantively significant coefficient is one
significance If a that, well, matters; it indicates that the independent variable has a meaningful
reasonable change in effect on the dependent variable. Deciding how big a coefficient must be for us to
the independent
believe it matters can be a bit subjective. However, this is a conversation we need to
variable is associated
with a meaningful
have. And statistical significance is not always a good guide. Remember that t stats
change in the depend a lot on the se( β̂1 ), and the se( β̂1 ) in turn depends on sample size and other
dependent variable, the factors (see page 65). If we have a really big sample, and these days it is increas-
effect is substantively ingly common to have sample sizes in the millions, the standard error will be tiny
significant. Some and our t stat might be huge even for a substantively trivial β̂1 estimate. In these
statistically significant cases, we may reject the null even when the β̂1 coefficient suggests a minor effect.
effects are not
For example, suppose that in our height and wages example we last
substantively significant,
especially for large data
discussed on page 104 we had 20 million observations (instead of roughly 2,000
sets. observations). The standard error on β̂1 would be one one-hundredth as big.
So while a coefficient of 0.41 was statistically significant in the data we had, a
coefficient of 0.004 would be statistically significant in the larger data set. Our
results would suggest that a inch in height is associated with 0.4 cent per hour
which, while statistically significant does not really matter that much. In other
words, we could describe such an effect as statistically, but not substantively,
significant. This is more likely to happen when we have large data sets, something
that has become increasingly likely in an era of big data.
Or, conversely, we could have a small sample size that would lead to a large
standard error on β̂1 and, say, to a failure to reject the null. But the coefficient could
be quite big, suggesting a perhaps meaningful relationship. Of course we wouldn’t
want to rush to conclude that the effect is really big, but it’s worth appreciating
that the data in such a case is indicating the possibility of a substantively
significant relationship. In this instance, getting more data would be particularly
valuable.
REMEMBER THIS
Statistical significance is not the same as substantive significance.
1. A coefficient is statistically significant if we reject the null hypothesis.

2. A coefficient is substantively significant if the variable has a meaningful effect on the dependent
variable.
3. With large data sets, substantively small effects can sometimes be statistically significant.
4. With small data sets, substantively large effects can sometimes be statistically insignificant.
4.6 Confidence Intervals 117
4.6 Confidence Intervals

One way to get many of the advantages of hypothesis testing without the
stark black/white, reject/fail-to-reject dichotomies of hypothesis testing is to
confidence interval use confidence intervals. A confidence interval defines the range of true
Defines the range of values that are most consistent with the observed coefficient estimate. A con-
true values that are fidence interval contrasts with a point estimate, which is a single number
consistent with the
(e.g., β̂1 ).
observed coefficient
estimate. Confidence
This section explains how confidence intervals are calculated and why they
intervals depend on the are useful. The intuitive way to think about confidence intervals is that they give
point estimate, β̂1 , and us a range in which we’re confident the true parameter lies. An approximate
the measure of rule of thumb is that the confidence interval for a β̂1 estimate goes from
uncertainty, se( β̂1 ). 2 standard errors of β̂1 below β̂1 to 2 standard errors of β̂1 above β̂1 . That is, the
confidence interval for an estimate β̂1 will approximately cover β̂1 − 2 × se( β̂1 ) to
point estimate β̂1 + 2 × se( β̂1 ).
Describes our best guess
The full explanation of confidence intervals involves statistical logic similar to
as to what the true value
is. that for t stats. The starting point is the realization that we can assess the probability
of observing the β̂1 for any “true” β1 . For some values of β1 , our observed β̂1
wouldn’t be surprising. Suppose, for example, we observe a coefficient of 0.41
with a standard error of 0.1, as we did in Table 3.2. If the true value were 0.41, a
β̂1 near 0.41 wouldn’t be too surprising. If the true value were 0.5, we’d be a wee
bit surprised, perhaps, but not shocked, to observe β̂1 = 0.41. For some values of
β1 , though, the observed β̂1 would be surprising. If the true value were 10, for
example, we’d be gobsmacked to observe β̂1 = 0.41 with a standard error of 0.1.
Hence, if we see β̂1 = 0.41 with a standard error of 0.1, we’re pretty darn sure the
true value of β1 isn’t 10.
Confidence intervals generalize this logic to identify the range of true values
that would be reasonably likely to produce the β̂1 that we observe. They identify
that range of true values for which the observed β̂1 and se( β̂1 ) would not be too
unlikely. We get to say what “unlikely” means by choosing our significance level,
confidence levels which is typically α = 0.05 or α = 0.01. We’ll often refer to confidence levels,
Term referring to which are 1 – α. The upper bound of a 95 percent confidence interval is the value
confidence intervals and of β1 that yields less than α2 = 0.025 probability of observing a β̂1 equal to or lower
based on 1 − α.
than the β̂1 actually observed. The lower bound of a 95 percent confidence interval
is the value of β1 that yields less than an α2 = 0.025 probability of observing a β̂1
equal to or higher than the β̂1 actually observed.
Figure 4.9 illustrates the meaning of a confidence interval. Suppose β̂1 = 0.41
and se(βˆ1 ) = 0.1. For any given true value of β, we can calculate the probability
of observing the β̂1 we actually did observe. Panel (a) shows that if β1 really were
0.606, the distribution of β̂1 would be centered at 0.606, and we would see a value
as low as 0.41 (what we actually observe for β̂1 ) only 2.5 percent of the time. Panel
(b) shows that if β1 really were 0.214, the distribution of β̂1 would be centered at
0.214, and we would see a value as high as 0.41 (what we actually observe for β̂1 )
only 2.5 percent of the time. In other words, our 95 percent confidence interval
Probability
density
The value of the upper bound of a 95% confidence
interval is the value of β1 such that we would see
the observed β1 or lower 2.5% of the time.
(a)
If true value of β1 is 0.606 we would see a β1

equal to or less than 0.41 only 2.5% of the
time.
–0.2 0 0.2 0.4 0.6 0.8 1
β1
Probability
density
The value of the lower bound of a 95% confidence
interval is the value of β1 such that we would see
the observed β1 or higher 2.5 percent of the time.
(b)
If true value of β1 is 0.214 we would see a β1
equal to or greater than 0.41 only 2.5% of the
time.
–0.2 0 0.2 0.4 0.6 0.8 1
β1
FIGURE 4.9: Meaning of Confidence Interval for Example of 0.41 ± 0.196
ranges from 0.214 to 0.606 and includes the values of β1 that plausibly generate
the β̂1 we actually observed.9
Figure 4.9 does not tell us how to calculate the upper and lower bounds of
a confidence interval. A confidence interval is β̂1 − critical value × se( β̂1 ) to
β̂1 + critical value × se( β̂1 ). For large samples and α = 0.05, the critical value
is 1.96, giving rise to the rule of thumb that a 95 percent confidence interval is
approximately β̂1 ± 2 × the standard error of β̂1 . In our example, where β̂1 = 0.41
and se( β̂1 ) = 0.1, we can be 95 percent confident that the true value is between
0.214 and 0.606.
Table 4.6 shows some commonly used confidence intervals for large sample
sizes. The large sample size allows us to use the normal distribution to calculate
9
Confidence intervals can also be defined with reference to random sampling. Just as an OLS
coefficient estimate is random, so is a confidence interval. And just as a coefficient may randomly be
far from true value, so may a confidence interval fail to cover the true value. The point of confidence
intervals is that it is unlikely that a confidence interval will fail to include the true value. For example,
if we draw many samples from some population, 95 percent of the confidence intervals generated
from these samples will include the true coefficient.
4.6 Confidence Intervals 119
TABLE 4.6 Calculating Confidence Intervals for Large Samples

Confidence Critical Confidence interval Example
level value βˆ1 = 0.41 and se(βˆ1 ) = 0.1
90% 1.64 βˆ1 ± 1.64 × se( βˆ1 ) 0.41 ± 1.64 × 0.1 = 0.246 to 0.574
95% 1.96 βˆ1 ± 1.96 × se( βˆ1 ) 0.41 ± 1.96 × 0.1 = 0.214 to 0.606
99% 2.58 βˆ1 ± 2.58 × se( βˆ1 ) 0.41 ± 2.58 × 0.1 = 0.152 to 0.668
critical values. A 90 percent confidence interval for our example is 0.246 to 0.574.
The 99 percent confidence interval for a β̂1 = 0.41 and se( β̂1 ) = 0.1 is 0.152
to 0.668. Notice that the higher the confidence level, the wider the confidence
interval.
Confidence intervals are closely related to hypothesis tests. Because confi-
dence intervals tell us the range of possible true values that are consistent with
what we’ve seen, we simply need to note whether the confidence interval on our
estimate includes zero. If it does not, zero was not a value that would be likely to
produce the data and estimates we observe; we can therefore reject H0 : β1 = 0.
Confidence intervals do more than hypothesis tests, though, because they
provide information on the likely location of the true value. If the confidence
interval is mostly positive but just barely covers zero, we would fail to reject the
null hypothesis; we would also recognize that the evidence suggests the true value
is likely positive. If the confidence interval does not cover zero but is restricted to a
region of substantively unimpressive values of β1 , we can conclude that while the
coefficient is statistically different from zero, it seems unlikely that the true value
is substantively important. Baicker and Chandra (2017) provide a useful summary:
“There is also a key difference between ‘no evidence of effect’ and ‘evidence of no
effect.’ The first is consistent with wide confidence intervals that include zero as
well as some meaningful effects, whereas the latter refers to a precisely estimated
zero that can rule out effects of meaningful magnitude.”
REMEMBER THIS
1. A confidence interval indicates a range of values in which the true value is likely to be, given
the data.
• The lower bound of a 95 percent confidence interval will be a value of β1 such that there is
less than a 2.5 percent probability of observing a β̂1 as high as the β̂1 actually observed.
• The upper bound of a 95 percent confidence interval will be a value of β1 such that there is
less than a 2.5 percent probability of observing a β̂1 as low as the β̂1 actually observed.
2. A confidence interval is calculated as β̂1 ± t critical value × se( β̂1 ), where the t critical value
is the critical value from the t table. It depends on the sample size and α, the significance level.
For large samples and α = 0.05, the t critical value is 1.96.
Conclusion
“Statistical inference” refers to the process of reaching conclusions based on the
data. Hypothesis tests, particularly t tests, are central to inference. They’re pretty
easy. Honestly, a well-trained parrot could probably do simple t tests. Look at the
damn t statistic! Is it bigger than 2? Then squawk “reject”; if not, squawk “fail to
reject.”
We can do much more, though. With p values and confidence intervals, we
can characterize our findings with some nuance. With power calculations, we
can recognize the likelihood of failing to see effects that are there. Taken as
a whole, then, these tools help us make inferences from our data in a sensible
way.
After reading and discussing this chapter, we should be able to do the
following:
• Section 4.1: Explain the conceptual building blocks of hypothesis test-

ing, including null and alternative hypotheses and Type I and Type II
errors.
• Section 4.2: Explain the steps in using t tests to test hypotheses.
• Section 4.3: Explain p values.
• Section 4.4: Explain statistical power. Describe when it is particularly

relevant.
• Section 4.5: Explain the limitations of hypothesis testing.
• Section 4.6: Explain confidence intervals and the rule of thumb for
approximating a 95 percent confidence interval.
Further Reading
Ziliak and McCloskey (2008) provide a book-length attack on the hypothesis
testing framework. Theirs is hardly the first such critique, but it may be the
most fun.
An important, and growing, school of thought in statistics called Bayesian
statistics produces estimates of the following form: “There is an 8.2 percent
probability that β is less than zero.” Happily, there are huge commonalities across
Bayesian statistics and the approach used in this (and most other) introductory
books. Simon Jackman’s Bayesian Analysis for the Social Sciences (2009) is an
excellent guide to Bayesian statistics.
Computing Corner 121
Key Terms
Alternative hypothesis (94) p Value (106) t Statistic (104)
Confidence interval (117) Point estimate (117) t Test (98)
Confidence levels (117) Power (111) Two-sided alternative
Critical value (101) Power curve (111) hypothesis (94)
Hypothesis testing (91) Significance level (95) Type I error (93)
Null hypothesis (92) Statistically significant (93) Type II error (93)
Null result (113) Substantive significance
One-sided alternative (116)
hypothesis (94) t Distribution (99)
Computing Corner
Stata
1. To find the critical value from a t distribution for a given α and N − k

degrees of freedom, use the inverse t tail function in Stata: display
invttail(n-k, a).10 The display command tells Stata to print the
results on the screen.
• To calculate the critical value for a one-tailed t test with n − k = 100

and α = 0.05, type display invttail(100, 0.05).
• To calculate the critical value for a two-tailed t test with n − k = 100

and α = 0.05, type display invttail(100, 0.05/2).
2. To find the critical value from a normal distribution for a given α, use the
inverse normal function in Stata. For a two-sided test with α = 0.05, type
display invnormal(1-0.05/2). For a one-sided test with α = 0.01,
type display invnormal(1-0.01).
3. The regression command in Stata (e.g., reg Y X1 X2) reports two-sided

p values and confidence intervals. To generate the p values from the t
statistic only, use display 2*ttail(DF, abs(TSTAT)), where DF is the
degrees of freedom and TSTAT is the observed value of the t statistic.11
10
This is referred to as an inverse t function because we provide a percent (the α) and it returns a
value of the t distribution for which α percent of the distribution is larger in magnitude. For a
non-inverse t function, we typically provide some value for t and the function tells us how much of
the distribution is larger in magnitude. The tail part of the function command indicates that we’re
dealing with the far ends of the distribution.
11
The ttail function in Stata reports the probability of a t distributed random variable being higher
than a t statistic we provide (which we denote here as TSTAT). This syntax contrasts to the convention
For a two-sided p value for a t statistic of 4.23 based on 1,908 degrees of

freedom, type display 2*ttail(1908, 4.23).
4. Use the following code to create a power curve for α = 0.01 and a
one-sided alternative hypothesis covering 71 possible values of the true
β1 from 0 to 7. We discuss calculation of power in Section 14.3.
set obs 71
gen BetaRange = (_n-1)/10 /* Sequence of possible betas from 0 to 7 */
scalar stderrorBeta = 1.0 /* Standard error of beta-hat */
gen PowerCurve = normal(BetaRange/stderrorBeta - 2.32)
/* Probability t statistic is greater than critical value */
/* for each value in BetaRange/stderrorBeta */
graph twoway (line PowerCurve BetaRange)
1. In R, inverse probability distribution functions start with q (no reason

why, really; it’s just a convention). To calculate the critical value for a
two-tailed t test with n−k = 100 and α = 0.05, use the inverse t distribution
command. For the inverse t function, type qt(1-0.05/2, 100). To find
the one-tailed critical value for a t distribution for α = 0.01 and 100 degrees
of freedom, type qt(1-0.01, 100).
2. To find the critical value from a normal distribution for a given a, use the
inverse normal function in R. For a two-sided test, type qnorm(1-a/2).
For a one-sided test, type display qnorm(1-a).
3. The p value reported in summary(lm(Y ∼ X1)) is a two-sided

p value. To generate the p values from the t statistic only, use
2*(1-pt(abs(TSTAT), DF)), where TSTAT is the observed value of the
t statistic and DF is the degrees of freedom. For example, for a two-sided
p value for a t statistic of 4.23 based on 1,908 degrees of freedom, type
2*(1-pt(abs(4.23), 1908)).
4. To calculate confidence intervals by means of the regression results from

the Simpsons data on page 84, use the confint command. For example,
the 95 percent confidence intervals for the coefficient estimates in the
donut regression model from the Chapter 3 Computing Corner (page 84) is
confint(OLSResults, level = 0.95)
for normal distribution functions, which typically report the probability of being less than the t
statistic we provide.
Exercises 123
2.5% 97.5%
(Intercept) 86.605 158.626
donuts 4.878 13.329
5. Use the following code to create a power curve for α = 0.01 and a
one-sided alternative hypothesis covering 71 possible values of the true
β1 from 0 to 7. We discuss calculation of power in Section 14.3.
BetaRange = seq(0, 7, 0.1)

# Sequence of possible betas from 0 to 7
# separated by 0.1 (e.g. 0, 0.1, 0.2, ...)
stderrorBeta = 1
# Standard error of beta-hat
PowerCurve = pnorm(BetaRange/stderrorBeta - 2.32)
# Probability t statistic is greater than critical value
# for each value in BetaRange/stderrorBeta}
plot(BetaRange, PowerCurve, xlab="Beta",
ylab="Probability reject null", type="l")
Exercises
1. Persico, Postlewaite, and Silverman (2004) analyzed data from the
National Longitudinal Survey of Youth (NLSY) 1979 cohort to assess the
relationship between height and wages for white men. Here we explore the
relationship between height and wages for the full sample, which includes
men and women and all races. The NLSY is a nationally representative
sample of 12,686 young men and women who were 14 to 22 years old
when first surveyed in 1979. These individuals were interviewed annually
through 1994 and biannually after that. Table 4.7 describes the variables
from heightwage.dta we’ll use for this question.
(a) Create a scatterplot of adult wages against adult height. What does
this plot suggest about the relationship between height and wages?
(b) Estimate an OLS regression in which adult wages is regressed on

adult height for all respondents. Report the estimated regression
TABLE 4.7 Variables for Height and Wage Data in the United States

equation, and interpret the results, explaining in particular what the

p value means.
(c) Assess whether the null hypothesis that the coefficient on height81
equals 0 is rejected at the 0.05 significance level for one-sided and
for two-sided hypothesis tests.
2. In this problem, we will conduct statistical analysis on the sheep experi-

ment discussed at the beginning of the chapter. We will create variables and
use OLS to analyze their relationships. Death is the dependent variable,
and treatment is the independent variable. For all models, the treatment
variable will equal 1 for the first 24 observations and 0 for the last 24
observations.
(a) Suppose, as in the example, that only one sheep in the treatment
group died and all sheep in the control group died. Is the treatment
coefficient statistically significant? What is the (two-sided) p value?
What is the confidence interval?
(b) Suppose now that only one sheep in the treatment group died and
only 10 sheep in the control group died. Is the treatment coefficient
statistically significant? What is the (two-sided) p value? What is the
confidence interval?
(c) Continue supposing that only one sheep in the treatment group died.
What is the minimal number of sheep in the control group that need
to die for the treatment effect to be statistically significant? (Solve
by trial and error.)
3. Voters care about the economy, often more than any other issue. It is
not surprising, then, that politicians invariably argue that their party is
best for the economy. Who is right? In this exercise, we’ll look at the
U.S. economic and presidential party data in PresPartyEconGrowth.dta to
test if there is any difference in economic performance between Repub-
lican and Democratic presidents. We will use two different dependent
variables:
• ChangeGDPpc is the change in real per capita GDP in each year from
1962 to 2013 (in inflation-adjusted U.S. dollars, available from the
World Bank).
• Unemployment is the unemployment rate each year from 1947 to 2013

(available from the U.S. Bureau of Labor Statistics).
Our independent variable is LagDemPres. This variable equals 1 if the

president in the previous year was a Democrat and 0 if the president in the
Exercises 125
previous year was a Republican. The idea is that the president’s policies do
not take effect immediately, so the economic growth in a given year may
be influenced by who was president the year before.12
(a) Estimate a model with Unemployment as the dependent variable and

LagDemPres as the independent variable. Interpret the coefficients.
(b) Estimate a model with ChangeGDPpc as the dependent variable and

LagDemPres as the independent variable. Interpret the coefficients.
Explain why the sample size differs from the first model.
(c) Choose an α level and alternative hypothesis, and indicate for each
model above whether you accept or reject the null hypothesis.
(d) Explain in your own words what the p value means for the
LagDemPres variable in each model.
(e) Create a power curve for the model with ChangeGDPpc as the
dependent variable for α = 0.01 and a one-sided alternative hypoth-
esis. Explain what the power curve means by indicating what the
curve means for true β1 = 200, 400, and 800. Use the code in the
Computing Corner, but with the actual standard error of β̂1 from the
regression output.13
(f) Discuss the implications of the power curve for the interpretation of
the results for the model in which ChangeGDPpc was the dependent
variable.
4. Run the simulation code in the initial part of the education and salary
question from the Exercises in Chapter 3 (page 87).
(a) Generate t statistics for the coefficient on education for each

simulation. What are the minimal and maximal values of these t
statistics?
12
Other ways of considering the question are addressed in the large academic literature on presidents
and the economy. See, among others, Bartels (2008), Campbell (2011), Comiskey and Marsh (2012),
and Blinder and Watson (2013).
13
In Stata, start with the following lines to create a list of possible true values of β1 and then set the
“stderrorBeta” variable to be equal to the actual standard error of β̂1 :
clear
set obs 201
gen BetaRange = 4*(_n-1) /* Sequence of true beta values from 0 to 800 */
Note: The first line clears all data; you will need to reload the data set if you wish to run additional
analyses. If you have created a syntax file, it will be easy to reload and re-run what you have done
so far.
In R, start with the code in the Computing Corner and set BetaRange = seq(0, 800, 4)
(b) Generate two-sided p values for the coefficient on education for each
simulation. What are the minimal and maximal values of these p
values?
(c) In what percent of the simulations do we reject the null hypothesis

that βEducation = 0 at the α = 0.05 level with a two-sided alternative
hypothesis?
(d) Re-run the simulations, but set the true value of βEducation to zero. Do
this for 500 simulations, and report what percent of time we reject
the null at the α = 0.05 level with a two-sided alternative hypothesis.
The code provided in Chapter 3 provides tips on how to do this.
5. We will continue the analysis of height and wages in Britain from the
Exercises in Chapter 3 (page 88).
(a) Estimate the model with income at age 33 as the dependent variable
and height at age 33 as the independent variable. (Exclude observa-
tions with wages above 400 British pounds per hour and height less
than 40 inches.) Interpret the t statistics on the coefficients.
(b) Explain the p values for the two estimated coefficients.
(c) Show how to calculate the 95 percent confidence interval for the
coefficient on height.
(d) Do we accept or reject the null hypothesis that β1 = 0 for α = 0.01

and a two-sided alternative? Explain your answer.
(e) Do we accept or reject the null hypothesis that β0 = 0 (the constant)

for α = 0.01 and a two-sided alternative? Explain.
(f) Limit the sample size to the first 800 observations. Do we accept or
reject the null hypothesis that β1 = 0 for α = 0.01 and a two-sided
alternative? Explain if/how/why this answer differs from the earlier
hypothesis test about β1 .14
14
In Stata, do this by adding & _n < 800 to the end of the “if” statement at the end of the “regress”
command. In R, create and use a new data set with the first 800 observations (e.g., dataSmall =
data[1:800,]).
Multivariate OLS: Where the 5
Action Is
It’s pretty easy to understand why we need to go beyond

bivariate OLS: observational data is lousy with endo-
geneity. It’s almost always the case that X is correlated
with in observational data, and if we ignore that reality,
we may come up with some pretty silly results.
For example, suppose we’ve been tasked to figure
out how retail sales responds to temperature. Easy, right?
We can run a bivariate model such as
Salest = β0 + β1 Temperaturet + t
where Salest is sales in billions of dollars during month t and Temperaturet is the
average temperature in month t. Figure 5.1 shows monthly data for New Jersey for
about 20 years. We’ve also added the fitted line from a bivariate regression. It’s
negative, implying that people shop less as temperatures rise.
Is that the full story? Could there be endogeneity? Is something correlated
with temperature and associated with more shopping? Think about shopping in the
United States. When is it at its most frenzied? Right before Christmas. Something
that happens in December . . . when it’s cold. In other words, we think there is
something in the error term (Christmas shopping season) that is correlated with
temperature. That’s a recipe for endogeneity.
In this chapter, we learn how to control for other variables so that we can avoid
(or at least reduce) endogeneity and thereby see causal associations more clearly.
Multivariate OLS is the tool that makes this possible. In our shopping example,
multivariate OLS helps us see that once we account for the December effect, higher
multivariate OLS temperatures are associated with higher sales.
OLS with multiple Multivariate OLS refers to OLS with multiple independent variables. We’re
independent variables. simply going to add variables to the OLS model developed in the previous
127
128 CHAPTER 5 Multivariate OLS: Where the Action Is
Monthly
retail sales
(billions of $)
12
10
30 40 50 60 70 80
Average monthly temperature

(in degrees Fahrenheit)
FIGURE 5.1: Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013
chapters. What do we gain? Two things: bias reduction and precision. When we
reduce bias, we get more accurate parameter estimates because the coefficient
estimates are on average closer to the true value. When we increase precision, we
reduce uncertainty because the distribution of coefficient estimates is more closely
clustered toward the true value.
In this chapter, we explain how to use multivariate OLS to fight endogeneity.
Section 5.1 introduces the model and shows how controlling for multiple variables
can lead to better estimates. Section 5.2 discusses omitted variable bias, which
occurs when we fail to control for variables that affect Y and are correlated with
included variables. Section 5.3 shows how the omitted variable bias framework
can be used to understand what happens when we use poorly measured variables.
Section 5.4 explains the precision of our estimates in multivariate OLS. Section 5.5
demonstrates how standardizing variables can make OLS coefficients more
comparable. Section 5.6 shows formal tests of whether coefficients differ from
each other. The technique illustrated can be used for any hypothesis involving
multiple coefficients.
5.1 Using Multivariate OLS to Fight Endogeneity

Multivariate OLS allows us to control for multiple independent variables at once.
This section presents two situations in which controlling for additional variables
has a huge effect on the results. We then discuss the mechanics of the multivariate
estimation process.
Multivariate OLS in action: retail sales

The sales and temperature example is useful for getting the hang of multivariate
analysis. Panel (a) of Figure 5.2 has the same data as Figure 5.1, but we’ve
indicated the December observations with triangles. Clearly, New Jerseyites shop
more in December; it looks like the average sales are around $11 billion in
December versus average sales of around $6 billion per month in other months.
After we have taken into account that December sales run about $5 billion higher
than other months, is there a temperature effect?
The idea behind multivariate OLS is to net out the December effect and
then look at the relationship between sales and temperature. That is, suppose
we subtracted the $5 billion bump from all the December observations and then
Monthly Monthly
retail sales retail sales
(billions of $) (billions of $)
12 12
December sales December sales
Other months minus $5 billion
Other months
10 10
8 8
6 6
4 4
30 40 50 60 70 80 30 40 50 60 70 80
Average monthly temperature Average monthly temperature

(in degrees Fahrenheit) (in degrees Fahrenheit)
(a) (b)
FIGURE 5.2: Monthly Retail Sales and Temperature in New Jersey with December Indicated
considered the relationship between temperature and sales. That is what we’ve
done in panel (b) of Figure 5.2, where each December observation is now $5
billion lower than before. When we look at the data this way, the negative
relationship between temperature and sales seems to go away; it may even be that
the relationship is now positive.
In essence, multivariate OLS nets out the effects of other variables when it
controls for additional variables. When we actually implement multivariate OLS,
we (or, really, computers) do everything at once, controlling for the December
effect while estimating the effect of temperature even as we are simultaneously
controlling for temperature while estimating the December effect.
Table 5.1 shows the results for both a bivariate and a multivariate model for
our sales data. In the bivariate model, the coefficient on temperature is –0.019. The
estimate is statistically significant because the t statistic is above 2. The implication
is that people shop less as it gets warmer, or in other words, folks like to shop in
the cold. When we use multivariate OLS to control for December (by including
the December variable that equals 1 for observations from the month of December
and 0 for all other observations), the coefficient on temperature becomes positive
and statistically significant. Our conclusion has flipped! Heat brings out the cash.
Whether this relationship exists because people like shopping when it’s warm or
are going out to buy swimsuits and sunscreen, we can’t say. We can say, though,
that our initial bivariate finding that people shop less as the temperature rises is
not robust to controlling for holiday shopping in December.
The way we interpret multivariate OLS regression coefficients is slightly
different from how we interpret bivariate OLS regression coefficients. We still
say that a one-unit increase in X is associated with a β̂1 increase in Y, but now
we need to add the phrase “holding constant the other factors in the model.”
TABLE 5.1 Bivariate and Multivariate Results for Retail Sales Data
Bivariate Multivariate
Temperature −0.019∗ 0.014∗

(0.007) (0.005)
[t = 2.59] [t = 3.02]
December 5.63∗
(0.26)
[t = 21.76]
Constant 7.16∗ 4.94∗
(0.41) (0.26)
[t = 17.54] [t = 18.86]
N 256 256
σ̂ 1.82 1.07
2
R 0.026 0.661

∗
indicates significance at p < 0.05, two-tailed.
We therefore interpret our multivariate results as “Controlling for the December

shopping boost, increases in temperature are associated with more shopping.”
In particular, the multivariate estimate implies that controlling for the surge in
shopping in December, a one-degree increase in average monthly temperature
is associated with an increase in retail sales of $0.014 billion (also known as
$14 million).
Unless we’re stalling for time, we don’t have to say the full long version every
time we talk about multivariate OLS results; people who understand multivariate
OLS will understand the longer, technically correct interpretation. We can also
ceteris paribus All use the fancy-pants phrase ceteris paribus, which means all else being equal, as
else being equal. A in “Ceteris paribus, the effect on retail shopping in New Jersey of a one-degree
phrase used to describe increase in temperature is $14 million.”
multivariate regression
The way economists talk about multivariate results takes some getting used
results as a coefficient is
said to account for
to. When economists say things like holding all else constant or holding all else
change in the equal, they are simply indicating that the model contains other variables, which
dependent variable with have been statistically controlled for. What they really mean is more like netting
all other independent out the effect of other variables in the model. The logic behind saying that other
variables held constant. factors are constant is that once we have netted out the effects of these other
variables, it is as if the values of these variables are equal for every observation.
The language doesn’t exactly sparkle with clarity, but the idea is not particularly
subtle. Hence, when someone says something like “Holding X2 constant, the
estimated effect of a one-unit change in X1 is β̂1 ,” we need simply to translate
the remark as “Accounting for the effect of X2 , the effect of X1 is estimated
to be β̂1 .”
Multivariate OLS in action: height and wages

Here’s another example that shows what happens when we add variables to a
model. We use the data on height and wages introduced in Chapter 3 (on page 75).
The bivariate model was
Wagesi = β0 + β1 Adult heighti + i (5.1)
where Wagesi was the wages of men in the sample in 1996 and the adult height
was measured in 1985.
This is observational data, and the reality is that with such data, the bivariate
model is suspect. There are many ways something in the error term could be
correlated with the independent variable.
The authors of the height and wages study identified several additional
variables to include in the model, focusing in particular on one: adolescent height.
They reasoned that people who were tall as teenagers might have developed more
confidence and participated in more high school activities, and that this experience
may have laid the groundwork for higher wages later.
If teen height is actually boosting adult wages in the way the researchers sus-
pected, it is possible that the bivariate model with only adult height (Equation 5.1)
will suggest a relationship even though the real action was to be found between
adolescent height and wages. How can we tell what the real story is?
Multivariate OLS comes to the rescue. It allows us to simply “pull” adolescent
height out of the error term and into the model by including it as an additional
variable in the model. The model then becomes
Wagesi = β0 + β1 Adult heighti + β2 Adolescent heighti + i (5.2)
where β1 reflects the effect on wages of being one inch taller as an adult when
including adolescent height in the model and β2 reflects the effect on wages of
being one inch taller as an adolescent when adult height is included in the model.
The coefficients are estimated by using logic similar to that for bivariate
OLS. We’ll discuss estimation momentarily. For now, though, let’s concentrate
on the differences between bivariate and multivariate results. Both are presented
in Table 5.2. The first column shows the coefficient and standard error on β̂1
for the bivariate model with only adult height in the model; these are identical
to the results presented in Chapter 3 (page 75). The coefficient of 0.412 implies
that each inch of height is associated with an additional 41.2 cents per hour
in wages.
TABLE 5.2 Bivariate and Multiple Multivariate Results for

Height and Wages Data
(a) (b)
Adult height 0.412∗ 0.003 0.03

(0.0975) (0.20) (0.20)
[t = 4.23] [t = 0.02] [t = 0.17]
Adolescent height 0.48∗ 0.35
(0.19) (0.19)
[t = 2.49] [t = 1.82]
Athletics 3.02∗
(0.56)
[t = 5.36]
Clubs 1.88∗
(0.28)
[t = 6.69]
Constant −13.09 −18.14∗ −13.57
(6.90) (7.14) (7.05)
[t = 1.90] [t = 2.54] [t = 1.92]
N 1,910 1,870 1,851
σ̂ 11.9 12.0 11.7

2
R 0.01 0.01 0.06

∗
Bivariate model Multivariate model
Adult
Height
Adolescent
Height
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Estimated coefficients Estimated coefficients
FIGURE 5.3: 95 Percent Confidence Intervals for Coefficients in Adult Height, Adolescent Height,
and Wage Models
The second column shows results from the multivariate analysis; they tell
quite a different story. The coefficient on adult height is, at 0.003, essentially zero.
In contrast, the coefficient on adolescent height is 0.48, implying that controlling
for adult height, adult wages were 48 cents higher per hour for each inch taller
someone was when younger. The standard error on this coefficient is 0.19 with a
t statistic that is higher than 2, implying a statistically significant effect.
Figure 5.3 displays the confidence intervals implied by the coefficients and
their standard errors. The dots are placed at the coefficient estimate (e.g., 0.41 for
the coefficient on adult height in the bivariate model and 0.003 for the coefficient
on adult height in the multivariate model). The solid lines indicate the range of the
95 percent confidence interval. As discussed in Chapter 4 (page 119), confidence
intervals indicate the range of true values of β most consistent with the observed
estimate; they are calculated as β̂ ± 1.96 × se(β̂).
The confidence interval for the coefficient on adult height in the bivariate
model is clearly positive and relatively narrow, and it does not include zero.
However, the confidence interval for the coefficient on adult height includes zero
in the multivariate model. In other words, the multivariate model suggests that
the effect of adult height on wages is small or even zero when we control for
adolescent height. In contrast, the confidence interval for adolescent height is
positive, reasonably wide, and far from zero when we control for adult height.
These results suggest that the effect of adolescent height on wages is large and the
relationship we see is unlikely to have arisen simply by chance.
In this head-to-head battle of the two height variables, adolescent height wins:
the coefficient on it is large, and its confidence interval is far from zero. The
coefficient on adult height, however, is puny and has a confidence interval that
clearly covers zero. In other words, the multivariate model we have estimated is
telling us that being tall as a kid matters more than being tall as a grown-up. This
conclusion is quite thought provoking. It appears that the height premium in wages
does not reflect a height fetish by bosses. We’ll explore what’s going on in a bit
more detail shortly.
Estimation process for multivariate OLS

Multivariate OLS allows us to add multiple independent variables; that’s where the
“multi” comes from. Whenever we think of another variable that could plausibly
be in the error term and could be correlated with the independent variable of
interest, we simply add it to the model (thereby removing it from the error term and
eliminating it as a possible source of endogeneity). Lather. Rinse. Repeat. Do this
long enough, and we may be able to wash away sources of endogeneity lurking in
the error term. The model will look something like
Yi = β0 + β1 X1i + β2 X2i + · · · + βk Xki + i (5.3)
where each X is another variable and k is the total number of independent variables.
Often a single variable, or perhaps a subset of variables, is of primary interest. We
control variable An refer to the other independent variables as control variable, as these are included
independent variable to control for factors that could affect the dependent variable and also could be
included in a statistical correlated with the independent variables of primary interest. We should note here
model to control for
that control variables and control groups are different: a control variable is an
some factor that is not
the primary factor of
additional variable we include in a model, while a control group is the group to
interest. which we compare the treatment group in an experiment.1
The authors of the height and wage study argue that adolescent height in and
of itself was not causing increased wages. Their view is that adolescent height
translated into opportunities that provided skills and experience that increased
ability to get high wages later. They view increased participation in clubs and
sports activities as a channel for adolescent height to improve wage-increasing
human capital. In statistical terms, the claim is that participation in clubs and
athletics was a factor in the error term of a model with only adult height and
adolescent height. If either height variable turns out to be correlated with any of
the factors in the error term, we could have endogeneity.
With the right data, we can check the claim that the effect of adolescent
height on adult wages is due, at least in part, to the effect of adolescent height on
participation in developmentally helpful activities. In this case, the researchers had
measures of the number of clubs each person participated in (excluding athletics
and academic/honor society clubs), as well as a dummy variable that indicated
whether each person participated in high school athletics.
1
The control variable and control group concepts are related. In an experiment, a control variable is
set to be the same for all subjects of the experiment to ensure that the only difference between treated
and untreated groups is the experimental treatment. If we were experimenting on samples in petri
dishes, for example, we could treat temperature as a control variable. We would make sure that the
temperature is the same for all petri dishes used in the experiment. Hence, the control group has
everything similar to the treatment group except the treatment. In observational studies, we cannot
determine the values of other factors; we can, however, try to net out these other factors, such that
once we have taken them into account, the treated and untreated groups should be the same. In the
Christmas shopping example, the dummy variable for December is our control variable. The idea is
that once we net out the effect of Christmas on shopping patterns in the United States, retail sales
should differ based only on differences in the temperature. If we worry (as we should) that factors in
addition to temperature still matter, we should include other control variables until we feel confident
that the only remaining difference is due to the variable of interest.
The right-most column of Table 5.2 therefore presents “multivariate (b)”

results from a model that also includes measures of participation in activities as
a young person. If adolescent height truly translates into higher wages because
tall adolescents have more opportunities to develop leadership and other skills,
we would expect part of the adolescent height effect to be absorbed by the
additional variables. As we see in the right-most column, this is part of the
story. The coefficient on adolescent height in the multivariate (b) column goes
down to 0.35 with a standard error of 0.19, which is statistically insignificant.
The coefficients on the athletics and clubs variables are 3.02 and 1.88, respec-
tively, with t stats of 5.36 and 6.69, implying highly statistically significant
effects. The fact that the extracurricular activities appear to “soak up” the effect
of the height variables suggests that what really matters is being socially engaged
from a young age, something that is correlated with adolescent height. So, eat your
veggies, make the volleyball team, get rich.
By the way, notice the R2 values at the bottom of the table. They are 0.01, 0.01,
and 0.06. Terrible, right? Recall that R2 is the square of the correlation of observed
and fitted observations. (Or, equivalently, these R2 numbers indicate the proportion
of the variance of wages explained by the independent variables.) These values
mean that the even in the best-fitting model,√ the correlation of observed and fitted
values of wages is about 0.245 (because 0.06 = 0.245). That’s not so hot, but
we shouldn’t care. That’s not how we evaluate models. As discussed in Chapter 3
(page 71), we evaluate the strength of estimated relationships based on coefficient
estimates and standard errors, not based on directly looking at R2 .
As practical people, we recognize that measuring every possible source of
endogeneity in the error term is unlikely. But if we can measure more variables
and pull more factors out of the error term, our estimates typically will become
less biased and will be distributed more closely to the true value. We provide more
details when we discuss omitted variable bias in the next section.
Given how important it is to control for additional variables, we may
reasonably wonder about how exactly multivariate OLS controls for multiple
variables. Basically, the estimation of the multivariate model follows the same
OLS principles used in the bivariate OLS model. Understanding the estimation
process is not essential for good analysis per se, but understanding it helps us get
comfortable with the model and its fitted values.
First, write out the equation for the residual, which is the difference between
actual and fitted values:
î = Yi − Ŷi
= Yi − (β̂ 0 + β̂1 X1i + β̂ 2 X2i + · · · + β̂ k Xki )
Second, square the residuals (for the same reasons given on page 48):
î2 = (Yi − (β̂ 0 + β̂1 X1i + β̂ 2 X2i + · · · + β̂ k Xki ))2
Multivariate OLS then finds the β̂’s that minimize the sum of the squared residuals
over all observations. We let computers do that work for us.
The name “ordinary least squares” (OLS) describes the process: ordinary
because we haven’t gotten to the fancy stuff yet, least because we’re minimizing
the deviations between fitted and actual values, and squares because there was
a squared thing going on in there. Again, it’s an absurd name. It’s like calling
a hamburger a “kill-with-stun-gun-then-grill-and-put-on-a-bun.” OLS is what
people call it, though, so we have to get used to it.
REMEMBER THIS
1. Multivariate OLS is used to estimate a model with multiple independent variables.
2. Multivariate OLS fights endogeneity by pulling variables from the error term into the estimated
equation.
3. As with bivariate OLS, the multivariate OLS estimation process selects β̂’s in a way that
minimizes the sum of squared residuals.
1. Mother Jones magazine blogger Kevin Drum (2013a, b, c) offers the following scenario.
Suppose we gathered records of a thousand school children aged 7 to 12, used a bivariate
model, and found that heavier kids scored better on standardized math tests.
(a) Based on these results, should we recommend that kids eat lots of potato chips and french
fries if they want to grow up to be scientists?
(b) Write down a model that embodies Drum’s scenario.
(c) Propose additional variables for this model.
(d) Would inclusion of additional controls bolster the evidence? Would doing so provide
definitive proof?
2. Researchers from the National Center for Addiction and Substance Abuse at Columbia
University (2011) suggest that time spent on Facebook and Twitter increases risks of smoking,
drinking, and drug use. They found that compared with kids who spent no time on social
networking sites, kids who visited the sites each day were five times likelier to smoke cigarettes,
three times more likely to drink alcohol, and twice as likely to smoke pot. The researchers
argue that kids who use social media regularly see others engaged in such behaviors and then
emulate them.
(a) Write down the model implied by the description of the Columbia study, and discuss the
factors in error term.
(b) What specifically has to be true about these factors for their omission to cause bias?
Discuss whether these conditions will be true for the factors you identify.
(c) Discuss which factors could be measured and controlled for and which would be difficult
to measure and control for.
3. Suppose we are interested in knowing the relationship between hours studied and scores on a
Spanish exam.
(a) Suppose some kids don’t study at all but ace the exam, leading to a bivariate OLS result
that studying has little or no effect on the score. Would you be convinced by these results?
(b) Write down a model, and discuss your answer to (a) in terms of the error term.
(c) What if some kids speak Spanish at home? Discuss implications for a bivariate model
that does not include this factor and a multivariate model that controls for this factor.
5.2 Omitted Variable Bias

Another way to think about how multivariate OLS fights bias is by looking at
what happens when we fail to soak up one of the error term variables. That is,
what happens if we omit a variable that should be in the model? In this section,
we show that omitting a variable that affects Y and is correlated with X1 will lead
to a biased estimate of β̂1 .
Let’s start with a case in which the true model has two independent variables,
X1 and X2 :
Yi = β0 + β1 X1i + β2 X2i + νi (5.4)
We assume (for now) that the error in this true model, νi , is uncorrelated with X1i
and X2i . (The Greek letter ν, or nu, is pronounced “new”—even though it looks
like a v.) As usual with multivariate OLS, the β1 parameter reflects how much
higher Yi would be if we increased X1i by one; β2 reflects how much higher Yi
would be if we increased X2i by one.
What happens if we omit X2 and estimate the following model?
OmitX2 OmitX2
Yi = β0 + β1 X1i + i (5.5)
where β OmitX2 indicates the coefficient on X1i we get when we omit variable X2
from the model. While we used νi to refer to the error term in Equation 5.4, we
use a different letter (which happens to be i ) in Equation 5.5 because the error
now includes νi and β2 X2i .
OmitX2 OmitX2
How close will β̂ 1 be to β1 in Equation 5.4? In other words, will β̂ 1
be an unbiased estimator of β1 ? Or, in English, will our estimate of the effect of X1
suck if we omit X2 ? We ask questions like this every time we analyze observational
data.
It’s useful to first characterize the relationship between the two independent
auxiliary regression variables, X1 and X2 . To do this, we use an auxiliary regression equation. An
A regression that is not auxiliary regression is a regression that is not directly the one of interest but yields
directly the one of information helpful in analyzing the equation we really care about. In this case,
interest but yields
we can assess how strongly X1 and X2 are related by means of the equation
information helpful in
analyzing the equation
we really care about. X2i = δ0 + δ1 X1i + τi (5.6)
where δ0 (“delta”) and δ1 are coefficients for this auxiliary regression and τi (“tau,”
rhymes with what you say when you stub your toe) is how we denote the error term
(which acts just like the error term in our other equations, but we want to make it
clear that we’re dealing with a different equation). We assume τi is uncorrelated
with νi and X1 .
This equation for X2i is not based on a causal model. Instead, we are using a
regression model to indicate the relationship between the included variable (X1 )
and the excluded variable (X2 ). If δ1 = 0, then X1 and X2 are not related. If δ1 = 0,
then X1 and X2 are related.
If we substitute the equation for X2i (Equation 5.6) into the main equation
(Equation 5.4), then do some rearranging and a bit of relabeling, we get
Yi = β0 + β1 X1i + β2 (δ0 + δ1 X1i + τi ) + νi

= (β0 + β2 δ0 ) + (β1 + β2 δ1 )X1i + (β2 τi + νi )
OmitX2 OmitX2
= β0 + β1 X1i + i
This means that

OmitX2
β1 = β1 + β2 δ1 (5.7)
where β1 and β2 come from the main equation (Equation 5.4) and δ1 comes from
the equation for X2i (Equation 5.6).2
Given our assumption that τ and ν are not correlated with any independent
OmitX2
variable, we can use our bivariate OLS results to know that β̂1 will be
distributed normally with a mean of β1 + β2 δ1 . In other words, when we omit
X2 , the distribution of the estimated coefficient on X1 will be skewed away from
omitted variable β1 by a factor of β2 δ1 . This is omitted variable bias.
OmitX2
bias Bias that results In other words, when we omit X2 , the coefficient on X1 , which is β1 , will
from leaving out a pick up not only β1 , which is the effect of X1 on Y, but also β2 , which is the effect
variable that affects the OmitX2
dependent variable and of the omitted variable X2 on Y. The extent to which β1 picks up the effect of
is correlated with the X2 depends on δ1 , which characterizes how strongly X2 and X1 are related.
independent variable.
2
Note that in the derivation, we replace β2 τi + νi with i . If, as we’re assuming here, τi and νi are
uncorrelated with each other and uncorrelated with X1 , then errors of the form β2 τi + νi will also be
uncorrelated with each other and uncorrelated with X1 .
This result is consistent with our intuition about endogeneity: when X2 is

omitted and thereby relegated to the error, we won’t be able to understand the
true relationship between X1 and Y to the extent that X2 is correlated with X1 .3
Two conditions must hold for omitted variable bias to occur. The first is that
β2 = 0. If this condition does not hold, β2 = 0 and β2 δ1 = 0, which means that the
bias term in Equation 5.7 goes away and there is no omitted variable bias. In other
words, if the omitted variable X2 has no effect on Yi (which is the implication
of β2 = 0), there will be no omitted variable bias. Perhaps this is obvious; we
probably weren’t worried that our wage and height models in Section 5.1 were
biased because we failed to include a variable for whether the individual likes
baked beans. Nonetheless, it is useful to be clear that omitted variable bias requires
omission of a variable that affects the dependent variable.
The second condition for omitted variable bias to occur is that δ1 = 0. The
parameter δ1 from Equation 5.6 tells us how strongly X1 and X2 are related. If X1
OmitX2
and X2 are not related, then δ1 = 0. This in turn means β̂ 1 will be an unbiased
estimate of β1 from Equation 5.4, the true effect of X1 on Y even though we omitted
X2 from the model. In other words, if the omitted variable is not correlated with
the included variable, then no harm and no foul.
This discussion relates perfectly to our theme of endogeneity. If a variable is
omitted, it ends up in the error term. If the omitted variable hanging out in the
error term is correlated with the included variable (which means δ1 = 0), then we
have endogeneity and bias. And we now have an equation that tells us the extent
of the bias. If, on the other hand, the omitted variable hanging out in the error term
is not correlated with the included variable (which means δ1 = 0), we do not have
endogeneity and do not have omitted variable bias. Happy, happy, happy.
If either of these two conditions holds, there is no omitted variable bias. In
most cases, though, we can’t be sure whether at least one condition holds because
we don’t actually have a measure of the omitted variable. In that case, we can
use omitted variable bias concepts to speculate on the magnitude of the bias. The
magnitude of bias depends on how much the omitted variable explains Y (which
is determined by β2 ) and how much the omitted variable is related to the included
variable (which is reflected in δ1 ). Sometimes we can come up with possible bias
but believe that β2 or δ1 is small, meaning that we shouldn’t lose too much sleep
over bias. On the other hand, in other cases, we might think β2 and δ1 are huge.
Hello, insomnia.
Omitted variable bias in more complicated models

Chapter 14 covers additional topics related to omitted variable bias. On page 505,
we discuss how to use the bias equation to anticipate whether omission of a
variable will cause the estimated coefficient to be higher or lower than it should
be. On page 507, we discuss the more complicated case in which the true model
and the estimated model have more variables. These situations are a little harder to
predict than the case we have discussed. As a general matter, bias usually (but not
3
We derive this result more formally on page 502.
always) goes down when we add variables that explain the dependent variable.
We’ll discuss a major exception in Chapter 7: bias can increase when we add a
so-called post-treatment variable.
REMEMBER THIS
1. Two conditions must both be true for omitted variable bias to occur:
(a) The omitted variable affects the dependent variable.
• Mathematically: β2 = 0 in Equation 5.4.
• An equivalent way to state this condition is that X2i really should have been in
Equation 5.4 in the first place.
(b) The omitted variable is correlated with the included independent variable.
• Mathematically: δ1 = 0 in Equation 5.6.
• An equivalent way to state this condition is that X2i needs to be correlated with X1i
2. Omitted variable bias is more complicated in models with more independent variables, but the
main intuition applies.
CASE STUDY Does Education Support Economic Growth?

Does more education lead to more economic growth?
A standard way to look at this question is via so-called
growth equations in which the average growth of coun-
tries over some time period is the dependent variable.
Hanushek and Woessmann (2012) put together a data
set on economic growth of 50 countries from 1960 to
2000. The basic model is
Growth from 1960 to 2000i

= β0 + β1 Average years of educationi
+ β2 GDP per capita in 1960i + i
The data is structured such that even though information on the economic
growth in these countries for each year is available, we are looking only at the
average growth rate across the 40 years from 1960 to 2000. Thus, each country
gets only a single observation. We control for GDP per capita in 1960 because of
a well-established phenomenon in which countries that were wealthier in 1960
have a slower growth rate. The poor countries simply have more capacity to grow
TABLE 5.3 Using Multiple Measures of Education to Study

Economic Growth and Education
Without math/science With math/science
test scores test scores
Avg. years of school 0.44∗ 0.02

(0.10) (0.08)
[t = 4.22] [t = 0.28]
Math/science test scores 1.97∗

(0.24)
[t = 8.28]
GDP in 1960 −0.39∗ −0.30∗

(0.08) (0.05)
[t = 5.19] [t = 6.02]
Constant 1.59∗ −4.76∗

(0.54) (0.84)
[t = 2.93] [t = 5.66]
N 50 50
σ̂ 1.13 0.72
R2 0.36 0.74

∗
economically. The main independent variable of interest at this point is average

years of education; it measures education across countries.
The results in the left-hand column of Table 5.3 suggest that additional years of
schooling promote economic growth. The β̂1 estimate implies that each additional
average year of schooling within a country is associated with 0.44 percentage point
higher annual economic growth. With a t statistic of 4.22, this is a highly statistically
significant result. By using the standard error and techniques from page 119, we can
calculate the 95 percent confidence interval to be from 0.23 to 0.65.
Sounds good: more education, more growth. Nothing more to see here,
right? Not according to Hanushek and Woessmann. Their intuition was that not
all schooling is equal. They were skeptical that simply sitting in class and racking
up the years improves economically useful skills, and they argued that we should
assess whether quality, not simply quantity, of education made a difference. As their
measure of quality, they used average math and science test scores.
Before getting to their updated model, it’s useful to get a feel for the data. Panel
(a) of Figure 5.4 is a scatterplot of economic growth and average years of schooling.
There’s not an obvious relationship. (We observe a strong positive coefficient in
the first column of Table 5.3 because GDP in 1960 was also controlled for.) Panel
(b) of Figure 5.4 is a scatterplot of economic growth and average test scores. The
observations with high test scores often were accompanied by high economic
growth, suggesting a relationship between the two.
Average Average
economic economic
growth growth
(in %) 7 (in %) 7
6 6
5 5
4 4
3 3
2 2
1 1
2 4 6 8 10 12 3 3.5 4 4.5 5 5.5
Average years of school Average test scores

(a) (b)
Average
5.5
test
scores
5
4.5
3.5
2 4 6 8 10 12
Average years of school

(c)
FIGURE 5.4: Economic Growth, Years of School, and Test Scores
Could the real story be that test scores, not years in school, explain growth? If
so, why is there a significant coefficient on average of schooling in the first column
of Table 5.3? We know the answer: omitted variable bias. As discussed on page 137,
if we omit a variable that matters (and we suspect that test scores matter), the
estimate of the effect of the variable that is included will be biased if the omitted
variable is correlated with the included variable. To address this issue, look at panel
(c) of Figure 5.4, a scatterplot of average test scores and average years of schooling.
Yes, indeed, these variables look quite correlated, as observations with high years
of schooling also tend to be accompanied by high test scores. Hence, the omission
of test scores could be problematic.
It therefore makes sense to add test scores to the model, as in the right-hand
column of Table 5.3. The coefficient on average years of schooling here differs
markedly from before. It is now very close to zero. The coefficient on average
test scores, on the other hand, is 1.97 and statistically significant, with a t statistic
of 8.28.
Because the scale of the test score variable is not immediately obvious, we
need to do a bit of work to interpret the substantive significance of the coefficient
estimate. Based on descriptive statistics (not reported), the standard deviation of
the test score variable is 0.61. The results therefore imply that increasing average
test scores by a standard deviation is associated with an increase of 0.61 × 1.97 =
1.20 percentage points in the average annual growth rate per year over these
40 years. This increase is large when we are talking about growth compounding
over 40 years.4
Notice the very different story we have across the two columns. In the first
one, years of schooling is enough for economic growth. In the second specification,
quality of education, as measured with math and science test scores, matters
more. The second specification is better because it shows that a theoretically
sensible variable matters a lot. Excluding this variable, as the first specification does,
risks omitted variable bias. In short, these results suggest that education is about
quality, not quantity. High test scores explain economic growth better than years
in school. Crappy schools do little; good ones do a lot. These results don’t end the
conversation about education and economic growth, but they do move it ahead a
few more steps.
5.3 Measurement Error

We can apply omitted variable concepts to understand the effects of measurement
measurement error error on our estimates. Measurement error is pretty common; it occurs when a
Measurement error variable is measured inaccurately.
occurs when a variable is In this section, we define the problem, show how to think of it as an omitted
measured inaccurately.
variables problem, and then characterize the nature of the bias caused when
independent variables are measured with error.
Quick: how much money is in your bank account? It’s pretty hard to recall
the exact amount (unless it’s zero!). So a survey of wealth relying on people to
recall their savings is probably going to have at least a little error, and maybe
a lot (especially because people get squirrelly about talking about money, and
4
Since the scale of the test score variable is different from the years in school variable, we cannot
directly compare the two coefficients. Sections 5.5 and 5.6 show how to make such comparisons.
some overreport and some underreport). And many, perhaps even most, variables
could have error. Just think how hard it would be to accurately measure spending
on education or life expectancy or attitudes toward Justin Bieber in an entire
country.
Measurement error in the dependent variable

OLS will do just fine if the measurement error is only in the dependent variable.
In this case, the measurement error is simply part of the overall error term. The
bigger the error, the bigger the variance of the error term. We know that in bivariate
OLS, a larger variance of the error term leads to a larger σ̂ 2 , which increases the
variance of β̂ (see page 65). This intuition carries over to multivariate OLS, as
we’ll see in Section 5.4.
Measurement error in the independent variable

OLS will not do so well if the measurement error is in an independent variable.
In this case, the OLS estimate will systematically underestimate the magnitude of
the coefficient. To see why, suppose the true model is
∗
Yi = β0 + β1 X1i + i
∗
where the asterisk in X1i indicates that we do not observe this variable directly. For
∗
this section, we assume that i is uncorrelated with X1i , which lets us concentrate
on measurement error.
Instead, we observe our independent variable with error; that is, we observe
some X1 that is a function of the true value X1∗ and some error. For example,
suppose we observe reported savings rather than actual savings:
∗
X1i = X1i + νi (5.8)
We keep things simple here by assuming that the measurement error (νi ) has a
mean of zero and is uncorrelated with the true value.
∗
Notice that we can rewrite X1i as the observed value (X1i ) minus the
measurement error:
∗
X1i = X1i − νi
∗
Substitute for X1i in the true model, do a bit of rearranging, and we get
Yi = β0 + β1 (X1i − νi ) + i
= β0 + β1 X1i − β1 νi + i (5.9)
The trick here is to think of this example as an omitted variable problem,

where νi is the omitted variable. We don’t observe the measurement error directly,
right? If we could observe it, we would fix our darn measure of X1 . So what we
do is treat the measurement error as an unobserved variable that by definition
we must omit; then we can see how this particular form of omitted variable bias
plays out. Unlike the case of a generic omitted variable bias problem, we know
two things that allow us to be more specific than in the general omitted variable
case: the coefficient on the omitted term (νi ) is β1 , and νi relates to X1 as in
Equation 5.8.
We go step by step through the logic and math in Chapter 14 (page 508). The
upshot is that as the sample size gets very large, the estimated coefficient when
the independent variable is measured with error is
σX2∗
plim β̂1 = β1 1
σν2 + σX2∗
1
where plim is the probability limit, as discussed in Section 3.5.

Notice that β̂1 converges to the true coefficient times a quantity that has to
be less than 1. In other words, as the sample size gets very large, the estimated
coefficient will converge to something that is less than equal to the true value
of β.
The equation becomes quite intuitive if we look at two extreme scenarios.
If σν2 is zero, the measurement error has no variance and must always equal
zero (given our assumption that it is a mean-zero random variable). In this case,
σX2∗
1
σν2 +σ 2∗
will equal 1 (assuming σX2∗ is not zero, which is simply assuming that
X 1
1
X1∗ varies). In other words, if there is no error in the measured value of X1
(which is what σν2 = 0 means), then plim β̂1 = β1 , and our estimate of β1 will
converge to the true value as the sample gets larger. This conclusion makes
sense: no measurement error, no problem. OLS will happily produce an unbiased
estimate.
On the other hand, if σν2 is huge relative to σX2∗ , the measurement error varies
1
σX2∗
a lot in comparison to X1∗ . In this case, 1
σν2 +σ 2∗
will be less than 1 and could be
X
1
near zero, which means that the probability limit of β̂1 will be smaller than the
true value. This result also makes sense: if the measurement of the independent
variable is junky, how could we see the true effect of that variable on Y?
attenuation bias A We refer to this particular example of omitted variable bias as attenuation
form of bias in which the bias because when we omit the measurement error term from the model, our
estimated coefficient is estimate of β̂1 deviates from the true value by a multiplicative factor between
closer to zero than it
zero and one. This means that β̂1 will be closer to zero than it should be when X1
should be.
is measured with error. If the true value of β1 is a positive number, we see values
of β̂1 less than they should be. If the true value of β1 is negative, we see values of
β̂1 larger (meaning closer to zero) than they should be.
REMEMBER THIS
1. Measurement error in the dependent variable does not bias β̂ coefficients but does increase the
variance of the estimates.
2. Measurement error in an independent variable causes attenuation bias. That is, when X1 is
measured with error, β̂1 will generally be closer to zero than it should be.
• The attenuation bias is a consequence of the omission of the measurement error from the
estimated model.
• The larger the measurement error, the larger the attenuation bias.
5.4 Precision and Goodness of Fit

Precision is crucial for hypothesis tests and confidence intervals. In this section,
we show that var(β̂) in multivariate OLS inherits the intuitions we have about
var(β̂) in bivariate OLS but also is influenced by the extent to which the multiple
independent variables covary together. We also discuss goodness of fit in the
multivariate model and, in particular, what happens when we include independent
variables that don’t explain the dependent variable at all.
Variance of coefficient estimates

The variance of coefficient estimates for the multivariate model is similar to the
variance of β̂1 for the bivariate model. As with variance of β̂1 in bivariate OLS,
the equation we present applies when errors are homoscedastic and not correlated
with each other. Complications arise when errors are heteroscedastic or correlated
with each other, but the intuitions we’re about to develop still apply.
We denote the coefficient of interest as β̂ j to indicate that it’s the coefficient
associated with the jth independent variable. The variance of the coefficient on the
jth independent variable is
σ̂ 2
var(β̂ j ) = (5.10)
N var(Xj )(1 − R2j )
This equation is similar to the equation for variance of β̂1 in bivariate OLS
(Equation 3.9, page 62). The new bit relates to the (1 − R2j ) in the denominator.
Before elaborating on R2j , let’s note the parts from the bivariate variance equation
that carry through to the multivariate context.
• In the numerator, we see σ̂ 2 , which means that the higher the variance of the
regression, the higher the variance of the coefficient estimate. Because σ̂ 2
measures the average

N squared deviation of the fitted value from the actual
2 (Yi −Ŷi )2
value σ̂ = i=1N−k , all else being equal, the better our variables
are able to explain the dependent variable, the more precise our estimate
of β̂ j will be. This point is particularly relevant for experiments. In their
ideal form, experiments do not need to add control variables to avoid bias.5
Including control variables when analyzing experiments is still useful,
however, because they improve the fit of the model, thus reducing σ̂ 2 and
therefore giving us more precise coefficient estimates.
• In the denominator, we see the sample size, N. As for the bivariate model,
more data leads the value of the denominator to get bigger making the
var(β̂ j ) smaller. In other words, more data means more precise estimates.
N
(X −X j )2
i=1 ij
• The greater
the variation of Xj as measured by N for large
samples , the bigger the denominator will be. The bigger the denominator,
the smaller var(β̂ j ) will be.
Multicollinearity
The new element in Equation 5.10 compared to the earlier variance equation is the
(1 − R2j ). Notice the j subscript. We use the subscript to indicate that R2j is the
R2 from an auxiliary regression in which Xj is the dependent variable and all
the other independent variables in the full model are the independent variables
in the auxiliary model. The R2 without the j is still the R2 for the main equation,
as discussed on page 72.
There is a different R2j for each independent variable. For example, if our
model is
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i (5.11)
there will be three different R2j ’s:
• R21 is the R2 from X1i = γ0 + γ1 X2i + γ2 X3i + i , where the γ parameters

are estimated coefficients from OLS. We’re not really interested in the
value of these parameters. We’re not making any causal claims about this
model—just using them to understand the correlation of independent vari-
ables (which is measured by the R2j ). (We’re being a bit loose notationally,
reusing the γ and notation in each equation.)
• R22 is the R2 from X2i = γ0 + γ1 X1i + γ2 X3i + i .
• R23 is the R2 from X3i = γ0 + γ1 X1i + γ2 X2i + i .
5
We discuss experiments in their real-world form in Chapter 10.
These R2j tell us how much the other variables explain Xj . If the other
variables explain Xj very well, the R2j will be high and—here’s the key insight—the
denominator will be smaller. Notice that the denominator of the equation for
var(β̂ j ) has (1 − R2j ). Remember that R2 is always between 0 and 1, so as R2j gets
bigger, 1−R2j gets smaller, which in turn makes var(β̂ j ) bigger. The intuition is that
if variable Xj is virtually indistinguishable from the other independent variables,
it should in fact be hard to tell how much that variable affects Y, and we will
therefore have a larger var(β̂ j ).
In other words, when an independent variable is highly related to other
independent variables, the variance of the coefficient we estimate for that variable
multicollinearity will be high. We use a fancy term, multicollinearity, to refer to situations in
Variables are which independent variables have strong linear relationships. The term comes
multicollinear if they are from “multi” for multiple variables and “co-linear” because they vary together in a
correlated. The
linear fashion. The polysyllabic jargon should not hide a simple fact: The variance
consequence of
multicollinearity is that
of our estimates increases when an independent variable is closely related to other
the variance of β̂1 will independent variables.
1
be higher than it would The term 1−R 2 is referred to as the variance inflation factor (VIF). It
j
have been in the
measures how much variance is inflated owing to multicollinearity relative to a
absence of
multicollinearity. case in which there is no multicollinearity.
Multicollinearity does It’s really important to understand what multicollinearity does and does not
not cause bias. do. It does not cause bias. It doesn’t even cause the standard errors of β̂1 to be
incorrect. It simply causes the standard errors to be bigger than they would be
if there were no multicollinearity. In other words, OLS is on top of the whole
variance inflation multicollinearity thing, producing estimates that are unbiased with appropriately
factor A measure of calculated uncertainty. It’s just that when variables are strongly related to each
how much variance is other, we’re going to have more uncertainty—that is, the distributions of β̂1 will
inflated owing to be wider, meaning that it will be harder to learn from the data.
multicollinearity. What, then, should we do about multicollinearity? If we have a lot of data,
our standard errors may be small enough to allow reasonable inferences about
the coefficients on the collinear variables. In that case, we do not have to do
anything. OLS is fine, and we’re perfectly happy. Both our empirical examples in
this chapter are consistent with this scenario. In the height and wages analysis in
Table 5.2, adult height and adolescent height are highly correlated (we don’t report
it in the table, but the two variables are correlated at 0.86, which is a very strong
correlation). And yet, the actual effects of these two variables are so different that
we can parse out their differential effects with the amount of data we have. In
the education and economic growth analysis in Table 5.3, the years of school and
test score variables are correlated at 0.81 (not reported in the table). And yet, the
effects are different enough to let us parse out the differential effects of these two
variables with the data we have.
If we have substantial multicollinearity, however, we may get very large
standard errors on the collinear variables, preventing us from saying much about
any one variable. Some are tempted in such cases to drop one or more of the highly
multicollinear variables and focus only on the results for the remaining variables.
This isn’t quite fair, however, since we may not have solid evidence to indicate
which variables we should drop and which we should keep. A better approach is
to be honest: we should just say that the collinear variables taken as a group seem
to matter or not and that we can’t parse out the individual effects of these variables.
For example, suppose we are interested in predicting undergraduate grades as
a function of two variables: scores from a standardized math test and scores from a
standardized verbal reasoning test. Suppose also that these test score variables are
highly correlated and that when we run a model with both variables as independent
variables, both are statistically insignificant in part because the standard errors
will be very high owing to the high R2j values. If we drop one of the test scores,
the remaining test score variable may be statistically significant, but it would be
poor form to believe, then, that only that test score affected undergraduate grades.
Instead, we should use the tools we present later (Section 5.6, page 158), which
allow us to assess whether both variables taken together explain grades. At that
point, we may be able to say that we know standardized test scores matter, but
we cannot say much about the relative effect of math versus verbal test scores. So
even though it would be more fun to say which test score matters, the statistical
evidence to justify the statement may simply not be there.
perfect A lethal dose of multicollinearity, called perfect multicollinearity, occurs
multicollinearity when an independent variable is completely explained by other independent
Occurs when an variables. If this happens, R2j = 1, and the var( β̂1 ) blows up because (1 − R2j )
independent variable is
is in the denominator (in the sense that the denominator becomes zero, which is
completely explained by
other independent
a big no-no). In this case, statistical software either will refuse to estimate the
variables. model or will automatically delete enough independent variables to extinguish
perfect multicollinearity. A silly example of perfect multicollinearity is including
the same variable twice in a model.
Goodness of fit
Let’s talk about the regular old R2 , the one without a j subscript. As with the R2 for
a bivariate OLS model, the R2 for a multivariate OLS model measures goodness
of fit and is the square of the correlation of the fitted values and actual values (see
Section 3.7).6 As before, it can be interesting to know how well the model explains
the dependent variable, but this information is often not particularly useful. A good
model can have a low R2 , and a biased model can have a high R2 .
There is one additional wrinkle for R2 in the multivariate context. Adding
a variable to a model necessarily makes the R2 go up, at least by a tiny bit.
To see why, notice that OLS minimizes the sum of squared errors. If we add a
new variable, the fit cannot be worse than before because we can simply set the
coefficient on this new variable to be zero, which is equivalent to not having the
variable in the model in the first place. In other words, every time we add a variable
to a model, we do no worse and, as a practical matter, do at least a little better
even if the variable doesn’t truly affect the dependent variable. Just by chance,
estimating a non-zero coefficient on this variable will typically improve the fit
for a couple of observations. Hence, R2 always is the same or larger as we add
variables.
6
The model needs to have a constant term for this interpretation to work—and for R2 to be sensible.
Devious people therefore think, “Aha, I can boost my R2 by adding variables.”

First of all, who cares? R2 isn’t directly useful for much. Second of all,
that’s cheating. Therefore, most statistical software programs report so-called
adjusted R2 The R2 adjusted R2 results. This measure is based on the R2 but lowers the value
with a penalty for the depending on how many variables are in the model. The adjustment is ad hoc,
number of variables and different people do it in different ways. The idea behind the adjustment is
included in the
perfectly reasonable, but it’s seldom worth getting too worked up about adjusting
model.
per se. It’s like electronic cigarettes. Yes, smoking them is less bad than smoking
regular cigarettes, but really, why do it at all?
Inclusion of irrelevant variables

The equation for the variance of β̂ j is also helpful for understanding what happens
irrelevant variable when we include an irrelevant variable—that is, when we add a variable to
A variable in a regression the model for which the true coefficient is zero. Whereas our omitted variable
model that should not discussion was about what happens when we exclude a variable that should be in
be in the model,
the model, here we want to know what happens when we include a variable that
meaning that its
coefficient is zero.
should not be in the model.
Including an irrelevant Including an irrelevant variable does not cause bias. It’s as if we’d written
variable does not cause down a model and the correct coefficient on the irrelevant variable happened to be
bias, but it does increase zero. That doesn’t cause bias; it’s just another variable. We should get an unbiased
the variance of the estimate of that coefficient, and including the irrelevant variable will not create
estimates. endogeneity.
It might therefore seem that the goal is simply to add as many variables as
we can get our hands on. That is, the more we control for, the less likely there are
to be factors in the error term that are correlated with the independent variable of
interest. The reality is different. Including an irrelevant variable is not harmless.
Doing so makes our estimates less precise because this necessarily increases R2j
since R2 always go up when variables are added.7 This conclusion makes sense: the
more we clutter up our analysis with variables that don’t really matter, the harder it
is to see a clear relationship between a given variable and the dependent variable.
Review Questions
1. How much will other variables explain Xj when Xj is a randomly assigned treatment?
Approximately what will R2j be?
2. Suppose we are designing an experiment in which we can determine the value of all independent
variables for all observations. Do we want the independent variables to be highly correlated or
not? Why or why not?
7
Our earlier discussion was about the regular R2 , but it also applies to any R2 (from the main
equation or an auxiliary equation). R2 goes up as the number of variables increases.
REMEMBER THIS
1. If errors are not correlated with each other and are homoscedastic, the variance of the β̂ j
estimate is
σ̂ 2
var(β̂ j ) =
N × var(Xj )(1 − R2j )
2. Four factors influence the variance of multivariate β̂ j estimates.
(a) Model fit: The better the model fits, the lower σ̂ 2 and var(β̂ j ) will be.
(b) Sample size: The more observations, the lower var(β̂ j ) will be.
(c) Variation in X: The more the Xj variable varies, the lower var(β̂ j ) will be.
(d) Multicollinearity: The less the other independent variables explain Xj , the lower R2j and
var(β̂ j ) will be.
3. Independent variables are multicollinear if they are correlated.
(a) The variance of β̂1 is higher when there is multicollinearity than when there is no
multicollinearity.
(b) Multicollinearity does not bias β̂1 estimates.
(c) The se( β̂1 ) produced by OLS accounts for multicollinearity.
(d) An OLS model cannot be estimated when there is perfect multicollinearity—that is,
when an independent variable is perfectly explained by one or more of the other
independent variables.
4. Inclusion of irrelevant variables occurs when variables that do not affect Y are included in a
model.
(a) Inclusion of irrelevant variables causes the variance of β̂1 to be higher than if the
variables were not included.
(b) Inclusion of irrelevant variables does not cause bias.
5. The variance of β̂ j is more complicated when errors are correlated or heteroscedastic,
but the intuitions about model fit, sample size, variance of X, and multicollinearity still
apply.
CASE STUDY Institutions and Human Rights

Governments around the world all too often violate basic
human rights. What deters such abuses? Many believe
that an independent judiciary constrains governments
from bad behavior.
This hypothesis offers a promising opportunity
for statistical analysis. Our dependent variable can be
Human rightsst , a measure of human rights for coun-
try s at time t based on rights enumerated in United
Nations treaties. Our independent variable can be
Judicial independencest , which measures judicial inde-
pendence for country s at time t based on the tenure of
judges and the scope of judicial authority.8
We pretty quickly see that a bivariate model will be
insufficient. What factors are in the error term? Could
they be correlated with judicial independence? Experi-
ence seems to show that human rights violations occur
less often in wealthy countries. Wealthy countries also
tend to have more independent judiciaries. In other
words, omission of country wealth plausibly satisfies
conditions for omitted variable bias to occur: the variable
influences the dependent variable and is correlated with
the independent variable in question.
In looking at the effect of judicial independence
on human rights, it therefore is a good idea to control
for wealth. The left-hand column of Table 5.4 presents
results from such a model. Wealth is measured by GDP per capita. The coefficient on
judicial independence is 11.37, suggesting that judicial independence does indeed
improve human rights. The t statistic is 2.53, so we reject the null hypothesis that
the effect of judicial independence is zero.
Is this the full story? Is an omitted variable that affects human rights (the
dependent variable) somehow correlated with judicial independence (the key
independent variable)? If so, then omitting this other variable could result in the
showing of a spurious association between judicial independence and human
rights protection.
New York University professor Anna Harvey (2011) proposes exactly such
a critique. She argues that democracy might protect human rights and that
the degree of democracy in a country could be correlated with judicial
independence.
8
This example is based on La Porta, Lopez-de-Silanes, Pop-Eleches, and Schliefer (2004).
Measurement of abstract concepts like human rights and judicial independence is not simple. See
Harvey (2011) for more details.
TABLE 5.4 Effects of Judicial Independence on Human

Rights
Without democracy With democracy
variable variable
Judicial independence 11.37∗ 1.03

(4.49) (3.15)
[t = 2.53] [t = 0.33]
Log GDP per capita 9.77∗ 1.07

(1.36) (4.49)
[t = 7.20] [t = 0.82]
Democracy 24.93∗
(2.77)
[t = 9.01]
Constant −22.68 30.97∗

(12.57) 10.15
[t = 1.80] [t = 3.05]
N 63 63
σ̂ 17.6 11.5
R2 0.47 0.78
R2Judicialind. 0.153
R2LogGDP 0.553
R2Democracy 0.552

∗
Before we discuss what Harvey found, let’s think about what would have to be
true if omitting a measure of democracy is indeed causing bias under our conditions
given on page 140. First, the level of democracy in a country actually needs to affect
the dependent variable, human rights (this is the β2 = 0 condition). Is that true here?
Very plausibly. We don’t know beforehand, of course, but it certainly seems possible
that torture tends not to be a great vote-getter. Second, democracy needs to be
correlated with the independent variable of interest, which in this case is judicial
independence. This we know is almost certainly true: democracy and judicial
independence definitely seem to go together in the modern world. In Harvey’s data,
democracy and judicial independence correlate at 0.26: not huge, but not nuthin’.
Therefore will be we have a legitimate candidate for omitted variable bias.
The right-hand column of Table 5.4 shows that Harvey’s intuition was right.
When the democracy measure is added, the coefficients on both judicial indepen-
dence and GDP per capita fall precipitously. The coefficient on democracy, however,
is 24.93, with a t statistic of 9.01, a highly statistically significant estimate.
Statistical significance is not the same as substantive significance, though. So
let’s try to interpret our results in a more meaningful way. If we generate descriptive
statistics for our variable that depends on human rights, we see that it ranges from
17 to 99, with a mean of 67 and a standard deviation of 24. Doing the same for the
democracy variable indicates a range of 0 to 2 with a mean of 1.07 and a standard
deviation of 0.79. A coefficient of 24.93 implies that a change in the democracy
measures of one standard deviation is associated with a 24.93 × 0.79 = 19.7 unit
increase on the human rights scale. Given that the standard deviation change in
the dependent variable is 24, this is a pretty sizable association between democracy
and human rights.9
This is a textbook example of omitted variable bias.10 When democracy is not
accounted for, judicial independence is strongly associated with human rights.
When democracy is accounted for, however, the effect of judicial independence
fades to virtually nothing. And this is not just about statistics. How we view the
world is at stake, too. The conclusion from the initial model was that courts protect
human rights. The additional analysis suggests that democracy protects human
rights.
The example also highlights the somewhat provisional nature of social scien-
tific conclusions. Someone may come along with a variable to add or another way
to analyze the same data that will change our conclusions. That is the nature of the
social scientific process. We do the best we can, but we leave room (sometimes a
little, sometimes a lot) for a better way to understand what is going on.
Table 5.4 also includes some diagnostics to help us think about multicollinear-
ity, for surely such factors as judicial independence, democracy, and wealth are
correlated. Before looking at specific diagnostics, though, we should note that
collinearity of independent variables does not cause bias. It doesn’t even cause
the variance equation to be wrong. Instead, multicollinearity simply causes the
variance to be higher than it would be without collinearity among the independent
variables.
Toward the bottom of the table will be we see that R2Judicialind. is 0.153. This
value is the R2 from an auxiliary regression in which judicial independence is the
dependent variable and the GDP and democracy variables are the independent
variables. This value isn’t particularly high, and if we plug it into the equation for
the VIF, which is just the part of the variance of β̂ j associated with multicollinearity,
we see that the VIF for the judicial independence variable is 1−1R2 = 1−0.153
1
= 1.18. In
j
other words, the variance of the coefficient on the judicial independence variable
is 1.18 times larger than it would have been if the judicial independence variable
were completely uncorrelated with the other independent variables in the model.
9
Determining exactly what is a substantively large effect can be subjective. There’s no rule book on
what is “large.” Those who have worked in a substantive area for a long time often get a good sense
of what effects qualify as “large.” An effect might be considered large if it is larger than the effect of
other variables that people think are important. Or an effect might be considered large if we know
that the benefit is estimated to be much higher than the cost. In the human rights case, we can get a
sense of what a 19.7 unit change in the human rights scale means by looking at pairs of countries that
differed by around 20 points on that scale. For example, Pakistan was 22 points higher than North
Korea. Decide if it would make a difference to vacation in North Korea or Pakistan. If it would make
a difference, then 19.7 is a large difference; if not, then it’s not.
10
Or, it is now . . .
5.5 Standardized Coefficients 155
That’s pretty small. The R2LogGDP is 0.553. This value corresponds to a VIF of 2.24,
which is higher but still not in a range people get too worried about. And just to
reiterate, this is not a problem to be corrected. Rather, we are simply noting that one
source of variance of the coefficient estimate on GDP is multicollinearity. Another
source is the sample size and another is the fit of the model (indicated by σ̂ , which
indicates that the fitted values are on average, roughly 11.5 units away from the
actual values).
5.5 Standardized Coefficients

We frequently want to compare coefficients. That is, we want to say whether
X1 or X2 has a bigger effect on Y. If the variables are on the same scale,
this task is pretty easy. For example, in the height and wages model, both
adolescent and adult height are measured in inches, so we can naturally compare
the estimated effects of an inch of adult height versus an inch of adolescent
height.
Challenge of comparing coefficients

When the variables are not on the same scale, we have a tougher time making a
direct comparison. Suppose we want to understand the economics of professional
baseball players’ salaries. Players with high batting averages get on base a lot,
keeping the offense going and increasing the odds of scoring. Players who hit
home runs score right away, sometimes in bunches. Which group of players earns
more? We might first address question this by estimating
Salaryi = β0 + β1 Batting averagei + β2 Home runsi + i
The results are in Table 5.5. The coefficient on batting average is

12,417,629.72. That’s huge! The coefficient on home runs is 129,627.36. Also
big, but nothing like the coefficient on batting average. Batting average must have
a much bigger effect on salaries than home runs, right?
Umm, no. These variables aren’t comparable. Batting averages typically
range from 0.200 to 0.350 (meaning most players get a hit between 20 and 35
percent of the time). Home runs per season range from 0 to 73 (with a lot more 0s
than 73s!). Each OLS coefficient in the model tells us what happens if we increase
the variable by “1.” For batting average, that’s an impossibly large increase (going
from probability of getting a hit of 0 to a probability of 1.0). Increasing the home
run variable by “1” happens every time someone hits a home run. In other words,
“1” means something very different for two variables, and we’d be nuts to directly
compare the regression coefficients on the variables.
Standardizing coefficients
standardize A convenient trick is to standardize the variables. To do so, we convert variables
Standardizing a variable to standard deviations from their means. That is, instead of having a variable that
converts it to a measure indicates a baseball player’s batting average, we have a variable that indicates how
of standard deviations
many standard deviations above or below the average batting average a player
from its mean.
was. Instead of having a variable that indicates home runs, we have a variable that
indicates how many standard deviations above or below the average number of
home runs a player hit. The attraction of standardizing variables is that a one-unit
increase for both standardized independent variables will be a standard deviation.
We often (but not always) standardize the dependent variable as well. If we
do so, the coefficient on a standardized independent variable can be interpreted as
“Controlling for the other variables in the model, a one standard deviation increase
in X is associated with a β̂1 standard deviation increase in the dependent variable.”
We standardize variables using the following equation:
Variable − Variable
VariableStandardized = (5.12)
sd(Variable)
TABLE 5.5 Determinants of Major League

Baseball Salaries, 1985–2005
Batting average 12,417,629.72∗

(940,985.99)
[t = 13.20]
Home runs 129,627.36∗

(2,889.77)
[t = 44.86]
Constant −2,869,439.40∗
(244,241.12)
[t = 11.75]
N 6,762
2
R 0.30

∗
TABLE 5.6 Means and Standard Deviations of

Baseball Variables
Variable Mean Standard deviation
Salary $2,024,616 $2,764,512

Batting average 0.267 0.031
Home runs 12.11 10.31

5.5 Standardized Coefficients 157
TABLE 5.7 Means and Standard Deviations of Baseball Variables for Three Players
Unstandardized Standardized
Player ID Salary Batting average Home runs Salary Batting average Home runs
1 5,850,000 0.267 43 1.38 0.00 2.99

2 2,000,000 0.200 4 −0.01 −2.11 −0.79
3 870,000 0.317 33 −0.42 1.56 2.03
where Variable is the mean of the variable for all units in the sample and
sd(Variable) is the standard deviation of the variable.
Table 5.6 reports the means and standard deviations of the variables for our
baseball salary example. Table 5.7 then uses these means and standard deviations
to report the unstandardized and standardized values of salary, batting average,
and home runs for three selected players. Player 1 earned $5.85 million. Given that
the standard deviation of salaries in the data set was $2,764,512, the standardized
− 2,024,616
value of this player’s salary is 5,850,000
2,764,512 = 1.38. In other words, player 1
earned 1.38 standard deviations more than the average salary. This player’s batting
average was 0.267, which is exactly the average. Hence, his standardized batting
average is zero. He hit 43 home runs, which is 2.99 standard deviations above the
mean number of home runs.
Table 5.8 displays standardized OLS results along with the unstandardized
results from Table 5.5. The dependent variable is standardized is the result on
the right. The standardized results allow us to reasonably compare the effects
on salary of batting average and home runs. We see in Table 5.6 that a standard
standardized deviation of batting average is 0.031. The standardized coefficients tell us that
coefficient The an increase of one standard deviation of batting average is associated with an
coefficient on an increase in salary of 0.14 standard deviations. So, for example, a player raising
his batting average by 0.031, from 0.267 to 0.298, can expect an increase in salary
that has been
standardized.
of 0.14 × $2, 764, 512 = $387, 032. A player who increases his home runs by one
standard deviation (which Table 5.6 tells us is 10.31 home runs), can expect a 0.48
standard deviation increase in salary (which is 0.48 × $2, 764, 512 = $1, 326, 966).
In other words, home runs have a bigger bang for the buck. Eat your steroid-laced
Wheaties, kids.11
While results from OLS models with standardized variables seem quite
different, all they really do is rescale the original results. The model fit is the
same whether standardized or unstandardized variables are used. Notice that
the R2 is identical. Also, the conclusions about statistical significance are the
same in the unstandardized and standardized regressions; we can see that by
comparing the t statistics in the unstandardized and standardized results. Think
of the standardization as something like international currency conversion. In
unstandardized form, the coefficients are reported in different currencies, but
in standardized form, the coefficients are reported in a common currency. The
11
That’s a joke! Wheaties are gross.
TABLE 5.8 Standardized Determinants of Major League Baseball

Salaries, 1985–2005
Unstandarized Standarized
∗
Batting average 12,417,629.72 0.14∗
(940,985.99) (0.01)
[t = 13.20] [t = 13.20]
Home runs 129,627.36∗ 0.48∗
(2,889.77) (0.01)
[t = 44.86] [t = 44.86]
Constant −2,869,439.40∗ 0.00
(244,241.12) (0.01)
[t = 11.75] [t = 0.00]
N 6,762 6,762
2
R 0.30 0.30

∗
underlying real prices, however, are the same whether they are reported in dollars,
euros, or baht.
REMEMBER THIS
Standardized coefficients allow the effects of two independent variables to be compared.
1. When the independent variable, Xk , and dependent variable are standardized, an increase of one
standard deviation in Xk is associated with a β̂ k standard deviation increase in the dependent
variable.
2. Statistical significance and model fit are the same for unstandardized and standardized results.
5.6 Hypothesis Testing about Multiple Coefficients

The standardized coefficients on batting average and home runs look quite
different. But are they statistically significantly different from each other? The
t statistics in Table 5.8 tell us that each is statistically significantly different from
zero but nothing about whether they are different from each other.
Answering this kind of question is trickier than it was for the t tests because
we’re dealing with more than one estimated coefficient. Uncertainty is associated
with both estimates, and to make things worse, the estimates may covary in ways
5.6 Hypothesis Testing about Multiple Coefficients 159
that we need to take into account. In this section, we discuss F tests as a solution
to this challenge, explain two different types of commonly used hypotheses about
multiple coefficients, and then show how to use R2 results to implement these tests,
including an example for our baseball data.12
F tests
There are several ways to test hypotheses involving multiple coefficients. We focus
F test A type of on an F test. This test shares features with hypothesis tests discussed earlier
hypothesis test (page 97). When using a F test, we define null and alternative hypotheses, set
commonly used to test a significance level, and compare a test statistic to a critical value. The F test
hypotheses involving
is different in that we use a new test statistic and compare it to a critical value
multiple coefficients.
derived from an F distribution rather than a t distribution or a normal distribution.
We provide more information on the F distribution in Appendix H (page 549).
F statistic The test The new test statistic is an F statistic. It is based on R2 values from two
statistic used in separate OLS specifications. We’ll first discuss these OLS models and then
conducting an F test. describe the F statistic in more detail.
The first specification is the unrestricted model, which is simply the full
unrestricted model model. For example, if we have three independent variables, our full model
The model in an F test might be
that imposes no Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i (5.13)
restrictions on the
coefficients. The model is called unrestricted because we are imposing no restrictions on what
the values of β̂1 , β̂ 2 , and β̂ 3 will be.
restricted model The second specification is the so-called restricted model in which we force
The model in an F test the computer to give us results that comport with the null hypothesis. It’s called
that imposes the restricted because we are restricting the estimated values of β̂1 , β̂ 2 , and β̂ 3 to be
restriction that the null
consistent with the null hypothesis.
hypothesis is true.
How do we do that? Sounds hard, but actually, it isn’t. We simply take the
relationship implied by the null hypothesis and impose it on the unrestricted
model. We can divide hypotheses involving multiple coefficients into two general
cases.
Case 1: Multiple coefficients equal zero under the null hypothesis

It is fairly common to be interested in a null like H0 : β1 = β2 = 0. This is
a null in which both coefficients are zero; we reject it if we observe evidence
that one or both coefficients are not equal to zero. This type of hypothesis is
particularly useful when we have multicollinear variables. In such circumstances,
the multicollinearity may drive up the standard errors of the β̂ estimates, giving
us very imprecise (and probably statistically insignificant) estimates for the
individual coefficients. By testing the null that the coefficients associated with
12
It is also possible to use t tests to compare multiple coefficients, but F tests are more widely used
for this purpose.
both the multicollinear variables equal zero, we can at least learn if one (or both)
of them is non-zero, even as we can’t say which one it is because the two are so
closely related.
In this case, imposing the null hypothesis means making sure that our
estimates of β1 and β2 are both zero. The process is actually easy-schmeasy: just
set the coefficients to zero and see that the resulting model is simply a model
without variables X1 and X2 . Specifically, the restricted model for H0 : β1 = β2 =
0 is
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i

= β0 + 0 × X1i + 0 × X2i + β3 X3i + i
= β0 + β3 X3i + i
Case 2: One or more coefficients equal each other under

the null hypothesis
In a more complicated—and interesting—case, we want to test whether the effect
of one variable is larger than the effect of another. In this case, the null hypothesis
will be that both coefficients are the same. For example, if we want to know if the
effect of X1 is bigger than the effect of X2 , the null hypothesis will be H0 : β1 = β2 .
Note that such a hypothesis test makes sense only if the scales of X1 and X2 are
the same or if the two variables have been standardized.
In this case, imposing the null hypothesis to create the restricted equation
involves rewriting the unrestricted equation so that the two coefficients are the
same. We can do so, for example, by replacing β2 with β1 (because they are
equal under the null). After some cleanup, we have a model in which β1 = β2 .
Specifically, the restricted model for H0 : β1 = β2 is
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i

= β0 + β1 X1i + β1 X2i + β3 X3i + i
= β0 + β1 (X1i + X2i ) + β3 X3i + i
In this restricted model, increasing X1 or X2 by one unit increases Yi by β1 .

To estimate this model, we need only create a new variable X1 + X2 and include it
as an independent variable instead of including X1 and X2 separately.
The cool thing is that if we increase X1 by one unit, X1 + X2 goes up by 1,
and we expect a β1 increase in Y. At the same time, if we increase X2 by one
unit, X1 + X2 goes up by 1, and we expect a β1 increase in Y. Presto! We have an
equation in which the effect of X1 and X2 will necessarily be the same.
F tests using R2 values

The statistical fits of the unrestricted and restricted model are measured with
R2Unrestricted and R2Restricted . These are simply the R2 values from each separate model.
The R2Unrestricted will always be higher because the model without restrictions can
generate a better model fit than the same model subject to some restrictions. This
conclusion is a little counterintuitive at first, but note that R2Unrestricted will be higher
than R2Restricted even when the null hypothesis is true. This is because in estimating
the unrestricted equation, the software not only has the option of estimating both
coefficients to be whatever the value is under the null (hence assuring the same
fit as in the restricted model), but also any other deviation, large or small, that
improves the fit.
The extent of difference between R2Unrestricted and R2Restricted depends on whether
the null hypothesis is or is not true. If we are testing H0 : β1 = β2 = 0 and β1 and
β2 really are zero, then restricting them to be zero won’t cause the R2Restricted to be
too far from R2Unrestricted because the optimal values of β̂1 and β̂ 2 really are around
zero. If the null is false and β1 and β2 are much different from zero, there will be a
huge difference between R2Unrestricted and R2Restricted because setting them to non-zero
values, as happens only in the unrestricted model, improves fit substantially.
Hence, the heart of an F test is the difference between R2Unrestricted and
2
RRestricted . When the difference is small, imposing the null doesn’t do too much
damage to the model fit. When the difference is large, imposing the null damages
the model fit a lot.
An F test is based on the F statistic:
(R2Unrestricted − R2Restricted )/q

Fq, N−k = (5.14)
(1 − R2Unrestricted )/(N − k)
The q term refers to how many constraints are in the null hypothesis. That’s just
a fancy way of saying how many equal signs are in the null hypothesis. So for
H0 : β1 = β2 , the value of q is 1. For H0 : β1 = β2 = 0, the value of q is 2. The N − k
term is a degrees of freedom term, like what we saw with the t distribution. This
is the sample size minus the number of parameters estimated in the unrestricted
model. (For example, k for Equation 5.14 will be 4 because we estimate β̂ 0 , β̂1 ,
β̂2 and β̂3 .) We need to know these terms because the shape of the F distribution
depends on the sample size and the number of constraints in the null, just as the t
distribution shifted based on the number of observations.
The F statistic has the difference of R2Unrestricted and R2Restricted in it and also
includes some other bits to ensure that the F statistic is distributed according to an
F distribution. The F distribution describes the relative probability of observing
different values of the F statistic under the null hypothesis. It allows us to know the
probability that the F statistic will be bigger than any given number when the null
is true. We can use this knowledge to identify critical values for our hypothesis
tests; we’ll describe how shortly.
How we approach the alternative hypotheses depends on the type of null
hypothesis. For case 1 null hypotheses (in which multiple coefficients are zero),
the alternative hypothesis is that at least one coefficient is not zero. In other words,
the null hypothesis is that they all are zero, and the alternative is the negation of
that, which is that one or more of the coefficients is not zero.
For case 2 null hypotheses (in which two or more coefficients are equal), it is
possible to have a directional alternative hypothesis that one coefficient is larger
than the other. The critical value remains the same, but we add a requirement that
the coefficients actually go in the direction of the specified alternative hypothesis.
For example, if we are testing H0 : β1 = β2 versus HA : β1 > β2 , we reject the null in
favor of the alternative hypothesis if the F statistic is bigger than the critical value
and β̂1 is actually bigger than β̂ 2 .
This all may sound complicated, but the process isn’t that hard, really. (And,
as we show in the Computing Corner at the end of the chapter, statistical software
makes it easy.) The crucial step is formulating a null hypothesis and using it to
create a restricted equation. This process is not very hard. If we’re dealing with
a case 1 null hypothesis (that multiple coefficients are zero), we simply drop
the variables listed in the null in the restricted equation. If we’re dealing with a
case 2 null hypothesis (that two or more coefficients are equal to each other), we
simply create a new variable that is the sum of the variables referenced in the
null hypothesis and use that new variable in the restricted equation instead of the
individual variables.
F tests and baseball salaries

To see F testing in action, let’s return to our standardized baseball salary model
and first test the following case 1 null hypothesis—that is, H0 : β1 = β2 = 0. The
unrestricted equation is
Salaryi = β0 + β1 Std. batting averagei + β2 Std. home runsi + i
The R2Unrestricted is 0.2992. (It’s usually necessary to be more precise than the
0.30 reported in Table 5.8.)
For the restricted model, we simply drop the variables listed in the null
hypothesis, yielding
Salaryi = β0 + i
This is a bit of a silly model, producing an R2Restricted = 0.00 (because R2 is always

zero when there are no independent variables to explain the dependent variable).
We calculate the F statistic by substituting these values, along with q, which is 2
because there are two equals signs in the null hypothesis, and N − k, which is the
sample size (6,762) minus 3 (because there are three coefficients estimated in the
unrestricted model), or 6,759. The result is

Fq,N−k =
(0.2992 − 0.00)/2
=
(1 − 0.2992)/6, 759
= 1, 442.846
The critical value (which we show how to identify in the Computing Corner,
pages 170 and 172) is 3.00. Since the F statistic is (way!) higher than the critical
value, we reject the null handily.
We can also easily test whether the standardized effect of home runs is bigger
than the standardized effect of batting average. The unrestricted equation is, as
before,
The R2Unrestricted continues to be 0.2992. For the restricted model, we simply

replace the individual batting average and home run variables with a variable that
is the sum of the two variables:

= β0 + β1 Std. batting averagei + β1 Std. home runsi + i
= β0 + β1 (Std. batting averagei + Std. home runsi ) + i
The R2Restricted turns out to be 0.2602. We calculate the F statistic by

substituting these values, along with q, which is 1 because there is one equal sign
in the null hypothesis, and N − k, which continues to be 6, 759. The result is

Fq,N−k =
(0.2992 − 0.2602)/1
=
(1 − 0.2992)/6, 759
= 376.14
The critical value (which we show how to identify in the Computing Corner,
pages 170 and 172) is 3.84. Here, too, the F statistic is vastly higher than the
critical value, and we also reject the null hypothesis that β1 = β2 .
REMEMBER THIS
1. F tests are useful to test hypotheses involving multiple coefficients. To implement an F test for
the following model
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i
proceed through the following four steps:

(a) Estimate an unrestricted model that is the full model.
(b) Write down the null hypothesis.
(c) Estimate a restricted model by using the conditions in the null hypothesis to restrict the
full model.
• Case 1: When the null hypothesis is that multiple coefficients equal zero, we create a
restricted model by simply dropping the variables listed in the null hypothesis.
• Case 2: When the null hypothesis is that two or more coefficients are equal, we create
a restricted model by replacing the variables listed in the null hypothesis with a single
variable that is the sum of the listed variables.
(d) Use the R2 values from the unrestricted and restricted models to generate an F statistic
using Equation 5.14, and compare the F statistic to the critical value from the F
distribution.
2. The bigger the difference between R2Unrestricted and R2Restricted , the more the null hypothesis is
reducing fit and, therefore, the more likely we are to reject the null.
CASE STUDY Comparing Effects of Height Measures

We assessed the effect of height on income in Section 5.1
(page 131). The final specification had independent vari-
ables measuring adult height, adolescent height, and
participation in clubs and athletics:
Wagesi = β0 + β1 Adult heighti + β2 Adolescent heighti

+ β3 Clubsi + β4 Athleticsi + i
Let’s test two different null hypotheses with multi-

ple coefficients. First, let’s test a case 1 null that neither
height variable has an effect on wages. This null is H0 :
β1 = β2 = 0. The restricted equation for this null will be
Wagesi = β0 + β3 Clubsi + β4 Athleticsi + i (5.15)
Table 5.9 presents results necessary to test this null. We use an F test that
requires R2 values from two specifications. The first column presents the unre-
stricted model; at the bottom is the R2Unrestricted , which is 0.06086. The second
column presents the restricted model; at the bottom is the R2Restricted , which is
0.05295. There are two restrictions in this null, meaning q = 2. The sample size
is 1,851, and the number of parameters in the unrestricted model is 5, meaning
N − k = 1,846.
TABLE 5.9 Unrestricted and Restricted Models for F Tests

Unrestricted Restricted model for Restricted model for
model H0 : β1 = β2 = 0 H0 : β1 = β2
Adult height 0.03

(0.20)
[t = 0.17]
Adolescent height 0.35

(0.19)
[t = 1.82]
Number of clubs 1.88∗ 1.91∗ 1.89∗

(0.28) (0.28) (0.28)
[t = 6.87] [t = 6.77] [t = 6.71]
Athletics 3.02∗ 3.28∗ 3.03∗

(0.56) (0.56) (0.56)
[t = 5.36] [t = 5.85] [t = 5.39]
Adult height plus adolescent height 0.19∗

(0.05)
[t = 3.85]
Constant −13.57 13.17∗ −13.91∗

(7.05) (0.41) (7.04)
[t = 1.92] [t = 32.11] [t = 1.98]
N 1,851 1,851 1,851

2
R 0.06086 0.05295 0.06050

∗
Hence, for H0 : β1 = β2 = 0,

Fq,N−k =
(0.06086 − 0.05295)/2
F2,1846 =
(1 − 0.06086)/1,846
= 7.77
We have to use software (or tables) to find the critical value. We’ll discuss that
process in the Computing Corner (pages 170 and 171). For q = 2 and N − k = 1,846,
the critical value for α = 0.05 is 3.00. Because our F statistic as just calculated is bigger
than that, we can reject the null. In other words, the data is telling us that if the null
were true, we would be very unlikely to see such a big difference in fit between the
unrestricted and restricted models.13
13
The specific value of the F statistic provided by automated software F tests will differ from our
presentation because the automated software tests do not round to three digits, as we have done.
Second, let’s test the following case 2 null, H0 : β1 = β2 . Again, the first column in
Table 5.9 presents the unrestricted model; at the bottom is the R2Unrestricted , which is
0.06086. However, the restricted model is different for this null. Following the logic
discussed on page 160, it is
Wagesi = β0 + β1 (Adult heighti + Adolescent heighti ) + β3 Clubsi + β4 Athleticsi + i

(5.16)
The third column in Table 5.9 presents the results for this restricted model; at the
bottom is the R2Restricted , which is 0.0605. There is one restriction in this null, meaning
q = 1. The sample size is still 1, 851, and the number of parameters in the unrestricted
model is still 5, meaning N − k = 1, 846.
Hence, for H0 : β1 = β2 ,

Fq,N−k =
(0.06086 − 0.06050)/1
F1,1846 =
(1 − 0.06086)/1, 846
= 0.71
We again have to use software (or tables) to find the critical value. For q = 1
and N − k = 1, 846, the critical value for α = 0.05 is 3.85. Because our F statistic as
calculated here is less than the critical value, we fail to reject the null that the two
coefficients are equal. The coefficients are quite different in the unrestricted model
(0.03 and 0.35), but notice that the standard errors are large enough to prevent us
from rejecting the null that either coefficient is zero. In other words, we have a lot of
uncertainty in our estimates. The F test formalizes this uncertainty by forcing OLS
to give us the same coefficient on both height variables, and when we do this, the
overall model fit is pretty close to the model fit achieved when the coefficients are
allowed to vary across the two variables. If the null is true, this result is what we
would expect because imposing the null would not lower R2 by very much. If the
null is false, then imposing the null probably would have caused a more substantial
reduction in R2Restricted .
Conclusion
Multivariate OLS is a huge help in our fight against endogeneity because it allows
us to add variables to our models. Doing so cuts off at least part of the correlation
between an independent variable and the error term because the included variables
are no longer in the error term. For observational data, multivariate OLS is
very necessary, although we seldom can wholly defeat endogeneity simply by
including variables. For experimental data not suffering from attrition, balance,
or compliance problems, we can beat endogeneity without multivariate OLS.
However, multivariate OLS makes our estimates more precise.
Conclusion 167
Multivariate OLS can be usefully regarded as an effort to avoid omitted

variable bias. Omitting a variable causes problems when both the following are
true: the omitted variable affects the dependent variable, and the omitted variable
is correlated with the included independent variable.
While we are most concerned with the factors that bias estimates, we have
also identified four factors that make our estimates less precise. Three were the
same as with bivariate OLS: poor model fit, limited variation in the independent
variable, and small data sets. A precision-killing factor new to multivariate OLS
is multicollinearity. When independent variables are highly correlated, they get in
one another’s way and make it hard for us to know which one has which effect.
The result is not bias, but imprecision.
Often we care not only about individual variables but also about how variables
relate to each other. Which variable has a bigger effect? As a first cut, we
can standardize variables to make them plausibly comparable. If and when the
variables are comparable, we can use F tests to determine which effect is larger.
This is possible because F tests allow us to test hypotheses about multiple
coefficients.
We’re well on our way to understanding multivariate OLS when we can do
the following:
• Section 5.1: Write down the multivariate regression equation and explain
all its elements (dependent variable, independent variables, coefficients,
intercept, and error term). Explain how adding a variable to a multivariate
OLS model can help fight endogeneity.
• Section 5.2: Explain omitted variable bias, including the two conditions
necessary for omitted variable bias to exist.
• Section 5.3: Explain what measurement error in dependent and independent

variables does to our coefficient estimates.
N for the variance of β̂1 , and explain the

• Section 5.4: Produce the equation
elements of it, including σ̂ 2 , i=1 (Xij − X j )2 , and R2j . Use this equation to
explain the consequences of multicollinearity and inclusion of irrelevant
variables.
• Section 5.5: Use standardized variables to compare coefficients. Show how

to standardize a variable. Explain how to interpret the coefficient on a
standardized independent variable.
• Section 5.6: Explain how to test a hypothesis about multiple coefficients.

Use an F test to test the following null hypotheses for the model
Yi = β0 + β1 X1i + β2 X2i + i
• H0 : β1 = β2 = 0
• H0 : β1 = β2
Further Reading
King, Keohane, and Verba (1994) provide an intuitive and useful discussion of
omitted variable bias.
Goldberger (1991) has a terrific discussion of multicollinearity. His point
is that the real problem with multicollinear data is that the estimates will
be imprecise. We defeat imprecise data with more data; hence, the problem
of multicollinearity is not having enough data, a state of affairs Goldberger
tongue-in-cheekily calls “micronumerosity.”
Morgan and Winship (2014) provide an excellent framework for thinking
about various approaches to controlling for multiple variables. They spend a fair
bit of time discussing the strengths and weaknesses of multivariate OLS and
alternatives.
Statistical results can often be more effectively presented as figures instead of
tables. Kastellec and Leoni (2007) provide a nice overview of the advantages and
options for such an approach.
Achen (1982, 77) critiques standardized variables, in part because they
depend on the standard errors of independent variables in the sample.
Key Terms
Adjusted R2 (150) Irrelevant variable (150) Restricted model (159)
Attenuation bias (145) Measurement error (143) Standardize (156)
Auxiliary regression (138) Multicollinearity (148) Standardized coefficient
Ceteris paribus (131) Multivariate OLS (127) (157)
Control variable (134) Omitted variable bias (138) Unrestricted model (159)
F test (159) Perfect multicollinearity Variance inflation factor
F statistic (159) (149) (148)
Computing Corner
Stata
1. To estimate a multivariate OLS model, we simply extend the syntax from

bivariate OLS (described on page 82). The syntax is
reg Y X1 X2 X3
For heteroscedasticity-consistent standard errors, simply add the robust
subcommand (as discussed on page 83):
reg Y X1 X2 X3, robust
2. There are two ways to assess multicollinearity.
• Calculating the R2j for each variable. For example, calculate the R21 via
reg X1 X2 X3
and calculate the R22 via
reg X2 X1 X3
1
• Stata also provides a VIF command that estimates 1−R2j
for each
variable. This command needs to be run immediately after the main
model of interest. For example,
reg Y X1 X2 X3
vif
would provide the VIF for all variables from the main model. A VIF
of 5, for example, indicates that the variance is five times higher than
it would be if there were no multicollinearity.
3. In Stata, there is an easy way and a hard way to generate standardized

regression coefficients. Here’s the easy way: type , beta at the end
of a regression command. For example, reg salary BattingAverage
Homeruns, beta produces
salary | Coef. Std. Err. t P>|t| Beta
-------------+---------------------------------------------------
BattingAverage | 1.24e+07 940986 13.20 0.000 .1422752
Homeruns | 129627.4 2889.771 44.86 0.000 .4836231
_cons | -2869439 244241.1 -11.75 0.000 .
-----------------------------------------------------------------
The standardized coefficients are listed on the right under “Beta.”
The hard way isn’t very hard. Use Stata’s egen comment to create
standardized versions of every variable in the model:
egen BattingAverage_std = std(BattingAverage)
egen Homeruns_std = std(Homeruns)
egen Salary_std = std(Salary)
Then run a regression with these standardized variables:
reg Salary_std BattingAverage_std Homeruns_std
Salary_std | Coef. Std. Err. t P>|t|

----------+----------------------------------------
BattingAverage_std | .1422752 .0107814 13.20 0.000
Homeruns_std | .4836231 .0107814 44.86 0.000
_cons | -2.82e-09 .0101802 -0.00 1.000
----------------------------------------------------
The standardized coefficients are listed, as usual, under “Coef.” Notice that
they are identical to the results from using the , beta command.
4. Stata has a very convenient way to conduct F tests for hypotheses involving
multiple coefficients. Simply estimate the unrestricted model, then type
test, and then key in the coefficients involved and restriction implied by
the null. For example, to test the null hypothesis that the coefficients on
Height81 and Height85 are both equal to zero, type the following:
reg Wage Height81 Height85 Clubs Athletics
test Height81 = Height85 = 0
To test the null hypothesis that the coefficients on Height81 and Height85
are equal to each other, type the following:
reg Wage Height81 Height85 Clubs Athletics
test Height81 = Height85
Rounding will cause this code to produce F statistics slightly different
from those on page 165.
5. The OLS output in Stata automatically reports results for an F test of

the hypothesis that the coefficients on all variables all equal zero. This
is sometimes referred to as “the” F test.
6. To find the critical value from an F distribution for a given α, q, and N − k,

use the inverse F function in Stata. The display function will print this on
the screen:
display invF(q, N-k, 1-a)
For example, to calculate the critical value on page 165 for H0 : β1 = β2 =
0, type display invF(2, 1846, 0.95)
7. To find the p value from an F distribution for a given F statistic, use disp
Ftail(df1, df2, F), where df1 and df2 are the degrees of freedom
and F is the F statistic. For example, to calculate the p value for the F
statistic on page 165 for H0 : β1 = β2 = 0, type disp Ftail(2, 1846,
7.77).
1. To estimate a multivariate OLS model, we simply extend the syntax

described on page 84. The syntax is
OLSResults = lm(Y ~ X1 + X2 + X3)
For heteroscedasticity-consistent standard errors, install and load the AER
package, and use the coeftest and vcov commands as follows, as
discussed on page 86:
coeftest(OLSResults, vcov = vcovHC(OLSResults,
type = “HC1"))
2. To assess multicollinearity, calculate the R2j for each variable. For example,
calculate the R21 via
AuxReg1 = lm(X1 ~ X2 + X3)
and calculate the R22 via
AuxReg2 = lm(X2 ~ X1 + X3)
3. R offers us an easy way and a hard way to generate standardized regression

coefficients. Here’s the easy way: use the scale command in R. This
command will automatically create standardized variables on the fly:
summary(lm(scale(Sal) ~ scale(BatAvg)+ scale(HR)))
A harder but perhaps more transparent approach is to create standardized

variables and then use them to estimate a regression model. Standardized
variables can be created manually (e.g., Sal_std = (bb$salary -
mean(bb$salary))/ sqrt(var(bb$salary)). Standardize all vari-
ables, and simply use those variables to run an OLS model.
summary(lm(Sal_std ~ BatAvg_std + HR_std))
Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.000 0.010 0.00 1.00
BatAvg_std 0.142 0.011 13.20 0.00
HR_std 0.483 0.011 44.86 0.00
4. There are automated functions available on the Web to do F tests for

hypotheses involving multiple coefficients, but they require a fair amount
of effort up front to get them working. Here we present a manual approach
for the tests on page 164:
Unrestricted = lm (Wage ~ Height81 + Height85 + Clubs + Athletics)

# Unrestricted model with all variables
Restricted1 = lm (Wage ~ Clubs + Athletics)
# Restricted model for null that height coefficients are both zero
HeightsAdded = Height81 + Height85
# Creates a new variable that is sum of two height variables
Restricted2 = lm (Wage ~ HeightsAdded + Clubs + Athletics)
# Restricted model for null that height coefficients equal each other
R stores R2 values and degrees of freedom information for each model. We

can access this information by using the summary command followed by
a dollar sign and the appropriate name. To see the various values of R2 for
the unrestricted and restricted models, type
summary(Unrestricted)$r.squared
summary(Restricted1)$r.squared
summary(Restricted2)$r.squared
To see the degrees of freedom for the unrestricted model, type
summary(Unrestricted1)$df[2]
We’ll have to keep track of q on our own.
To calculate the F statistic for H0 : β1 = β2 = 0 as described on page 165,

type
((summary(Unrestricted)$r.squared - summary(Restricted1)$r.squared)/2) /
((1-summary(Unrestricted)$r.squared)/summary(Unrestricted)$df[2])
This code will produce slightly different F statistics than on page 165 due
to rounding.
5. The OLS output in R automatically reports results for an F test of the

hypothesis that the coefficients on all variables all equal zero. This is
sometimes referred to as “the” F test.
6. To find the critical value from an F distribution for a given α, q, and N − k,

type qf(1-a, df1=q, df2= N-k). For example, to calculate the critical
value on page 165 for H0 : β1 = β2 =
0, type qf(.95, df1=2, df2=1846).
7. To find the p value from an F distribution for a given F statistic, use 1

- pf(q, df1, df2), where q is the F statistic and df1 and df2 are
the degrees of freedom. For example, to calculate the p value for the F
statistic on page 165 for H0 : β1 = β2 = 0, type 1 - pf(7.77, df1=2,
df2=1846).
Exercises
1. Table 5.10 describes variables from heightwage.dta we will use in this
problem. We have seen this data in Chapter 3 (page 74) and in Chapter 4
(page 123).
(a) Estimate two OLS regression models: one in which adult wages is
regressed on adult height for all respondents, and another in which
adult wages is regressed on adult height and adolescent height for
all respondents. Discuss differences across the two models. Explain
why the coefficient on adult height changed.
(b) Assess the multicollinearity of the two height variables using

(i) a plot, (ii) the VIF command, and (iii) an auxiliary regression.
Exercises 173
TABLE 5.10 Variables for Height and Wages Data in the United States

height81 Adolescent height: height (in inches) measured in 1981
athlets Participation in high school athletics (1 = yes, 0 = no)

clubnum Number of club memberships in high school, excluding athletics and aca-
demic/vocational clubs

age Age in 1996
male Male (1 = yes, 0 = no)
Run the plot once without a jitter subcommand and once with it, and
choose the more informative of the two plots.14
(c) Notice that IQ is omitted from the model. Is this a problem? Why or
why not?
(d) Notice that eye color is omitted from the model. Is this a problem?
Why or why not?
(e) You’re the boss! Use the data in this file to estimate a model that
you think sheds light on an interesting relationship. The specification
decisions include whether to limit the sample and what variables to
include. Report only a single additional specification. Describe in no
more than two paragraphs why this is an interesting way to assess the
data.
2. Use the MLBattend.dta data on Major League Baseball attendance records

for 32 teams from the 1970s through 2000.14 We are interested in the
factors that impact baseball game attendance.
14
In Stata, add jittering to a scatter plot via scatter X1 X2, jitter(3). In R, add jittering to a
plot via plot(jitter(X1), jitter(X2)). Note that in the auxiliary regression, it’s useful to limit
the sample to observations for which wage96 is not missing to ensure that the R2 from the auxiliary
regression will be based on the same number of observations as the regression originally. In Stata,
add if wage96 !=. to the end of a regression statement, where the exclamation means “not” and
the period is how Stata marks missing values. In R, we could limit the sample via data =
data[is.na(data$wage96) == 0, ] in the regression command, where the is.na function
returns a 1 for missing observations and a 0 for non-missing observations.
(a) Estimate a regression in which home attendance rate is the dependent

variable and wins, runs scored, and runs allowed are the indepen-
dent variables. Report your results, identify coefficients that are
statistically significant, and interpret all significant coefficients.
(b) Suppose someone argues that we need to take into account the
growth of the U.S. population between 1970 and 2000. This
particular data set does not have a population variable, but it does
have a variable called Season, which indicates what season the data
is from (e.g., Season equals 1969 for observations from 1969 and
Season equals 1981 for observations from 1981, etc.). What are the
conditions that need to be true for omission of the season variable to
bias other coefficients? Do you think they hold in this case?
(c) Estimate a second regression by using the dependent and indepen-

dent variables from part (a), but including Season as an additional
independent variable to control for trends on overall attendance over
time. Report your results, and discuss the differences between these
results and those observed in part (a).
(d) What is the relationship between Season and Runs_scored? Assess

with an auxiliary regression and a scatterplot. Discuss the implica-
tions for the results in part (c).
(e) Which matters more for attendance: winning or runs scored? [To
keep us on the same page, use home_attend as the dependent variable
and control for wins, runs_scored, runs_allowed, and season.]
3. Do cell phones distract drivers and cause accidents? Worried that this
is happening, many states recently have passed legislation to reduce
distracted driving. Fourteen states now have laws making handheld cell
phone use while driving illegal, and 44 states have banned texting while
driving. This problem looks more closely at the relationship between cell
phones and traffic fatalities. Table 5.11 describes the variables in the data
set Cellphone_2012_homework.dta.
(a) While we don’t know how many people are using their phones while
driving, we can find the number of cell phone subscriptions in a
state (in thousands). Estimate a bivariate model with traffic deaths
as the dependent variable and number of cell phone subscriptions as
the independent variable. Briefly discuss the results. Do you suspect
endogeneity? If so, why?
(b) Add population to the model. What happens to the coefficient on cell
phone subscriptions? Why?
Exercises 175
TABLE 5.11 Variables for Cell Phones and Traffic Deaths Data
year Year
state State name

state_numeric State name (numeric representation of state)
numberofdeaths Number of traffic deaths

cell_subscription Number of cell phone subscriptions (in thousands)
population Population within a state

total_miles_driven Total miles driven within a state for that year (in millions of miles)
(c) Add total miles driven to the model. What happens to the coefficient
on cell phone subscriptions? Why?
(d) Based on the model in part (c), calculate the variance inflation factor
for population and total miles driven. Why are they different? Dis-
cuss implications of this level of multicollinearity for the coefficient
estimates and the precision of the coefficient estimates.
4. What determines how much drivers are fined if they are stopped for
speeding? Do demographics like age, gender, and race matter? To answer
this question, we’ll investigate traffic stops and citations in Massachusetts
using data from Makowsky and Stratmann (2009). Even though state law
sets a formula for tickets based on how fast a person was driving, police
officers in practice often deviate from the formula. Table 5.12 describes
data in speeding_tickets_text.dta that includes information on all traffic
stops. An amount for the fine is given only for observations in which the
police officer decided to assess a fine.
(a) Estimate a bivariate OLS model in which ticket amount is a function

of age. Is age statistically significant? Is endogeneity possible?
(b) Estimate the model from part (a), also controlling for miles per hour
over the speed limit. Explain what happens to the coefficient on age
and why.
TABLE 5.12 Variables for Speeding Ticket Data

MPHover Miles per hour over the speed limit

amount Assessed fine for the ticket
age Age of driver

(c) Suppose we had only the first thousand observations in the data set.
Estimate the model from part (b), and report on what happens to the
standard errors and t statistics when we have fewer observations.15
5. We will continue the analysis of height and wages in Britain from

Exercise 3 in Chapter 3 (page 88). We want to know if the relationship
between height and wages in the United States also occurs among British
men. Table 5.13 describes the data set heightwage_british_males_mul-
tivariate.dta, which contains data on males in Britain from Persico,
Postlewaite, and Silverman (2004).16
(a) Persico, Postlewaite, and Silverman (2004) argue that adolescent

height is most relevant because it is height at these ages that affects
the self-confidence to develop interpersonal skills at a young age.
Estimate a model with wages at age 33 as the dependent variable and
both height at age 33 and height at age 16 as independent variables.
What happens to the coefficient on height at age 33? Explain what is
going on here.
(b) Let’s keep going. Add height at age 7 to the above model, and discuss
the results. Be sure to note changes in sample size (and its possible
TABLE 5.13 Variables for Height and Wages Data in Britain



momed Education of mother, measured in years
daded Education of father, measured in years

Ht16Noisy Height (in inches) measured at age 16 with measurement error added in
15
In Stata, use if _n < 1001 at the end of the regression command to limit the sample to the first
thousand observations. In R, create and use a new data set with the first 1,000 observations (e.g.,
dataSmall = data[1:1000,]). Because the ticket amount is missing for drivers who were not
fined, the sample size of the regression model will be smaller than 1, 000.
16
For the reasons discussed in the exercise in Chapter 3 on page 89, we limit the data set to
observations with height greater than 40 inches and self-reported income less than 400 British pounds
per hour. We also exclude observations of individuals who grew shorter from age 16 to age 33.
Excluding these observations doesn’t substantially affect the results we see here, but since it’s
reasonable to believe there is some kind of non-trivial measurement error for these cases, we exclude
them for the analysis for this question.
Exercises 177
effects), and discuss the implications of adding a variable with the

statistical significance observed for the height at age 7 variable.
(c) Is there multicollinearity in the model from part (b)? If so, qualify the
degree of multicollinearity, and indicate its consequences. Specify
whether the multicollinearity will bias coefficients or have some
other effect.
(d) Perhaps characteristics of parents affect height (some force kids to eat
veggies, while others give them only french fries and Fanta). Add the
two parental education variables to the model, and discuss the results.
Include only height at age 16 (meaning we do not include the height
at ages 33 and 7 for this question—although feel free to include them
on your own; the results are interesting).
(e) Perhaps kids had their food stolen by greedy siblings. Add the
number of siblings to the model, and discuss the results.
(f) We have included a variable, Ht16Noisy, which is the height

measured at age 16 with some random error included. In other
words, it does not equal the actual measured height at age 16
but is a “noisy” measure of height at age 16. Estimate the model
using the variable Ht16Noisy instead of height16, and discuss any
changes in coefficient on the height variable. Relate the changes to
theoretical expectations about measurement error discussed in the
chapter.
6. Use globaled.dta, the data set on education and growth from Hanushek and
Woessmann (2012) for this question. The variables are given in Table 5.14.
TABLE 5.14 Variables for Global Education Data

Variable Description
name Country name

code Country code
ypcgr Average annual growth rate (GDP per capita), 1960–2000

testavg Average combined math and science standardized test scores, 1964–2003
edavg Average years of schooling, 1960–2000

ypc60 GDP per capita in 1960
region Region
open Openness of the economy scale
proprts Security of property rights scale
(a) Use standardized variables to assess whether the effect of test scores
on economic growth is larger than the effect of years in school. At
this point, simply compare the different effects in a meaningful way.
We’ll do statistical tests next. The dependent variable is average
annual GDP growth per year. For all parts of this exercise, control
for average test scores, average years of schooling between 1960 and
2000, and GDP per capita in 1960.
(b) Now conduct a statistical test of whether the (appropriately compa-

rable) effects of test scores and years in school on economic growth
are different.
(c) Now add controls for openness of economy and security of property
rights. Which matters more: test scores or property rights? Use
appropriate statistical evidence in your answer.
Dummy Variables: Smarter than 6
You Think
Picture, if you will, a frenzied home crowd at a sporting

event. That has to help the home team, right? The fans
sure act like it will. But does it really? This is a question
begging for data analysis.
Let’s look at Manchester City the English Premier
League soccer for 2012–2013. Panel (a) of Figure 6.1
shows the goal differential for Manchester City’s 38
games, distinguishing between home and away games.
The average goal differential for away games is about
0.32 (meaning the team scored on average 0.32 more
goals than their opponents when playing away from
home). The average goal differential for home games is about 1.37, meaning that
the goal differential is more than one goal higher at home. Well done, obnoxious
drunk fans! Panel (b) in Figure 6.1 shows the goal differential for Manchester
United. The average goal differential for away games is about 0.90, and the average
goal differential for home games is about 1.37 (coincidentally the same value as for
Manchester City). These numbers mean that the home field advantage for Manch-
ester United is only about 0.47. C’mon, Manchester United fans—yell louder!
We can use OLS to easily generate such estimates and conduct hypothesis
tests. And we can do much more. We can estimate such difference of means while
controlling for other variables, and we can see whether covariates have different
effects at home and away. The key step is using a dummy variable—that is, variable
that equals either 0 or 1—as the independent variable.
In this chapter we show the many powerful uses of dummy variables in
OLS models. Section 6.1 shows how to use a bivariate OLS model for difference
of means. Section 6.2 shows how to use multivariate OLS to control for other
variables when conducting a difference of means test. Section 6.3 uses dummy
variables to control for categorical variables, which indicate category membership
in one of multiple categories. Religion and race are classic categorical variables.
Section 6.4 discusses how dummy variable interactions allow us to estimate
different slopes for different groups. We should note here that this chapter covers
dummy independent variables; Chapter 12 covers dummy dependent variables.
179
180 CHAPTER 6 Dummy Variables: Smarter than You Think
Goal Goal
differential differential
5 5
4 4
3 3
2 2
Average for Average for

1.37 1.37
home games home games
1 1 Average for
0.9 away games
Average for
0.32
away games
0 0
−1 −1
−2 −2
0 1 0 1
Away Home Away Home
Manchester City Manchester United
(a) (b)
FIGURE 6.1: Goal Differentials for Home and Away Games for Manchester City and Manchester
United
6.1 Using Bivariate OLS to Assess Difference of Means

Researchers frequently want to know how two groups differ. In experiments,
researchers are curious about whether the treatment group differs from the control
group. In observational studies, researchers want to compare outcomes between
categories: men versus women, college grads versus high school grads, Ohio State
versus Michigan. These comparisons are often referred to as difference of means
tests because they involve comparing the mean of Y for one group (e.g., the
treatment group) against the mean of Y for another group (e.g., the control group).
In this section, we show how to use the bivariate regression model and OLS to
make such comparisons. We also work through an example about opinions on
President Donald Trump.
6.1 Using Bivariate OLS to Assess Difference of Means 181
Regression model for difference of means tests

difference of means Consider a typical experiment. There is a treatment group, which is a randomly
test A test that involves selected group of individuals who received a treatment. There is also a control
comparing the mean of group, which received no treatment. We use a dummy variable to represent
Y for one group (e.g., the
whether an individual was or was not in the treatment group. A dummy variable
treatment group)
against the mean of Y equals either 0 or 1 for each observation. A dummy variable is also referred to
for another group (e.g., as a dichotomous variable. Typically, the dummy variable is 1 for those in the
the control group). treatment group and 0 for those in the control group.
These tests can be A bivariate OLS model that assesses the effect of an experimental treatment is
conducted with
bivariate and Yi = β0 + β1 Treatmenti + i (6.1)
multivariate OLS and
other statistical where Yi is the dependent variable, β0 is the intercept, β1 is the effect of being
procedures. treated, and Treatmenti is our independent variable (“Xi ”). This variable is 1 if
person i received the experimental treatment and 0 otherwise. As usual, i is the
dummy variable A error term. Because this is an experiment (one that we assume does not suffer
dummy variable equals from attrition, balancing, or compliance problems), i will be uncorrelated with
either 0 or 1 for all Treatmenti , thereby satisfying the exogeneity condition.
observations. A dummy The standard interpretation of β̂1 from bivariate OLS applies here: a one-unit
variable is sometimes
increase in the independent variable is associated with a β̂1 increase in Yi .
referred to as a
dichotomous variable.
(See page 47 on the standard OLS interpretation.) Equation 6.1 implies that getting
the treatment (going from 0 to 1 on Treatmenti ) is associated with a β̂1 increase
in Yi .
dichotomous When our independent variable is a dummy variable, as with our Treatmenti
variable A variable, we can also treat β̂1 as an estimate of the difference of means of our
dichotomous variable dependent variable Y across the two groups. To see why, note first that the fitted
takes on one of two value for the control group (for whom Treatmenti = 0) is
values, almost always 0
or 1, for all observations. Ŷi = β̂0 + β̂1 Treatmenti
= β̂0 + β̂1 × 0
= β̂0
In other words, β̂0 is the predicted value of Y for individuals in the control group.
It is not surprising that the value of β̂0 that best fits the data is simply the average
of Yi for individuals in the control group.1
The fitted value for the treatment group (for whom Treatmenti = 1) is
Ŷi = β̂0 + β̂1 Treatmenti

= β̂0 + β̂1 × 1
= β̂0 + β̂1
In other words, β̂0 + β̂1 is the predicted value of Y for individuals in the treatment
group. The best predictor of this value is simply the average of Y for individuals
1
The proof is a bit laborious. We show it in the Citations and Additional Notes section on page 557.
in the treatment group. Because β̂0 is the average of individuals in the control
group, β̂1 is the difference in averages between the treatment and control groups.
If β̂1 > 0, then the average Y for those in the treatment group is higher than for
those in the control group. If β̂1 < 0, then the average Y for those in the treatment
group is lower than for those in the control group. If β̂1 = 0, then the average Y
for those in the treatment group is no different from the average Y for those in the
control group.
In other words, our slope coefficient ( β̂1 ) is, in the case of a bivariate OLS
model with a dummy independent variable, a measure of the difference in means
across the two groups. The standard error on this coefficient tells us how much
uncertainty we have and determines the confidence interval for our estimate of β̂1 .
Figure 6.2 graphically displays the difference of means test in bivariate OLS
with a scatterplot of data. It looks a bit different from our previous scatterplots
(e.g., Figure 3.1 on page 46) because here the independent variable takes on only
two values: 0 or 1. Hence, the observations are stacked at 0 and 1. In our example,
Dependent
variable 20
15
Average for
β0 + β1
treatment group
10
e)
slop
(the
β1
5
Average for
β0
control group
0 1
Control Treatment
group group
Treatment variable
FIGURE 6.2: Bivariate OLS with a Dummy Independent Variable

the values of Y when X = 0 are generally lower than the values of Y when X = 1.
The parameter β̂0 corresponds to the average of Y for all observations for which
X = 0. The average for the treatment group (for whom X = 1) is β̂0 + β̂1 . The
difference in averages across the groups is β̂1 . A key point is that the standard
interpretation of coefficients in bivariate OLS still applies: a one-unit change in X
(e.g., going from X = 0 to X = 1) is associated with a β̂1 change in Y.
This is excellent news. Whenever our independent variable is a dummy
variable—as it typically is for experiments and often is for observational data—we
can simply run bivariate OLS and the β̂1 coefficient tells us the difference of
means. The standard error on this coefficient tells us how precisely we have
measured this difference and allows us to conduct a hypothesis test and determine
a confidence interval.
OLS produces difference of means tests for observational data as well. The
model and interpretation are the same; the difference is how much we worry
about whether the exogeneity assumption is satisfied. Typically, exogeneity will
be seriously in doubt for observational data. And sometimes OLS can be useful
in estimating the difference of means as a descriptive statistic without a causal
interpretation.
Difference of means tests can be conducted without using OLS. Doing so
is totally fine, of course; in fact, OLS and non-OLS difference of means tests
assuming the same variances across groups produce identical estimates and
standard errors. The advantage of the OLS approach is that we can use it within a
framework that also does all the other things OLS does, such as adding multiple
variables to the model.
Difference of means and views about President Trump

Table 6.1 provides an example of using OLS to conduct a difference of means
test. The left-hand column presents results from a model of feelings toward
then-candidate President Trump from a May 2016 public opinion survey. The
dependent variable consists of respondents’ answers to a request to rate the
president on a “feeling thermometer,” scale of 0 to 100, where 0 is feeling very
cold toward him and 100 is feeling very warm toward him. The independent
variable is a dummy variable called Republican that is 1 for respondents who
identify themselves as Republicans and 0 for those who do not. The Republican
variable equals 0 for all non-Republicans (a group that includes Democrats,
independents, supporters of other parties, and non-partisans). The results indicate
that Republicans rate Trump 36.06 points higher than non-Republicans, an effect
that is highly statistically significant.2
2
A standard OLS regression model produces a standard error and a t statistic that are equivalent to
the standard error and t statistic produced by a difference of means test in which variance is assumed
to be the same across both groups. An OLS model with heteroscedasticity-consistent standard errors
(as discussed in Section 3.6) produces a standard error and t statistic that are equivalent to a
difference of means test in which variance differs across groups. The Computing Corner at the end of
the chapter shows how to estimate these models.
TABLE 6.1 Feeling Thermometer toward Donald Trump

Treatment = Republican Treatment = Not Republican
∗
Republican 36.06
(1.33)
[t = 27.11]
Not Republican −36.06∗
(1.33)
[t = 27.11]
Constant 14.95∗ 51.01∗
(0.69) (1.13)
[t = 21.67] [t = 45.14]
N 1,914 1,914
2
R 0.28 0.28

∗
Difference of means tests convey the same essential information when the
coding of the dummy variable is flipped. The column on the right in Table 6.1
shows results from a model in which NotRepublican was the independent
variable. This variable is the opposite of the Republican variable, equaling 1
for non-Republicans and 0 for Republicans. The numerical results are different,
but they nonetheless contain the same information. The constant is the mean
evaluation of Trump by Republicans. In the first specification, this mean is
β̂0 + β̂1 = 14.95 + 36.06 = 51.01. In the second specification it is simply β̂0
because this is the mean value for the reference category. In the first specification,
the coefficient on Republican is 36.06, indicating that Republicans evaluated
Trump 36.06 points higher than non-Republicans. In the second specification the
coefficient on NotRepublican is negative, −36.06, indicating that non-Republicans
evaluated Trump 36.06 points lower than Republicans.
Figure 6.3 scatterplots the data and highlights the estimated differences in
means between non-Republicans and Republicans. Dummy variables can be a
bit tricky to plot because the values of the independent variable are only zero
jitter A process used or one, causing the data to overlap such that we can’t tell whether a given dot
in scatterplotting data. A in the scatterplot indicates 2 or 200 observations. A trick of the trade is to jitter
small, random number is each observation by adding a small, random number to each observation for the
added to each independent and dependent variables. The jittered data gives the cloudlike images
observation for in the figure that help us get a decent sense of the data. We jitter only the data
purposes of plotting that is plotted; we do not jitter the data when running the statistical analysis.
only. This procedure
The Computing Corner at the end of this chapter shows how to jitter data for
produces cloudlike
images, which overlap plots.3
less than the unjittered
data, hence providing a
3
better sense of the data. We discussed jittering data earlier, on page 74.
Feeling
thermometer
toward 100
Trump
80
60
β0 + β1 Average for
Republicans
40
e)
lop
es
(th
β1
20
β0 Average for
Non−Republicans
0 1
Non−Republicans Republicans
Partisan identification
FIGURE 6.3: Scatterplot of Trump Feeling Thermometers and Party Identification
Non-Republicans’ feelings toward Trump clearly run lower: that group shows
many more observations at the low end of the feeling thermometer scale. The
non-Republicans’ average feeling thermometer rating is 14.95. Feelings toward
Trump among Republicans are higher, with an average of 51.01. When interpreted
correctly, both the specifications in Table 6.1 tell this same story.
REMEMBER THIS
A difference of means test assesses whether the average value of the dependent variable differs
between two groups.
1. We often are interested in the difference of means between treatment and control groups,
between women and men, or between other groupings.
2. Difference of means tests can be implemented in bivariate OLS by using a dummy independent
variable:
Yi = β0 + β1 Treatmenti + i
(a) The estimate of the mean for the control group is β̂0 .
(b) The estimate of the mean for the treatment group is β̂0 + β̂1 .
(c) The estimate for differences in means between groups is β̂1 .
Review Questions
1. Approximately what are the averages of Y for the treatment and control groups in each panel
of Figure 6.4? Approximately what is the estimated difference of means in each panel?
2. Approximately what are the values of β̂0 and β̂1 in each panel of Figure 6.4?
Dependent
variable
4
−2
0 1
Control group Treatment group

(a)
Dependent
10
variable
5
0
−5
−10
−15
0 1

(b)
Dependent
variable 120
110
100
90
80
70
60
0 1

(c)
FIGURE 6.4: Three Difference of Means Tests (for Review Questions)

CASE STUDY Sex Differences in Heights

As an example of OLS difference in means, let’s look at
the difference in heights between men and women. We
already know men are taller on average, but it is interest-
ing to know just how much taller and how confident we
can be of the estimate. In this case, the dependent vari-
able is height, and the independent variable is gender.
We can code the “treated” value as either being male or
female; for now, we’ll use a male dummy variable that
is 1 if the person is male and 0 if the person is female.4
Later, we’ll come back and do the analysis again with a
female dummy variable.
Figure 6.5 displays a scatterplot of height and gen-
der. As expected, men are taller on average than women;
the men-blob is clearly higher than the women-blob.
That’s not very precise, though, so we’ll use an OLS
model to provide a specific estimate of the difference in
heights between men and women. The model is
Heighti = β0 + β1 Malei + i
The estimated coefficient β̂0 tells us the average

height for the group for which the dummy variable is 0,
which in this case is women. The estimated coefficients
β̂0 + β̂1 tell us the average height for the group for which the dummy variable is 1,
which in this case is men. The difference between the two groups is estimated as β̂1 .
The results are reported in Table 6.2. The average height of women is β̂0 , which
is 64.23 inches. The average height for men is β̂0 + β̂1 , which is 64.23 + 5.79 =
70.02 inches. The difference between the two groups is estimated as β̂1 , which is
5.79 inches.
This estimate is quite precise. The t statistic for Male is 103.4, which allows us to
reject the null hypothesis. We can also use our confidence interval algorithm from
page 119 to produce a 95 percent confidence interval for β̂1 of 5.68 to 5.90 inches.
In other words, we are 95 percent confident that the difference of means of height
between men and women is between 5.68 and 5.90 inches.
Figure 6.6 adds the information from Table 6.2 to the scatterplot. We can
see that β̂0 is estimating the middle of the women-blob, β̂0 + β̂1 is estimating
the middle of the men-blob, and the difference between the two is β̂1 . We can
interpret the estimated effect of going from 0 to 1 on the independent variable
(which is equivalent to going from female to male) is to add 5.79 inches on
average.
4
Sometimes people will name a variable like this “gender.” That’s annoying! Readers will then have
to dig through the paper to figure out whether 1 indicates males or females.
Height
(in inches)
80
75
70
65
60
55
50
0 1
Women Men
Gender
FIGURE 6.5: Scatterplot of Height and Gender
We noted earlier that it is reasonable to code the treatment as being female.

If we replace the male dummy variable with a female dummy variable, the model
becomes
Heighti = β0 + β1 Femalei + i
TABLE 6.2 Difference of Means Test for

Height and Gender
Constant 64.23∗
(0.04)
[t = 1, 633.6]
Male 5.79∗
(0.06)
[t = 103.4]
N 10, 863

∗
Height
(in inches)
80
75
Average height
β0 + β170 for men
65 Average height
β0 for women
60
55
50
0 1
Women Men
Gender
FIGURE 6.6: Another Scatterplot of Height and Gender
Now the estimated coefficient β̂0 will tell us the average height for men
(the group for which Female = 0). The estimated coefficients β̂0 + β̂1 will tell us
the average height for women, and the difference between the two groups is
estimated as β̂1 .
The results with the female dummy variable are in the right-hand column of
Table 6.3. The numbers should look familiar because we are learning the same
information from the data. It is just that the accounting is a bit different. What is the
estimate of the average height for men? It is β̂0 in the right-hand column, which is
70.02. Sound familiar? That was the number we got from our initial results (reported
again in the left-hand column of Table 6.3); in that case, we had to add β̂0 + β̂1
because when the dummy variable indicated men, we needed both coefficients to
get the average height for men. What is the difference between males and females
estimated in the right-hand column? It is –5.79, which is the same as before, only
negative. The underlying fact is that women are estimated to be 5.79 inches shorter
on average. If we have coded our dummy variable as Female = 1, then going from
TABLE 6.3 Another Way to Show Difference of Means

Test Results for Height and Gender
Treatment = male Treatment = female
Male 5.79∗
(0.06)
[t = 103.4]
Female −5.79∗
(0.06)
[t = 103.4]
Constant 64.23∗ 70.02∗
(0.04) (0.04)
[t = 1, 633.6] [t = 1, 755.9]
N 10,863 10,863

∗
0 to 1 on the independent variable is associated with a decline of 5.79 inches on

average. If we have coded our dummy variable as Male = 1, then going from 0 to 1 on
the independent variable is associated with an increase of 5.79 inches on average.
6.2 Dummy Independent Variables in Multivariate OLS

We can easily extend difference of means tests to multivariate OLS. The extension
is useful because it allows us to control for other variables when assessing whether
two groups are different.
For example, earlier in this chapter we assessed the home field advantage
of Manchester City while controlling for the quality of the opponent. Using
multivariate OLS, we can estimate
Goal differentiali = β0 + β1 Homei + β2 Opponent qualityi + i (6.2)
where Opponent qualityi measures the opponent’s overall goal differential in all
other games. The β̂1 estimate will tell us, controlling for opponent quality, whether
the goal differential was higher for Manchester City for home games. The results
are in Table 6.4.
The generic for such a model is
Yi = β0 + β1 Dummyi + β2 Xi + i (6.3)
It is useful to think graphically about the fitted lines from this kind of
model. Figure 6.7 shows the data for Manchester City’s results in 2012–2013.
6.2 Dummy Independent Variables in Multivariate OLS 191
TABLE 6.4 Manchester City Example

with Dummy and Continuous
Independent Variables
Home field 1.026∗

(0.437)
[t = 2.35]
Opponent quality −0.025∗
(0.009)
[t = 2.69]
Constant 0.266
(0.309)
[t = 0.86]
N 38
2
R 0.271
σ̂ 1.346

∗
The observations for home games (for which the Home dummy variable is 1) are
dots; the observations for away games (for which the Home dummy variable is 0)
are squares.
As discussed on page 181, the intercept for the Homei = 0 observations (the
away games) will be β̂0 , and the intercept for the Homei = 1 observations (the
home games) will be β̂0 + β̂1 , which equals the intercept for away games plus
the bump (up or down) for home games. Note that the coefficient indicating
the difference of means is the coefficient on the dummy variable. (Note also
that the β we should look at depends on how we write the model. For this
model, β1 indicates the difference of means controlling for the other variable,
but it would be β2 if we wrote the model to have β2 multiplied by the dummy
variable.)
The innovation is that our difference of means test here also controls for
another variable—in this case, opponent quality. Here the effect of a one-unit
increase in opponent quality is β̂ 2 ; this effect is the same for the Homei = 1
and Homei = 0 groups. Hence, the fitted lines are parallel, one for each group
separated by β̂1 , the differential bump associated with being in the Homei = 1
group. In Figure 6.7, β̂1 is greater than zero, but it could be less than zero (in
which case the dashed line for β̂0 + β̂1 for the Homei = 1 group would be below
the β̂0 line) or equal to zero (in which case the two dashed lines would overlap
exactly).
We can add independent variables to our heart’s content, allowing us to
assess the difference of means between the Homei = 1 and Homei = 0 groups
in a manner that controls for the additional variables. Such models are incredibly
common.
Goal
differential
5
Home = 1
Fitted line for home games
4 Home = 0
Fitted line for away games
β0 + β1
1
β2 (th
es lope)
β0
0
β2 (th
e slope
)
−1
−2
−32 −24 −16 −8 0 8 16 24 32 40
Opponent quality
FIGURE 6.7: Fitted Values for Model with Dummy Variable and Control Variable: Manchester City
Example
REMEMBER THIS
1. Including a dummy variable in a multivariate regression allows us to conduct a difference of
means test while controlling for other factors with a model such as
Yi = β0 + β1 Dummyi + β2 Xi + i
2. The fitted values from this model will be two parallel lines, each with a slope of β̂ 2 and separated
by β̂1 for all values of X.
Come up with an example of an interesting relationship involving a dummy independent variable and
one other independent variable variable that you would like to test.
1. Write down a multivariate OLS model for this relationship.

2. Discuss what is in the error term and whether you suspect endogeneity.
3. Sketch the expected relationship, indicating the coefficients from your model on the sketch.
4. Do you think that the slope will be the same for both groups indicated by the dummy variable?
Discuss how you could sketch and model such a possibility.
Transforming Categorical Variables to Multiple

6.3 Dummy Variables
Categorical variables (also known as nominal variables) are common in data
categorical analysis. They have two or more categories, but the categories have no intrinsic
variables Have two or ordering. Information on religion is often contained in a categorical variable:
more categories but do 1 for Buddhist, 2 for Christian, 3 for Hindu, and so forth. Race, industry, and
not have an intrinsic
many more attributes also appear as categorical variables. Categorical variables
ordering. Also known as
nominal variables.
differ from dummy variables in that categorical variables have multiple categories.
Categorical variables differ from ordinal variables in that ordinal variables
ordinal variables express rank but not necessarily relative size. An example of an ordinal variable
Variables that express is a variable indicating answers to a survey question that is coded 1 = strongly
rank but not necessarily disagree, 2 = disagree, 3 = agree, 4 = strongly agree.5
relative size.
In this section, we show how to use dummy variables to analyze categorical
variables. We illustrate the technique with an example about wage differentials
across regions in the United States.
Categorical variables in regression models

We might suspect that wages in the United States are different in different regions.
Are they higher in the Northeast? Are they higher in the South? Suppose we have
data on wages and on region. It should be easy to figure this out, right? Well,
yes, as long as we appreciate how to analyze categorical variables. Categorical
5
It is possible to treat ordinal independent variables in the same way as categorical variables in the
manner we describe here. Or, it is common to simply include ordinal independent variables directly
in a regression model and interpret a one-unit increase as movement from one category to another.
variables indicate membership in some category. They are common in policy

analysis. For example, suppose our region variable is coded such that 1 indicates
a person is from the Northeast, 2 indicates a person is from the Midwest, 3
indicates a person is from the South, and 4 indicates a person is from the
West.
How should we incorporate categorical variables into OLS models? Should
we estimate the model with this equation?
Wagei = β0 + β1 X1i + β2 Regioni + i (6.4)
(Here Wagei is the wages of person i and Regioni is the region person i lives in, as
just defined.)
No, no, and no. Though the categorical variable may be coded numerically, it
has no inherent order, which means the units are not meaningful. The Midwest
is not “1” more than the Northeast; the South is not “1” more than the
Midwest.
So what do we do with categorical variables? Dummy variables save the
day. We simply convert categorical variables into a series of dummy variables,
a different one for each category. If region is the categorical variable, we simply
create a Northeast dummy variable (1 for people from the Northeast, 0 otherwise),
a Midwest dummy variable (1 for people from the Midwest, 0 otherwise),
and so on.
The catch is that we cannot include dummy variables for every category—if
we did, we would have perfect multicollinearity (as we discussed on page 149).
Hence, we exclude one of the dummy variables and treat that category as
reference category the reference category (also called the excluded category), which means that
When a model includes coefficients on the included dummy variables indicate the difference between the
dummy variables category designated by the dummy variable and the reference category.
indicating the multiple
We’ve already been doing something like this with dichotomous dummy
categories of a
categorical variable, we
variables. When we used the male dummy variable in our height and wages
need to exclude a example on page 187, we did not include a female dummy variable, meaning
dummy variable for one that females were the reference category and the coefficient on the male dummy
of the groups, which we variable indicated how much taller men were. When we used the female
refer to as the reference dummy variable, men were the reference category and the coefficient on the female
category. Also referred dummy variable indicated how much shorter females were on average.
to as the excluded
category.
Categorical variables and regional wage differences

To see how categorical variables work in practice, we will analyze women’s wage
data in 1996 across the Northeast, Midwest, South, and West in the United States.
We won’t, of course, include a single region variable. Instead, we create dummy
variables for each region and include all but one of them in the OLS regression.
For example, if we treat West as the reference category, we estimate
Wagesi = β0 + β1 Northeasti + β2 Midwesti + β3 Southi + i

TABLE 6.5 Using Different Reference Categories for Women’s Wages and Region
(a) (b) (c) (d)
West as South as Midwest as Northeast as
reference reference reference reference
Northeast 2.02∗ 4.15∗ 3.61∗

(0.59) (0.506) (0.56)
[t = 3.42] [t = 8.19] [t = 6.44]
Midwest −1.59∗ 0.54 −3.61∗

(0.534) (0.44) (0.56)
[t = 2.97] [t = 1.23] [t = 6.44]
South −2.13∗ −0.54 −4.15∗

(0.48) (0.44) (0.51)
[t = 4.47] [t = 1.23] [t = 8.19]
West 2.13∗ 1.59∗ −2.02∗

(0.48) (0.53) (0.59)
[t = 4.47] [t = 2.97] [t = 3.42]
Constant 12.50∗ 10.37∗ 10.91∗ 14.52∗

(0.40) (0.26) (0.36) (0.43)
[t = 31.34] [t = 39.50] [t = 30.69] [t = 33.53]
N 3,223 3,223 3,223 3,223

2
R 0.023 0.023 0.023 0.023

∗
The results for this regression are in column (a) of Table 6.5. The β̂0 result
(indicated in the “Constant” line in the table) tells us that the average wage per
hour for women in the West (the reference category) was $12.50. Women in the
Northeast are estimated to receive $2.02 more per hour than those in the West, or
$14.52 per hour. Women in the Midwest earn $1.59 less than women in the West,
which works out to $10.91 per hour. And women in the South receive $2.13 less
than women in the West, or $10.37 per hour.
Column (b) of Table 6.5 shows the results from the same data, but with South
as the reference category instead of West. The β̂0 result tells us that the average
wage per hour for women in the South (the reference category) was $10.37.
Women in the Northeast get $14.52 per hour, which is $4.15 per hour more than
women in the South. Women in the Midwest receive $0.54 per hour more than
women in the South (which works out to $10.91 per hour), and women in the West
get $2.13 per hour more than women in the South (which works out to $12.50 per
hour). The key pattern is that the estimated amount that women in each region get is
the same in columns (a) and (b). Columns (c) and (d) have Midwest and Northeast,
respectively, as the reference categories, and with calculations like those we just
did, we can see that the estimated average wages for each region are the same in
all specifications.
Hence, it is important to always remember that the coefficient estimates for

dummy variables associated with a categorical variable themselves are meaningful
only with reference to the reference category. Even though the coefficients on
each dummy variable change across the specifications, the underlying fitted values
for wages in each region do not. Think of the difference between Fahrenheit and
Celsius—the temperature is the same, but the numbers on the two thermometers
are different.
Thus, we don’t need to worry about which category should be the reference
category. It simply doesn’t matter. The difference is due to the reference category
we are using. In the first specification, we are comparing wages in the Northeast,
Midwest, and South to the West; in the second specification, we are comparing
wages in the Northeast, Midwest, and West to the South. The coefficient on
Midwest is negative in the first specification and positive in the second because
women in the Midwest earn less than women in the West [the reference category
in specification (a)] and more than women in the South [the reference category in
specification (b)]. In both specifications (and the subsequent two), women in the
Midwest are estimated to earn $10.91 per hour.
REMEMBER THIS
To use dummy variables to control for categorical variables, we include dummy variables for every
category except one.
1. Coefficients on the included dummy variables indicate how much higher or lower each group
is than the reference category.
2. Coefficients differ depending on which reference category is used, but when interpreted
appropriately, the fitted values for each category do not change across specifications.
Review Questions
1. Suppose we wanted to conduct a cross-national study of opinion in North America and have
a variable named “Country” that is coded 1 for respondents from the United States, 2 for
respondents from Mexico, and 3 for respondents from Canada. Write a model, and explain
how to interpret the coefficients.
2. For the results in Table 6.6 on page 197, indicate what the coefficients are in boxes (a)
through (j).
TABLE 6.6 Hypothetical Results for Wages and Region When

Different Categories Are Used as Reference Categories
Exclude Exclude Exclude Exclude
West South Midwest Northeast
Constant 125.0 95.0 (d) (g)

(0.9) (1.1) (1.0) (0.9)
Northeast −5.0 (a) (e)
(1.3) (1.4) (1.3)
Midwest −10.0 (b) (h)
(1.4) (1.5) (1.3)
South −30.0 (f) (i)

(1.4) (1.5) (1.4)
West (c) 10.0 (j)

(1.4) (1.4) (1.3)
N 1,000 1,000 1,000 1,000
2
R 0.3 0.3 0.3 0.3
CASE STUDY When Do Countries Tax Wealth?

Taxes are a big deal. They affect how people allocate
their time, how much money the government has, and
potentially, how much inequality exists in society. If we
can figure out why some tax policies are chosen over
others, we’ll have some insight into why economies and
societies look the way they do.
Inheritance taxes are a particularly interesting tax
policy because of the clear potential for conflict between
rich and poor. Because these policies have a bigger
negative impact on the rich than on those who are less
well off (you’ve got to be pretty rich to have lots of
money to pass on), we might expect that democracies
with more middle- and low-income voters would be more likely to have high
inheritance taxes.
Scheve and Stasavage (2012) investigated the sources of inheritance taxes by
looking at tax policy and other characteristics of 19 countries for which data is
available from 1816 to 2000; these countries include most of the major economies
over that period of time. Specifically, the researchers looked at the relationship
between inheritance taxes and who was allowed to vote. Keep in mind that early
democracies generally limited voting to (mostly white) men with property, so a
reasonable measure of how many people could vote is whether the government
limited suffrage to property owners or included all men (with or without property).
TABLE 6.7 Difference of Means of

Inheritance Taxes for
Countries with Universal Male
Suffrage, 1816–2000
Universal male suffrage 19.33∗

(1.81)
[t = 10.66]
Constant 4.75∗
(1.45)
[t = 3.27]
N 563
Standard errors are in parentheses.

∗
Hence, at least for earlier times, universal male suffrage was a policy that broadened
the electorate from a narrow slice of property holders to a larger group of
non-property holders—and, thus, less wealthy citizens.
To assess if universal male suffrage led to increases in inheritance taxes, we can
begin with the following model:
Inheritance taxit = β0 + β1 Universal male suffragei,t−1 + εit (6.5)
The data is measured every five years. The dependent variable is the top
inheritance tax rate, and the independent variable is a dummy variable for whether
all men were eligible to vote in at least half of the previous five years.6
Table 6.7 shows initial results that corroborate our suspicion. The coefficient on
our universal male suffrage dummy variable β̂1 is 19.33, with a t statistic of 10.66,
indicating strong statistical significance. The results mean that countries without
universal male suffrage had an average inheritance tax of 4.75 (β̂0 ) percent and
that countries with universal male suffrage had an average inheritance tax of 24.08
(β̂0 + β̂1 ) percent.
These results are from a bivariate OLS analysis of observational data. It is likely
that unmeasured factors lurking in the error term are correlated with the universal
suffrage dummy variable, which would induce endogeneity.
One possible source of endogeneity could be that major advances in universal
male suffrage happened at the same time inheritance taxes were rising throughout
the world, whatever the state of voting was. Universal male suffrage wasn’t really
a thing until around 1900 but then took off quickly, and by 1921, a majority of the
6
Measuring these things can get tricky; see the original paper for details. Most countries had an
ignominious history of denying women the right to vote until the late nineteenth or early twentieth
century (New Zealand was one of the first to extend the right to vote to women, in 1893) and of
denying or restricting voting by minorities until even later. Scheve and Stasavage used additional
statistical tools we will cover later, including fixed effects (introduced in Chapter 8) and lagged
dependent variables (explained in Chapter 13).
Inheritance Year
tax 2000
(%)
80
1950
60
1900
40
20
1850
1850 1900 1950 2000 0 1
Not universal Universal

Year male suffrage male suffrage
Universal male suffrage
(a) (b)
FIGURE 6.8: Relation between Omitted Variable (Year) and Other Variables
countries had universal male suffrage (at least in theory). In other words, it seems
quite possible that something in the error term (a time trend) is correlated both with
inheritance taxes and with universal suffrage. So what appears to be a relationship
between suffrage and taxes may be due to the fact that suffrage increased at
a time when inheritance taxes were going up rather than to a causal effect of
suffrage.
Figure 6.8 presents evidence consistent with these suspicions. Panel (a) shows
the relationship between year and the inheritance tax. The line is the fitted line
from a bivariate OLS regression model in which inheritance tax was the dependent
variable and year was the independent variable. Clearly, the inheritance tax was
higher as time went on.
Panel (b) of Figure 6.8 shows the relationship between year and universal male
suffrage. The data is jittered for ease of viewing, and the line is from a bivariate
model. Obviously, this is not a causal model; it instead shows that the mean value
for the year variable was much higher when universal male suffrage equaled 1
than when universal male suffrage equaled 0. Taken together with panel (a), we
have evidence that the two conditions for omitted variable bias are satisfied: the
year variable is associated with the dependent variable and with the independent
variable.
What to do next is simple enough—include a year variable with the following
model:
Inheritance taxit = β0 + β1 Universal male suffragei,t−1 + β2 Yearit + εit (6.6)
where Year equals the value of the year of the observation. This model allows us to
assess whether a difference exists between countries with universal male suffrage
and countries without universal male suffrage even after we control for a year trend
that may have affected all countries.
Table 6.8 shows the results. The bivariate column is the same as in Table 6.7.
The multivariate (a) column adds the year variable. Whoa! Huge difference. Now
the coefficient on universal male suffrage is –0.38, with a tiny t statistic. In terms
of difference of means testing, we can now say that controlling for a year trend,
the average inheritance tax in countries with universal male suffrage was not
statistically different from that in countries without universal male suffrage.
Scheve and Stasavage argue that war was a more important factor behind
increased inheritance taxes. When a country mobilizes to fight, leaders not only
need money to fund the war, they also need a societal consensus in favor of it.
Ordinary people may feel stretched thin, with their sons conscripted and their
taxes increased. An inheritance tax could be a natural outlet that provides the
government with more money while creating a sense of fairness within society.
Column (b) in the multivariate results includes a dummy variable indicating
that the country was mobilized for war for more than half of the preceding
five years. The coefficient on the war variable is 14.05, with a t statistic of 4.68,
meaning that there is a strong connection between war and inheritance taxes. The
coefficient on universal suffrage is negative but not quite statistically significant
(with a t statistic of 1.51). The coefficient on year continues to be highly statistically
significant, indicating that the year trend persists even when we control for war.
Many other factors could affect the dependent variable and be correlated with
one or more of the independent variables. There could, for example, be regional
variation, as perhaps Europe tended to have more universal male suffrage and
higher inheritance taxes. Therefore, we include dummy variables for Europe, Asia,
and Australia/New Zealand in column (c). North America is the reference category,
which means, for example, that European inheritance taxes were 5.65 percentage
points lower than in North America once we control for the other variables.
The coefficient on the war variable in column (c) is a bit lower than in column
(b) but still very significant. The universal male suffrage variable is close to zero and
statistically insignificant. These results therefore suggest that the results in column
(b) are robust to controlling for continent.
Column (d) shows what happens when we use Australia/New Zealand as our
reference category instead of North America. The coefficients on the war and
TABLE 6.8 Multivariate OLS Analysis of Inheritance Taxes

(a) (b) (c) (d)
Universal male suffrage 19.33* −0.38 −3.24 0.69 0.69

(1.81) (2.10) (2.15) (2.22) (2.22)
[t = 10.66] [t = 0.18] [t = 1.51] [t = 0.31] [t = 0.31]
Year 0.28* 0.30* 0.28* 0.28*
(0.02) (0.02) (0.02) (0.02)
[t = 14.03] [t = 15.02] [t = 13.75] [t = 13.75]
War 14.05* 11.76* 11.76*
(3.00) (2.94) (2.94)
[t = 4.68] [t = 4.01] [t = 4.01]
Europe −5.65* 2.19
(2.38) (2.57)
[t = 2.37] [t = 0.85]
Asia 10.87* 18.71*
(3.18) (3.45)
[t = 3.41] [t = 5.42]
Australia/New Zealand −7.84*
(3.32)
[t = 2.36]
North America 7.84*
(3.32)
[t = 2.36]
Constant 4.75* −516.48* −565.60* −513.79* −521.63*
(1.45) (37.18) (37.99) (38.33) (37.92)
[t = 3.27] [t = 13.89] [t = 14.89] [t = 13.41] [t = 13.78]
N 563 563 563 563 563

∗
suffrage variables are identical to those in column (c). Remember that changing
the reference category affects only how we interpret the coefficients on the dummy
variables associated with the categorical variable in question.
The coefficients on the region variables, however, do change with the new
reference category. The coefficient on Europe in column (d) is 2.19 and statistically
insignificant. Wait a minute! Wasn’t the coefficient on Europe –5.65 and statistically
significant in column (c)? Yes, but in column (c), Europe was being compared to
North America, and Europe’s average inheritance taxes were (controlling for the
other variables) 5.65 percentage points lower than North American inheritance
taxes. In column (d), Europe is being compared to Australia/New Zealand, and the
coefficient indicates that European inheritance taxes were 2.19 percentage points
higher than in Australia/New Zealand.
The relative relationship between Europe and North America is the same in
both specifications as the coefficient on the North America dummy variable is 7.84
in column (d), which is 5.65 higher than the coefficient on Europe in column (d).
Bivariate
model
Multivariate
model (a)
Multivariate
model (b)
Multivariate
model (c)
−5 0 5 10 15 20
Estimated coefficient
FIGURE 6.9: 95 Percent Confidence Intervals for Universal Male Suffrage Variable in Table 6.8
We can go through such a thought process for each of the coefficients and see
the bottom line: as long as we know how to use dummy variables for categorical
variables, the substantive results are exactly the same in multivariate columns (c)
and (d).
Figure 6.9 shows the 95 percent confidence intervals for the coefficient on the
universal suffrage variable for the bivariate and multivariate models. As discussed
in Section 4.6, confidence intervals indicate the range of possible true values
most consistent with the data. In the bivariate model, the confidence interval
ranges from 15.8 to 22.9. This confidence interval does not cover zero, which is
another way of saying that the coefficient is statistically significant. When we move
to the multivariate models, however, the 95 percent confidence intervals shift
dramatically downward and cover zero, indicating that the estimated effect is no
longer statistically significant. We don’t need to plot the results from column (d)
because the coefficient on the suffrage variable is identical to that in column (c).
6.4 Interaction Variables

Dummy variables can do even more work for us. Perhaps being in the Dummyi = 1
group does more than give each individual a bump up or down. Group membership
might interact with another independent variable, changing the way the indepen-
dent variable affects Y. Perhaps, for example, discrimination does not simply mean
that all men get paid more by the same amount. It could be that work experience for
men is more highly rewarded than work experience for women. We address this
possibility with models in which a dummy independent variable interacts with
(meaning “is multiplied by”) a continuous independent variable.7
The following OLS model allows the effect of X to differ across groups:
Yi = β0 + β1 Xi + β2 Dummyi + β3 Dummyi × Xi + i (6.7)
The third variable is produced by multiplying the Dummyi variable times the Xi
variable. In a spreadsheet, we would simply create a new column that is the product
of the Dummy and X columns. In statistical software, we generate a new variable,
as described in the Computing Corner of this chapter.
For the Dummyi = 0 group, the fitted value equation simplifies to
Ŷi = β̂0 + β̂1 Xi + β̂ 2 Dummyi + β̂ 3 Dummyi × Xi

= β̂0 + β̂1 Xi + β̂ 2 (0) + β̂ 3 (0) × Xi
= β̂0 + β̂1 Xi
In other words, the estimated intercept for the Dummyi = 0 group is β̂0 and the
estimated slope is β̂1 .
For the Dummyi = 1 group, the fitted value equation simplifies to
Ŷi = β̂0 + β̂1 Xi + β̂ 2 Dummyi + β̂ 3 Dummyi × Xi

= β̂0 + β̂1 Xi + β̂ 2 (1) + β̂ 3 (1) × Xi
= ( β̂0 + β̂ 2 ) + ( β̂1 + β̂ 3 )Xi
In other words, the estimated intercept for the Dummyi = 1 group is β̂0 + β̂ 2 and
the estimated slope is β̂1 + β̂ 3 .
Figure 6.10 shows a hypothetical example for the following model of salary
as a function of experience for men and women:
Salaryi = β0 + β1 Experiencei + β2 Malei + β3 Malei × Experiencei + i (6.8)
The dummy variable here is an indicator for men, and the continuous variable
is a measure of years of experience. The intercept for women (the Dummyi = 0
group) is β̂0 , and the intercept for men (the Dummyi = 1 group) is β̂0 + β̂ 2 . The β̂2
coefficient indicates the salary bump that men get even at 0 years of experience.
The slope for women is β̂1 , and the slope for men is β̂1 + β̂ 3 . The β̂ 3 coefficient
indicates the extra salary men get for each year of experience over and above the
salary increase women get for another year of experience. In this figure, the initial
gap between the salaries of men and women is modest (equal to β̂ 2 ), but due to
a positive β̂ 3 , the salary gap becomes quite large for people with many years of
experience.
7
Interactions between continuous variables are created by multiplying two continuous variables
together. The general logic is the same. Kam and Franceze (2007) provide an in-depth discussion of
all kinds of interactions.
Salary
(in $1,000s) Men (Dummyi = 1 group)
Fitted line for men (Dummyi = 1 group)
Women (Dummyi = 0 group)
70
Fitted line for women (Dummyi = 0 group)
60
50
n)
me
or
ef
op
(sl
+β
3
40 β1
en)
β0 + β2 for wom
lope
β 1 (s
β0
30
0 1 2 3 4 5 6 7 8 9 10
Years of experience
FIGURE 6.10: Interaction Model of Salaries for Men and Women
We have to be careful when we interpret β̂ 3 , the coefficient on Dummyi × Xi .

It is the differential slope for the Dummyi = 1 group, meaning that β̂ 3 tells us how
different the effect of X is for the Dummyi = 1 group compared to the Dummyi = 0
group. β̂ 3 is positive in Figure 6.10, meaning that the slope of the fitted line for the
Dummyi = 1 group is steeper than the slope of the line for the Dummyi = 0 group.
If β̂ 3 were zero, the slope of the fitted line for the Dummyi = 1 group would
be no steeper than the slope of the line for the Dummyi = 0 group, meaning that
the slopes would be the same for both the Dummyi = 0 and Dummyi = 1 groups. If
β̂ 3 were negative, the slope of the fitted line for the Dummyi = 1 group would be
less steep than the slope of the line for the Dummyi = 0 group (or even negative).
Interpreting interaction variables can be a bit tricky sometimes: the β̂ 3 can be
negative, but the effect of X on Y for the Dummyi = 1 group still might be positive.
For example, if β̂1 = 10 and β̂ 3 = −3, the slope for the Dummyi = 1 group would
be positive because the slope is the sum of the coefficients and therefore equals 7.
The negative β̂ 3 indicates that the slope for the Dummyi group is less than the
slope for the other group; it does not tell us whether the effect of X is positive or
negative, though. We must look at the sum of the coefficients to know that.
Table 6.9 summarizes how to interpret coefficients when dummy interaction
variables are included.
TABLE 6.9 Interpreting Coefficients in Dummy Interaction Model:

Yi = β0 + β1 Xi + β2 Di + β3 Xi × Di
β̂ 3 < 0 β̂ 3 = 0 β̂ 3 > 0
βˆ1 < 0 Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is
negative. Slope for Di = 1 negative. Slope for Di = 1 negative. Slope for Di = 1
group is more negative. group is same. group is less negative and will
be positive if βˆ1 + β̂ 3 > 0.
βˆ1 = 0 Slope for Di = 0 group is Slope for both groups is Slope for Di = 0 group is
zero. Slope for Di = 1 group zero. zero. Slope for Di = 1 group
is negative. is positive.
βˆ1 > 0 Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is pos-
positive. Slope for Di = 1 positive. Slope for Di = 1 itive. Slope for Di = 1 group
group is less positive and will group is same. is more positive.
be negative if βˆ1 + β̂ 3 < 0.
The standard error of β̂ 3 is useful for calculating confidence intervals for the
difference in slope coefficients across the two groups. Standard errors for some
quantities of interest are tricky, though. To generate confidence intervals for the
effect of X on Y, we need to be alert. For the Dummyi = 0 group, the effect is
simply β̂1 , and we can simply use the standard error of β̂1 . For the Dummyi = 1
group, the effect is β̂1 + β̂ 3 ; the standard error of the effect is more complicated
because we must account for the standard error of both β̂1 and β̂ 3 in addition to
any correlation between β̂1 and β̂ 3 (which is associated with the correlation of X1
and X3 ). The Citations and Additional Notes section provides more details on how
to do this on page 559.
REMEMBER THIS
Interaction variables allow us to estimate effects that depend on more than one variable.
1. A dummy interaction is created by multiplying a dummy variable times another variable.

2. Including a dummy interaction in a multivariate regression allows us to conduct a difference of
means test while controlling for other factors with a model such as
Yi = β0 + β1 Xi + β2 Dummyi + β3 Dummyi × Xi + i
3. The fitted values from this model will be two lines. For the model as written, the slope for the
group for which Dummyi = 0 will be β̂1 . The slope for the group for which Dummyi = 1 will
be β̂1 + β̂ 3 .
4. The coefficient on a dummy interaction variable indicates the estimated difference in slopes
between two groups.
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(a) (b)
Y 10 Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(c) (d)
Y 10 Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(e) (f)
FIGURE 6.11: Various Fitted Lines from Dummy Interaction Models (for Review Questions)
Review Questions
1. For each panel in Figure 6.11, indicate whether each of β0 , β1 , β2 , and β3 is less than, equal to,
or greater than zero for the following model:
Yi = β0 + β1 Xi + β2 Dummyi + β3 Dummyi × Xi + i
2. Express the value of β̂ 3 in panel (d) in terms of other coefficients.

3. True or false: If β̂ 3 < 0, an increase in X for the treatment group is associated with a decline
in Y.
CASE STUDY Energy Efficiency

Energy efficiency promises a double whammy of
benefits: reduce the amount of energy used, and
we can both save the world and money. What’s not
to love?
But do energy-saving devices really deliver?
The skeptic in us should worry that energy savings
may be overpromised. In this case study, we ana-
lyze the energy used to heat a house before and
after the homeowner installed a programmable
thermostat. The attraction of a programmable ther-
mostat is that it allows the user to preset temper-
atures at energy-efficient levels, especially for the middle of the night when the
house doesn’t need to be as warm (or, for hot summer nights, as cool).
Figure 6.12 is a scatterplot of monthly observations of the gas used in a house
(measured in therms) and heating degree-days (HDD), which is a measure of how
cold it was in the month.8 We’ve marked the months without a programmable
thermostat as squares and the months with the programmable thermostat as
circles.
Visually, we immediately see that heating goes up as HDD increase (which
happens when the temperature drops). Not a huge surprise. We also can see the
possibility that the programmable thermostat lowered gas usage because the
observations with the programmable thermostat seem lower. However, it is not
clear how large the effect is and whether it is statistically significant.
We need a model to get a more precise answer. What model is best? Let’s start
with a very basic difference of means model:
Thermsi = β0 + β1 Programmable Thermostati + i
The results for this model, in column (a) of Table 6.10, indicate that the home-
owner used 13.02 fewer therms of energy in months of using the programmable
thermostat than in months before he acquired it. Therms cost about $1.59 at this
8
For each day, the HDD is measured as the number of degrees that a day’s average temperature is
below 65 degrees Fahrenheit, the temperature below which buildings may need to be heated. The
monthly measure adds up the daily measures and provides a rough measure of the amount of heating
needed in the month. If the temperature is above 65 degrees, the HDD measure will be zero.
Therms 300 Months without programmable thermostat

(amount of Months with programmable thermostat
gas used
in home)
200
100
0 250 500 750 1,000
Heating degree-days (HDD)
FIGURE 6.12: Heating Used and Heating Degree-Days for Homeowner who Installed a Program-
mable Thermostat
time, so the homeowner saved roughly $20.70 per month on average. That’s not
bad. However, the effect is not statistically significant (not even close, really, as
the t statistic is only 0.54), so based on this result, we should be skeptical that the
thermostat saved money.
The difference of means model does not control for anything else, and we know
that the coefficient on the programmable thermostat variable will be biased if some
other variable matters and is correlated with the programmable thermostat vari-
able. In this case, we know unambiguously that HDD matters, and it is plausible that
the HDD differed in the months with and without the programmable thermostat.
Hence, a better model is clearly
Thermsi = β0 + β1 Programmable Thermostati + β2 HDDi + i
The results for this model are in column (b) of Table 6.10. The HDD variable is
hugely (massively, superlatively) statistically significant. Including it also leads to a
TABLE 6.10 Data from Programmable Thermostat and Home Heating Bills
(a) (b) (c)
∗
Programmable thermostat −13.02 −20.05 −0.48
(23.94) (4.49) (4.15)
[t = 0.54] [t = 4.46] [t = 0.11]
HDD 0.22∗ 0.26∗

(Heating degree-days) (0.006) (0.007)
[t = 34.42] [t = 38.68]
Programmable thermostat × HDD −0.062∗

(0.009)
[t = 7.00]
Constant 81.52∗ 14.70∗ 4.24

(17.49) (3.81) (3.00)
[t = 4.66] [t = 3.86] [t = 1.41]
N 45 45 45
σ̂ 80.12 15.00 10.25
R2 0.007 0.966 0.985

∗
greater (in magnitude) coefficient on the programmable thermostat variable, which

is now −20.05. The standard error on the programmable thermostat variable also
goes down a ton because of the much smaller σ̂ , which in turn is due to the much
better fit we get by including the HDD variable. The effect of the programmable
thermostat variable is statistically significant, and given a cost of $1.59 per therm,
the savings is about $31.88 per month. Because a programmable thermostat costs
about $60 plus installation, the programmable thermostat should pay for itself
pretty quickly.
Something about these results, however, should nag at us: they are about gas
usage only, which in this house goes overwhelmingly to heating (with the rest
used to heat water and for the stove). Does it make sense that the programmable
thermostat should save $30 in the middle of the summer? The furnace is never on
and, well, that means cooking a lot fewer scrambled eggs on the stove to save that
much money.
It makes more sense to think about the effect of the thermostat as interactive.
That is, the colder it is, the more energy the programmable thermostat can save.
Therefore, we also estimate the following model that includes an interaction
between the thermostat dummy and HDD:
Thermsi = β0 + β1 Programmable thermostati + β2 HDDi

+β3 Programmable thermostati × HDDi + i
The results for this model are in column (c) of Table 6.10, where the coef-
ficient on Programmable thermostat indicates the difference in therms when the
other variables are zero. Because both variables include HDD, the coefficient on
Programmable thermostat indicates the effect of the thermostat when HDD is zero
(meaning the weather is warm for the whole month). The coefficient of −0.48 with a
t statistic of 0.11 indicates there is no significant bump down in energy usage across
all months. This might seem to be bad news, but is it good news for us, given that
we have figured out that the programmable thermostat shouldn’t reduce heating
costs when the furnace isn’t running?
Not quite. The overall effect of the thermostat is β̂1 + β̂ 3 × HDD. Although
we have already seen that β̂1 is insignificant, the coefficient on Programmable
thermostat × HDD, −0.062, is highly statistically significant, with a t statistic of
7.00. For every one-unit increase in HDD, the programmable thermostat lowered
the therms used by 0.062. In a month with the HDD variable equal to 500, we
estimate that the homeowner changed energy used by β̂1 + β̂ 2 500 = −.048 +
(−0.062 × 500) = −31.48 therms after the programmable thermostat was installed
(lowering the bill by $50.05, at $1.59 per therm). In a month with the HDD
variable equal to 1,000, we estimate that the homeowner changed energy use by
−.048 + (−0.062 × 1000) = −62.48 therms, lowering the bill by $99.34 at $1.59 per
therm. Suddenly we’re talking real money. And we’re doing so from a model that
makes intuitive sense because the savings should indeed differ depending on how
cold it is.9
This case provides an excellent example of how useful—and distinctive—the
dummy variable models we’ve presented in this chapter can be. In panel (a) of
Figure 6.13, we show the fitted values based on model (b) in Table 6.10, which
controls for HDD but models the effect of the thermostat as a constant difference
across all values of HDD. The effect of the programmable thermostat is statistically
significant and rather substantial, but it doesn’t ring true because it suggests that
savings from reduced use of gas for the furnace are the same in a sweltering
summer month and in a frigid winter month. Panel (b) of Figure 6.13 shows
the fitted values based on model (c) in Table 6.10, which allows the effect of
the thermostat to vary depending on the HDD. This is an interactive model that
yields fitted lines with different slopes. Just by inspection, we can see the fitted
lines for model (c) fit the data better. The effects are statistically significant and
substantial and, perhaps most important, make more sense because the effect of
the programmable thermostat on heating gas used increases as the month gets
colder.
9
We might be worried about correlated errors given that this is time series data. As discussed on page
68, the coefficient estimates are not biased if the errors are correlated, but standard OLS standard
errors might not be appropriate. In Chapter 13, we show how to estimate models with correlated
errors. For this data set, the results get a bit stronger.
Conclusion 211
Therms Therms
300 Months without 300 Months without
programmable thermostat programmable thermostat
Months with Months with
programmable thermostat programmable thermostat
200 200
100 100
0 0
0 250 500 750 1000 0 250 500 750 1000
Heating degree-days (HDD) Heating degree-days (HDD)

(a) (b)
FIGURE 6.13: Heating Used and Heating Degree-Days with Fitted Values for Different Models
Conclusion
Dummy variables are incredibly useful. Despite a less-than-flattering name, they
do some of the most important work in all of statistics. Experiments almost
always are analyzed with treatment group dummy variables. A huge proportion
of observational studies care about or control for dummy variables such as gender
or race. And when we interact dummy variables with continuous variables, we can
investigate whether the effects of certain variables differ by group.
We have mastered the core points of this chapter when we can do the
following:
• Section 6.1: Write down a model for a difference of means test using
bivariate OLS. Which parameter measures the estimated difference? Sketch
a diagram that illustrates the meaning of this parameter.
• Section 6.2: Write down a model for a difference of means test using
multivariate OLS. Which parameter measures the estimated difference?
Sketch a diagram that illustrates the meaning of this parameter.
• Section 6.3: Explain how to incorporate categorical variables in OLS

models. What is the reference category? Explain why coefficient estimates
change when the reference category changes.
• Section 6.4: Write down a model that has a dummy variable (D) interaction
with a continuous variable (X). How do we explain the effect of X on Y?
Sketch the relationship for Di = 0 observations and Di = 1 observations.
Further Reading
Brambor, Clark, and Golder (2006) as well as Kam and Franceze (2007) provide
excellent discussions of interactions, including the appropriate interpretation of
models with two continuous variables interacted. Braumoeller (2004) does a good
job of injecting caution into the interpretation of coefficients on lower-order terms
in models that include interaction variables.
Key Terms
Categorical variables (193) Difference of means Jitter (184)
Dichotomous variable (181) test (180) Ordinal variables (193)
Dummy variable (181) Reference category (194)
Computing Corner
Stata
1. A difference of means test in OLS is simply reg Y Dum, where Dum is

the name of a dummy variable. This command will produce an identical
estimate, standard error, and t statistic as ttest Y, by(Dum). To allow
the variance to differ across the two groups, the OLS model is reg Y Dum,
robust and the stand-alone t test is ttest Y, by(Dum) unequal.
2. To create an interaction variable named “DumInteract,” simply type gen

DumInteract = Dum * X, where Dum is the name of the dummy variable
and X is the name of the continuous variable.
3. Page 559 in the citations and additional notes section discusses how to
generate a standard error in Stata for the effect of X on Y for the Dummyi =
1 group.
4. It is often useful to let Stata convert a categorical variable into the

appropriate number of dummy variables in the model. For example, if X1
is a categorical variable with four categories, the following command will
estimate a regression in which one category will be automatically set as
the reference category and three dummy variables will be included:
reg Y i.X1
The default reference category is whichever category is listed first. To
choose a specific reference category, use reg Y ib2.X1 (to set the second
group as the reference category), or reg Y ib3.X1 (to set the third group
as the reference category), and so on. The b in the ib2 and ib3 commands
refers to base level.
1. A difference of means test in OLS is simply lm(Y ~ Dum). This com-

mand will produce an identical estimate, standard error, and t statistic
as t.test(Y[Dum==1], Y[Dum==0], var.equal = TRUE). To allow
the variance to differ across the two groups, the stand-alone t test is
t.test(Y[Dum==1], Y[Dum==0], var.equal = FALSE). The OLS
version of this model takes a bit more work, as it involves estimating the
heteroscedasticity-consistent standard error model described on page 85.
It is
OLSResults = lm(Y ~ Dum)
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
2. To create an interaction variable named “DumInteract,” simply type

DumInteract = Dum * X, where Dum is the name of the dummy variable
and X is the name of the continuous variable.
3. Page 559 in the citations and additional notes section discusses how to
generate a standard error in R for the effect of X on Y for the Dummyi = 1
group.
4. R provides several ways to automate inclusion of appropriate dummy

variables when we have a categorical independent variable. To take advan-
tage of these, it is useful to first check the data type using class(X1),
where X1 is the name of a categorical variable. If the data type is integer
or numeric, running lm(Y ~ factor(X1)) will produce an OLS model
with the appropriate dummy variables included. For example, if X1 is a
categorical variable with four categories, the command will estimate a
regression in which one category will be automatically set as the reference
category and three dummy variables will be included. If the data type
of our categorical variable is factor, running lm(Y ~ X1) (notice we
do not need the factor command) will produce an OLS model with the
appropriate dummy variables included. To change the reference value for
a factor variable, use the relevel() command. For example, if we include
X1 = relevel(X1, ref = “south“) before our regression model, the
reference category will be south.
Exercises
1. Use data from heightwage.dta that we used in Exercise 1 in Chapter 5
(page 172).
(a) Estimate an OLS regression model with adult wages as the dependent
variable and adult height, adolescent height, and a dummy variable
for males as the independent variables. Does controlling for gender
affect the results?
(b) Generate a female dummy variable. Estimate a model with both a

male dummy variable and a female dummy variable. What happens?
Why?
(c) Reestimate the model from part (a) separately for males and females.
Do these results differ from the model in which male was included
as a dummy variable? Why or why not?
(d) Estimate a model in which adult wages is the dependent variable and
there are controls for adult height and adolescent height in addition
to dummy variable interactions of male times each of the two height
variables. Compare the results to the results from part (c).
(e) Estimate a model in which adult wages is the dependent variable

and there are controls for male, adult height, adolescent height,
and two dummy variable interactions of male times each of the
two height variables. Compare the results to the results from
part (c).
(f) Every observation is categorized into one of four regions based

on where the subjects lived in 1996. The four regions are North-
east (norest96), Midwest (norcen96), South (south96), and West
(west96). Add dummy variables for regions to a model explaining
wages in 1996 as a function of height in 1981, male, and male times
height in 1981. First exclude West, then exclude South, and explain
the changes to the coefficients on the height variables and the regional
dummy variables.
Exercises 215
TABLE 6.11 Variables for Monetary Policy Data

FEDFUNDS Effective federal funds rate (in percent)
lag_FEDFUNDS Lagged effective federal funds rate (in percent)

Democrat Democrat = 1, Republican = 0
Quarters Quarters since previous election (0–15)

Inflation Annualized inflation rate (one-percent inflation = 1.00)
DATE Date
2. These questions are based on “The Fed May Be Politically Independent

but It Is not Politically Indifferent,” a paper by William Clark and Vincent
Arel-Bundock (2013). The paper explores the relationship between elec-
tions and the federal funds rate (FFR). Often a benchmark for financial
markets, the FFR is the average interest rate at which federal funds trade
in a day. The rate is set by the U.S. central bank, the Federal Reserve, a.k.a.
the Fed. Table 6.11 describes the variables from fed_2012.dta that we use
in this problem.
(a) Create two scatterplots, one for years in which a Democrat was
president and one for years in which a Republican was president,
showing the relationship between the FFR and the quarters since the
previous election. Comment on the differences in the relationships.
The variable Quarters is coded 0 to 15, representing each quarter
from one election to the next. For each presidential term, the value
of Quarters is 0 in the first quarter containing the election and 15 in
the quarter before the next election.
(b) Create an interaction variable between Quarters and Democrat to

test whether closeness to elections has the same effect on Democrats
and Republicans. Run a model with the FFR as the dependent
variable, allowing the effect of the Quarters variable to vary by party
of the president.
(i) What change in FFR is associated with a one-unit increase in

the Quarters variable when the president is a Republican?
(ii) What change in FFR is associated with a one-unit increase in

the Quarters variable when the president is a Democrat?
(c) Is the effect of Quarters statistically significant under Republicans?

(Easy.) Is the effect of Quarters statistically significant under
Democrats? (Not so easy.) How can the answer be determined? Run
any additional tests if necessary.
(d) Graph two fitted lines for the relationship between Quarters and
interest rates, one for Republicans and one for Democrats. (In Stata,
use the twoway and lfit commands with appropriate if statements;
label by hand. In R, use the abline command.) Briefly describe the
relationship.
(e) Rerun the model from part (b) controlling for both the interest rate
in the previous quarter (lag_FEDFUND) and inflation b and discuss
the results, focusing on (i) effect of Quarters for Republicans, (ii) the
differential effect of Quarters for Democrats, (iii) impact of lagged
FFR, and (iv) inflation. Simply report the statistical significance of
the coefficient estimates; don’t go through the entire analysis from
part (c).
3. This problem uses the cell phone and traffic data set described in Chapter 5
(page 174) to analyze the relationship between cell phone and texting bans
and traffic fatalities. We add two variables: cell_ban is coded 1 if it is illegal
to operate a handheld cell phone while driving and 0 otherwise; text_ban
is coded 1 if it is illegal to text while driving and 0 otherwise.
(a) Add the dummy variables for cell phone bans and texting bans to the
model from Question 3, part (c) in Chapter 5 (page 175). Interpret
the coefficients on these dummy variables.
(b) Explain whether the results from part (a) allow the possibility that
a cell phone ban saves more lives in a state with a large population
compared to a state with a small population. Discuss the implications
for the proper specification of the model.
(c) Estimate a model in which total miles is interacted with both the
cell phone ban and the prohibition of texting variables. What is the
estimated effect of a cell phone ban for California? For Wyoming?
What is the effect of a texting ban for California? For Wyoming?
What is the effect of total miles?
(d) This question uses material from page 559 in the citations and
additional notes section. Figure 6.14 displays the effect of the cell
phone ban as a function of total miles. The dashed lines depict
confidence intervals. Identify the points on the fitted lines for the
estimated effects for California and Wyoming from the results in
part (c). Explain the conditions under which the cell phone ban has
a statistically significant effect.10
10
Brambor, Clark, and Golder (2006) provide Stata code to create a plot like this for models with
interaction variables.
Exercises 217
Dependent variable: traffic deaths

500
Marginal effect of text ban

0
–500
–1,000
Marginal effect of text ban

95% confidence interval
–1,500
0 100,000 200,000 300,000
Total miles driven
FIGURE 6.14: Marginal Effect of Text Ban as Total Miles Changes

Amount Assessed fine for the ticket

Age Age of driver
Female Equals 1 for women and 0 for men

Black Equals 1 for African-Americans and 0 otherwise
Hispanic Equals 1 for Hispanics and 0 otherwise

StatePol Equals 1 if ticketing officer was state patrol officer
OutTown Equals 1 if driver from out of town and 0 otherwise
OutState Equals 1 if driver from out of state and 0 otherwise
4. In this problem we continue analyzing the speeding ticket data first

introduced in Chapter 5 (page 175). The variables we use are in Table 6.12.
(a) Implement a simple difference of means test that uses OLS to assess
whether the fines for men and women are different. Do we have any
reason to expect endogeneity? Explain.
(b) Implement a difference of means test for men and women that
controls for age and miles per hour. Do we have any reason to expect
endogeneity? Explain.
(c) Building from the model just described, also assess whether fines are
higher for African-Americans and Hispanics compared to everyone
else (non-Hispanic whites, Asians and others). Explain what the
coefficients on these variables mean.
(d) Look at standard errors on coefficients for the Female, Black, and
Hispanic variables. Why they are different?
(e) Within a single OLS model, assess whether miles over the speed limit
has a differential effect on the fines for women, African-Americans,
and Hispanics.
5. There is a consensus among economists that increasing government

spending and cutting taxes boost economic growth during recessions. Do
regular citizens share in this consensus? We care because political leaders
often feel pressure to do what voters want, regardless of its probable
effectiveness.
To get at this issue, a 2012 YouGov survey asked people questions about
what would happen to unemployment if taxes were raised or government
spending increased. Answers were coded into three categories based on
consistency with the economic consensus. On the tax question, people
who said raising taxes would raise unemployment were coded as “3”
(the correct answer), people who said raising taxes would have no effect
on unemployment were coded as “2,” and people who said raising taxes
would lower unemployment were coded as “1.” On the spending question,
people who said raising government spending would lower unemploy-
ment were coded as “3” (the correct answer), people who said raising
spending would have no effect on unemployment were coded as “2,” and
people who said raising spending would raise unemployment were coded
as “1.”
(a) Estimate two bivariate OLS models in which political knowledge

predicts the answers. In one model, use the tax dependent variable;
in the other model, use the spending dependent variable. The model
will be
Answeri = β0 + β1 Political knowledgei + i
where Answeri is the correctness of answers, coded as described.

We measure political knowledge based on how many of nine
factual questions about government each person answered correctly.
(Respondents were asked to identify the Vice President, the Chief
Justice of the U.S. Supreme Court, and so forth.) Interpret the
results.
Exercises 219
(b) Add partisan affiliation to the model by estimating the following

model for each of the two dependent variables (the tax and spending
variables):
Answeri = β0 + β1 Political knowledgei + β2 Republicani + i
where Republicani is 1 for people who self-identify with the

Republican Party and 0 for everyone else.11 Explain your results.
(c) The effect of party may go beyond simply giving all Republicans
a bump up or down in their answers. It could be that political
knowledge interacts with being Republican such that knowledge has
different effects on Republicans and non-Republicans. To test this,
estimate a model that includes a dummy interaction term:
Answeri = β0 + β1 Political knowledgei + β2 Republicani

+β3 Political knowledgei × Republicani + i
Explain the results and compare/contrast to the initial bivariate

results.
11
We could use tools for categorical variables discussed in Section 6.3 to separate non-Republicans
into Democrats and Independents. Our conclusions would be generally similar in this particular
example.
7 Specifying Models
What makes people happy? Relationships? Wisdom?

Money? Chocolate? Figure 7.1 provides an initial look
at this question by displaying the self-reported life
satisfaction of U.S. citizens from the World Values
Survey (2008). Each data point is the average value
reported by survey respondents in a two-year age group.
The scores range from 1 (“dissatisfied”) to 10 (“sat-
isfied”).1 There is a pretty clear pattern: people start
off reasonably satisfied at age 18 and then reality hits,
making them less satisfied until their mid-40s. Happily,
things brighten from that point onward, and old folks
are generally the happiest bunch. (Who knew?) This
pattern is not an anomaly: other surveys at other times
and in other countries reveal similar patterns.
The relationship is U-shaped.2 Given what we’ve
done so far, it may not be obvious how to make OLS
estimate such a model. However, OLS is actually quite flexible, and this chapter
shows some of the tricks OLS can do, including estimating non-linear relationships
like the one we see in the life satisfaction data. The unifying theme is that each of
these tricks involves a transformation of the data or the model to do useful things.
Figuring out the right functional form of an OLS model is an example of
model specification model specification, the process of specifying exactly what our model equation
The process of looks like. Another important element of model specification is choosing which
specifying the equation variables to include. Specification, it turns out, can be treacherous. Political
for our model.
scientist Phil Schrodt (2014) has noted that most experienced statistical analysts
have witnessed cases in which “even minor changes in model specification
can lead to coefficient estimates that bounce around like a box full of gerbils
1
We have used multivariate OLS to net out the effect of income, religiosity, and children from the
life satisfaction scores.
2
Or smile shaped, if you will. To my knowledge, there is no study of chocolate and happiness, but
I’m pretty sure it would be an upside down U: people might get happier the more they eat for a while,
but at some point, more chocolate has to lead to unhappiness, as it did for the kid in Willy Wonka.
220
Life
satisfaction
8.0
7.5
7.0
6.5
6.0
5.5
5.0
4.5
20 30 40 50 60 70
Age
FIGURE 7.1: Average Life Satisfaction by Age in the United States
on methamphetamines.” This is an exaggeration—perhaps a box of caffeinated

chinchillas is more like it—but there is a certain truth behind his claim.
The problem is that if we include too few variables, we risk omitted variable
bias as described in Chapter 5. But if we include too many variables (or, really, if
we include variables of the wrong kind), we risk other biases that we describe in
this chapter.
This chapter provides an overview of the opportunities and challenges in
model specification. Section 7.1 shows how to estimate non-linear effects with
polynomial models. In Section 7.2, we produce a different kind of non-linear
model by using logged variables, which are particularly helpful in characterizing
effects in percentage terms. Section 7.3 discusses the dangers of including
post-treatment variables in our models. Section 7.4 presents good practices when
specifying models.
7.1 Quadratic and Polynomial Models

The world doesn’t always move in straight lines, and happily, neither do OLS
estimates. In this section, we explain the difference between linear and non-linear
models in the regression context and then introduce quadratic and polynomial
models as flexible tools to deal with non-linear models.
222 CHAPTER 7 Specifying Models
Linear versus non-linear models

The standard OLS model is remarkably flexible. It can, for example, estimate
non-linear effects. This idea might seem a little weird at first. Didn’t we say at
the outset (page 45) that OLS is also known as linear regression? How can we
estimate non-linear effects with a linear regression model? The reason is a bit
pedantic, but here goes: when we refer to a “linear” model, we mean linear in
parameters, which means that the β’s aren’t squared or cubed or logged or subject
to some other non-linearity. This means that OLS can’t handle models like the
following3 :
Yi = β0 + β12 X1i + i

Yi = β0 + β1 X1i + β2 X1i + i
The X’s, though, are fair game: we can square, cube, log, or otherwise
transform X’s to produce fitted curves instead of fitted lines. Therefore, both of
the following models are OK in OLS because each β simply multiplies itself times
some independent variable that may or not be non-linear:
Yi = β0 + β1 X1i + β2 X1i
2
+ i

Yi = β0 + β1 X1i + β2 X1i 7
+ i
Non-linear relationships are common in the real world. Figure 7.2 shows
data on life expectancy and GDP per capita for all countries in the world. We
immediately sense that there is a positive relationship: the wealthier countries
definitely have higher life expectancy. But we also see that the relationship is a
curve rather than a line because life expectancy rises rapidly at the lower levels of
GDP per capita but then flattens out. Based on this data, it’s pretty reasonable to
expect an annual increase of $1,000 in per capita GDP to have a fairly substantial
effect on life expectancy in a country with low GDP per capita, while an increase of
$1,000 in per capita GDP for a very wealthy country would have only a negligible
effect on life expectancy. Therefore, we want to get beyond estimating straight
lines alone.
Figure 7.3 shows the life expectancy data with two different kinds of fitted
lines. Panel (a) shows a fitted line from a standard OLS model:
Life expectancyi = β0 + β1 GDPi + i (7.1)
As we can see, the fit isn’t great. The fitted line is lower than the data for many
of the observations with low GDP values. For observations with high GDP levels,
the fitted line dramatically overestimates life expectancy. As bad as it is, though,
this is the best possible straight line in terms of minimizing squared error.
3
The world doesn’t end if we really want to estimate a model that is non-linear in the β’s. We just
need something other than OLS to estimate the model. In Chapter 12, we discuss probit and logit
models, which are non-linear in the β’s.
Life
expectancy
(in years)
80
75
70
65
60
55
50
$0 $40,000 $80,000 $120,000

GDP per capita
FIGURE 7.2: Life Expectancy and Per Capita GDP in 2011 for All Countries in the World
Polynomial models
polynomial model We can generate a better fit by using a polynomial model. Polynomial models
A model that includes include not only an independent variable but also the independent variable raised
values of X raised to
to some power. By using a polynomial model, we can produce fitted value lines
powers greater than
one. that curve.
The simplest example of a polynomial model is a quadratic model that
quadratic model A includes X and X 2 . The model looks like this:
model that includes X
and X 2 as independent Yi = β0 + β1 X1i + β2 X1i
2
+ i (7.2)
variables.
For our life expectancy example, a quadratic model is
Life expectancyi = β0 + β1 GDPi + β2 GDP2i + i (7.3)
Panel (b) of Figure 7.3 plots this fitted curve, which better captures the
non-linearity in the data as life expectancy rises rapidly at low levels of GDP and
then levels off. The fitted curve is not perfect. The predicted life expectancy is
Life
expectancy
in years
90 90
80 80
70 70
60 60
50 50
0 20 40 60 80 100 0 20 40 60 80 100
GDP per capita GDP per capita
(in $1,000s) (in $1,000s)
FIGURE 7.3: Linear and Quadratic Fitted Lines for Life Expectancy Data
still a bit low for low values of GDP, and the turn to negative effects seems more
dramatic than the data warrant. We’ll see how to generate fitted lines that flatten
out without turning down when we cover logged models later in this chapter.
Interpreting coefficients in a polynomial model is different from this proce-
dure in a standard OLS model. Note that the effect of X changes depending on
the value of X. In panel (b) of Figure 7.3, the effect of GDP on life expectancy
is large for low values of GDP. That is, when GDP goes from $0 to $20,000, the
fitted value for life expectancy increases relatively rapidly. The effect of GDP on
life expectancy is smaller as GDP gets higher: the change in fitted life expectancy
when GDP goes from $40,000 to $60,000 is much smaller than the change in fitted
life expectancy when GDP goes from $0 to $20,000. The predicted effect of GDP
even turns negative when GDP goes above $60,000.
We need some calculus to get the specific equation for the effect of X on Y.
∂Y
We refer to the effect of X on Y as ∂X :
1
∂Y
= β1 + 2β2 X1 (7.4)
∂X1
This equation means that when we interpret results from a polynomial regression,
we can’t look at individual coefficients in isolation; instead, we need to know how
the coefficients on X1 and X12 come together to produce the estimated curve.4
Y Y
1,000
150 Y = −0.1X + 0.02X 2 800
100 600
400
50
200 Y = 20X − 0.1X 2
0
0
0 20 40 60 80 100 0 20 40 60 80 100
X X
(a) (b)
Y Y
0
1,000
−200 Y = −20X + 0.1X 2

800
−400
600
−600
−800 400
−1,000 Y = 1,000 + 2X − 0.1X 2

200
0 20 40 60 80 100 0 20 40 60 80 100
X X
(c) (d)
Y Y
250 0
200 −50
150 −100 Y = −10X + 0.1X 2
100 Y = 10X − 0.1X 2 −150
50 −200
0 −250
0 20 40 60 80 100 0 20 40 60 80 100
X X
(e) (f)
FIGURE 7.4: Examples of Quadratic Fitted Curves
4
Equation 7.4 is the result of using standard calculus tools to take the derivative of Y in Equation 7.2
with respect to X1 . The derivative is the slope evaluated at a given value of X1 . For a linear model, the
slope is always the same and is β̂1 . The ∂Y in the numerator refers to the change in Y; the ∂X1 in the
Figure 7.4 illustrates more generally the kinds of relationships that a quadratic
model can account for. Each panel illustrates a different quadratic function. In
panel (a), the effect of X is getting bigger as X gets bigger. In panel (b), the effect
of X on Y is getting smaller. In both panels, Y gets bigger as X gets bigger, but the
relationships have a quite different feel.
In panels (c) and (d) of Figure 7.4, there are negative relationships between
X and Y: the more X, the less Y. Again, though, we see very different types of
relationships. In panel (c), there is a leveling out, while in panel (d), the negative
effect of X on Y accelerates as X gets bigger.
A quadratic OLS model can even estimate relationships that change direc-
tions. In panel (e) of Figure 7.4, Y initially gets bigger as X increases, but then it
levels out. Eventually, increases in X decrease Y. In panel (f), we see the opposite
pattern, with Y getting smaller as X rises for small values of X and, eventually, Y
rising with X.
One of the nice things about using a quadratic specification in OLS is that
we don’t have to know ahead of time whether the relationship is curving down or
up, flattening out, or getting steeper. The data will tell us. We can simply estimate
a quadratic model and, if the relationship is like that in panel (a) of Figure 7.4,
the estimated OLS coefficients will yield a curve like the one in the panel; if the
relationship is like that in panel (f), OLS will produce coefficients that best fit the
data. So if we have data that looks like any of the patterns in Figure 7.4, we can
get fitted lines that reflect the data simply by estimating a quadratic OLS model.
Polynomial models with cubed or higher-order terms can account for patterns
that wiggle and bounce even more than those in the quadratic model. It’s relatively
rare, however, to use higher-order polynomial models, which often simply aren’t
supported by the data. In addition, using higher-order terms without strong
theoretical reasons can be a bit fishy—as in raising the specter of the model fishing
we warn about in Section 7.4. A control variable with a high order can be more
defensible, but ideally, our main results do not depend on untheorized high-order
polynomial control variables.
REMEMBER THIS
OLS can estimate non-linear effects via polynomial models.
1. A polynomial model includes X raised to powers greater than one. The general form is
Yi = β0 + β1 Xi + β2 Xi2 + β3 Xi3 + . . . + βk Xik + i
∂Y
denominator refers to the change in X1 . The fraction ∂X1
therefore refers to the change in Y divided
by the change in X1 , which is the slope.
2. The most commonly used polynomial model is the quadratic model:
Yi = β0 + β1 Xi + β2 Xi2 + i
• The effect of Xi in a quadratic model varies depending on the value of X.

• The estimated effect of a one-unit increase in Xi in a quadratic model is β̂1 + 2 β̂2 X.
For each of the following, discuss whether you expect the relationship to be linear or non-linear.
Sketch the relationship you expect with a couple of points on the X-axis, labeled to identify the
nature of any non-linearity you anticipate.
(a) Age and income in France
(b) Height and speed in the Boston Marathon
(c) Height and rebounds in the National Basketball Association
(d) IQ and score on a college admissions test in Japan
(e) IQ and salary in Japan
(f) Gas prices and oil company profits
(g) Sleep and your score on your econometrics final exam
CASE STUDY Global Warming

Climate change may be one of the most important
long-term challenges facing humankind. We’d really like
to know if temperatures have been increasing and, if so,
at what rate.
Figure 7.5 shows global temperatures since 1880.
Panel (a) plots global average temperatures by year
over time. Temperature is measured in deviation from
average pre-industrial temperature. The more positive
the value, the more temperature has increased. Clearly,
there is an upward trend. But how should we character-
ize this trend?
Temperature
(deviation from
average
0.9
pre-industrial
temperature,
in Fahrenheit) 0.7
0.5
0.3
0.1
−0.1
1890 1930 1970 2010
Year
(a)
Temperature
(deviation from
average 0.9 0.9
pre-industrial
temperature,
in Fahrenheit) 0.7 0.7
0.5 0.5
0.3 0.3
0.1 0.1
−0.1 −0.1
1890 1930 1970 2010 1890 1930 1970 2010
Year Year
(b) (c)
FIGURE 7.5: Global Temperature over Time
Panel (b) of Figure 7.5 includes the fitted line from a bivariate OLS model with
Year as the independent variable:
Temperaturei = β0 + β1 Yeari + i
The linear model fits reasonably well, although it seems to be underestimating

recent temperatures and overestimating temperatures in the 1970s.
TABLE 7.1 Global Temperature, 1879–2012

(a) (b)
∗
Year 0.006 −0.166∗
(0.0003) (0.031)
[t = 18.74] [t = 5.31]
Year2 0.000044∗
(0.000008)
[t = 5.49]
Constant −10.46∗ 155.68∗
(0.57) (30.27)
[t = 18.31] [t = 5.14]
N 128 128
σ̂ 0.12 0.11
R2 0.73 0.78

∗
Column (a) of Table 7.1 shows the coefficient estimates for the linear model.
The estimated β̂1 is 0.006, with a standard error of 0.0003. The t statistic of 18.74
indicates a highly statistically significant coefficient. The result suggests that the
earth has been getting 0.006 degree warmer each year since 1879 (when the data
series begins).
The data looks pretty non-linear, so we also estimate the following quadratic
OLS model:
Temperaturei = β0 + β1 Yeari + β2 Yeari2 + i
in which Year and Year2 are independent variables. This model allows us to assess
whether the temperature change has been speeding up or slowing down by
enabling us to estimate a curve in which the change per year in recent years is,
depending on the data, larger or smaller than the change per year in earlier years.
We have plotted the fitted line in panel (c) of Figure 7.5; notice it is a curve that gets
steeper over time. It fits the data even better, with less underestimation in recent
years and less overestimation in the 1970s.
Column (b) of Table 7.1 reports results from the quadratic model. The coef-
ficients on Year and Year2 have t stats greater than 5, indicating clear statistical
significance. The coefficient on Year is −0.166, and the coefficient on Year2 is
0.000044. What the heck do those numbers mean? At a glance, not much. Recall,
however, that in a quadratic model, an increase in Year by one unit will be associated
with a β̂1 + 2 β̂2 Yeari increase in estimated average global temperature. This means
the predicted change from an increase in Year by one unit in 1900 is
−0.166 + 2 × 0.000044 × 1900 = 0.0012 degree
The predicted change in temperature from an increase in Year by one unit in 2000
is
−0.166 + 2 × 0.000044 × 2000 = 0.01 degree
In the quadratic model, in other words, the predicted effect of Year changes
over time. In particular, the estimated rate of warming in 2000 (0.01 degree per year)
is around eight times the estimated rate of warming in 1900 (0.0012 degree per
year).
We won’t pay much attention at this point to the standard errors because errors
are almost surely autocorrelated (as discussed in Section 3.6), which would make
the standard errors reported by OLS incorrect (probably too small). We address
autocorrelation and other time series aspects of this data in Chapter 13.
Review Questions
Figure 7.6 contains hypothetical data on investment by consumer electronics companies as a function
of their profit margins.
1. For each panel, describe the model you think best explains the data.
2. Sketch a fitted line for each panel.
3. For each panel, approximate the predicted effect on R & D investment of changing profits from
0 to 1 percent and from changing profits from 3 to 4 percent.
7.2 Logged Variables

Empirical analysts, especially in economics, often use logged variables. Logged
variables allow for non-linear relationships but have cool properties that allow us
to interpret estimated effects in percentage terms. In this section, we discuss logs
and how they work in OLS models, and we show how they work in our height
and wages example. Although we present several different ways to use logged
variables, the key thing to remember is that if there’s a log, there’s a percentage
interpretation of some sort going on.
Logs in regression models

We’ll work with so-called natural logs, which revolve around the constant e, which
equals approximately 2.71828. Like π ≈ 3.14, e is one of those numbers that pops
up all over in math. Recall that if e2 = 7.38, then ln(7.38) = 2. (The notation
“ln” refers to natural log.) In other words, the natural log of some number k is the
exponent to which we have to raise e to obtain k. The fact that ln(3) = 1.10 means
that e1.10 = 3 (with rounding).
For our purposes, we won’t be using the mathematical properties of logs too
much.5 We instead note that using logged variables in OLS equations can allow
us to characterize non-linear relationships that are broadly similar to panels (b)
and (c) of Figure 7.4. In that sense, these models don’t differ dramatically from
polynomial models.
One difference from the quadratic models is that models with logged variables
have an additional attractive feature. The estimated coefficients can be interpreted
R&D R&D
investment investment
8
12
7
10
6
8
5
4 6
3 4
2
2
0 1 2 3 4 0 1 2 3 4
Profit margin Profit margin

(in percent) (in percent)
(a) (b)
R&D R&D
investment investment 25
15
20
10
15
5
0 10
0 1 2 3 4 0 1 2 3 4
Profit margin Profit margin

(in percent) (in percent)
(c) (d)
FIGURE 7.6: Hypothetical Investment Data (for Review Questions)
5
We derive the marginal effects in log models in the Citations and Additional Notes section on
(page 560).
directly in percentage terms. That is, with the correct logged model, we can
produce results that tell us how much a one percent increase in X affects Y. Often
this is a good way to think about empirical questions.
Consider the model of GDP and life expectancy we looked at on page 222. If
we estimate a basic OLS model such as
Life expectancyi = β0 + β1 GDPi + i (7.5)
the estimated β̂1 in this model would tell us the increase in life expectancy that
would be associated with a one-unit increase in GDP per capita (measured in
thousands of dollars in this example). At first glance, this might seem like an OK
model. On second glance, we might get nervous. Suppose the model produces
β̂1 = 0.25; that result would say that every country—whatever their GDP— would
get another 0.25 years of life expectancy for every thousand dollar increase in GDP
per capita. That means that the effect of a dollar (or, given that we’re measuring
GDP in thousands of dollars, a thousand dollars) is the same in rich country like
the United States and a poor country like Cambodia. One could easily imagine
that the money in poor country could go to life-extending medicine and nutrition;
in the United States, it seems likely the money would go to iPhone apps and maybe
triple bacon cheeseburgers, neither of which are particularly likely to increase
life-expectancy.
It may be is better to think of GDP changes in percentage terms rather than
in absolute values. A $1, 000 increase in GDP per capita in Cambodia is a large
percentage increase, while in the United States, a $1, 000 increase in GDP per
capita is not very large in percentage terms.
Logged models are extremely useful when we want to model relationships in
linear-log model A percentage terms. For example, we could estimate a linear-log model in which the
model in which the independent variable is logged (and the dependent variable is not logged). Such a
dependent variable is model would look like
not logged, but the
Yi = β0 + β1 ln Xi + i (7.6)
independent variable is.
where β1 indicates the effect of a one percent increase in X on Y.
We need to divide the estimated coefficient by 100 to convert it to units of Y.
This is one of the odd hiccups in models with logged variables: the units can be
a bit tricky. While we can memorize the way units work in these various models,
the safe course of action here is to simply accept that each time we use logged
models, we’ll probably have to look up how units in logged models work in the
summary on page 236.
Figure 7.7 shows a fitted line from a linear-log model using the GDP and
life expectancy data we saw earlier in Figure 7.3. One nice feature of the fitted
line from this model is that the fitted values keep rising by smaller and smaller
amounts as GDP per capita increases. This pattern contrasts to the fitted values
in the quadratic model, which declined for high values of GDP per capita. The
estimated coefficient in the linear-log model on GDP per capita (measured in
thousands of dollars) is 5.0. This implies that a one percent increase in GDP per
capita is associated with an increase in life expectancy of 0.05 years. For a country
with a GDP per capita of $100,000, then, an increase of GDP per capita of $1,000
Life
expectancy
in years
80
70
60
50
0 20 40 60 80 100
GDP per capita

(in $1,000s)
FIGURE 7.7: Linear-Log Model for Life Expectancy Data
is an increase of one percent and will increase life expectancy by 0.05 of a year.
For a country with a GDP per capita of $10,000, however, an increase of GDP
per capita of $1,000 is a 10 percent increase, implying that the estimated effect
is to increase life expectancy by about 0.5 of a year. A $1,000 increase in GDP
per capita for a country with GDP per capita of $1,000 would be a 100 percent
increase, implying that the fitted value of life expectancy would rise by about
5 years.
log-linear model A Logged models come in several flavors. We can also estimate a log-linear
model in which the model in which the dependent variable is transformed by taking the natural log
dependent variable is of it and the independent variable is not logged. For example, suppose we are
transformed by taking
interested in testing if women get paid less than men. We could run a simple linear
its natural log.
model with wages as the dependent variable and a dummy variable for women.
That’s odd, though, because it would say that all women get β̂1 dollars less. It
might be more reasonable to think that discrimination works in percentage terms
as women may get some percent less than men. The following log-linear model
would be a good start:
ln(Wagesi ) = β0 + β1 Femalei + i (7.7)
Because of the magic of calculus (shown on page 560), the β̂1 in this model
can be interpreted as the percentage change in Y associated with a one-unit
increase in X. In other words, the model would provide us with an estimate that
the difference in wages women get is β̂1 percent.
log-log model A At the pinnacle of loggy-ness is the so-called log-log model. Log-log models
model in which the do a lot of work in economic models. Among other uses, they also allow us to
dependent variable and estimate elasticity, which is the percent change in Y associated with a percent
the independent change in X. For example, if we want to know the elasticity of demand for airline
variable are logged. tickets, we can get data on sales and prices and estimate the following model:
elasticity The ln(Ticket salesi ) = β0 + β1 ln(Pricei ) + i (7.8)

percent change in Y
associated with a where the dependent variable is the natural log of monthly ticket sales on routes
percent change in X. (e.g., New York to Tokyo) and the independent variable is the natural log of the
monthly average price of the tickets on those routes. β̂1 estimates the percentage
change of sales when ticket prices go up by one percent.6
Another hiccup we notice with logged models is that the values of the variable
being logged must be greater than zero. The reason is that the mathematical log
function is undefined for values less than or equal to zero.7 Hence, logged models
work best with economic variables such as sales, quantities, and prices. Even there,
however, it is not uncommon to see an observation with zero sales or zero wages,
and we’re forced to omit such observations if we want to log those variables.8
Logged models are super easy to estimate; we’ll see how in the Computing
Corner at the end of the chapter. The key is interpretation. If the model has a logged
variable or variables, we know the coefficients reflect a percentage of some sort,
with the exact interpretation depending on which variables are logged.
Logs in height and wages example

Table 7.2 takes us back to the height and wage data we discussed on page 131.
It reports results from four regressions. In the first column, nothing is logged.
Interpretation of β̂1 is old hat: a one-inch increase in adolescent height is
associated with a $0.412 increase in predicted hourly wages.
6
A complete analysis would account for the fact that prices are also a function of the quantity of
tickets sold. We address these types of models in Section 9.6.
7
Recall that (natural) log of k is the exponent to which we have to raise e to obtain k. There is no
number that we can raise e to and get zero. We can get close by raising e to minus a huge number; for
example, e−100 = e1001
, which is very close to zero, but not quite zero.
8
Some people recode these numbers as something very close to zero (e.g., 0.0000001) on the
reasoning that the log function is defined for low positive values and the essential information (that
the variable is near zero) in such observations is not lost. However, it’s always a bit sketchy to be
changing values (even from zero to a small number), so tread carefully.
TABLE 7.2 Different Logged Models of Relationship between Height and Wages
No log Linear-log Log-linear Log-log
∗ ∗
Adolescent height 0.412 0.033
(0.098) (0.015)
[t = 4.23] [t = 2.23]
Log adolescent height 29.316∗ 2.362∗
(6.834) (1.021)
[t = 4.29] [t = 2.31]
Constant −13.093 −108.778∗ 0.001 −7.754
(6.897) (29.092) (1.031) (4.348)
[t = 1.90] [t = 3.74] [t = 0.01] [t = 1.78]
N 1,910 1,910 1,910 1,910
2
R 0.009 0.010 0.003 0.003

∗
The second column reports results from a linear-log model in which the
dependent variable is not logged and the independent variable is logged. The
interpretation of β̂1 is that a one percent increase in X (which is adolescent height
100 = $0.293 increase in hourly wages. The
in this case) is associated with a 29.316
dividing by 100 is a bit unusual, but no big deal once we get used to it.
The third column reports results from a model in which the dependent variable
has been logged but the independent variable has not been logged. In such a
log-linear model, the coefficient indicates the percent change in the dependent
variable associated with a one-unit change in the independent variable. The
interpretation of β̂1 here is that a one-inch increase in height is associated with
a 3.3 percent increase in wages.
The fourth column reports a log-log model in which both the dependent
variable and the independent variable have been logged. The interpretation of β̂1
here is that a one percent increase in height is associated with a 2.362 percent
increase in wages. Note that in the log-linear column, the probability is on a scale
of 0 to 1, and in the log-log column, the probability is on a 0 to 100 scale. Yeah,
that’s a pain; it’s just how the math works out.
So which model is best? Sadly, there is no magic bullet that will always hit the
perfect model here, another hiccup when we work with logged models. We can’t
simply look at the R2 because those values are not comparable: in the first two
models the dependent variable is Y, and in the last two, the dependent variable is
ln(Y). As is often the case, some judgment will be necessary. If we’re dealing with
an economic problem of estimating price elasticity, a log-log model is natural. In
other contexts, we have to decide whether the causal mechanism makes more sense
in percentage terms and whether it applies to the dependent and/or independent
variables.
REMEMBER THIS
1. How to interpret logged models:
Linear-log: Yi = β0 + β1 ln Xi + i A one percent increase in X is associated

β1
with a 100 change in Y.
Log-linear: ln Yi = β0 + β1 Xi + i A one-unit increase in X is associated
with a β1 percent change in Y
(on a 0–1 scale).
Log-log: ln Yi = β0 + β1 ln Xi + i A one percent increase in X is associated
with a β1 percent change in Y
(on a 0–100 scale).
2. Logged models have some challenges not found in other models (the Three Hiccups):
(a) The scale of the β̂ coefficients varies depending on whether the model is log-linear,
linear-log, or log-log.
(b) We cannot log variables that have values less than or equal to zero.
(c) There is no simple test for choosing among log-linear, linear-log, and log-log
models.
7.3 Post-Treatment Variables

Hopefully, we have been sufficiently alarmist about the dangers of endogeneity.
All too often, a relationship between Y and X may actually be spurious, as the real
relationship is between Y and some omitted Z that is correlated with X and Y. We
can solve this problem with OLS by including these Z variables in the model so
that they can no longer be a source of endogeneity. This may lead us to think that
we should simply add as many variables as we can to our models.
If only life were so simple. We cannot simply include variables willy-nilly.
Adding certain types of variables can cause bias, sometimes extreme bias. We
therefore need to think carefully about which variables belong in our model. In
particular, we should avoid including post-treatment variables in our models. A
post-treatment post-treatment variable is a variable that is affected by our independent variable
variable A variable that of interest.
is causally affected by an The terminology comes from randomized experiments in which the ran-
independent variable. domized instrumental variable of interest is often called the treatment. This
treatment can affect not only the dependent varaible but also other potential
independent variables. A post-treatment variable is therefore a variable that comes
X1
Independent variable
Example: 9th grade
tutoring treatment
γ1
X2
Y
γ2
Post-treatment variable
Dependent variable
Example: 12th grade
Example: age 26 earnings
reading scores
FIGURE 7.8: Post-Treatment Variable that Soaks Up Effect of X1
after an independent variable of interest and could be caused by it. Our concern
with post-treatment variables is definitely not limited to experiments, though, as
post-treatment variables can screw up observational studies as well.
Two problems can arise when we include post-treatment variables. The first
mediator bias Bias is called mediator bias which is a type of bias that occurs when a post-treatment
that occurs when a variable is added and absorbs some of the causal effect of the treatment variable.
post-treatment variable For example, suppose we provided extra tutoring for a randomly selected group
is added and absorbs of ninth graders and then assessed their earnings at age 26. The mechanism for
some of the causal
the tutoring to work had two parts, as shown in Figure 7.8. The arrows indicate a
effect of the treatment
variable. causal effect, and the Greek letters next to the arrows indicate the magnitude of
the variable’s effect.
We see that the tutoring had a direct effect on earnings of γ1 . Tutoring also
increased test scores by α, and reading scores increased earnings by γ2 . In other
words, Figure 7.8 shows that if we plunk a kid in this tutoring program, he or she
will make γ1 + αγ2 more at age 26.
Suppose we estimate a simple bivariate model
Earningsi = β0 + β1 Tutoring treatmenti + i (7.9)
While this model doesn’t capture the complexity of the process by which tutoring
increased earnings, it does capture the overall effect of being in the tutoring
program. Simply put, β̂1 will provide an unbiased estimate of the effect of the
tutoring program because the tutoring treatment was randomly assigned and is
therefore not correlated with anything, including . In terms of Figure 7.8, a kid in
the tutoring program will earn γ1 +αγ2 more at age 26, and E[ β̂1 ] will be γ1 +αγ2 .
It might seem that adding reading scores to the model might be useful. Maybe.
But we need to be careful. If we estimate
Earningsi = β0 + β1 Tutoring treatmenti + β2 12th grade readingi + i (7.10)
the estimated coefficient on the tutoring treatment will only capture the direct
effect of tutoring and will not capture the indirect effect of the tutoring via
improving reading scores; that is, E[ β̂1 ] = γ1 . That means that if we naively focus
on β̂1 as the effect of the tutoring treatment, we’ll miss that portion of the effect
associated with the tutoring increasing reading scores.9
In a case like this, two steps are most appropriate. First, we should estimate the
simpler model without the post-treatment variable in order to estimate the overall
effect of the treatment. Second, if we want to understand the process by which
the treatment variable affects the outcome we can estimate two equations: one
that looks like Equation 7.10 in order to estimate the direct effect of 12th grade
reading on earnings, and another equation to understand the effect of the tutoring
treatment on 12th grade reading.
collider bias Bias The second problem that post-treatment variables can cause is collider bias,
that occurs when a a type of bias that occurs when a post-treatment variable creates a pathway for
post-treatment variable spurious effects to appear in our estimation. This bias is more subtle and therefore
creates a pathway for more insidious than mediator bias. In particular, if we include a post-treatment
spurious effects to
variable that is affected by an unobserved confounder that also affects the
appear in our
estimation. dependent variable, the estimated effect of a variable of interest may look large
when it is zero, look small when it is large, look positive when it is negative, and
so on.10
Here we’ll focus on a case in which including a post-treatment variable can
lead to an appearance of a causal relationship when there is in fact no relationship;
we’re building on an example from Acharya, Blackwell, and Sen (2016). Suppose
we want to know if car accidents cause the flu. It’s a silly question: we don’t really
think that car accidents cause the flu, but let’s see if a post-treatment variable could
lead us to think car accidents do cause (or prevent) the flu. Suppose we have data
9
We provide references to the recent statistical literature on this issue in the Further Reading section
at the end of this chapter.
10
The name “collider bias” is not particularly intuitive. It comes from a literature that uses diagrams
(like Figure 7.9) to assess causal relations. The two arrows from X1 and U “collide” at X2 , hence the
name.
X1 U
Unobserved confounder
Independent variable variable
Example: car accident Example: high fever
ρ1
α ρ2
X2 Y
Post-treatment variable Dependent variable

Example: hospitalization Example: flu
FIGURE 7.9: Example in which a Post-Treatment Variable Creates a Spurious Relationship between X1 and Y
on 100, 000 people and whether they were in a car accident (our independent
variable of interest, which we label X1 ), whether they were hospitalized (our
post-treatment variable, which we label X2 ), and whether they had the flu (our
dependent variable, Y). Compared to our discussion of mediator bias, we’ll add
a confounder variable, which is something that is unmeasured but affects both
the post-treatment variable (X2 ) and the dependent variable (Y). We’ll label the
confounder as U to emphasize that it is unobserved.
Figure 7.9 depicts the true state of relationships among variables in our exam-
ple. Car accidents increase hospitalization by α, fever increases hospitalization by
ρ1 , and fever increases the probability of having the flu by ρ2 . In our example, car
accidents have no direct effect on having the flu, and being hospitalized itself does
not increase the probability of having the flu. (We allow for these direct effects in
our more general discussion of collider bias in Section 14.8.)
If we estimate a simple model
Flui = β0 + β1 Car accidenti + i (7.11)

we will be fine because the car accident variable is uncorrelated with the
unobserved factor, fever (which we can see by noting there is no direct connection
between the car accidents and fever in Figure 7.9). The expected value of β̂1 for
such a model will be the true effect, which is zero in our example depicted in the
figure.
It might seem pretty harmless to also add a variable for hospitalization to the
model, so that our model now looks like
Flui = β0 + β1 Car accidenti + β2 Hospitalizedi + i (7.12)
Actually, however, danger lurks! We go through a more mathematical explanation

in Section 14.8, but here’s the intuition. People in the hospital do in fact have a
higher probability of having the flu because many of them were there because
of having a high fever. This means that the β̂2 coefficient on hospitalization will
be greater than zero. This isn’t a causal effect but is essentially a familiar case of
omitted variable bias. The problem for our estimated coefficient on the car accident
variable comes from the fact that the connection between hospitalization and flu
is only for people who arrived at the hospital with a high fever. Those folks who
arrived at a hospital after a car accident did not generally have higher fever, so there
is no connection between their being hospitalized and having the flu. So if we’re
going to have a β̂2 > 0 that reflects the effect of fever, we’ll want to undo that for
the folks who were hospitalized after a car accident, meaning that the coefficient
on the car accident variable ( β̂1 ) will need to be less than zero.
In other words, the estimates in Equation 7.12 will tell the following story:
folks in the hospital do indeed have a higher chance of having the flu (which is
what β̂2 > 0 means), but this doesn’t apply to folks who were hospitalized after a
car accident (which is what β̂1 < 0 means).
The creation of a fake causal pathway when adding a post-treatment variable
is one of the funkier things we cover in this book. Exercise 4 at the end of the
chapter gives us a chance to see for ourselves how the coefficient on X1 gets
knocked away from the true effect of X1 when we add a post-treatment variable to
a model in which there is an unmeasured confounder.
We generalize our example (and get a bit mathy) in Section 14.8. The form
of relationships is depicted in Figure 7.10. The true direct effect of X1 on Y is γ1 ,
and the true direct effect of X2 on Y is γ2 , and the rest of the effects are the same
as in our previous example. If we estimated the following model
Yi = β0 + β1 X1i + β2 X2i + i (7.13)

the expected value of the coefficients are
ρ2
E[ β̂1 ] = γ1 − α (7.14)
ρ1
ρ2
E[ β̂2 ] = γ2 + (7.15)
ρ1
We’ll focus on the estimated effect of X1 , which is our independent variable
of interest. In this case, we’ll have both mediator bias and collider bias to contend
U
X1
Unobserved
Independent variable confounder
variable
ρ1 ρ2
α
γ1
X2
γ2 Y
Post-treatment
Dependent variable
variable
FIGURE 7.10: A More General Depiction of Models with a Post-Treatment Variable
with. Here we’ll examine how collider bias distorts our estimate of the direct effect
of X1 on Y. The true direct effect of X1 on Y is γ1 (see Figure 7.10); we’ll consider
bias to be any deviation of the expected value of the estimated coefficient from
γ1 . This factor is α ρρ2 , meaning that three conditions therefore are necessary for a
1
post-treatment variable to create bias: α = 0, ρ1 = 0, and ρ2 = 0. The condition that
α = 0 is simply the condition that X2 is in fact a post-treatment variable affected
by X1 . If α = 0, then X1 has no effect on X2 . The conditions that ρ1 = 0 and ρ2 = 0
are the conditions that make the unobserved variable a confounder: it affects both
the post-treatment variable X2 and Y. If U does not affect both X2 and Y, then there
is no hidden relationship that is picked up the estimation.
What should we do if we suspect collider bias? One option has a very
multivariate OLS feel: simply add the confounder. If we do this, the bias goes
away. But the thing about confounders is that the reason we’re thinking about them
as confounders in the first place is that they are something we probably haven’t
measured, so this approach is often infeasible.
Another option is to simply not include the post-treatment variable. If X1 is

a variable from a randomized experiment, this is a wonderful option. When X1
is a variable in an observational study, however, it sometimes gets hard to know
what causes what. In this case, estimate models with and without the problematic
variable. If the results change only a little, then this concern is not particularly
pressing. If the results change a lot, we’ll need to use theory and experience to
defend one of the specifications. We discuss more sophisticated approaches in the
further reading section at the end of the chapter.
1. Suppose we are interested in assessing whether there is gender bias in wages. Our main variable
of interest is X1 , which is a dummy variable for women. Our dependent variable, Y, is wages.
We also know the occupation for each person in our sample. For simplicity, assume that our
occupation variable is simply a dummy variable X2 indicating whether someone is an engineer
or not. Do not introduce other variables into your discussion (at least until you are done with
the following questions!).
(a) Create a figure like Figure 7.8 that indicates potential causal relations.
(b) What is E[ β̂1 ] for Yi = β0 + β1 X1 ?
(c) What are signs of E[ β̂1 ] and E[ β̂2 ] for Yi = β0 + β1 X1 + β2 X2 ?
(d) What model specification do you recommend?
2. Suppose we are interested in assessing whether having a parent who was in jail is more likely
to increase the probability that a person will be arrested as an adult. Our main variable of
interest is X1 , which is a dummy variable indicating a person’s parent served time in jail. Our
dependent variable, Y, is an indicator for whether that person was arrested as an adult. We also
have a variable X2 that indicates whether the person was suspended in high school. We do not
observe childhood lead exposure, which we label as U. Do not introduce other factors into your
discussion (at least until you are done with the following questions!).
(a) Create a figure that indicates potential causal relations.
(b) What is E[ β̂1 ] for Yi = β0 + β1 X1 ?
(c) What are signs of E[ β̂1 ] and E[ β̂2 ] for Yi = β0 + β1 X1 + β2 X2 ?
(d) What model specification do you recommend?
7.4 Model Specification 243
REMEMBER THIS
1. Post-treatment variables are variables that are affected by the independent variable of interest.
2. Including post-treatment variables in a model can create two types of bias.
(a) Mediator bias: Including post-treatment variables in a model can cause the
post-treatment variable to soak up some of the causal effect of our variable of interest.
(b) Collider bias: Including post-treatment variables in a model can bias the coefficient on
our variable of interest if there is an unmeasured confounder variable that affects both
the post-treatment variable and the dependent variable.
3. It is best to avoid including post-treatment variables in models.
7.4 Model Specification

Sometimes a given result may emerge under just the
right conditions—perhaps the coefficient on X1 is only
statistically significant when variables X1 and X4 are
included, X2 is squared, and X3 is excluded. It might be
tempting to report only that specification, but clearly,
this is cheating. We want not only to avoid these
temptations but also to convince others that we have
not fallen prey to them. In this section, we discuss how
to be thoughtful and transparent about how we specify
our models, including how to choose the variables we
include in our model.
model fishing Model We call it model fishing, when researchers modify
fishing is a bad statistical their model specification until they get the results they were looking for. We also
practice that occurs call it p-hacking, which occurs when a researcher changes the model until the p
when researchers add
value on the coefficient of interest reaches a desired level.
and subtract variables
until they get the results
Model fishing is possible because coefficients can change from one speci-
they were looking for. fication to another. There are many reasons for such variability. First, exclusion
of a variable that affects Y and is correlated with X1 can cause omitted variable
p-hacking Occurs bias, as we know very well. Second, inclusion of a highly multicollinear variable
when a researcher
can increase the variance of a coefficient estimate, making anomalous β̂ more
changes the model until
the p value on the
likely and/or making the coefficient on our variable of interest insignificant. Third,
coefficient of interest inclusion of a post-treatment variable can cause coefficients to be biased in an
reaches a desired level. unknown (but potentially large) way.
And sometimes the changes in results can be subtle. Sometimes we’re missing
observations for some variables. For example, in survey data it is quite common
for a pretty good chunk of people to decline to answer questions about their
annual income. If we include an income variable in a model, OLS will include
only observations for people who fessed up about how much money they make.
If only half of the survey respondents answered, including income as a control
variable will cut our sample size in half. This change in the sample can cause
coefficient estimates to jump around because, as we talked about with regard to
sampling distributions (on page 53), coefficients will differ for each sample. In
some instances, the effects on a coefficient estimate can be large.11
Two good practices mitigate the dangers inherent in model specification. The
first is to adhere to the replication standard. Some people see how coefficient
estimates can change dramatically depending on specification and become
statistical cynics. They believe that statistics can be manipulated to give any
answer. Such thinking lies behind the aphorism “There are three kinds of lies:
lies, damned lies, and statistics.” A better response is skepticism, a belief that
statistical analysis should be transparent to be believed. In this view, the saying
should be “There are three kinds of lies: lies, damned lies, and statistics that can’t
be replicated.”
A second good practice is to present results from multiple specifications in
a way that allows readers to understand which steps of the specification are the
crucial ones for the conclusion being offered. Begin by presenting a minimal
specification, which is a specification with only the variable of interest and perhaps
some small number of can’t-exclude variables as well (see Lenz and Sahn 2017).
Then explain the addition of additional variables (or other specification changes
such as including non-linearities or limiting the sample). Coefficients may change
when variables are added or excluded—that is, after all, the point of multivariate
analysis. When a specification choice makes a big difference, the researcher owes
the reader a big explanation for why this is a sensible modeling choice. And
because it often happens that two different specifications are reasonable, the reader
should see (or have access to in an appendix) both specifications. This will inform
readers that the results either are robust across reasonable specification choices
or depend narrowly on particular specification choices. The results on height and
wages reported in Table 5.2 offer one example, and we’ll see more throughout the
book.
11
And it is possible that the effects of a variable differ throughout the population. If we limit the
sample to only those who report income (people who tend to make less money, as it happens), we
may be estimating a different effect (the effect of X1 in a lower-income subset) than when we
estimate the model with all the data (the effect of X1 for the full population). Aronow and Samii
(2016) provide an excellent discussion of these and other nuances in OLS estimation.
Conclusion 245
REMEMBER THIS
1. An important part of model specification is choosing what variables to include in the model.
2. Researchers should provide convincing evidence that they are not model fishing by including
replication materials and by reporting results from multiple specifications, beginning with a
minimal specification.
Conclusion
This chapter has focused on the opportunities and challenges inherent in
model specification. First, the world is not necessarily linear, and the multivariate
model can accommodate a vast array of non-linear relationships. Polynomial mod-
els, of which quadratic models are the most common, can produce fitted lines with
increasing returns, diminishing returns, and U-shaped and upside-down U-shaped
relationships. Logged models allow effects to be interpreted in percentage terms.
Post-treatment variables provide an example in which we can have too many
variables in a model, as post-treatment variables can soak up causal effects or,
more subtly, create pathways for spurious causal effects to appear.
We have mastered the core points of this chapter when we can do the
following:
• Section 7.1: Explain polynomial models and quadratic models. Sketch the
various kinds of relationships that a quadratic model can estimate. Show
how to interpret coefficients from a quadratic model.
• Section 7.2: Explain three different kinds of logged models. Show how to
interpret coefficients in each.
• Section 7.3: Define a post-treatment variable, and explain two ways in

which including a post-treatment variable can bias coefficients.
• Section 7.4: Explain good practices regarding model specification.
Further Reading
Empirical papers using logged variables are very common; see, for example, Card
(1990). Zakir Hossain (2011) discusses the use of Box-Cox tests to help decide
which functional form (linear, log-linear, linear-log, or log-log) is best.
Acharya, Blackwell, and Sen (2016) as well as Montgomery, Nyhan, and

Torres (2017) provide excellent discussions of the challenges—and solutions—to
problems that arise when post-treatment variables are in the mix.
Key Terms
Collider bias (238) Log-log model (234) p-hacking (243)
Elasticity (234) Mediator bias (237) Post-treatment variable (236)
Linear-log model (232) Model fishing (243) Polynomial model (223)
Log-linear model (233) Model specification (220) Quadratic model (223)
Computing Corner
Stata
1. To estimate a quadratic model in Stata, simply generate a new variable

equal to the square of the variable (e.g., gen X1Squared = X1∧ 2) and
include it in a regression (e.g., reg Y X1 X1Squared X2).
2. To estimate a linear-logged model in Stata, simply generate a new

variable equal to the log of the independent variable (e.g., gen X1Log
= log(X1)) and include it in a regression (e.g., reg Y X1Log X2).
Log-linear and log-log models proceed similarly.
1. To estimate a quadratic model in R, simply generate a new variable equal

to the square of the variable (e.g., X1Squared = X1∧ 2) and include it in
a regression (e.g., lm(Y ~ X1 +X1Squared +X2)).
2. To estimate a linear-logged model in R, simply generate a new variable

equal to the log of the independent variable (e.g., X1Log = log(X1)) and
include it in a regression (e.g., lm(Y ~ X1 +X1Log +X2)). Log-linear
and log-log models proceed similarly.
Exercises 247
Exercises
1. The relationship between political instability and democracy is important
and likely to be quite complicated. Do democracies manage conflict in
a way that reduces instability, or do they stir up conflict? Use the data
set called Instability_PS data.dta from Zaryab Iqbal and Christopher Zorn
(2008) to answer the following questions. The data set covers 157 countries
between 1946 and 1997. The unit of observation is the country-year. The
variables are listed in Table 7.3.
(a) Estimate a bivariate model with instability as the dependent variable

and democracy as the independent variable. Because the units of
the variables are not intuitive, use standardized coefficients from
Section 5.5 to interpret. Briefly discuss the estimated relationship
and whether you expect endogeneity.
(b) To combat endogeneity, include a variable for lagged GDP. Discuss

changes in results, if any.
(c) Perhaps GDP is better conceived of in log terms. Estimate a model

with logged-lagged GDP, and interpret the coefficient on this GDP
variable.
(d) Suppose we are interested in whether instability was higher or lower

during the Cold War. Run two models. In the first, add a Cold War
dummy variable to the preceding specification. In the second model,
add a logged Cold War dummy variable to the above specification.
Discuss what happens.
(e) It is possible that a positive relationship exists between democracy

and political instability because in more democratic countries,
TABLE 7.3 Variables for Political Instability Data

Ccode Country code

Year Year
Instab Index of instability (revolutions, crises, coups, etc.); ranges from −4.65 to +10.07
Coldwar Cold War year (1 = yes, 0 = no)
GDPlag GDP in previous year

Democracy Democracy score in previous year, ranges from 0 (most autocratic) to 100 (most democratic)
people feel freer to engage in confrontational political activities

such as demonstrations. It may be, however, that this relationship is
positive only up to a point or that more democracy increases political
instability more at lower levels of political freedom. Estimate a
quadratic model, building off the specification above. Use a figure
to depict the estimated relationship, and use calculus to indicate the
point at which the sign on democracy changes.
2. We will continue the analysis of height and wages in Britain from

Exercise 5 in Chapter 5. We’ll use the data set heightwage_british_all_
multivariate.dta, which includes men and women and the variables listed
in Table 7.4.12
(a) Estimate a model explaining wages at age 33 as a function of female,

height at age 16, mother’s education, father’s education, and number
of siblings. Use standardized coefficients from Section 5.5 to assess
whether height or siblings has a larger effect on wages.
(b) Use bivariate OLS to implement a difference of means test across

males and females. Do this twice: once with female as the dummy
variable and the second time with male as the dummy variable (the
male variable needs to be generated). Interpret the coefficient on the
gender variable in each model, and compare results across models.
TABLE 7.4 Variables for Height and Wages Data in Britain


momed Education of mother, measured in years
daded Education of father, measured in years

female Female indicator variable (1 for women, 0 for men)

LogWage33 Log of hourly wages at age 33
12
For the reasons discussed in the homework exercise in Chapter 3 on page 89, we limit the data set
to observations with height greater than 40 inches and self-reported income less than 400 British
pounds per hour. We also exclude observations of individuals who grew shorter from age 16 to age
33. Excluding these observations doesn’t really affect the results, but the observations themselves are
just odd enough to make us think that these cases may suffer from non-trivial measurement error.
Exercises 249
(c) Now do the same test, but with log of wages at age 33 as the
dependent variable. Use female as the dummy variable. Interpret the
coefficient on the female dummy variable.
(d) How much does height explain salary differences across genders?
Estimate a difference of means test across genders, using logged
wages as the dependent variable and controlling for height at age
33 and at age 16. Explain the results.
(e) Does the effect of height vary across genders? Use logged wages at
age 33 as the dependent variable, and control for height at age 16
and the number of siblings. Explain the estimated effect of height at
age 16 for men and for women using an interaction with the female
variable. Use an F test to assess whether height affects wages for
women.
3. In this problem, we continue analyzing the speeding ticket data introduced

in Chapter 5 (page 175). The variables we will use are listed in Table 7.5.
(a) Is the effect of age on fines non-linear? Assess this question by

estimating a model with a quadratic age term, controlling for
MPHover, Female, Black, and Hispanic. Interpret the coefficients on
the age variables.
(b) Sketch the relationship between age and ticket amount from the
foregoing quadratic model: calculate the fitted value for a white
male with MPHover equals 0 (probably not many people going
zero miles over the speed limit got a ticket, but this simplifies
calculations a lot) for ages equal to 20, 25, 30, 35, 40, and 70.
(In Stata, the following displays the fitted value for a 20-year-old,


Amount Assessed fine for the ticket
Age Age of driver

Female Equals 1 for women and 0 for men
Black Equals 1 for African-Americans and 0 otherwise

Hispanic Equals 1 for Hispanics and 0 otherwise
StatePol Equals 1 if ticketing officer was state patrol officer
OutTown Equals 1 if driver from out of town and 0 otherwise

OutState Equals 1 if driver from out of state and 0 otherwise
assuming all other independent variables equal zero: display

_b[_cons]+ _b[Age]*20+ _b[AgeSq]*20^2. In R, suppose that
we name our OLS model in part (a) “TicketOLS.” Then the
following displays the fitted value for a 20-year-old, assuming all
other independent variables equal zero: coef(TicketOLS)[1] +
coef(TicketOLS)[2]*20 + coef(TicketOLS)[3]*(20^2).)
(c) Use Equation 7.4 to calculate the marginal effect of age at ages 20,
35, and 70. Describe how these marginal effects relate to your sketch.
(d) Calculate the age that is associated with the lowest predicted fine
based on the quadratic OLS model results given earlier.
(e) Do drivers from out of town and out of state get treated differ-
ently? Do state police and local police treat non-locals differently?
Estimate a model that allows us to assess whether out-of-towners
and out-of-staters are treated differently and whether state police
respond differently to out-of-towners and out-of-staters. Interpret the
coefficients on the relevant variables.
(f) Test whether the two state police interaction terms are jointly
significant. Briefly explain the results.
4. The book’s website provides code that will simulate a data set we can use to
explore the effects of including post-treatment variables. (Stata code is in
Ch7_PostTreatmentSimulation.do; R code is in Ch7_PostTreatmentSim-
ulation.R).
The first section of code simulates what happens when X1 (the
independent variable of interest) affects X2 , a post-treatment variable as
in Figure 7.8 on page 237. Initially, we set γ1 (the direct effect of X1 on Y),
α (the effect of X1 on X2 ), and γ2 (the effect of X2 on Y) all equal to 1.
(a) Estimate a bivariate model in which Y = β0 + β1 X1 . What is your

estimate of β1 ? How does this estimate change for (i) γ1 = 0, (ii)
γ2 = 0 (setting γ1 back to 1), and (iii) α = 1 (setting γ1 and γ2 equal
to 1).
(b) Estimate a multivariate model in which Y = β0 + β1 X1 + β2 X2 . What

is your estimate of β1 ? How does this estimate change for (i) γ1 = 0,
(ii) γ2 = 0 (setting γ1 back to 1), and (iii) α = 1 (setting γ1 and γ2
equal to 1).
(c) Come up with a real-world example with X1 , X2 , and Y for an analysis

of interest to you.
Exercises 251
The second section code adds an unmeasured confounder, U, to the

simulation. Refer to Figure 7.9 on page 239. Initially, we set α (the effect
of X1 on X2 ), ρ1 (the effect of U on X2 ), and ρ2 (the effect of U on Y) all
equal to 1.
(d) Estimate a bivariate model in which Y = β0 + β1 X1 . What is your

estimate of β1 ? How does this estimate change for (i) α1 = 0 (ii)
ρ1 = 0 (setting α1 back to 1), and (iii) ρ2 = 1 (setting α1 and ρ1
equal to 1)?
(e) Estimate a multivariate model in which Y = β0 + β1 X1 + β2 X2 . What

is your estimate of β1 ? How does this estimate change for (i) α1 = 0
(ii) ρ1 = 0 (setting α1 back to 1), and (iii) ρ2 = 1 (setting α1 and ρ1
equal to 1).
(f) Come up with a real-world example with X1 , X2 , U, and Y for an

analysis of interest to you.
(g) [Advanced] Create a loop in which you run these simulations

100 times for each exercise, and record the average value of the
parameter estimates.
PA R T II
The Contemporary Econometric Toolkit

Using Fixed Effects Models to Fight 8
Endogeneity in Panel Data and
Difference-in-Difference Models
Do police reduce crime? It certainly seems plausible that

they get some bad guys off the street and deter others
from breaking laws. It is, however, hardly a foregone
conclusion. Maybe cops don’t get out of their squad
cars enough to do any good. Maybe police officers do
some good, but not as much universal prekindergarten
does.
It is natural to try to answer the question by using
OLS to analyze data on crime and police in cities over
time. The problem is we probably won’t be able to
measure many factors that are associated with crime,
such as drug use and gang membership. These factors will go in the error
term and will probably correlate highly with the number of police officers
as police are hired specifically to deal with such problems. A naive OLS
model therefore risks finding that police cause crime because the places with
lots of crime-causing factors in the error term will also have large police
forces.
In this chapter, we introduce fixed effects models as a simple yet powerful
way to fight such endogeneity. Fixed effects models boil down to models that have
dummy variables that control for otherwise unexplained unit-level differences in
outcomes across units. They can be applied to data on individuals, cities, states,
countries, and many other units of observation. Often they produce profoundly
different—and more credible—results than basic OLS models.
There are two contexts in which the fixed effect logic is particularly useful.
panel data Has In the first, we have panel data, which consists of multiple observations for a
observations for specific set of units. Observing annual crime rates in a set of cities over 20 years
multiple units over time. is an example. So, too, is observing national unemployment rates for every year
from 1946 to the present for all advanced economies. Anyone analyzing such data
needs to use fixed effects models to be taken seriously.
255
256 CHAPTER 8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference Models
The logic behind the fixed effect approach also is important when we conduct
difference-in-difference analysis, which is particularly helpful in the evaluation
of policy changes. We use this model to compare changes in units affected by
some policy change to changes in units not affected by the policy. We show how
difference-in-difference methods rely on the logic of fixed models and, in some
cases, use the same tools as panel data analysis.
In this chapter, we show the power and ease of implementing fixed effects
models. Section 8.1 uses a panel data example to illustrate how basic OLS can fail
when the error term is correlated with the independent variable. Section 8.2 shows
how fixed effects can come to the rescue in this case (and others). It describes how
to estimate fixed effects models by using dummy variables or so-called de-meaned
data. Section 8.3 explains the mildly miraculous ability of fixed effects models
to control for variables even as the models are unable to estimate coefficients
associated with these variables. This ability is a blessing in that we control for these
variables; it is a curse in that we sometimes are curious about such coefficients.
Section 8.4 extends fixed effect logic to so-called two-way fixed effects models
that control for both unit- and time-related fixed effects. Section 8.5 discusses
difference-in-difference methods that rely on the fixed effect logic and are widely
used in policy analysis.
8.1 The Problem with Pooling

In this section, we show how using basic OLS to analyze crime data in U.S. cities
over time can lead us dangerously astray. Understanding the problem helps us
understand the merits of the fixed effects approach we present in Section 8.2.
We explore a data set that covers robberies per capita and police officers per
capita in 59 large cities in the United States from 1951 to 1992.1 Table 8.1 presents
OLS results from estimation of the following simple model:
Crimeit = β0 + β1 Policei,t−1 + it (8.1)
where Crimeit is crime in city i at time t and Policei,t−1 is a measure of the

number of police on duty in city i in the preceding year. It’s common to use
lagged police because under some conditions the number of police in a given
year might be simultaneously determined by the number of crimes in that year.
We revisit this point in Section 9.6. For now, let’s take it as a fairly conventional
modeling choice in analyses of the effect of police on crime. Notice also that the
subscripts contain both i’s and t’s. This notation is new and will become important
later.
pooled model Treats We’ll refer to this model as a pooled model. In a pooled model, an observation
all observations as is completely described by its X variables; that some observations came from one
independent
observations. 1
This data is from Marvell and Moody (1996). Their paper discusses a more comprehensive analysis
of this data.
TABLE 8.1 Basic OLS Analysis of Robberies and

Police Officers
Pooled OLS
Lagged police, per capita 2.37∗

(0.07)
[t = 32.59]
N 1,232

∗
city and others from another city is ignored. For all the computer knew when
running that model, there were N separate cities producing the data.
Table 8.1 shows the results. The coefficient on the police variable is positive
and very statistically significant. Yikes. More cops, more crime. Weird. In fact,
for every additional police officer per capita, there were 2.37 more robberies per
capita. Were we to take these results at face value, we would believe that cities
could eliminate more than two robberies per capita for every police officer per
capita they fired.
Of course we don’t believe the pooled results. We worry that there are
unmeasured factors lurking in the error term that could be correlated with the
number of police, thereby causing bias. The error term in Equation 8.1 contains
gangs, drugs, economic hopelessness, broken families, and many more conditions.
If any of those factors is correlated with the number of police in a given city,
we have endogeneity. Given that police are more likely to be deployed when and
where there are gangs, drugs, and economic desolation, endogeneity in our model
seems inevitable.
In this chapter, we try to eliminate some of this endogeneity by focusing on
aspects of the error associated with each city. To keep our discussion relatively
simple, we’ll turn our attention to five California cities: Los Angeles, San
Francisco, Oakland, Fresno, and Sacramento. Figure 8.1 plots their per capita
robbery and police data from 1971 to 1992.
Consistent with the OLS results on all cities, the message seems clear that
robberies are more common when there are more police. However, we actually
have more information than Figure 8.1 displays. We know which city each
observation comes from. Figure 8.2 replots the data from Table 8.1, but in a
way that differentiates by city. The underlying data is exactly the same, but the
observations for each city have different shapes. The observations for Fresno are
the circles in the lower left, the observations for Oakland are the triangles in the
top middle, and so forth. What does the relationship between police and crime
look like now?
It’s still a bit hard to see, so Figure 8.3 adds a fitted line for each city. These
are OLS regression lines estimated on a city-by-city basis. All are negative, some
dramatically so (Los Angeles and San Francisco). The claim that police reduce
Robberies
per 1,000 people
12
10
2.0 2.5 3.0 3.5 Police

per 1,000 people
FIGURE 8.1: Robberies and Police for Large Cities in California
Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
2.0 2.5 3.0 3.5

Police
per 1,000 people
FIGURE 8.2: Robberies and Police for Specified Cities in California

Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
2
2.0 2.5 3.0 3.5 Police
per 1,000 people
FIGURE 8.3: Robberies and Police for Specified Cities in California with City-Specific Regression
Lines
crime is looking much better. Within each individual city, robberies tend to decline
as police increase.
The difference between the pooled OLS results and these city-specific
regression lines presents a puzzle. How can the pooled OLS estimates suggest
a conclusion so radically different from Figure 8.3? The reason is the villain of
this book—endogeneity.
Here’s how it happens. Think about what’s in the error term it in Equation 8.1:
gangs, drugs, and all that. These factors almost certainly affect the crime across
cities and are plausibly correlated with the number of police because cities with
bigger gang or drug problems hire more police officers. Many of these elements in
the error term are also stable within each city, at least in our 20-year time frame. A
city that has a culture or history of crime in year 1 probably has a culture or history
of crime in year 20 as well. This is the case in our selected cities: San Francisco
has lots of police and many robberies, while Fresno has not so many police and
not so many robberies.
And here’s what creates endogeneity: these city-specific baseline levels of
crime are correlated with the independent variable. The cities with the most
robberies (Oakland, Los Angeles, and San Francisco) have the most police. The
cities with fewest robberies (Fresno and Sacramento) have the fewest police. If
we are not able to find another variable to control for whatever is causing these
differential levels of baselines—and if it is something hard to measure like history
or culture or gangs or drugs, we may not be able to—then standard OLS will have
endogeneity-induced bias and lead us to the spurious inference we highlighted at
the start of the chapter.
Test score example

The problem we have identified here occurs in many contexts. Let’s look at another
example to get comfortable with identifying factors that can cause endogeneity.
Suppose we want to assess whether private schools produce better test scores than
public schools and we begin with the following pooled model:
Test scoresit = β0 + β1 Private schoolit + it (8.2)
where Test scoresit is test scores of student i at time t and Private schoolit is a
dummy variable that is 1 if student i is in a private school at time t and 0 if not.
This model is for a (hypothetical) data set in which we observe test scores for
specific children over a number of years.
The following three simple questions help us identify possibly troublesome
endogeneity.
What is in the error term? Test performance potentially depends not only on
whether a child went to a private school (a variable in the model) but also on his or
her intelligence and diligence, the teacher’s ability, family support, and many other
factors in the error term. While we can hope to measure some of these factors, it
is a virtual certainty that we will not be able to measure them all.
Are there any stable unit-specific elements in the error term? Intelligence,
diligence, and family support are likely to be quite stable for individual students
across time.
Are the stable unit-specific elements in the error term likely to be correlated
with the independent variable? It is quite likely that family support, at least, is
correlated with attendance at private schools, since families with the wealth and/or
interest in private schools are likely to provide other kinds of educational support
to their children. This tendency is by no means set in stone, however: countless
kids with good family support go to public schools, and there are certainly kids
with no family support who end up in private schools. On average, though, it is
reasonable to suspect that kids in private schools have more family support. If
this is the case, then what may seem to be a causal effect of private schools on
test scores may be little more than an indirect effect of family support on test
scores.
8.2 Fixed Effects Models 261
REMEMBER THIS
1. A pooled model with panel data ignores the panel nature of the data. The equation is
Yit = β0 + β1 Xit + it
2. A common source of endogeneity in the use of a pooled model to analyze panel data is that
the specific units have different baseline levels of Y, and these levels are correlated with X. For
example, cities with higher crime (meaning high unit-specific error terms) also tend to have
more police, creating a correlation in a pooled model between the error term and the police
8.2 Fixed Effects Models

In this section, we introduce fixed effects as a way to deal with at least part of the
endogeneity described in Section 8.1. We define the term and then show two ways
to estimate basic fixed effects models.
fixed effect A Starting with Equation 8.1, we divide the error term, it , into a fixed effect, αi ,
parameter associated and a random error term, νit (the Greek letter nu, pronounced “new”). Our focus
with a specific unit in a here is on αi ; we’ll assume the νit part of the error term is well behaved—that is,
panel data model. For a
it is homoscedastic and not correlated with other errors or with any independent
model
Yit = β0 + β1 X1it + αi + νit ,
variable. We rewrite our model as
the αi parameter is the
fixed effect for unit i. Crimeit = β0 + β1 Policei,t−1 + it
= β0 + β1 Policei,t−1 + αi + νit
fixed effects model More generally, fixed effects models look like
A model that controls
for unit-specific effects. Yit = β0 + β1 X1it + αi + νit (8.3)
These fixed effects
capture differences in A fixed effects model is simply a model that contains a parameter like αi that
the dependent variable captures differences in the dependent variable associated with each unit and/or
associated with each
period.
unit.
The fixed effect αi is the part of the unobserved error that has the same
value for every observation for unit i. It basically reflects the average value of
the dependent variable for unit i, after we have controlled for the independent
variables. The unit is the unit of observation. In our city crime example, the unit
of observation is the city.
Even though we write down only a single parameter (αi ), we’re actually
representing a different value for each unit. That is, this parameter takes on a
potentially different value for each unit. In the city crime model, therefore, the
value of αi will be different for each city. If Pittsburgh has a higher average number
of robberies than Portland, the αi for Pittsburgh will be higher than the αi for
Portland.
The amazing thing about the fixed effects parameter is that it allows us to
control for a vast array of unmeasured attributes of units in the data set. These
could correspond to historical, geographical, or institutional factors. Or these
attributes could relate to things we haven’t even thought of. The key is that the fixed
effect term allows different units to have different baseline levels of the dependent
variable.
Why is it useful to model fixed effects in this way? When fixed effects are in
the error term, as in the pooled OLS model, they can cause endogeneity and bias.
But if we can pull them out of the error term, we will have overcome this source of
endogeneity. We do this by controlling for the fixed effects, which will take them
out of the error term so that they no longer can be a source for the correlation of
the error term and an independent variable. This strategy is similar to the one we
pursued with multivariate OLS: we identified a factor in the error term that could
cause endogeneity and pulled it out of the error term by controlling for the variable
in the regression.
How do we pull the fixed effects out of the error term? Easy! We simply
estimate a different intercept for each unit. This will work as long as we have
multiple observations for each unit. In other words, we can pull fixed effects out
of the error term when we have panel data.
Least squares dummy variable approach

least squares There are two ways to estimate fixed effects models. In the least squares dummy
dummy variable variable (LSDV) approach, we create dummy variables for each unit and include
(LSDV) approach An these dummy variables in the model:
approach to estimating
fixed effects models in
the analysis of panel
Yit = β0 + β1 X1it + β2 D1i + β3 D2i + · · · + βP DP−1,i + νit (8.4)
data.
where D1i is a dummy variable that equals 1 if the observation is from the first unit
(which in our crime example is city) and 0 otherwise, D2i is a dummy variable that
equals 1 if the observation is from the second unit and 0 otherwise, and so on to
the (P − 1)th unit. We exclude the dummy for one unit because we can’t have a
dummy variable for every unit if we include β0 , for reasons we discussed earlier
(page 194).2 The data will look like the data in Table 8.2, which gives the city, year,
the dependent and independent variables, and the first three dummy variables. In
the Computing Corner at the end of the chapter, we show how to quickly create
these dummy variables.
With this simple step we have just soaked up anything—anything—that in the
error term that is fixed within unit over the time period of the panel.
2
It doesn’t really matter which unit we exclude. We exclude the Pth unit for convenience; plus, it is
fun to try to pronounce (P − 1)th.
TABLE 8.2 Example of Robbery and Police Data for Cities in California
City Year Robberies Police per D1 D2 D3
per 1,000 (Fresno (Oakland (San Francisco
1,000 (lagged) dummy) dummy) dummy)
Fresno 1991 6.03 1.83 1 0 0
Fresno 1992 8.42 1.78 1 0 0

Oakland 1991 10.35 2.57 0 1 0
Oakland 1992 11.94 2.82 0 1 0

San Francisco 1991 9.50 3.14 0 0 1
San Francisco 1992 11.02 3.14 0 0 1
We are really just running OLS with loads of dummy variables. In other
words, we’ve seen this before. Specifically, on page 193, we showed how to
use multiple dummy variables to account for categorical variables. Here the
categorical variable is whatever the unit of observation denotes (in our city crime
data, it’s city).
De-meaned approach
We shouldn’t let the old-news feel of the LSDV approach lead us to underestimate
fixed effects models. They’re actually doing a lot of work, and work that we can
better appreciate when we consider a second way to estimate fixed models, the
de-meaned de-meaned approach. It’s an odd term—it sounds like we’re trying to humiliate
approach An data—but it describes well what we’re doing. (Data is pretty shameless anyway.)
approach to estimating When using the de-meaned approach, we subtract the unit-specific averages from
fixed effects models for
both independent and dependent variables. This approach allows us to control
panel data involving
subtracting average
for the fixed effects (the αi terms) without estimating coefficients associated with
values within units from dummy variables for each unit.
all variables. Why might we want to do this? Two reasons. First, it can be a bit of a
hassle creating dummy variables for every unit and then wading through results
with so many variables. For example, using the LSDV approach to estimate a
country-specific fixed effects model describing voting in the United Nations, we
might need roughly 200 dummy variables.
Second, the inner workings of the de-meaned estimator reveal the intuition
behind fixed effects models. This reason is more important. The de-meaned model
looks like
Yit − Y i· = β1 (Xit − X i· ) + ν̃it (8.5)
where Y i· is the average of Y for unit i over all time periods in the data set and
X i· is the average of X for unit i over all time periods in the data set. The dot
notation indicates when an average is calculated. So Y i· is the average for unit i
averaged over all time periods (values of). In our crime data, Y Fresno· is the average
crime in Fresno over the time frame of our data, and X Fresno· is the average police
per capita in Fresno over the time frame of our data.3 Estimating a model using
this transformed data will produce exactly the same coefficient and standard error
estimates for β̂1 as produced by the LSDV approach.
The de-meaned approach allows us to see that fixed effects models convert
data to deviations from mean levels for each unit and variable. In other words, fixed
effects models are about differences within units, not differences across units. In
the pooled model for our city crime data, the variables reflect differences in police
and robberies in Los Angeles relative to police and robberies in Fresno. In the
fixed effects model, the variables are transformed to reflect how much robberies
in Los Angeles at a specific time differ from average levels in Los Angeles as a
function of how much police in Los Angeles at a specific time differ from average
levels of police in Los Angeles.
An example shows how this works. Recall the data on crime earlier, where we
saw that estimating the model with a pooled model led to very different coefficients
than with the fixed effects model. The reason for the difference was, of course, that
the pooled model was plagued by endogeneity and the fixed effects model was
not. How does the fixed effects model fix things? Figure 8.4 presents illustrative
data for two made-up cities, Fresnomento and Los Frangelese. In panel (a), the
pooled data is plotted as in Figure 8.1, with each observation number indicated.
The relationship between police and robberies looks positive, and indeed, the OLS
β̂1 is positive.
In panel (b) of Figure 8.4, we plot the same data after it has been de-meaned.
Table 8.3 shows how we generated the de-meaned data. Notice, for example, that
observation 1 is from Los Frangelese in 2010. The number of police (the value
of Xit ) was 4, which is one of the bigger numbers in the Xit column. When we
compare this number to the average number of police per thousand people in Los
Frangelese (which was 5.33), though, it is low. In fact, the de-meaned value of the
police variable for Los Frangelese in 2010 is −1.33, indicating that the police per
thousand people was actually 1.33 lower than the average for Los Frangelese in
the time period of the data.
Although the raw values of Y get bigger as the raw values of X get bigger,
the relationship between Yit − Y i· and Xit − X i· is quite different. Panel (b) of
Figure 8.4 shows a clear negative relationship between the de-meaned X and the
de-meaned Y.4
3
The de-meaned equation is derived by subtracting the same thing from both sides of Equation 8.3.
Specifically, note that the average dependent variable for unit i over time is Y i· = β0 + β1 X i· + α i + ν i· .
If we subtract the left-hand side of this equation from the left-hand side of Equation 8.3 and the
right-hand side of this equation from the right-hand side of Equation 8.3, we get
Yit − Y i· = β0 + β1 Xit + αi + νit − β0 − β1 X i· − α i· − ν i· . The α terms cancel because α i equals αi (the
average of fixed effects for each unit are by definition the same for all observations of a given unit in
all time periods). Rearranging terms yields something that is almost Equation 8.5. For simplicity, we
let ν̃it = νit − ν i· ; this new error term will inherit the properties of νit (e.g., being uncorrelated with the
independent variable and having a mean of zero).
4
One issue that can seem confusing at first—but really isn’t—is how to interpret the coefficients.
Because the LSDV and de-meaned approaches produce identical estimates, we can stick with our
Robberies
per 1,000
people
12 1
2
9
e
n lin 3
(a) gr essio
6 le d re
Poo
4
3 5 Los Frangeles
6 Fresnomento
1 2 3 4 5 6 7
Police per 1,000 people
Robberies
per 1,000
people, Fresnomento, de-meaned
2 1 Re
de-meaned gre Los Frangeles, de-meaned
ssi
by city on
1
line
4 for
de-
me
ane
(b) 0 2 d (f
ixe
5 de
ffec
ts)
−1 mo
6 del
−2 3
−2 −1 0 1 2
Police per 1,000 people,

de-meaned by city
FIGURE 8.4: Robberies and Police for Hypothetical Cities in California
TABLE 8.3 Robberies and Police Data for Hypothetical Cities in California
Observation
number City Year Xit X i· Xit − X i· Yit Y i· Yit − Y i·
1 Los Frangelese 2010 4 5.33 −1.33 12 10 2
2 Los Frangelese 2011 5.5 5.33 0.17 10 10 0

3 Los Frangelese 2012 6.5 5.33 1.17 8 10 −2
4 Fresnomento 2010 1 2 −1 4 3 1
5 Fresnomento 2011 2 2 0 3 3 0
6 Fresnomento 2012 3 2 1 2 3 −1
relatively straightforward way of explaining LSDV results even when we’re describing results from a
de-meaned model. Specifically, we can simply say that a one-unit change in X1 is associated with a
β̂1 increase in Y when we control for unit fixed effects. This interpretation is similar to how we
interpret multivariate OLS coefficients, which makes sense because the fixed effects model is really
just an OLS model with lots of dummy variables.
TABLE 8.4 Robberies and Police Officers, Pooled versus Fixed

Effects Models
Pooled OLS Fixed effects
(one-way)
Lagged police (per capita) 2.37∗ 1.49∗

(0.07) (0.17)
[t = 32.59] [t = 8.67]
N 1,232 1,232
Number of cities 59 59

∗
In practice, we seldom calculate the de-meaned variables ourselves. There are

easy ways to implement the model in Stata and R. We describe these techniques
in the Computing Corner at the end of the chapter.
Table 8.4 shows the results for a basic fixed effects model for our city crime
data. We include the pooled results from Table 8.1 for reference. The coefficient
on police per capita falls from 2.37 to 1.49 once we’ve included fixed effects.
The drop in the coefficient suggests that there were indeed more police officers in
cities with higher baseline levels of crime. So the fixed effects were real. That is,
some cities have higher average robberies per capita even when we control for the
number of police, and these effects may be correlated with the number of police
officers. The fixed effects model controls for these city-specific averages and leads
to a smaller coefficient on police officers.
The coefficient, however, still suggests that every police officer per capita
is associated with 1.49 more robberies. This estimate seems quite large and is
highly statistically significant. We’ll revisit this data in Section 8.4 with models
that account for additional important factors.
We should note that we do not indicate whether results in Table 8.4 were
estimated with LSDV or the de-meaned approach. Why? Because it doesn’t matter.
Either one would produce identical coefficients and standard errors on the police
variable.
REMEMBER THIS
1. A fixed effects model includes an αi term for every unit:
Yit = β0 + β1 X1it + αi + it
2. The fixed effects approach allows us to control for any factor that is fixed within unit for the
entire panel, regardless of whether we observe this factor.
8.3 Working with Fixed Effects Models 267
3. There are two ways to produce identical fixed effects coefficient estimates for the model.
(a) In the LSDV approach, we simply include dummy variables for each unit except an
excluded reference category.
(b) In the de-meaned approach, we transform the data such that the dependent and
independent variables indicate deviations from the unit mean.
Discussion Question
What factors influence student evaluations of professors in college courses? Are instructors who teach
large classes evaluated less favorably? Consider using the following model to assess the question
based on a data set of evaluations of instructors across multiple classes and multiple years:
Evaluationit = β0 + β1 Number of studentsit + it
where Evaluationit is the average evaluation by students of instructor i in class t, and

Number of studentsit is the number of students in the class of instructor i’s class t.
(a) What is in the error term?

(b) Are there any stable, unit-specific elements in the error term?
(c) Are any stable, unit-specific elements in the error term likely to be correlated with the
independent variable?
8.3 Working with Fixed Effects Models

Fixed effects models are relatively easy to implement. In practice, though, several
elements take a bit of getting used to. In this section, we explore the consequences
of using fixed effects models when they’re necessary and when they’re not. We
also explain why fixed effects models cannot estimate some relationships even as
they control for them.
It’s useful to consider possible downsides of using fixed effects models. What
if we control for fixed effects when αi = 0 for all units? In this case, the fixed
effects are all zero and cannot cause bias. Could including fixed effects in this
case cause bias? The answer is no, and for the same reasons we discussed earlier
(in Chapter 5, page 150): controlling for irrelevant variables does not cause bias.
Rather, bias occurs when errors are correlated with independent variables. As a
general matter, however, including extra variables does not cause errors to be
correlated with independent variables.5
If the fixed effects are non-zero, we want to control for them. We should
note, however, that just because some (or many!) αi are non-zero, our fixed
effects model and our pooled model will not necessarily produce different results.
Recall that bias occurs when errors are correlated with an independent variable.
The fixed effects could exist, but they are not necessarily correlated with the
independent variables. To cause bias, in other words, fixed effects must not only
exist, they must be correlated with the independent variables. It’s not unusual
to observe instances in real data where fixed effects exist but don’t cause bias.
In such cases, the coefficients from the pooled and fixed effects models are
similar.6
The prudent approach to analyzing panel data is therefore to control for
fixed effects. If the fixed effects are zero, we’ll get unbiased results even with
the controls for fixed effects. If the fixed effects are non-zero, we’ll get unbiased
results that will differ or not from pooled results depending on whether the fixed
effects are correlated with the independent variable.
A downside to fixed models is that they make it impossible to estimate effects
for certain variables that might be of interest. As is often the case, there is no free
lunch (although it’s a pretty cheap lunch).
Specifically, fixed effects models cannot estimate coefficients on any variables
that are fixed for all individuals over the entire time frame. Suppose, for example,
that in the process of analyzing our city crime data we wonder if northern cities are
more crime prone. We studiously create a dummy variable Northi that equals 1 if
a city is in a northern state and 0 otherwise and set about estimating the following
model:
Crimeit = β0 + β1 Policei,t−1 + β2 Northi + αi + νit
Sadly, this approach won’t work. The reason is easiest to see by considering
the fixed effects model in de-meaned terms. The North variable will be converted
to Northit − Northi· . What is the value of this de-meaned variable for a city in the
North? The Northit part will equal 1 for all time periods for such a city. But wait,
this means that Northi· will also be 1 because that is the average of this variable
for this northern city. And that means the value of the de-meaned North variable
will be 0 for any city in the North. What is the value for the de-meaned North
5
Controlling for fixed effects when all αi = 0 will lead to larger standard errors, though. So if we can
establish that there is no sign of a non-zero αi for any unit, we may wish to also estimate a model
without fixed effects. To test for unit-specific fixed effects, we can implement an F test following the
process discussed in Chapter 5 (page 158). The null hypothesis is H0 : α1 = α2 = α3 = · · · = 0. The
alternative hypothesis is that at least one of the fixed effects is non-zero. The unrestricted model is a
model with fixed effects (most easily thought of as the LSDV model that has dummy variables for
each specific unit). The restricted model is a model without any fixed effects, which is simply the
pooled OLS model. We provide computer code on pages 285 and 286.
6
A so-called Hausman test can be used to test whether fixed effects are causing bias. If the results
indicate no sign of bias when fixed effects are not controlled for, we can use a random effects model
as discussed in Chapter 15 on page 524.
8.3 Working with Fixed Effects Models 269
variable for a non-northern city? Similar logic applies: the Northit part will equal
0 for all time periods, and so will Northi· for a non-nothern city. The de-meaned
North variable will therefore also be 0 for non-northern cities. In other words, the
de-meaned North variable will be 0 for all cities in all years. The first job of a
variable is to vary. If it doesn’t, well, that ain’t no variable! Hence, it will not be
possible to estimate a coefficient on this variable.7
More generally, a fixed effects model (estimated with either LSDV or the
de-meaned approach) cannot estimate a coefficient on a variable if the variable
does not change within units for all units. So even though the variable varies across
cities (e.g., the Northi variable is 1 for some cities and 0 for other cities), we can’t
estimate a coefficient on it because it does not vary within cities. This issue arises in
many other contexts. In panel data where individuals are the unit of observation,
fixed effects models cannot estimate coefficients on variables such as gender or
race that do not vary within individuals. In panel data on countries, the effect of
variables such as area or being landlocked cannot be estimated when there is no
variation within country for any country in the data set.
Not being able to include such a variable does not mean fixed effects models
do not control for it. The unit-specific fixed effect is controlling for all factors that
are fixed within a unit for the span of the data set. The model cannot parse out
which of these unchanging factors have which effect, but it does control for them
via the fixed effects parameters.
Some variables might be fixed within some units but variable within other
units. Those we can estimate. For example, a dummy variable that indicates
whether a city has more than a million people will not vary for many cities that
have been above or below one million in population for the entire span of the
panel data. However, if at least some cities have risen above or declined below
one million during the period covered in the panel data, then the variable can be
used in a fixed effects model.
Panel data models need not be completely silent with regard to variables that
do not vary. We can investigate how unchanging variables interact with variables
that do change. For example, we can estimate β2 in the following model:
Crimeit = β0 + β1 Policei,t−1 + β2 (Northi × Policei,t−1 ) + αi + νit
The β̂2 will tell us how different the coefficient on the police variable is for
northern cities.
Sometimes people are tempted to abandon fixed effects because they care
about variables that do not vary within unit. That’s cheating. The point of choosing
a fixed effects model is to avoid the risk of bias, which could creep in if something
fixed within individuals across the panel happened to be correlated with an
independent variable. Bias is bad, and we can’t just close our eyes to it to get
7
Because we know that LSDV and de-meaned approaches produce identical results, we know that we
will not be able to estimate a coefficient on the North variable in an LSDV model as well. This is the
result of perfect multicollinearity: the North variable is perfectly explained as the sum of the dummy
variables for the northern cities.
to a coefficient we want to estimate. The best-case scenario is that we run a fixed

effects model and test for whether we need the fixed effects, find that we do not, and
then proceed guilt free. But let’s not get our hopes up. We usually need the fixed
effects.
REMEMBER THIS
1. Fixed effects models do not cause bias when implemented in situations in which αi = 0 for all
units.
2. Pooled OLS models are biased only when fixed effects are correlated with the independent
variable.
3. Fixed effects models cannot estimate coefficients on variables that do not vary within at least
some units. Fixed effects models do control for these factors, though, as they are subsumed
within the unit-specific fixed effect.
1. Suppose we have panel data on voter opinions toward government spending in 2010, 2012,
and 2014. Explain why we can or cannot estimate the effect of each of the following in a fixed
effects model.
(a) Gender
(b) Income
(c) Race
(d) Party identification
2. Suppose we have panel data on the annual economic performance of 100 countries from 1960
to 2015. Explain why we can or cannot estimate the effect of each of the following in a fixed
effects model.
(a) Average years of education
(b) Democracy, which is coded 1 if political control is determined by competitive elections
and 0 otherwise
(c) Country size
(d) Proximity to the equator
8.4 Two-Way Fixed Effects Model 271
3. Suppose we have panel data on the annual economic performance of the 50 U.S. states from
1960 to 2015. Explain why we can or cannot estimate the effect of each of the following in a
fixed effects model.
(a) Average years of education
(b) Democracy, which is coded 1 if political control is determined by competitive elections
and 0 otherwise
(c) State size
(d) Proximity to Canada
8.4 Two-Way Fixed Effects Model

So far we have presented models in which there is a fixed effect for the unit of
one-way fixed observation. We refer to such models as one-way fixed effects model. We can
effects model A panel generalize the approach to a two-way fixed effects model in which we allow
data model that allows for fixed effects not only at the unit level but also at the time level. That is, just
for fixed effects at the
as some cities might have more crime than others (due to unmeasured history
unit level.
of violence or culture), some years might have more crime than others as a
two-way fixed result of unmeasured factors. Therefore, we add a time fixed effect to our model,
effects model A panel making it
data model that allows
for fixed effects at the
unit and time levels. Yit = β0 + β1 X1it + αi + τt + νit (8.6)
where we’ve taken Equation 8.3 from page 261 and added τt (the Greek letter
tau—rhymes with “wow”), which accounts for differences in crime for all units
in year t. This notation provides a shorthand way to indicate that each separate
time period gets its own τt effect on the dependent variable (in addition to the
αi effect on the dependent variable for each individual unit of observation in the
data set).
Similar to our one-way fixed effects model, the single parameter for a time
fixed effect indicates the average difference for all observations in a given year,
after we have controlled for the other variables in the model. A positive fixed
effect for the year 2008 (α2008 ) would indicate that controlling for all other
factors, the dependent variable was higher for all units in the data set in 2008.
A negative fixed effect for the year 2014 (α2014 ) would indicate that controlling
for all other factors, the dependent variable was lower for all units in the data set
in 2014.
There are lots of situations in which we suspect that a time fixed effect might
be appropriate:
• The whole world suffered an economic downturn in 2008 because of a

financial crisis. Hence, any model with economic dependent variables could
merit a time fixed effect to soak up this distinctive characteristic of the
economy in 2008.
• Approval of political institutions went way up in the United States after

the terrorist effects of September 11, 2001. This was clearly a time-specific
factor that affected the entire country.
We can estimate a two-way fixed model in several different ways. The

simplest approach is to extend the LSDV approach to include dummy variables
both for units and for time periods. Or we can use a two-way de-meaned approach.8
We can even use a hybrid LSDV/de-meaned approach; we show how in the
Computing Corner at the end of the chapter.
Table 8.5 shows the huge effect of using a two-way fixed effects model on our
analysis of city crime data. For reference, the first two columns show the pooled
OLS and one-way fixed effects results. The third column displays the results for a
two-way fixed effects model controlling only for police per capita. In contrast to
the pooled and one-way models, the coefficient in this column is small (0.14) and
statistically insignificant, suggesting that both police spending and crime were
high in certain years. Robberies were common in some years throughout the
country (possibly, perhaps, owing to the crack epidemic that was more serious
in some years that in others). Once we had controlled for that fact, however, we
were able to net out a source of substantial bias.
The fourth and final column reports two-way fixed effects results from a
model that also controls for the lagged per capita robbery rate in each city in order
to control for city-specific trends in crime. The estimate from this model implies
that an increase of one police officer per 100, 000 people is associated with a
decrease of 0.202 robbery per capita. The effect just misses statistical significance
for a two-way hypothesis test and α = 0.05.9
8
The algebra is a bit more involved than for a one-way model, but the result has a similar feel:
Yit − Y i· − Y ·t + Y ·· = β1 (Xit − X i· − X ·t + X ·· ) + ν̃it (8.7)
where the dot notation indicates what is averaged over. Thus, Y i· is the average value of Y for unit i
over time, Y ·t is the average value of Y for all units at time t, and Y ·· is the average over all units and
all time periods. Don’t worry, we almost certainly won’t have to create these variables ourselves;
we’re including the dot convention just to provide a sense of how a one-way fixed effects model
extends to a two-way fixed effects model.
9
The additional control variable is called a lagged dependent variable. Inclusion of such a variable is
common in analysis of panel data. These variables often are highly statistically significant, as is the
case here. Control variables of these types raise some complications, which we address in Chapter 15
on advanced panel data models.
TABLE 8.5 Robberies and Police Officers, for Multiple Models

Pooled OLS Fixed effects Fixed effects
(one-way) (two-way)
Lagged police 2.37∗ 1.49∗ 0.14 −0.212

(per capita) (0.07) (0.17) (0.17) (0.11)
[t = 32.59] [t = 8.67] [t = 0.86] [t = 1.88]
Lagged robberies — — — 0.79∗

(per capita) — — — (0.02)
— — — [t = 41.63]
N 1,232 1,232 1,232 1,232

Number of cities 59 59 59 59

∗
It is useful to take a moment to appreciate that not all models are created
equal. A cynic might look at the results in Table 8.5 and conclude that statistics
can be made to say anything. But this is not the right way to think about the
results. The models do indeed produce different results, but there are reasons for
the differences. One of the models is better. A good statistical analyst will know
this. We can use statistical logic to explain why the pooled results are suspect. We
know pretty much what is going on: certain fixed effects in the error term of the
pooled model are correlated with the police variable, thereby biasing the pooled
OLS coefficients. So although there is indeed output from statistical software that
could be taken to imply that police cause crime, we know better. Treating all results
as equivalent is not serious statistics; it’s just pressing buttons on a computer.
Instead of supporting statistical cynicism, this example testifies to the benefits of
appropriate analysis.
REMEMBER THIS
1. A two-way fixed effects model accounts for both unit- and time-specific errors.
2. A two-way fixed effects model is written as
Yit = β0 + β1 X1it + β2 X2it + αi + τt + νit
3. A two-way fixed effects model can be estimated with an LSDV approach (which has dummy
variables for each unit and each period in the data set), with a de-meaned approach, or with a
combination of the two.
CASE STUDY Trade and Alliances

Does trade follow the flag? That is, does interna-
tional trade flow more heavily between countries
that are allies? Or do economic factors alone deter-
mine trade? On the one hand, it seems reasonable
to suppose that national security alliances boost
trade by fostering good relations and stability. On
the other hand, isn’t pretty much everything in the
United States made in China?
A basic panel model to test for the effect of
alliances on trade is
Bilateral tradeit = β0 + β1 Allianceit + αi + it (8.8)
dyad An entity that where Bilateral tradeit is total trade volume between countries in dyad i at time t. A
consists of two dyad is a unit that consists of two elements. Here, a dyad indicates a pair of countries,
elements. and the data indicates how much trade flows between them. For example, the
United States and Canada form one dyad, the United States and Japan form another
dyad, and so on. Allianceit is a dummy variable that is 1 if countries in the dyad are
entered into a security alliance at time t and 0 otherwise. The αi term captures the
amount by which trade in dyad i is higher or lower over the entire course of the
panel.
Because the unit of observation is a country-pair dyad, fixed effects here entail
factors related to a pair of countries. For example, the fixed effect for the United
States–New Zealand dyad in the trade model may be higher because of the shared
language. The fixed effect for the China-India dyad might be negative because the
countries are separated by mountains (which they happen to fight over, too).
As we consider whether a fixed effects model is necessary, we need to
think about whether the dyad-specific fixed effects could be correlated with the
independent variables. Dyad-specific fixed effects could exist because of a history
of commerce between two countries, a favorable trading geography (not divided
by mountains, for example), economic complementarities of some sort, and so on.
These factors could also make it easier or harder to form alliances.
Table 8.6 reports results from Green, Kim, and Yoon (2001) based on data
covering trade and alliances from 1951 to 1992. The dependent variable is the
amount of trade between the two countries in a given dyad in a given year. In
addition to the alliance measure, the independent variables are GDP (total gross
domestic product of the two countries in the dyad), Population (total population of
the two countries in the dyad), Distance (distance between the capitals of the two
countries), and Democracy (the minimum value of a democracy ranking for the two
countries in the dyad: the higher the value, the more democracy).
The dependent and continuous independent variables are logged. Logging
variables is a common practice in this literature; the interpretation is that a one
TABLE 8.6 Bilateral Trade, Pooled versus Fixed Effects Models

Pooled OLS Fixed effects Fixed effects
(one-way) (two-way)
Alliance −0.745∗ 0.777∗ 0.459∗

(0.042) (0.136) (0.134)
[t = 17.67] [t = 5.72] [t = 3.43]
GDP (logged) 1.182∗ 0.810∗ 1.688∗

(0.008) (0.015) (0.042)
[t = 156.74] [t = 52.28] [t = 39.93]
Population (logged) −0.386∗ 0.752∗ 1.281∗

(0.010) (0.082) (0.083)
[t = 39.70] [t = 9.19] [t = 15.47]
Distance (logged) −1.342∗

(0.018)
[t = 76.09]
Democracy (logged) 0.075∗ −0.039∗ −0.015∗

(0.002) (0.003) (0.003)
[t = 35.98] [t = 13.42] [t = 5.07]
Observations 93,924 93,924 93,924

Dyads 3,079 3,079 3,079

∗
percent increase in any independent variable is associated with a β̂ percent increase

in trade volume. (We discussed logged variables on page 230.)
The results are remarkable. In the pooled model, Alliance is associated with
a 0.745 percentage point decline in trade. In the one-way fixed effects model, the
estimate completely flips and is associated with a 0.777 increase in trade. In the
two-way fixed effects model, the estimated effect remains positive and significant
but drops to 0.459. The coefficients on Population and Democracy also flip, while
being statistically significant across the board.
These results are shocking. If someone said, “I’m going to estimate an OLS
model of bilateral trade relations,” we’d be pretty impressed, right? But actually, that
model produces results almost completely opposite from those produced by the
more appropriate fixed effects models.
There are other interesting things going on as well. The coefficient on Distance
disappears in the fixed effects models. Yikes! What’s going on? The reason, of course,
is that the distance between two countries does not change. Fixed effects models
cannot estimate a coefficient on distance because distance does not vary within the
dyad over the course of the panel. Does that mean that the effect of distance is not
controlled for? That would seem to be a problem, since distance certainly affects
trade. It’s not a problem, though, because even though fixed effects models cannot
estimate coefficients on variables that do not vary within unit of observation (which
is dyad pairs of countries in this data set), the effects of these variables are controlled
for via the fixed effect. And even better, not only is the effect of distance controlled
for, so are hard-to-measure factors such as being on a trade route or having cultural
affinities. That’s what the fixed effect is—a big ball of all the effects that are the same
within units for the period of the panel.
Not all coefficients flip. The coefficient on GDP is relatively stable, indicating
that unlike the variables that do flip signs from the pooled to fixed effects specifica-
tions, GDP does not seem to be correlated with the unmeasured fixed effects that
influence trade between countries.
8.5 Difference-in-Difference
difference-in- The logic of fixed effects plays a major role in difference-in-difference models,
difference model A which look at differences in changes in treated units compared to untreated units
model that looks at and are particularly useful in policy evaluation. In this section, we explain the
differences in changes in logic of this approach, show how to use OLS to estimate these models, and then
treated units compared link the approach to the two-way fixed effects models we developed for panel
to untreated units.
data.
Difference-in-difference logic
To understand difference-in-difference logic, let’s consider a policy evaluation
of “stand your ground” laws, which have the effect of allowing individuals to
use lethal force when they reasonably believe they are threatened.10 Does a law
that removes the duty to retreat when life or property is being threatened prevent
homicides by making would-be aggressors reconsider? Or do such laws increase
homicides by escalating violence?
Naturally, we would start by looking at the change in homicides in a state
that passed a stand your ground law. This approach is what every policy maker in
the history of time uses to assess the impact of a policy change. Suppose we find
homicides rising in the states that passed the law. Is that fact enough to lead us to
conclude that the law increases crime?
It doesn’t take a ton of thinking to realize that such evidence is pretty weak.
Homicides could rise or fall for a lot of reasons, many of them completely
unrelated to stand your ground laws. If homicides went up not only in the state
that passed the law but in all states—even states that made no policy change—we
can’t seriously blame the law for the rise in homicides. Or, if homicides
declined everywhere, we shouldn’t attribute the decline in a particular state to
the law.
What we really want to do is to look at differences in the state that passed
the policy in comparison to differences in similar states that did not pass such a
10
See McClellan and Tekin (2012) as well as Cheng and Hoekstra (2013).
8.5 Difference-in-Difference 277
law. To use experimental language, we want to look at the difference in treated

states versus the difference in control states. We can write this difference of
differences as
ΔYT − ΔYC (8.9)
where ΔYT is the change in the dependent variable in treated states (those that
passed a stand your ground law) and ΔYC is the change in the dependent variable
in the untreated states (those that did not pass such a law). We call this approach
the difference-in-difference approach because we look at the difference between
differences in treated and control states.
Using OLS to estimate difference-in-difference models

It is perfectly reasonable to generate a difference-in-difference estimate by
calculating the changes in treated and untreated states and taking the difference.
We’ll use OLS to produce the same result, however. The advantage is that OLS will
also spit out standard errors on our estimate. We can easily add control variables
when we use OLS as well.
Specifically, we’ll use the following OLS model:
Yit = β0 + β1 Treatedi + β2 Aftert + β3 (Treatedi × Aftert ) + it (8.10)
where Treatedi equals 1 for a treated state and 0 for a control state, Aftert equals 1
for all after observations (from both control and treated units) and 0 otherwise, and
Treatedi × Aftert is an interaction of Treatedi and Aftert . This interaction variable
will equal 1 for treated states in the post-treatment period and 0 for all other
observations.
The control states have some mean level of homicides, which we denote with
β0 ; the treated states also have some mean level of homicides, and we denote with
β0 + β1 Treatedi . If β1 is positive, the mean level for the treated states is higher
than in control states. If β1 is negative, the mean level for the treated states is
lower. If β1 is zero, the mean level for the treated states is the same as in control
states. Since this preexisting difference of mean levels was by definition there
before the law was passed, the law can’t be the cause of differences. Instead, these
differences represented by β1 are simply the preexisting differences in the treated
and untreated states. This parameter is analogous to a unit fixed effect, although
here it is for the entire group of treated states rather than individual units.
The model captures national trends with the β2 Aftert term. The dependent
variable for all states, treated and not, changes by β2 in the after period. This
parameter is analogous to a time fixed effect, but it’s for the entire post-treatment
period rather than individual time periods.
The key coefficient is β3 . This is the coefficient on the interaction between
Treatedi and Aftert . This variable equals 1 only for treated units in the after period
and 0 otherwise. The coefficient tells us there is an additional change in the treated
states after the policy went into effect, once we have controlled for preexisting
differences between the treated and control states (β1 ) and differences in the before
and after periods for all states (β2 ).
If we work out the fitted values for changes in treated and control states, we
can see how this regression model produces a difference-in-difference estimate.
First, note that the fitted value for treated states in the after period is β0 + β1 + β2 +
β3 (because Treatedi , Aftert , and Treatedi × Aftert all equal 1 for treated states in
the after period). Second, note that the fitted value for treated states in the before
period is β0 + β1 , so the change for fitted states is β2 + β3 . The fitted value for
control states in the after period is β0 + β2 (because Treatedi and Treatedi × Aftert
equal 0 for control states). The fitted value for control states in the before period is
β0 , so the change for control states is β2 . The difference in differences of treated
and control states will therefore be β3 . Presto!
Figure 8.5 displays two examples that illustrate the logic of difference-in-
difference models. In panel (a), there is no treatment effect. The dependent
Y Y
4 4
Treated Treated β3
Control Control
3 β2 3 β2
2 2
β1 β1
β2 β2
1 1
β0 β0
0 0
Time Time
Before After Before After
No treatment effect Treatment effect

(a) (b)
FIGURE 8.5: Difference-in-Difference Examples

variables for the treated and control states differ in the before period by β1 . Then
the dependent variable for both the treated and control units rose by β2 in the
after period. In other words, Y was bigger for the treated unit than for the control
by the same amount before and after the treatment. The implication is that the
treatment had no effect, even though Y went up in treatment states after they passed
the law.
Panel (b) in Figure 8.5 shows an example with a treatment effect. The
dependent variables for the treated and control states differ in the before period
by β1 . The dependent variable for both the treated and control units rose by β2
in the after period, but the value of Y for the treated unit rose yet another β3 . In
other words, the treated group was β1 bigger than the control before the treatment
and β1 + β3 bigger than the control after the treatment. The implication is that the
treatment caused a β3 bump over and above the differences across unit and time
that are accounted for in the model.
Consider how the difference-in-difference approach would assess outcomes
in our stand your ground law example. If homicides declined in states with such
laws more than in states without them, the evidence supports the claim that the
law prevented homicides. Such an outcome could happen if homicides went down
by 10 in states with the law but decreased by only 2 in other states. Such an
outcome could also happen if homicides actually went up by 2 in states with
stand your ground laws but went up by 10 in other states. In both instances, the
difference-in-difference estimate is −8.
One great thing about using OLS to estimate difference-in-difference models
is that it is easy to control for other variables with this method. Simply include
them as covariates, and do what we’ve been doing. In other words, simply add
a β4 Xit term (and additional variables, if appropriate), yielding the following
difference-in-difference model:
Yit = β0 + β1 Treatedi + β2 Aftert + β3 (Treatedi × Aftert ) + β4 Xit + it (8.11)
Difference-in-difference models for panel data

A difference-in-difference model works not only with panel data but also with
rolling cross- rolling cross-sectional data. Rolling cross section data consists of data from
sectional data each treated and untreated region; the individual observations come from different
Repeated cross sections individuals across time periods. An example of a rolling cross section of data
of data from different
is a repeated national survey of people’s experience with their health insurance
individuals at different
points in time.
over multiple years. We could look to see if state-level decisions about Medicaid
coverage in 2014 led to different changes in treated states relative to untreated
states. For such data, we can easily create dummy variables indicating whether
the observation did or did not come from the treated state and whether the
observation was in the before or after period. The model can take things from
there.
If we have panel data, we can estimate a more general form of a

difference-in-difference model that looks like a two-way fixed effects model:
Yit = αi + τt + β3 (Treatedi × Aftert ) + β4 Xit + it (8.12)
where
• The αi terms (the unit-specific fixed effects) capture differences that exist
across units both before and after the treatment.
• The τt terms (the time-specific fixed effects) capture differences that exist
across all units in every period. If homicide rates are higher in 2007 than in
2003, then the τt for 2007 will be higher than the τt for 2003.
• Treatedi × Aftert is an interaction of a variable indicating whether a unit is a

treatment unit (meaning in our case that Treatedi = 1 for states that passed
stand your ground laws) and Postt , which indicates whether the observation
occurred post-treatment (meaning in our case that the observation occurred
after the state passed a stand your ground law). This interaction variable
will equal 1 for treated states in the post-treatment period and 0 for all other
observations.
Our primary interest is the coefficient on Treatedi × Aftert (which we call β3
to be consistent with earlier equations). As in the difference-in-difference model
without fixed effects, this parameter indicates the effect of the treatment.
Table 8.7 refers to a 2012 analysis of stand your ground laws by Georgia State
University economists Chandler McClellan and Erdal Tekin. They implemented a
state and time fixed effect version of a difference-in-difference model and found
that the homicide rate per 100,000 residents went up by 0.033 after the passage
of the stand your ground laws. In other words, controlling for the preexisting
differences in state homicide rates (via state fixed effects) and national trends in
TABLE 8.7 Effect of Stand Your Ground

Laws on Homicide Rate per
100,000 Residents
Variable Coefficient
Stand your ground laws 0.033∗

(0.013)
[t = 2.54]
State fixed effects Included
Period fixed effects Included

∗
Includes controls for racial, age, and urban

demographics. Adapted from Appendix Table 1 of
McClellan and Tekin (2012).
homicide rates (via time fixed effects) and additional controls related to race, age,
and percent of residents living in urban areas, they found that the homicide rates
went up by 0.033 after states implemented these laws.11
REMEMBER THIS
A difference-in-difference model estimates the effect of a change in policy by comparing changes in
treated units to changes in control units.
1. A basic difference-in-difference estimator is ΔYT − ΔYC , where ΔYT is the change in the
dependent variable for the treated unit and ΔYC is the change in the dependent variable for
a control unit.
2. Difference-in-difference estimates can be generated from the following OLS model:
Yit = β0 + β1 Treatedi + β2 Aftert + β3 (Treatedi × Aftert ) + β4 Xit + it
3. For panel data, we can use a two-way fixed effects model to estimate difference-in-difference
effects:
Yit = αi + τt + β3 (Treatedi × Aftert ) + β4 Xit + it
where the αi fixed effects capture differences in units that existed both before and after treatment
and τt captures differences common to all units in each time period.
Discussion Question
For each of the following examples, explain how to create (i) a simple difference-in-difference
estimate of policy effects and (ii) a fixed effects difference-in-difference model.
(a) California implemented a first-in-the-nation program of paid family leave in 2004. Did this
policy increase use of maternity leave?a
(b) Fourteen countries engaged in “expansionary austerity” policies in response to the 2008
financial crisis. Did these austerity policies work? (For simplicity, treat austerity as a dummy
variable equal to 1 for countries that engaged in it and 0 for others.)
(c) Some neighborhoods in Los Angeles changed zoning laws to make it easier to mix commercial
and residential buildings. Did these changes reduce crime?b
a
See Rossin-Slater, Ruhm, and Waldfogel (2013).
b
See Anderson, Macdonald, Bluthenthal, and Ashwood (2013).
11
Cheng and Hoekstra (2013) found similar results.
Y Y
4 4
Treated Treated
Control Control
3 3
2 2
1 1
Before After Time Before After Time

(a) (b)
Y Y
4 4
Treated
Treated
Control
Control
3 3
2 2
1 1
Before After Time Before After Time

(c) (d)
FIGURE 8.6: More Difference-in-Difference Examples (for Review Question)
Review Question
For each of the four panels in Figure 8.6, indicate the values of β0 , β1 , β2 , and β3 for the basic
difference-in-difference OLS model:
Yit = β0 + β1 Treatedi + β2 Aftert + β3 (Treatedi × Aftert ) + it
Conclusion
Again and again, we’ve emphasized the importance of exogeneity. If X is uncorre-
lated with , we get unbiased estimates and are happy. Experiments are sought after
because the randomization in them ensures—or at least aids—exogeneity. With
OLS we can sometimes, maybe, almost, sort of, kind of approximate endogeneity
Further Reading 283
by soaking up so much of the error term with measured variables that what remains
correlates little or not at all with X.
Realistically, though, we know that we will not be able to measure everything.
Real variables with real causal force will almost certainly lurk in the error term.
Are we stuck? Turns out, no (or at least not yet). We’ve got a few more tricks up our
sleeve. One of the best tricks is to use fixed effects tools. Although uncomplicated,
the fixed effects approach can knock out a whole class of unmeasured (and even
unknown) variables that lurk in the error term. Simply put, any factor that is fixed
across time periods for each unit or fixed across units for each time period can be
knocked out of the error term. Fixed effects tools are powerful, and as we have
seen in real examples, they can produce results that differ dramatically from those
produced by basic OLS models.
We will have mastered the material in this chapter when we can do the
following:
• Section 8.1: Explain how a pooled model can be problematic in the analysis
of panel data.
• Section 8.2: Write down a fixed effects model, and explain the fixed
effect. Give examples of the kinds of factors subsumed in a fixed effect.
Explain how to estimate a fixed effects model with LSDV and de-meaned
approaches.
• Section 8.3: Explain why coefficients on variables that do not vary within
a unit cannot be estimated in fixed effects models. Explain how these
variables are nonetheless controlled for in fixed effects models.
• Section 8.4: Explain a two-way fixed effects model.
• Section 8.5: Explain the logic behind a difference-in-difference estimator.

Provide and explain an OLS model that generates a difference-in-difference
estimate.
Further Reading
Chapter 15 discusses advanced panel data models. Baltagi (2005) is a more
technical survey of panel data methods.
Green, Kim, and Yoon (2001) provide a nice discussion of panel data methods
in international relations. Wilson and Butler (2007) reanalyze articles that did not
use fixed effects and find results changed, sometimes dramatically.
If we use pooled OLS to analyze panel data sets, we are quite likely to
have errors that are correlated within unit in the manner discussed on page 69.
This correlation of errors will not cause OLS β̂1 estimates to be biased, but
it will make the standard OLS equation for the variance of β̂1 inappropriate.
While fixed effects models typically account for a substantial portion of the
correlation of errors, there is also a large literature on techniques to deal with

the correlation of errors in panel data and difference-in-difference models. We
discuss one portion of this literature when we cover random effects models
in Chapter 15. Bertrand, Duflo, and Mullainathan (2004) show that standard
error estimates for difference-in-difference estimators can be problematic in the
presence of autocorrelated errors if there are multiple periods both before and after
the treatment.
Hausman and Taylor (1981) discuss an approach for estimating parameters
on time-invariant covariants.
Key Terms
De-meaned approach (263) Least squares dummy Pooled model (256)
Difference-in-difference variable (LSDV) approach Rolling cross-sectional data
model (276) (262) (279)
Dyad (274) One-way fixed effects model Two-way fixed effects model
Fixed effect (261) (271) (271)
Fixed effects model (261) Panel data (255)
Computing Corner
Stata
1. To use the LSDV approach to estimate a panel data model, we run an OLS
model with dummy variables for each unit.
(a) Generate dummy variables for each unit:

tabulate City, generate(CityDum)
This command generates a variable called “CityDum1” that is 1 for
observations from the first city listed in “City” and 0 otherwise,
a variable called “CityDum2” that is 1 for observations from the
second city listed in “City,” and so on.
(b) Estimate the model with the command regress Y X1 X2 X3

CityDum2 - CityDum50. The notation of CityDum2 - CityDum50
tells Stata to include each of the city dummies from CityDum2
to CityDum50. As we discussed in Chapter 6, we need an
excluded category. By starting at CityDum2 in our list of dummy
variables, we are setting the first city as the excluded reference

category.
(c) To use an F test to examine whether fixed effects are all zero, the
unrestricted model is the model with the dummy variables we just
estimated. The restricted model is a regression model without the
dummy variables (also known as the pooled model):
regress Y X1 X2 X3.
2. To use the de-meaned approach to estimate a one-way fixed effects model,

type xtreg Y X1 X2 X3, fe i(City).
The subcommand of , fe tells Stata to estimate a fixed effects model. The
i(City) subcommand tells Stata to use the City variable to identify the
city for each observation.
3. To estimate a two-way fixed model:
(a) Create dummy variables for years:

tabulate Year, gen(Yr)
This command generates a variable called “Yr1” that is 1 for
observations in the first year and 0 otherwise, a variable called “Yr2”
that is 1 for observations in the second year and 0 otherwise, and so
on.
(b) Run Stata’s built-in, one-way fixed effects model and include the
dummies for the years:
xtreg Y X1 X2 X3 Yr2-Yr10, fe i(City)
where Yr2-Yr10 is a shortcut way of including every Yr variable
from Yr2 to Yr10.
4. There are several ways to implement difference-in-difference models.
(a) To implement a basic difference-in-difference model, type reg

Y Treat After TreatAfter X2, where Treat indicates mem-
bership in treatment group, After indicates the after period,
TreatAfter is the interaction of the two variables, and X2 is one
(or more) control variables.
(b) To implement a panel data version of a difference-in-difference

model, type xtreg Y TreatAfter X2 Yr2-Yr10, fe i(City).
(c) To view the basic difference-in-difference results, plot separate fitted

lines for the treated and untreated groups:
graph twoway (lfit Y After if Treat ==0) (lfit Y
After if Treat ==1).
1. To use the LSDV approach to estimate a panel data model, we run an OLS
model with dummy variables for each unit.
(a) It’s possible to name and include dummy variables for every unit,
but doing this can be a colossal pain when we have lots of units.
It is usually easiest to use the factor command, which will
automatically include dummy variables for each unit. The code
is lm(Y ~ X1 + factor(unit)). This command will estimate a
model in which there is a dummy variable for every unique value
unit indicated in the unit variable. For example, if our data looked
like Table 8.2, including a factor(city) term in the regression
code would lead to the inclusion of dummy variables for each
city.
(b) To implement an F test on the hypothesis that all fixed effects (both
unit and time) are zero, the unrestricted equation is the full model
and the restricted equation is the model with no fixed effects.
Unrestricted = lm(Y ~ X1 + factor(unit)+ factor
(time))
Restricted = lm(Y ~ X1)
Refer to page 171 for more details on how to implement an F test
in R.
2. To estimate a one-way fixed effects model by means of the de-meaned

approach, use one of several add-on packages that automate the steps in
panel data analysis. We discussed how to install an R package in Chapter 3
on page 86. For fixed effects models, we can use the plm command from
the “plm” package.
(a) Install the package by typing install.packages("plm"). Once

installed on a computer, the package can be brought into R’s memory
with the library(plm) command.
(b) The plm command works like the lm command. We indicate

the dependent variable and the independent variables for the
main equation. We need to indicate what the units are with the
index=c("city", "year") command. These are the variable
names that indicate your units and time variables, which will vary
depending on the data set. Put all your variables into a data frame,
and then refer to that data frame in the plm command.12 For a
one-way fixed effects model, include model="within".
library(plm)
All.data = data.frame(Y, X1, X2, city, time)
plm(Y ~ X1 + X2, data=All.data, index=c("city"),
model="within")
3. To estimate a two-way fixed effects model, we have two options.
(a) We can simply include time dummies as covariates in a one-way

fixed effects model.
plm(Y ~ X1 + X2 + factor(year), data=All.data,
index=c("city"), model="within")
(b) We can use the plm command and indicate the unit and time variables
with the index=c("city", "year") command. These are the
variable names that indicate your units and time variables, which
will vary depending on your data set. We also need to include the
subcommand effect="twoways".
plm(Y ~ X1 + X2, data=All.data, index=c("city",
"year"), model="within", effect="twoways")
4. There are several ways to implement difference-in-difference models.
(a) To implement a basic difference-in-difference model, type lm(Y

~ Treat + After + TreatAfter + X2), where Treat indicates
membership in the treatment group, After indicates that the period
is the after period, TreatAfter is the interaction of the two
variables, and X2 is one (or more) control variables.
(b) To implement a panel data version of a difference-in-difference

model, type lm(Y ~ TreatAfter + factor(Unit)+ factor
(Year) + X2).
(c) To view the basic difference-in-difference results, plot separate fitted

lines for the treated and untreated groups.
plot(After, Y, type="n")
abline(lm(Y[Treat==0] ~ After[Treat==0]))
abline(lm(Y[Treat==1] ~ After[Treat==1]))
12
A data frame is a convenient way to package data in R. Not only can you put variables together in
one named object, but you can include text variables like names of countries.
Exercises
1. Researchers have long been interested in the relationship between eco-
nomic factors and presidential elections. The PresApproval.dta data set
includes data on presidential approval polls and unemployment rates by
state over a number of years. Table 8.8 lists the variables.
(a) Use pooled data for all years to estimate a pooled OLS regression
explaining presidential approval as a function of state unemployment
rate. Report the estimated regression equation, and interpret the
results.
(b) Many political observers believe politics in the South are different.
Add South as an additional independent variable, and reestimate the
model from part (a). Report the estimated regression equation. Do
the results change?
(c) Reestimate the model from part (b), controlling for state fixed effects
by using the de-meaned approach. How does this approach affect the
results? What happens to the South variable in this model? Why?
Does this model control for differences between southern and other
states?
(d) Reestimate the model from part (c) controlling for state fixed effects
using the LSDV approach. (Do not include a South dummy variable).
Compare the coefficients and standard errors for the unemployment
variable.
(e) Estimate a two-way fixed effects model. How does this model affect
the results?
2. How do young people respond to economic conditions? Are they more

likely to pursue public service when jobs are scarce? To get at this
TABLE 8.8 Variables for Presidential Approval Data

State State name

StCode State numeric ID
Year Year
PresApprov Percent positive presidential approval
UnemPct State unemployment rate
South Southern state (1 = yes, 0 = no)

Exercises 289
TABLE 8.9 Variables for Peace Corps Data

state State name
year Year
stateshort First three letters of state name (for labeling scatterplot)
appspc Applications to the Peace Corps from each state per capita
unemployrate State unemployment rate
question, we’ll analyze data in PeaceCorps.dta, which contains variables

on state economies and applications to the Peace Corps. Table 8.9 lists the
variables.
(a) Before looking at the data, what relationship do you hypothesize

between these two variables? Explain your hypothesis.
(b) Run a pooled regression of Peace Corps applicants per capita on the
state unemployment rate and year dummies. Describe and critique
the results.
(c) Plot the relationship between the state economy and Peace Corps
applications. Does any single state stick out? How may this outlier
affect the estimate on unemployment rate in the pooled regression in
part (b)? Create a scatterplot without the unusual state, and comment
briefly on the difference from the scatterplot with all observations.
(d) Run the pooled model from part (b) without the outlier. Comment
briefly on the results.
(e) Use the LSDV approach to run a two-way fixed effects model
without the outlier. Do your results change from the pooled analysis?
Which results are preferable?
(f) Run a two-way fixed effects model without the outlier; use the fixed
effects command in Stata or R. Compare to the LSDV results.
3. We wish to better understand the factors that contribute to a student’s favor-

able overall evaluation of an instructor. The data set TeachingEval_HW.dta
contains average faculty evaluation scores, class size, a dummy variable
indicating required courses, and the percent of grades that were A– and
above. Table 8.10 lists the variables.
(a) Estimate a model ignoring the panel structure of the data. Use overall
evaluation of the instructor as the dependent variable and the class
TABLE 8.10 Variables for Instructor Evaluation Data

Eval Average course evaluation on a 5-point scale
Apct Percent of students who receive an A or A– in the course

Enrollment Number of students in the course
Required A dummy variable indicating if the course was required

InstrID A unique identifying number for each instructor
CourseID A unique identifying number for each course

Year Academic year
size, required, and grades variables as independent variables. Report

and briefly describe the results.
(b) Explain what a fixed effect for each of the following would control
for: instructor, course, and year.
(c) Use the equation from part (a) to estimate a model that includes
a fixed effect for instructor. Report your results, and explain any
differences from part (a).
(d) Estimate a two-way fixed effects model with year as an additional

fixed effect. Report and briefly describe your results.
4. In 1993, Georgia initiated a HOPE scholarship program to let state

residents who had at least a B average in high school attend public college
in Georgia for free. The program is not need based. Did the program
increase college enrollment? Or did it simply transfer funds to families
who would have sent their children to college anyway? Dynarski (2000)
used data on young people in Georgia and neighboring states to assess this
question.13 Table 8.11 lists the variables.
(a) Run a basic difference-in-difference model. What is the effect of the

program?
(b) Calculate the percent of people in the sample in college from the fol-
lowing four groups: (i) Before 1993/non-Georgia, (ii) Before 1993/
Georgia, (iii) After 1992/non-Georgia, and (iv) After 1992/Georgia.
First, use the mean function (e.g., in Stata use mean Y if X1 == 0
& X2 == 0 and in R use mean Y[X1 == 0 & X2 == 0]). Second,
use the coefficients from the OLS output in part (a).
13
For simplicity, we will not use the sample weights used by Dynarski. The results are stronger,
however, when these sample weights are used.
Exercises 291
TABLE 8.11 Variables for the HOPE Scholarship Data

InCollege A dummy variable equal to 1 if the individual is in college
AfterGeorgia A dummy variable equal to 1 for Georgia residents after 1992

Georgia A dummy variable equal to 1 if the individual is a Georgia resident
After A dummy variable equal to 1 for observations after 1992

Age Age
Age18 A dummy variable equal to 1 if the individual is 18 years old

Black A dummy variable equal to 1 if the individual is African-American
StateCode State codes
Year Year of observation

Weight Weight used in Dynarski (2000)
(c) Graph the fitted lines for the Georgia group and non-Georgia
samples.
(d) Use panel data formulation for a difference-in-difference model to

control for all year and state effects.
(e) Add covariates for 18-year-olds and African-Americans to the panel

data formulation. What is the effect of the HOPE program?
(f) The way the program was designed, Georgia high school graduates
with a B or higher average and annual family income over $50,000
could qualify for HOPE by filling out a simple one-page form. Those
with lower income were required to apply for federal aid with a
complex four-page form and had any federal aid deducted from
their HOPE scholarship. Run separate basic difference-in-difference
models for these two groups, and comment on the substantive
implication of the results.
5. Table 8.12 describes variables in TexasSchools.dta, a data set covering

1,020 Texas school board districts and teachers’ salaries in them from
2003 to 2009. Anzia (2012) used this data to estimate the effect of election
timing on teachers’ salaries in Texas. Some believe that teachers will be
paid more when school board members are elected in “off-cycle” elections,
when only school board members are up for election. The idea is that
teachers and their allies will mobilize for these elections while many other
citizens will tune out. In this view, teachers’ salaries will be relatively
lower when school boards are elected in “on-cycle” elections in which
people also vote for state and national offices; turnout will be higher in
TABLE 8.12 Variables for the Texas School Board Data

LnAvgSalary The average salary of teachers in the district, adjusted for inflation and logged
OnCycle A dummy variable which equals 1 for districts where school boards were elected
“on-cycle” (i.e., they were elected at same time people were voting on other office)
and 0 if the school board was elected “off-cycle” (i.e., school board members were
elected in a separate election).
CycleSwitch A dummy variable indicating that the district switched from off-cycle to on-cycle
elections starting in 2007
AfterSwitch A dummy variable indicating year > 2006

AfterCycleSwitch CycleSwitch×AfterSwitch, an interaction of the cycle switch variable (the treatment)
and the after switch variable (indicates post-treatment time periods)
DistNumber District ID number
Year Year
on-cycle elections, and teachers and teachers unions will have relatively
less influence.
From 2003 to 2006, all districts in the sample elected their school board
members off-cycle. A change in state policies in 2006 led some, but not all,
districts to elect their school board members on-cycle from 2007 onward.
The districts that switched then stayed switched for the period 2007–2009,
and no other district switched.
(a) Estimate the pooled model of LnAvgSalaryit = β0 + β1 OnCycleit +

it . Discuss whether there is potential bias here. Consider in particu-
lar the possibility that teachers unions are most able to get off-cycle
elections in districts where they are strongest. Could such a situation
create bias? Explain why or why not.
(b) Estimate a standard difference-in-difference model using the fact

that a subset of districts switched their school board elections to
“on-cycle” in 2007 and all subsequent elections in the data set.
No one else switched at any other time. Before 2007, all districts
used “off-cycle” elections. Explain the results. What is the effect of
election time on teachers’ salaries? Can we say anything about the
types of districts that switched? Can we say anything about salaries
in all districts in the years after the switch?
(c) Run a one-way fixed effects model in which the fixed effect relates to
individual school districts. Interpret the results, and explain whether
this model accounts for time trends that could affect all districts.
Exercises 293
(d) Use a two-way fixed effects model to estimate a difference-in-

difference approach. Interpret the results, and explain whether this
model accounts for (i) differences in preexisting attributes of the
switcher districts and non-switcher districts and (ii) differences in
the post-switch years that affected all districts regardless of whether
they switched.
(e) Suppose that we tried to estimate the two-way fixed effects model on
only the last three years of the data (2007, 2008, and 2009). Would
we be able to estimate the effect of OnCycle for this subset of the
data? Why or why not?
6. This problem uses a panel version of the data set described in Chapter 5
(page 174) to analyze the effect of cell phone and texting bans on traffic
fatalities. Use deaths per mile as the dependent variable because this
variable accounts for the pattern we saw earlier that miles driven is a strong
predictor of the number of fatalities. Table 8.13 describes the variables
in the data set Cellphone_panel_homework.dta; it covers all states plus
Washington, DC, from 2006 to 2012.
(a) Estimate a pooled OLS model with deaths per mile as the dependent
variable and cell phone ban and text ban as the two independent
variables. Briefly interpret the results.
(b) Describe a possible state-level fixed effect that could cause endo-
geneity and bias in the model from part (a).
(c) Estimate a one-way fixed effects model that controls for state-level
fixed effects. Include deaths per mile as the dependent variable and
cell phone ban and text ban as the two independent variables. Does
TABLE 8.13 Variables for the Cell Phones and Traffic Deaths Data
year Year
state State name

state_numeric State name (numeric representation of state)
population Population within a state

DeathsPerBillionMiles Deaths per billion miles driven in state
cell_ban Coded 1 if handheld cell phone while driving ban is in effect; 0 otherwise
text_ban Coded 1 if texting while driving ban is in effect; 0 otherwise
cell_per10thous_pop Number of cell phone subscriptions per 10,000 people in state

urban_percent Percent of state residents living in urban areas
the coefficient on cell phone ban change in the manner you would
expect based on your answer from part (a)?
(d) Describe a possible year fixed effect that could cause endogeneity
and bias in the fixed effects model in part (c).
(e) Use the hybrid de-meaned approach discussed in the chapter to

estimate a two-way fixed effects model. Include deaths per mile as
the dependent variable and cell phone ban and text ban as the two
independent variables. Does the coefficient on cell phone ban change
in the manner you would expect based on your answer in part (d)?
(f) The model in part (e) is somewhat sparse with regard to control
variables. Estimate a two-way fixed effects model that includes
control variables for cell phones per 10,000 people and percent
urban. Briefly describe changes in inference about the effect of cell
phone and text bans.
(g) Estimate the same two-way fixed effects model by using the least
LSDV approach. Compare the coefficient and t statistic on the cell
phone variable to the results from part (f).
(h) Based on the LSDV results, identify states with large positive and
negative fixed effects. Explain what these mean (being sure to note
the reference category), and speculate about how the positive and
negative fixed effect states differ. (It is helpful to connect the state
number to state name; in Stata, do this with the command list
state state_numeric if year ==2012.)
Instrumental Variables: Using 9
Exogenous Variation to Fight
Endogeneity
Medicaid is the U.S. government health insurance pro-

gram for low-income people. Does it save lives? If
so, how many? These are important but challenging
questions. The challenge is, you guessed it, endogeneity.
People enrolled in Medicaid differ from those not
enrolled in terms of income but also on many other
factors. Some factors, such as age, race, and gender, are
fairly easy to measure. Other factors, such as health,
lifestyle, wealth, and medical knowledge, are difficult
to measure.
The danger is that these unmeasured factors may be
correlated with enrollment in Medicaid. Who is more likely to enroll: a poor sick
person or a poor healthy person? Probably the sick people are more likely to enroll,
which means that comparing health outcomes of enrollees and non-enrollees could
show differences not only due to Medicaid but also due to one or more underlying
conditions that preceded the decision to enroll in Medicaid.
We must therefore be cautious—or clever—when we analyze Medicaid. This
chapter goes with clever. We show how we can use instrumental variables to
navigate around endogeneity. This approach is relatively advanced, but its logic
is pretty simple. The idea is to find exogenous variation in X and use only that
variation to estimate the effect of X on Y. For the Medicaid question, we want
to look for some variation in program enrollment that is unrelated to the health
outcomes of individuals. One way is to find a factor that changed enrollment but
was unrelated to health, lifestyle, or any other factor that affects the health outcome
two-stage least
squares (2SLS) Uses
variable. In this chapter, we show how to incorporate instrumental variables using
exogenous variation in X a technique called two-stage least squares (2SLS). In Chapter 10, we’ll use 2SLS
to estimate the effect of to analyze randomized experiments in which not everyone complies with assigned
X on Y. treatment.
295
296 CHAPTER 9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Like many powerful tools, 2SLS can be a bit dangerous. We won’t cut off a
finger using it, but if we aren’t careful, we could end up with worse estimates than
we would have produced with OLS. And like many powerful tools, the approach is
not cheap. In this case, the cost is that the estimates produced by 2SLS are typically
quite a bit less precise than OLS estimates.
In this chapter, we provide the instruction manual for this tool. Section 9.1
presents an example in which an instrumental variables approach proves useful.
Section 9.2 gives the basics for the 2SLS model. Section 9.3 discusses what
to do when we have multiple instruments. Section 9.4 reveals what happens to
2SLS estimates when the instruments are flawed. Section 9.5 explains why 2SLS
estimates tend to be less precise than OLS estimates. And Section 9.6 applies 2SLS
tools to so-called simultaneous equation models in which X causes Y but Y also
causes X.
9.1 2SLS Example

Before we work through the steps of the 2SLS approach, we introduce the
logic of 2SLS with an example about police and crime by Freakonomics author
Steve Levitt (1997, 2002). Having seen the question of whether police reduce
crime before (on page 256), we know full well that an observational study
almost certainly suffers from endogeneity. Why? Because it is highly likely that
components in the error term that cause crime—factors such as drug use, gang
warfare, and demographic changes—also are related to how many police officers
a city has. After all, it is just common sense for communities that expect more
crime to hire more police. Equation 9.1 shows the basic model:
Crimeit = β0 + β1 Policei,t−1 + it (9.1)
Levitt’s (2002) idea is that while some police are hired for endogenous reasons
(city leaders expect more crime and so hire more police), other police are hired
for exogenous reasons (the city simply has more money to spend). In particular,
Levitt argues that the number of firefighters in a city reflects voters’ tastes for
public services, union power, and perhaps political patronage. These factors also
partially predict the size of the police force and are not directly related to crime.
In other words, to the extent that changes in the number of firefighters predict
changes in police numbers, those changes in the numerical strength of a police
force are exogenous because they have nothing to do with crime. The idea, then,
is to isolate the portion of changes in the police force associated with changes in
the number of firefighters and see if crime went down (or up) in relation to those
changes.
We’ll work through the exact steps of the process soon. For now, we can get
a sense of how instrumental variables can matter by looking at Levitt’s results.
The left column of results in Table 9.1 shows the coefficient on police estimated
9.1 2SLS Example 297
TABLE 9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime
OLS with year OLS with year 2SLS
dummies only and city dummies
Lagged police officers 0.562∗ −0.076 −0.435

per capita (logged) (0.056) (0.061) (0.231)
[t = 10.04] [t = 1.25] [t = 1.88]

∗
All models include controls for prison population, per capita income, abortion, city size, and racial demographics.
Results from Levitt (2002).
via a standard OLS estimation of Equation 9.1 based on an OLS analysis with
covariates and year dummy variables but no city fixed effects. The coefficient is
positive and significant, implying that police cause crime. Yikes!
We’re pretty sure, however, that endogeneity distorts simple OLS results
in this context. The second column in Table 9.1 shows that the results change
dramatically when city fixed effects are included. As discussed in Chapter 8,
fixed effects account for the tendency of cities with chronically high crime to also
have larger police forces. The estimated effect of police is negative, but small and
statistically insignificant at usual levels.
The third column in Table 9.1 shows the results obtained when the instru-
mental variables technique is used. The coefficient on police is negative and
almost statistically significant. This result differs dramatically from the OLS result
without city fixed effects and non-trivially from the fixed effects results.
Levitt’s analysis essentially treats changes in firefighters as a kind of experi-
ment. He estimates the number of police that cities add when they add firefighters
and assesses whether crime changed in conjunction with these particular changes
instrumental in police. Levitt is using the firefighter variable as an instrumental variable, a
variable Explains variable that explains the endogenous independent variable of interest (which in
the endogenous this case is the log of the number of police per capita) but does not directly explain
independent variable of
the dependent variable (which in this case is violent crimes per capita).
interest but does not
directly explain the
The example also highlights some limits to instrumental variables methods.
dependent variable. First, the increase in police associated with changes in firefighters may not
really be exogenous. That is, can we be sure that the firefighter variable is
truly independent of the error term in Equation 9.1? It is possible, for example,
that reelection-minded political leaders provide other public services when they
boost the number of firefighters—goodies such as tax cuts, roads, and new
stadiums—and that these policy choices may affect crime (perhaps by improving
economic growth). In that case, we worry that our exogenous bump in police
is actually associated with factors that also affect crime, and that those factors
may be in the error term. Therefore, as we develop the logic of instrumental
variables, we also spend a lot of time worrying about the exogeneity of our
instruments.
A second concern is that we may reasonably worry that changes in firefighters

do not account for much of the variation in police forces. In that case, the
exogenous change we are measuring will be modest and may lead to imprecise
estimates. We see this in Table 9.1, where the standard error based on an
instrumental variables approach is more than four times larger than the standard
errors in the other models.
REMEMBER THIS
1. An instrumental variable is a variable that explains the endogenous independent variable of
interest but does not directly explain the dependent variable.
2. When we use the instrumental variables approach, we focus on changes in Y due to the changes
in X that are attributable to changes in the instrumental variable.
3. Major challenges associated with using instrumental variables include the following:
(a) It is often hard to find an appropriate instrumental variable that is exogenous.
(b) Estimates based on instrumental variables are often imprecise.
9.2 Two-Stage Least Squares (2SLS)

We implement the instrumental variables approach with the 2SLS approach.
As you can see from the name, it’s a least squares approach, meaning that the
underlying calculations are still based on minimizing the sum of squared residuals
as in OLS. The new element is that 2SLS has—you guessed it—two stages, unlike
standard OLS, which only has one stage.
In this section, we distinguish endogenous and instrumental variables, explain
the two stages of 2SLS, discuss the characteristics of good instrumental variables,
and describe the challenges of finding good instrumental variables.
Endogenous and instrumental variables

The main equation in 2SLS is the same as in OLS:
Yi = β0 + β1 X1i + β2 X2i + i (9.2)
where Yi is our dependent variable, X1i is our main variable of interest, and X2i is
a control variable (and we could easily add additional control variables).
The difference is that X1i is an endogenous variable, which means that it is
correlated with the error term. Our goal with 2SLS is to replace the endogenous
X1i with a different variable that measures only the portion of X1i that is not related
to the error term in the main equation.
We model X1i as
X1i = γ0 + γ1 Zi + γ2 X2i + νi (9.3)
where Zi is a new variable we are adding to the analysis, X2i is the control variable
in Equation 9.2, the γ’s are coefficients that determine how well Zi and X2i explain
X1i , and νi is an error term. (Recall that γ is the Greek letter gamma and ν is the
Greek letter nu.) We call Z our instrumental variable; this variable is the star of
this chapter, hands down. The variable Z is the source of our exogenous variation
in X1i .
In Levitt’s police and crime example, “police officers per capita” is the
endogenous variable (X1 in our notation) and “firefighters” is the instrumental
variable (Z in our notation). The instrumental variable is the variable that causes
the endogenous variable to change for reasons unrelated to the error time. In other
words, in Levitt’s model, Z (firefighters) explains X1i (police per capita) but is not
correlated with the error term in the equation explaining Y (crime).
The two stages of 2SLS

There are, not surprisingly, two steps to 2SLS. First, we generate fitted values of X,
which we call (as is our habit) X̂1i , by estimating γ̂ values based on Equation 9.3
and use them in the following equation:
X̂1i = γ̂0 + γ̂1 Zi + γ̂2 X2i
Notice that X̂1i is a function only of Z, X2 , and the γ’s. That fact has important
implications for what we are trying to do. The error term when X1i is the dependent
variable is νi ; it is almost certainly correlated with i , the error term in the Yi
equation. That is, drug use and criminal history are likely to affect both the number
of police (X1 ) and crime (Y). This means the actual value of X1 is correlated with
; the fitted value X̂1i , on the other hand, is only a function of Z, X2 , and the γ’s. So
even though police forces in reality may be ebbing and flowing as related to drug
use and other factors in the error term of Equation 9.2, the fitted value X̂1i will not
change. Our X̂1i will ebb and flow only with changes in Z and X2 , which means
our fitted value of X has been purged of the association between X and .
All control variables from the second-stage model must be included in the
first stage. We want our instrument to explain variation in X1 over and above any
variation that can be explained by the other independent variables.
In the second stage, we estimate our outcome equation, but (key point here)
we use X̂1i —the fitted value of X1i —rather than the actual value of X1i . In other
words, instead of using X1i , which we suspect is endogenous (correlated with i ),
we use the measure of X̂1i , which has been purged of X1i ’s association with error.
Specifically, the second stage of the 2SLS model is
Yi = β0 + β1 X̂1i + β2 X2i + i (9.4)

The little hat on X̂1i is a big deal. Once we appreciate why we’re using it and
how to generate it, 2SLS becomes easy. We are now estimating how much the
exogenous variation in X1i affects Y. Notice also that there is no Z in Equation 9.4.
By the logic of 2SLS, Z affects Y only indirectly, by affecting X.
Control variables play an important role, just as in OLS. If a factor that
affects Y is correlated with Z, we need to include it in the second-stage regression.
Otherwise, the instrument may soak up some of the effect of this omitted factor
rather than merely exogenous variation in X1 . For example, suppose that cities in
the South started facing more arson and hence hired more firefighters. In that case,
Levitt’s firefighter instrument for police officers will also contain variation due to
region. If we do not control for region in the second-stage regression, some of the
region effect may work its way through the instrument, potentially creating a bias.
Actual estimation via 2SLS is a bit more involved than simply running OLS
with X̂1 because X̂1i is itself an estimate, and the standard errors need to be adjusted
to account for this. In practice, though, statistical packages do this adjustment
automatically with their 2SLS commands.1
Two characteristics of good instruments

The success of 2SLS hinges on the instrument. Good instruments satisfy two
conditions. These conditions are conceptually simple, but in practice, they are hard
to meet.
First, an instrument must actually explain the endogenous variable of interest.
That is, our endogenous variable, X1 , must vary in relation to our instrument, Z.
inclusion condition This is the inclusion condition, a condition that Z needs to exert a meaningful
For 2SLS, a condition effect in the first-stage equation that explains X1i . In Levitt’s police example, police
that the instrument forces must actually rise and fall as firefighter numbers change. This claim is
exert a meaningful
plausible but not guaranteed. We can easily check this condition for any potential
effect in the first-stage
equation in which the
instrument, Z, by estimating the first-stage model of the form of Equation 9.3. If
endogenous variable is the coefficient on Z is statistically significant, we have satisfied this condition. For
the dependent variable. reasons we explain later (in Sections 9.4 and 9.5), the more Zi explains X1i , the
better.
Second, an instrument must be uncorrelated with the error term in the
exclusion condition second-stage equation, Equation 9.2. This condition is the exclusion condition
For 2SLS, a condition because it implies that since the instrument exerts no direct effect on Y, it can
that the instrument be excluded from the second-stage equation. In other words, by saying that an
exert no direct effect in
instrument is uncorrelated with , we are saying that it reflects no part of the error
the second-stage
equation. This condition
term in the main equation and hence can be excluded from it. Recall the various
cannot be tested factors in the error term in a crime model: drug use, gang warfare, demographic
empirically. changes, and so forth. Levitt’s use of firefighters as an instrument was based on
1
When there is a single endogenous independent variable and a single instrument, the 2SLS
cov(Z,Y)
estimator reduces to β̂1 = cov(Z, X)
(Murnane and Willett 2011, 229). While it may be computationally
simpler to use this ratio of covariances to estimate β̂1 , it becomes harder to see the intuition about
exogenous variation if we do so. In addition, the 2SLS estimator is more general: it allows for
multiple independent variables and instruments.
an argument that the number of firefighters in a city was uncorrelated with these
elements of the error term.
Unfortunately, there is no direct test of whether Z is uncorrelated with . The
whole point of the error term is that it covers unmeasured factors. We simply
cannot directly observe whether Z is correlated with these unmeasured factors.
A natural instinct is to try to test the exclusion condition by including Z
directly in the second stage, but this won’t work. If Z is a good instrument, it
will explain X1i , which in turn will affect Y. We will observe some effect of Z
on Y, which will be the effect of Z on X1i , which in turn can have an effect on
Y. Instead, the discussion of the exclusion condition will need to be primarily
conceptual rather than statistical. We will need to justify our assertion, without
statistical analysis, that Z does not affect Y directly. Yes, that’s a bummer and,
frankly, a pretty weird position to be in for a statistical analyst. Life is like that
sometimes.2
Figure 9.1 illustrates the two conditions necessary for Z to be an appropriate
instrument. The inclusion condition is that Z explains X. We test this simply by
regressing X on Z. The exclusion restriction is that Z does not cause Y. The
exclusion condition is tricky to test because if the inclusion condition holds, Z
causes X, which in turn may cause Y. In this case, there would be an observed
relationship between Z and Y but only via Z’s effect on X. Hence, we can’t test
the exclusion restriction statistically and must make substantive arguments about
why we believe Z has no direct effect on Y.
Finding a good instrument is hard

Finding an instrument that satisfies the exclusion condition is really hard with
observational data. Economists Josh Angrist and Alan Krueger provided a famous
example in a 1991 study of the effect of education on wages. Because the personal
traits that lead a person to get more education (smarts, diligence, family wealth)
are often the traits that lead to financial success, education is very likely to be
endogenous when one is explaining wages. Therefore, the researchers sought an
instrument for education, a variable that would explain years of schooling but have
nothing to do with wages. They identified a very clever possibility: quarter of birth.
Although this idea seems crazy at first, it actually makes sense. Quarter of
birth satisfies the inclusion condition because how much schooling a person gets
depends, in part, on the month in which the person was born. Most school districts
2
A test called the Hausman test (or the Durbin-Wu-Hausman test) is sometimes referred to as a test
of endogeneity. We should be careful to recognize that this is not a test of the exclusion restriction.
Instead, the Hausman test assesses whether X is endogenous. It is not a test of whether Z is
exogenous. Hausman derived the test by noting that if Z is exogenous and X is endogenous, then OLS
and 2SLS should produce very different β̂ estimates. If Z is exogenous and X is exogenous, then OLS
and 2SLS should produce similar β̂ estimates. The test involves assessing how different the β̂
estimates are from OLS and 2SLS. Crucially, we need to assume that Z is exogenous for this test.
That’s the claim we usually want to test, so the Hausman test of endogeneity is often less valuable
than it sounds.
(Independent variable)
Y
Inclusion condition:
Z must explain X (Dependent variable)
Exclusion restriction:
Z
Z must not explain Y
(Instrumental variable)
FIGURE 9.1: Conditions for Instrumental Variables
have laws that say that young people have to stay in school until they are 16. For
a school district that starts kids in school based on their age on September 1, kids
born in July would be in eleventh grade when they turn 16, whereas kids born in
October (who started a year later) would be only in tenth grade when they turn
16. Hence, kids born in July can’t legally drop out until they are in the eleventh
grade, but kids born in October can drop out in the tenth grade. The effect is not
huge, but with a lot of data (and Angrist and Krueger had a lot of data), this effect
is statistically significant.
Quarter of birth also seems to satisfy the exclusion condition because birth
month doesn’t seem to be related to such unmeasured factors that affect salary as
smarts, diligence, and family wealth. (Astrologers disagree, by the way.)
Bound, Jaeger, and Baker (1995), however, showed that quarter of birth
has been associated with school attendance rates, behavioral difficulties, mental
health, performance on tests, schizophrenia, autism, dyslexia, multiple sclerosis,
region, and income. [Wealthy families, for example, have fewer babies in the
winter (Buckles and Hungerman 2013). Go figure.] That this example may fail
the exclusion condition is disappointing: if quarter of birth doesn’t satisfy the
exclusion condition, it’s fair to say a lot of less clever instruments may be in trouble
as well. Hence, we should exercise due caution in using instruments, being sure
both to implement the diagnostics discussed next and to test theories with multiple
instruments or analytical strategies.
REMEMBER THIS
Two-stage least squares uses exogenous variation in X to estimate the effect of X on Y.
1. In the first stage, the endogenous independent variable is the dependent variable and the
instrument, Z, is an independent variable:
X1i = γ0 + γ1 Zi + γ2 X2i + νi
2. In the second stage, X̂1i (the fitted values from the first stage) is an independent variable:
Yi = β0 + β1 X̂1i + β2 X2i + i
3. A good instrument, Z, satisfies two conditions.

• Z must be a statistically significant determinant of X1 . In other words, it needs to be included
in the first stage of the 2SLS estimation process.
• Z must be uncorrelated with the error term in the main equation, which means that Z must
not directly influence Y. In other words, an instrument must be properly excluded from the
second stage of the 2SLS estimation process. This condition cannot be directly assessed
statistically.
4. When we use observational data, it is difficult to find an instrument that incontrovertibly
satisfies the exclusion condition.
1. Some people believe cell phones and platforms like Twitter, which use related technology, have
increased social unrest by making it easier to organize protests or acts of violence. Pierskalla
and Hollenbach (2013) used data from Africa to test this view. In its most basic form, the model
was
Violencei = β0 + β1 Cell phone coveragei + i
where Violencei is data on organized violence in city i and Cell phone coveragei measures
availability of mobile coverage in city i.
(a) Explain why endogeneity may be a concern.

(b) Pierskalla and Hollenbach propose using a measure of regulatory quality as an
instrument for cell phone coverage. Explain how to test whether this variable satisfies
the inclusion condition.
(c) Does the regulatory quality variable satisfy the exclusion condition? Can we test whether
this condition holds?
2. Do political protests affect election results? Consider the following model, which is a simplified
version of the analysis presented in Madestam, Shoag, Veuger, and Yanagizawa-Drott (2013):
Republican votei = β0 + β1 Tea Party protest turnouti + i
where Republican votei is the vote for the Republican candidate for Congress in district i in
2010 and Tea Party protest turnouti measures the number of people who showed up at Tea Party
protests in district i on April 15, 2009, a day of planned protests across the United States.
(b) Consider local rainfall on April 15, 2009, as an instrument for Tea Party protest turnout.
Explain how to test whether the rain variable satisfies the inclusion condition.
(c) Does the local rainfall variable satisfy the exclusion condition? Can we test whether this
condition holds?
3. Do economies grow more when their political institutions are better? Consider the following
simple model:
Economic growthi = β0 + β1 Institutional qualityi + i
where Economic growthi is the growth of country i and Institutional qualityi is a measure of
the quality of governance of country i.
(b) Acemoglu, Johnson, and Robinson (2001) proposed country-specific mortality rates
faced by European soldiers, bishops, and sailors in their countries’ colonies in the sev-
enteenth, eighteenth, and nineteenth centuries as an instrument for current institutions.
The logic is that European powers were more likely to set up worse institutions in
places where the people they sent over kept dying. In these places, the institutions were
oriented more toward extracting resources than toward creating a stable, prosperous
society. Explain how to test whether the settler mortality variable satisfies the inclusion
condition.
(c) Does the settler mortality variable satisfy the exclusion condition? Can we test whether
this condition holds?
CASE STUDY Emergency Care for Newborns

Are neonatal intensive care units (NICUs) effective? These
high-tech medical facilities deal with the most at-risk
pregnancies and work to keep premature babies alive
and healthy. It seems highly likely they help because they
attract some of the best people in medicine and have
access to the most advanced technology.
A naive analyst using observational data might not
think so, however. Suppose we analyze birth outcomes
with the following simple model
Deathi = β0 + β1 NICUi + i (9.5)
where Death equals 1 if the baby passed away (and 0 otherwise) and NICU equals 1
if the delivery occurred in a high-level NICU facility (and 0 otherwise).
It is highly likely that the coefficient in this case would be positive. It is beyond
doubt that the riskiest births go to the NICU, so clearly, the key independent variable
(NICU) will be correlated with factors associated with a higher risk of death. In other
words, we are quite certain endogeneity will bias the coefficient upward. We could,
of course, add covariates that indicate risk factors in the pregnancy. Doing so would
reduce the endogeneity by taking factors correlated with NICU out of the error term
and putting them in the equation. Nonetheless, we would still worry that cases that
are riskier than usual in reality, but perhaps in ways that are difficult to measure,
would still be more likely to end up in NICUs, with the result that endogeneity would
be hard to fully purge with multivariate OLS.
Perhaps experiments could be helpful. They are, after all, designed to ensure
exogeneity. They are also completely out of bounds in this context. It is shocking to
even consider randomly assigning mothers and newborns to NICU and non-NICU
facilities. It won’t and shouldn’t happen.
So are we done? Do we have to accept multivariate OLS as the best we can do?
Not quite. Instrumental variables, and 2SLS in particular, give us hope for producing
more accurate estimates. What we need is something that explains exogenous
variation in use of the NICU. That is, can we identify a variable that explains usage
of NICUs but is not correlated with pregnancy risk factors?
Lorch, Baiocchi, Ahlberg, and Small (2012) identified a good prospect: distance
to a NICU. Specifically, they created a dummy variable we’ll call Near NICU, which
equals 1 for mothers who could get to NICU in at most 10 minutes more than it took
to get to a regular hospital (and 0 otherwise). The idea is that mothers who lived
closer to a NICU-equipped hospital would be more likely to deliver there. At the
same time, distance to a NICU should not directly affect birth outcomes; it should
affect birth outcomes only to the extent that it affects utilization of NICUs.
Does this variable satisfy the conditions necessary for an instrument? The first
condition is that the instrumental variable explains the endogenous variable, which
in this case is whether the mother delivered at a NICU. Table 9.2 shows the results
from a multivariate analysis in which the dependent variable was a dummy variable
indicating delivery at a NICU and the main independent variable was the variable
indicating that the mother lived near a NICU.
Clearly, mothers who live close to a NICU hospital are more likely to deliver
at such a hospital. The estimated coefficient on Near NICU is highly statistically
significant, with a t statistic exceeding 178. Distance does a very good job explain-
ing NICU usage. Table 9.2 shows coefficients for two other variables as well (the
actual analysis has 60 control variables). Gestational age indicates how far along
the pregnancy was at the time of delivery. ZIP code poverty indicates the percent
of people in a ZIP code living below the poverty line. Both these control variables
are significant, with babies that are gestationally older less likely to be delivered in
NICU hospitals and women from high-poverty ZIP codes more likely to deliver in
NICU hospitals.
The second condition that a good instrument must satisfy is that its variable not
be correlated with the error term in the second stage. This is the exclusion condition,
which holds that we can justifiably exclude the instrument from the second stage.
Certainly, it seems highly unlikely that the mere fact of living near a NICU would
help a baby unless the mother used that facility. However, living near a NICU might
be correlated with a risk factor. What if NICUs tended to be in large urban hospitals
in poor areas? In that case, living near one could be correlated with poverty, which
in turn might itself be a pregnancy risk factor. Hence, it is crucial in this analysis that
poverty be a control variable in both the first and second stages. In the first stage,
controlling for poverty allows us to identify how much more likely women are to go
TABLE 9.2 Influence of Distance on NICU

Utilization (First-Stage Results)
Near NICU 0.040∗

(0.0002)
[t = 178.05]
Gestational age −0.021∗

(0.0006)
[t = 34.30]
ZIP code poverty 0.623∗
(0.026)
[t = 23.83]
N 192, 077

∗
The model includes a total of 60 controls for pregnancy risk

and demographics factors. Results based on Lorch,
Baiocchi, Ahlberg, and Small (2012).
TABLE 9.3 Influence of NICU Utilization on Baby Mortality

Bivariate OLS Multivariate OLS 2SLS
∗ ∗
NICU utilization 0.0109 −0.0042 −0.0058∗
(0.0006) (0.0006) (0.0016)
[t = 17.68] [t = 6.72] [t = 3.58]
Gestational age −0.0141∗ −0.0141∗
(0.0002) (0.0002)
[t = 79.87] [t = 78.81]
ZIP code poverty 0.0113 0.0129
(0.0076) (0.0078)
[t = 1.48] [t = 1.66]
N 192, 077 192, 077 192, 077

∗
The multivariate OLS and 2SLS models include many controls for pregnancy
risk and demographic factors. Results based on Lorch, Baiocchi, Ahlberg,
and Small (2012).
to a NICU without ignoring neighborhood poverty. In the second stage, controlling

for poverty allows us to control for the effect of this variable to prevent conflating
it with the effect of actually going to a NICU-equipped hospital.
Table 9.3 presents results for assessing the effect of giving birth in a NICU
hospital. The first column shows results from a bivariate OLS model predicting
whether the baby passes away as a function of whether the delivery was in a NICU
hospital. The coefficient is positive and highly significant, meaning that babies
delivered in NICU hospitals are more likely to die. For the reasons discussed earlier,
we would never believe this conclusion due to obvious endogeneity, but it provides
a useful baseline to appreciate the pitfalls of failing to account for endogeneity.
The second column shows that adding covariates changes the results consid-
erably: the effect of giving birth in a NICU is now associated with lower chance
of death. The effect is statistically significant, with a t statistic of 6.72. Table 9.3
reports results for two covariates, gestational age and ZIP code poverty. The highly
statistically significant coefficient on gestational age indicates that babies that have
been gestating longer are less likely to die. The effect of ZIP code poverty is not quite
statistically significant. The full analysis included many more variables on risk and
demographic factors.
We’re still worried that the multivariate OLS result could be biased upward (i.e.,
less negative than it should be) if unmeasured pregnancy risk factors sent women to
the NICU hospitals. The results in the 2SLS model address this concern by focusing
on the exogenous change in utilization of NICU hospitals associated with living
near them. The coefficient on living near a NICU continues to be negative, and at
−0.0058, it is almost 50 percent greater in magnitude than the multivariate OLS
results (in this case, almost 50 percent more negative). This is the coefficient on the
fitted value of NICU utilization that is generated by using the coefficients estimated
in Table 9.2. The estimated coefficient on NICU utilization is statistically significant,

but with a smaller t statistic than multivariate OLS, which is consistent with the fact
that 2SLS results are typically less precise than OLS results.
Review Questions
Table 9.4 provides results on regressions used in a 2SLS analysis of the effect of alcohol consumption
on grades. This is from hypothetical data on grades, standardized test scores, and average weekly
alcohol consumption from 1,000 undergraduate students at universities in multiple states. The beer
tax variable measures the amount of tax on beer in the state in which the student attends university.
The test score is the composite SAT score from high school. Grades are measured as grade point
average in the student’s most recent semester.
1. Identify the first-stage model and the second-stage model. What is the instrument?
2. Is the instrument a good instrument? Why or why not?
3. Is there evidence about the exogeneity of the instrument in the table? Why or why not?
4. What would happen if we included the beer tax variable in the grades model?
5. Do the (hypothetical!) results here present sufficient evidence to argue that alcohol has no effect
on grades?
TABLE 9.4 Regression Results for Models

Relating to Drinking and Grades
Dependent Variable
Drinks per week Grades
Standardized test score −0.001∗ 0.01∗

(0.002) (0.001)
[t = 5.00] [t = 10.00]
Beer tax −2.00
(1.50)
[t = 1.33]
Drinks per week −1.00
(fitted) (1.00)
[t = 1.00]
Constant 4.00∗ 2.00∗
(1.00) (1.00)
[t = 4.00] [t = 2.00]
N 1,000 1,000
2
R 0.20 0.28

∗
9.3 Multiple Instruments 309
9.3 Multiple Instruments

Sometimes we have multiple potential instrumental variables that we think predict
X but not Y. In this section, we explain how to handle multiple instruments and
the additional diagnostic tests that become possible when we have more than one
instrument.
2SLS with multiple instruments

When we have multiple instruments, we proceed more or less as before but include
all instruments in the first stage. So if we had three instruments (Z1 , Z2 , and Z3 ),
the first stage would be
X1i = γ0 + γ1 Z1i + γ2 Z2i + γ3 Z3i + γ4 X2i + νi (9.6)
If these are all valid instruments, we have multiple sources of exogeneity that could
improve the fit in the first stage.
When we have multiple instruments, the best way to assess whether the
instruments adequately predict the endogenous variable is to use an F test for the
null hypothesis that the coefficients on all instruments in the first stage are zero.
For our example, the F test would test H0 : γ1 = γ2 = γ3 = 0. We presented the F
test in Chapter 5 (page 159). In this case, rejecting the null would lead us to accept
that at least one of the instruments helps explain X1i . We discuss a rule of thumb
for this test shortly on page 312.
Overidentification tests
overidentification Having multiple instruments also allows us to implement an overidentification
test A test used for test. The name of the test comes from the fact that we say an instrumental
2SLS models having variable model is identified if we have an instrument that can explain X without
more than one directly influencing Y. When we have more than one instrument, the equation
instrument. The logic of is overidentified; that sounds a bit ominous, like something will explode.3
the test is that the
Overidentification is actually a good thing. Having multiple instruments allows
estimated coefficient on
the endogenous us to do some additional analysis that will shed light on the performance of the
variable in the instruments.
second-stage equation The references in this chapter’s Further Reading section point to a number
should be roughly the of formal tests regarding multiple instruments. These tests can get a bit involved,
same when each but the core intuition is rather simple. If each instrument is valid—that is, if each
individual instrument is satisfies the two conditions for instruments—then using each one alone should
used alone.
produce an unbiased estimate of β1 . Therefore, as an overidentification test, we
can simply estimate the 2SLS model with each individual instrument alone. The
coefficient estimates should look pretty much the same given that each instrument
alone under these circumstances produces an unbiased estimator. Hence, if each
of these models produces coefficients that are similar, we can feel pretty confident
3
Everyone out now! The model is going to blow any minute . . . it’s way overidentified!
that each is a decent instrument (or that they all are equally bad, which is the skunk
at the garden party for overidentification tests).
If the instruments produce vastly different β̂1 coefficient estimates, we have
to rethink our instruments. This can happen if one of the instruments violates the
exclusion condition. The catch is that we don’t know which instrument is the bad
one. Suppose that β̂1 found by using Z1 as an instrument is very different from
β̂1 found by using Z2 as an instrument. Is Z1 a bad instrument? Or is the problem
with Z2 ? Overidentification tests can’t say.
An overidentification test is like having two clocks. If the clocks show
different times, we know at least one is wrong, and possibly both. If both clocks
show the same time, we know they’re either both right or both wrong in same exact
way.
Overidentification tests are relatively uncommon, not because they aren’t
useful but because it’s hard to find one good instrument, let alone two
or more.
REMEMBER THIS
An instrumental variable is overidentified when there are multiple instruments for a single
endogenous variable.
1. To estimate a 2SLS model with multiple valid instruments, simply include all of them in the
first stage.
2. To use overidentification tests to assess instruments, run 2SLS models separately with each
instrumental variable. If the second-stage coefficients on the endogenous variable in question
are similar across models, this result is evidence that all the instruments are valid.
9.4 Quasi and Weak Instruments

2SLS estimates are fragile. In this section, we show how they can go bad if Z is
correlated with or if Z performs poorly in the first stage.
Quasi-instrumental variables are not strictly exogenous

As discussed earlier, observational data seldom provide instruments for which we
can be sure that the correlation of Z and is literally zero. Sometimes we will
be considering the use of instruments that we believe correlate with just a little
9.4 Quasi and Weak Instruments 311
bit, or at least a lot less than X1 correlates with . Such an instrument is called a
quasi-instrument quasi-instrument.
An instrumental variable It can sometimes be useful to estimate a 2SLS model with a quasi-instrument
that is not strictly because a bit of correlation between Z and does not necessarily render 2SLS
exogenous. useless. To see why, let’s consider a simple case: one independent variable and
one instrument. We examine the probability limit of β̂1 because the properties
of probability limits are easier to work with than expectations in this context.4
For reference, we first note that the probability limit for the OLS estimate
of β̂1 is
OLS σ
plim β̂ 1 = β1 + corr(X1 , ) (9.7)
σX
where plim refers to the probability limit and corr indicates the correlation of the
two variables in parentheses. If corr(X1 , ) is zero, then the probability limit of
OLS
β̂ 1 is β1 . That’s a good thing! If corr(X1 , ) is non-zero, the OLS of β̂1 will
converge to something other than β1 as the sample size gets very large. That’s not
good.
If we use a quasi-instrument to estimate a 2SLS, the probability limit for the
2SLS estimate of β̂1 is
2SLS corr(Z, ) σ
plim β̂ 1 = β1 + (9.8)
corr(Z, X1 ) σX1
2SLS
If corr(Z, ) is zero, then the probability limit of β̂ 1 is β1 .5 Another good thing!
Otherwise, the 2SLS estimate of β̂1 will converge to something other than β1 as
the sample size gets very large.
Equation 9.8 has two very different implications. On the one hand, the
equation can be grounds for optimism about 2SLS. Comparing the probability
limits from the OLS and 2SLS models shows that if there is only a small
correlation between Z and and a high correlation between Z and X1 , then
2SLS will perform better than OLS when the correlation of X and is
large. This can happen when an instrument does a great job predicting X but
has a wee bit of correlation with the error in the main equation. In other
words, quasi-instruments may help us get estimates that are closer to the true
value.
On the other hand, the correlation of the Z and X1 in the denominator of
Equation 9.8 implies that when the instrument does a poor job of explaining
X1 , even a small amount of correlation between Z and can become magnified
by virtue of being divided by a very small number. In the education and wages
4
Section 3.5 introduces probability limits.
5
The form of this equation is from Wooldridge (2009), based on Bound, Jaeger, and Baker (1995).
example, the month of birth explained so little of the variation in education that the
danger was substantial distortion of the 2SLS estimate if even a dash of correlation
existed between month of birth and .
Weak instruments do a poor job of predicting X

The possibility that our instrument may have some correlation with means
weak instrument that with 2SLS, we must be on guard against problems associated with weak
An instrumental variable instruments—those that add little explanatory power to the first-stage regression.
that adds little Equation 9.8 showed that when we have a weak instrument, a small amount of
explanatory power to correlation of the instrument and error term can lead to 2SLS to produce β̂1
the first-stage regression estimates that diverge substantially from the true value.
in a 2SLS analysis.
Weak instruments create additional problems as well. Technically, 2SLS
produces consistent, but biased, estimates of β̂1 . This means that even though the
2SLS estimate is converging toward the true value, β1 , as the sample gets large, the
expected value of the estimate for any given sample will not be β1 . In particular,
the expected value of β̂1 from 2SLS will be skewed toward the β̂1 from OLS. The
extent of this bias decreases as the sample gets bigger. This means that in small
samples, 2SLS tends to look more like OLS than we would like. This problem
worsens as the fit in the first-stage model worsens.
We therefore might be tempted to try to pump up the fit of our first-stage
model by including additional instruments. Unfortunately, it’s not that simple.
The bias of 2SLS associated with small samples also worsens as the number of
instruments increases, creating a trade-off between the number of instruments
and the explanatory power of the instruments in the first stage. Each additional
instrument brings at least a bit more explanatory power, but also a bit more
small-sample bias. The details are rather involved; see references discussed in the
Further Reading section for more details.
It is therefore important to diagnose weak instruments by looking at how well
Z explains X1 in the first-stage regression. When we use multivariate regression,
we’ll want to know how much more Z explains X1 than the other variables in
the model. We’ll look for large t statistics for the Z variable in the first stage.
The typical rule of thumb is that the t statistic should be greater than 3, which
is higher than our standard rule of thumb for statistical significance. A rule
of thumb for multiple instruments is that the F statistic should be at least 10
for the test of the null hypothesis that the coefficients on all instruments are
all zero in the first-stage regression. This rule of thumb is not a statistical test
but, rather, a guideline for what to aim for in seeking a first-stage model that
fits well.6
6
The rule of thumb is from Staiger and Stock (1997). We can, of course, run an F test even when we
have only a single instrument. A cool curiosity is that the F statistic in this case will be the square of
the t statistic. This means that when we have only a single instrument, we can simply look for a t
9.5 Precision of 2SLS 313
REMEMBER THIS
1. A quasi-instrument is an instrument that is correlated with the error term in the main equation.
If the correlation of the quasi-instrument (Z) and the error term () is small relative to the
correlation of the quasi-instrument and the endogenous variable (X), then as the sample size
gets very large, the 2SLS estimate based on Z will converge to something closer to the true
value than the OLS estimate.
2. A weak instrument does a poor job of explaining the endogenous variable (X). Weak
instruments magnify the problems associated with quasi-instruments and also can cause bias
in small samples.
3. All 2SLS analyses should report tests of independent explanatory power of the instrumental
variable or variables in first-stage regression. A rule of thumb is that the F statistic should be
at least 10 for the hypothesis that the coefficients on all instruments in the first-stage regression
are zero.
9.5 Precision of 2SLS

To calculate proper standard errors for 2SLS, we need to account for the fact that
the fitted X̂1 values are themselves estimates. Any statistical program worth its salt
does this automatically, so we typically will not have to worry about the nitty-gritty
of calculating precision for 2SLS.
We should appreciate, however, that standard errors for 2SLS estimates differ
in interesting ways from OLS standard errors. In this section, we show why
they run bigger and how this result is largely related to the fit of the first-stage
regression.
The variance of 2SLS estimates is similar to the variance of OLS esti-
mates. Recall from page 146 that the variance of a coefficient estimate in
OLS is
σ̂ 2
var(β̂ j ) = (9.9)
Nvar(Xj )(1 − R2j )

(Y −Ŷ )2
where σ̂ is the variance of (which is estimated as σ̂ =
2 i
N−k
i 2
) and R2j is
the R from a regression of Xj on all the other independent variables (Xj = γ0 +
2
γ1 X1 + γ2 X2 + · · · ).
√
statistic that is bigger than 10, which we approximate (roughly!) by saying the t statistic should be
bigger than 3. Appendix H provides more information on the F distribution on page 549.
For a 2SLS estimate, the variance of the coefficient on the instrumented

variable is
2SLS σ̂ 2
var(β̂ 1 ) = (9.10)
Nvar(X̂1 )(1 − R2X̂ NoZ )
1
2
(Yi −Ŷi )
where σ̂ 2 = N−k using fitted values from 2SLS estimation and R2X̂ NoZ is the
1
R2 from a regression of X̂1 on all the other independent variables (X̂1 = γ0 +γ2 X2 +
· · · ) but not the instrumental variable (we’ll return to R2X̂ NoZ soon).
1
As with OLS, variance is lower when there is a good model fit (meaning a
low σ̂ 2 ) and a large sample size (meaning a large N in the denominator). The new
points for the 2SLS variance equation relate to our use of X̂1i instead of X1i in the
equation. There are two important implications.
• The denominator of Equation 9.10 contains var(X̂1 ), which is the variance

of the fitted value, X̂1 (notice the hat). If the fitted values do not vary much,
then var(X̂1 ) will be relatively small. That’s a problem because to produce
a small variance, this quantity should be big. In other words, we want the
fitted values for our endogenous variable to vary a lot. A poor fit in the
first-stage regression can lead the fitted values to vary little; a good fit will
lead the fitted variables to vary more.
• The R2X̂ NoZ term in Equation 9.10 is the R2 from

1
X̂1i = π0 + π1 X2i + ηi (9.11)
where we use π , the Greek letter pi, as coefficients and η, the Greek letter
eta (which rhymes with β), to emphasize that this is a new model, different
from earlier models. Notice that Z is not in this regression, meaning that
the R2 from it explains the extent to which X̂1 is a function of the other
independent variables. If this R2 is high, X̂1 is explained by X2 but not by
2SLS
Z, which will push up var(β̂ X̂1 ).
The point here is not to learn how to calculate standard error estimates by
hand. Computer programs do the chore perfectly well. The point is to understand
the sources of variance in 2SLS. In particular, it is useful to see the importance of
2SLS
the ability of Z to explain X1 . If Z lacks this ability, our β̂ 1 estimates will be
imprecise.
As for goodness of fit, the conventional R2 for 2SLS is basically broken. It is
possible for it to be negative. If we really need a measure of goodness of fit, the
square of the correlation of the fitted values and actual values will do. However, as
we discussed when we introduced R2 on page 71, the validity of the results does
not depend on the overall goodness of fit.
REMEMBER THIS
1. Four factors influence the variance of 2SLS β̂ j estimates.
2SLS
(a) Model fit: The better the model fits, the lower σ̂ 2 and var(β̂ j ) will be.
2SLS
(b) Sample size: The more observations, the lower var(β̂ j ) will be.
(c) The overall fit of the first-stage regression: The better the fit of the first-stage model, the
2SLS
higher var(X̂1 ) and the lower var( β̂1 ) will be.
(d) The explanatory power of the instrument in explaining X:
• If Z is a weak instrument (i.e., if it does a poor job of explaining X1 when we control
for the other X variables), then R2X̂ NoZ will be high because X̂1 will depend almost
1
2SLS
completely on the other independent variables. The result will be a high var( β̂1 ).
• If Z explains X1 when we control for the other X variables, then R2X̂ NoZ will be low,
1
2SLS
which will lower var( β̂1 ).
2. R2 is not meaningful for 2SLS models.
9.6 Simultaneous Equation Models

simultaneous One particular source of endogeneity occurs in a simultaneous equation model
equation model A in which X causes Y and Y also causes X. In this section, we explain these models,
model in which two as well as why endogeneity is inherent in them and how to use 2SLS to estimate
variables simultaneously them.
cause each other.
Endogeneity in simultaneous equation models

Simultaneous causation is funky, but not crazy. Examples abound:
• In equilibrium, price in a competitive market is a function of quantity

supplied. Quantity supplied is also a function of price.
• Effective government institutions may spur economic growth. At the

same time, strong economic growth may produce effective government
institutions.
• Individual views toward the Affordable Care Act (“ObamaCare”) may be

influenced by what a person thinks of President Obama. At the same time,
views of President Obama may be influenced by what a person thinks of
the Affordable Care Act.
The labels X and Y don’t really work anymore when the variables cause each
other because no variable is only an independent variable or only a dependent
variable. Therefore, we use the following equations to characterize basic model of
simultaneous causality:
Y1i = β0 + β1 Y2i + β2 Wi + β3 Z1i + 1i (9.12)

Y2i = γ0 + γ1 Y1i + γ2 Wi + γ3 Z2i + 2i (9.13)
The first dependent variable, Y1 , is a function of Y2 (the other dependent variable),

W (a variable that affects both dependent variables), and Z1 (a variable that
affects only Y1 ). The second dependent variable, Y2 , is a function of Y1 (the other
dependent variable), W (a variable that affects both dependent variables), and Z2
(a variable that affects only Y2 ).
Figure 9.2 illustrates the framework characterized by Equations 9.12 and 9.13.
Y1 and Y2 cause each other. W causes both Y1 and Y2 , but the Y variables have no
effect on W. Z1 causes only Y1 , and Z2 causes only Y2 .
With simultaneity comes endogeneity. Let’s consider Y2i , which is an indepen-
dent variable in Equation 9.12. We know from Equation 9.13 that Y2i is a function
of Y1i , which in turn is a function of 1i . Thus, Y2i must be correlated with 1i , and
we therefore have endogeneity in Equation 9.12 because an independent variable
is correlated with the error. The same reasoning applies for Y1i in Equation 9.13.
Simultaneous equations are a bit mind-twisting at first. It really helps to work
through the logic for ourselves. Consider the classic market equilibrium case in
which price depends on quantity supplied and vice versa. Suppose we look only
at price as a function of quantity supplied. Because quantity supplied depends on
price, such a model is really looking at price as a function of something (quantity
supplied) that is itself a function of price. Of course, quantity supplied will explain
price—it is determined in part by price.
As a practical matter, the approach to estimating simultaneous equation
models is quite similar to what we did for instrumental variable models. Only
now we have two equations, so we’ll do 2SLS twice. We just need to make sure,
for reasons we describe shortly, that our first-stage regression does not include the
other endogenous variable.
Let’s say we’re more interested in the Y1 equation; the logic goes through in
the same way for both equations, of course. In this case, we want to estimate Y1
as a function of Y2 , W, and Z1 . Because Y2 is the endogenous variable, we’ll want
to find an instrument for it with a variable that predicts Y2 but does not predict Y1 .
We have such a variable in this case. It is Z2 , which is in the Y2 equation but not
the Y1 equation.
Y1
Y2
(Endogenous variable)
(Endogenous variable)
(Independent variable
in both equations)
Z1 Z2
(Instrumental variable for Y1) (Instrumental variable for Y2)
FIGURE 9.2: Simultaneous Equation Model
Using 2SLS for simultaneous equation models

The tricky thing is that Y2 is a function of Y1 . If we were to run a first-stage model
for Y2 , include Y1 , and then put the fitted value into the equation for Y1 , we would
have a variable that is a function of Y1 explaining Y1 . Not cool. Instead, we work
reduced form with a reduced form equation for Y2 . In a reduced form equation, Y1 is only a
equation In a reduced function of the non-endogenous variables (which are the W and Z variables, not
form equation, Y1 is only
the Y variables). For this reason, the first-stage regression will be
a function of the
nonendogenous
Y2i = π0 + π1 Wi + π2 Z1i + π3 Z2i + i (9.14)
variables (which are the
X and Z variables, not
We use the Greek letter π to indicate our coefficients because they will differ
the Y variables).
from the coefficients in Equation 9.13, since Equation 9.14 does not include Y1 .
We show in the citations and additional notes section on page 560 how the reduced
form relates to Equations 9.12 and 9.13.
The second-stage regression will be
Y1i = β0 + β1 Ŷ2i + β2 Wi + β3 Z1i + 1i (9.15)
where Ŷ2i is the fitted value from the first-stage regression (Equation 9.14).
Identification in simultaneous equation models

identified A For simultaneous equation models to work, they must be identified; that is, we
statistical model is need certain assumptions in order to be able to estimate the model. For 2SLS with
identified on the basis of one equation, we need at least one instrument that satisfies both the inclusion
assumptions that allow
and exclusion conditions. When we have two equations, we need at least one
us to estimate the
model.
instrument for each equation. To estimate both equations here, we need at least
one variable that belongs in Equation 9.12 but not in Equation 9.13 (which is Z1
in our notation) and at least one variable that belongs in Equation 9.13 but not in
Equation 9.12 (which is Z2 in our notation).
Happily, we can identify equations separately. So even if we don’t have an
instrument for each equation, we can nonetheless plow ahead with the equation
for which we do have an instrument. So if we have only a variable that works in
the second equation and not in the first equation, we can estimate the first equation
(because the instrument allows us to estimate a fitted value for the endogenous
variable in the first equation). If we have only a variable that works in the first
equation and not in the second equation, we can estimate the second equation
(because the instrument allows us to estimate a fitted value for the endogenous
variable in the second equation).
In fact, we can view the police and crime example discussed in Section 9.1
as a simultaneous equation model, with police and crime determining each other
simultaneously. To estimate the effect of police on crime, Levitt needed an
instrument that predicted police but not crime. He argued that his firefighter
variable fit the bill and used that instrument in a first-stage model for predicting
police forces, generating a fitted value of police that he then used in the model for
predicting crime. We discussed this model as a single equation, but the analysis
would be unchanged if we viewed it as a single equation of a simultaneous
equation system.
REMEMBER THIS
We can use instrumental variables to estimate coefficients for the following simultaneous equation
model:
Y1i = β0 + β1 Y2i + β2 Wi + β3 Z1i + 1i

Y2i = γ0 + γ1 Y1i + γ2 Wi + γ3 Z2i + 2i
1. Use the following steps to estimate the coefficients in the first equation:
• In the first stage, we estimate a model in which the endogenous variable is the dependent
variable and all W and Z variables are the independent variables. Importantly, the other
endogenous variable (Y1 ) is not included in this first stage:
Y2i = π0 + π1 Wi + π2 Z1i + π3 Z2i + νi

• In the second stage, we estimate a model in which the fitted values from the first stage, Ŷ2i ,
are an independent variable:
Y1i = β0 + β1 Ŷ2i + β2 Wi + β3 Z1i + 1i
2. We proceed in a similar way to estimate coefficients for the second equation in the model:
• First, estimate a model with Y1i as the dependent variable and the W and Z variables (but
not Y2 !) as independent variables.
• Estimate the final model by using Ŷ1i instead of Y1i as an independent variable.
CASE STUDY Supply and Demand Curves for the Chicken Market
Even though nothing defines the field of eco-
nomics like supply and demand, estimating supply
and demand curves can be tricky. We can’t simply
estimate an equation in which quantity supplied is
the dependent variable and price is an indepen-
dent variable because price itself is a function of
how much is supplied. In other words, quantity and
price are simultaneously determined.
Our simultaneous equation framework can
help us navigate this challenge. First, though, let’s
be clear about what we’re trying to do. We want to
estimate two relationships: a supply function and a
demand function. Each of these characterizes the relationship between price and
amount, but they do so in pretty much opposite ways. We expect the quantity
supplied to increase as the price increases. After all, we suspect a producer will
say, “You’ll pay more? I’ll make more!” On the other hand, we expect the quantity
demanded to decrease when the price increases, as consumers will say, “It costs
more? I’ll buy less!”
As we pose the question, we can see this won’t be easy as we typically observe
one price and one quantity for each period. How are we going to get two different
slopes out of this same information?
If we only had information on price and quantity, we could not, in fact, estimate
the supply and demand functions. In that case, we should shut the computer off and
go to bed. If we have other information, however, that satisfies our conditions for
instrumental variables, then we can potentially estimate both supply and demand
functions. Here’s how.
Let’s start with the supply side and write down equations for quantity and price
supplied:
Quantityt = β0 + β1 Pricet + β2 Wt + β3 Z1t + ε1t

Pricet = γ0 + γ1 Quantityt + γ2 Wt + γ3 Z2t + ε2t
where W is a variable in both equations and Z 1t and Z 2t are instrumental variables

that each appear in only one of the equations.
As always with simultaneous equation models, the key is the instruments.
These are variables that belong in one but not in the other equation. For example,
to estimate the quantity equation, we need an instrumental variable for the price
equation. Such a variable will affect the price but will not directly affect the supply.
For example, the income of the potential demanders should directly affect how
much they are willing to pay but have no direct effect on the amount supplied
other than through the price mechanism. The price of a substitute or complement
good may also affect how much of the product people want. If the price of ice
cream cones skyrockets, for example, maybe people are less interested in buying
ice cream; if the price of gas plummets, perhaps people are more interested in
buying trucks. The prices of other goods do not directly affect the supply of the
good in question other than via changing the price that people are willing to pay
for something.
Okay. We’ve got a plan. Now we can go out and do this for every product,
right? Actually, it’s pretty hard to come up with a nice, clean example of supply
and demand where everything works the way economic theory says it should.
Epple and McCallum (2006) pored through 26 textbooks and found that none
of them presented a supply and demand model estimated with simultaneous
equations that produced statistically significant and theoretically sensible results.
Epple and McCallum were, however, able to come up with one example using data
on chicken.7
The supply is the overall supply of chicken from U.S. producers, and the
demand is the amount of chicken consumed by U.S. consumers. For the supply
equation, Epple and McCallum proposed several instruments, including change in
income, change in price of beef (a substitute), and the lagged price. They argue
that each of these variables is a valid instrument because it affects the price that
people are willing to pay without directly affecting the quantity supplied function.
In other words, Epple and McCallum say that we can include each of these variables
in the price equation but not in the supply equation. Clearly, these claims are
open to question, especially with regard to lagged price, but we’ll work with their
assumptions to show how the model works.
We include other variables in the supply equation that plausibly affect the
supply of chicken, including the price of chicken feed and the lagged value of
production. We also include a time trend, which can capture changes in technology
and transportation affecting production over time. The supply equation therefore is
Chicken producedt = β0 + β1 Pricet + β2 Feed pricet + β3 Chicken producedt−1

+β4 Timet + ε1t
where Pricet is instrumented with change in income, change in the price of beef,
and the lagged price.
We can do a similar exercise when estimating the demand function. We still
work with quantity and price equations. However, now we’re looking for factors
7
We simplify things a fair bit; see the original article as well as Brumm, Epple, and McCallum
(2008) for a more detailed discussion.
TABLE 9.5 Price and Quantity Supplied Equations for U.S. Chicken Market
Price equation Quantity supplied equation
(first stage) (second stage)
Endogenous Price of chicken 0.203*

variable (logged) (0.099)
[t = 2.04]
Control Price of chicken feed 0.188* –0.141*
variables (logged) (0.072) (0.048)
[t = 2.62] [t = 2.94]
Time –0.162 –0.018*
(0.009) (0.006)
[t = 1.83] [t = 3.03]
Lag production 0.323 0.640*
(logged) (0.185) (0.116)
[t = 1.74] [t = 5.53]
Instrumental Change in income 1.35*
variables (logged) (0.614)
[t = 2.21]
Change in price of beef 0.435*
(logged) (0.178)
[t = 2.44]
Lag price of chicken 0.644*
(logged) (0.124)
[t = 5.18]
Constant –1.73 1.98*
(1.07) (0.634)
[t = 1.62] [t = 3.12]
N 40 40
F test of H0 : coefficients 11.16
on instruments = 0

∗
that affect the price via the supply side but do not directly affect how much
chicken people will want to consume. Epple and McCallum proposed the price of
chicken feed, the amount of meat (non-chicken) demanded for export, and the
lagged amount of the amount produced as instrumental variables that satisfy these
conditions. For example, the price of feed will affect how much it costs to produce
chicken, but it should not affect the amount consumed except by affecting the
price. This leads to the following model:
Chicken demandedt = γ0 + γ1 Pricet + γ2 Beef pricet + γ3 Chicken demandedt−1 +

γ4 Incomet + γ5 Timet + ε2t
where Pricet is instrumented with price of chicken feed, the change in meat exports,
and the lagged production.
There are two additional challenges. First, we will log variables in order to
generate price elasticities. We discussed reasons why in Section 7.2. Hence, every
variable except the time trend will be logged. Second, we’re dealing with time
series data. We saw a bit about time series data when we covered autocorrelation in
TABLE 9.6 Price and Quantity Demanded Equations for U.S. Chicken Market
Price equation Quantity demanded equation
(first stage) (second stage)
Endogenous Price of chicken −0.257*

variable (logged) (0.076)
[t = 3.37]
Control Change in income 1.718* 0.408
variables (logged) (0.502) (0.219)
[t = 3.42] [t = 1.86]
Change in price of beef 0.330* 0.232*
(logged) (0.161) (0.079)
[t = 2.05] [t = 2.94]
Time −0.00038 −0.002*
(0.0002) (0.00008)
[t = 1.95] [t = 2.32]
Instrumental Price of chicken feed 0.212*
variables (logged) (0.074)
[t = 2.89]
Change in meat exports 0.081*
(logged) (0.034)
[t = 2.36]
Lag price of chicken −0.135*
(logged) (0.037)
[t = 3.61]
N 39 39
F test of H0 : coefficients 10.86
on instruments = 0

∗
Section 3.6, and we’ll discuss time series data in much greater detail in Chapter 13.
For now, we simply note that a concern with strong time dependence led Epple and
McCallum to conclude the best approach was to use differenced variables for the
demand equation. Differenced variables measure the change in a variable rather
than the level. Hence, the value of a differenced variable for year 2 of the data is the
change from period 1 to period 2 rather than the amount in period 2.
Table 9.5 on page 321 shows the results for the supply equation. The first-stage
results are from a reduced form model in which the price is the dependent variable
and all the control variables and instruments are the independent variables.
Notably, we do not include the quantity as a control variable in this first-stage
regression. Each of the instruments is statistically significant, and the F statistic for
the null hypothesis that all coefficients on the instruments equal zero is 11.16, which
satisfies the rule of thumb that the F statistic for the test regarding all instruments
should be over 10.
The second-stage supply equation uses the fitted value of the price of chicken.
We see that the elasticity is 0.203, meaning that a one percent change in price
is associated with a 0.203 percent increase in production. We also see that a one
percent increase in the price of chicken feed, a major input, is associated with a
0.141 percent reduction in quantity of chicken produced.
Conclusion 323
Table 9.6 on page 322 shows the results for the demand equation. The
first-stage price equation uses the price of chicken feed, meat exports, and the
lagged price of chicken as instruments. Chicken feed prices should affect suppliers
but not directly affect the demand side. The volume of meat exports should
affect suppliers’ output but not what consumers in the United States want. Our
dependent variable in the second stage is the amount of chicken consumed by
people in the United States.
Each instrument performs reasonably well, with the t statistics above 2. The F
statistic for the null hypothesis that all coefficients are zero is 10.86, which satisfies
our first-stage inclusion condition.
The second-stage demand equation reported in Table 9.6 is quite sensible. A
one percent increase in price is associated with a 0.257 percent decline in amount
demanded. This is pretty neat. Whereas Table 9.5 showed an increase in quantity
supplied as price rises, Table 9.6 shows a decrease in quantity demanded as price
rises. This is precisely what economic theory says should happen.
The other coefficients in Table 9.6 make sense as well. A one percent increase in
incomes is associated with a 0.408 percent increase in consumption, although this
is not quite statistically significant. In addition, the amount of chicken demanded
increases as the price of beef rises. In particular, if beef prices go up by one percent,
people in the U.S. eat 0.232 percent more chicken. Think of that coefficient as
basically a Chick-fil-A commercial, but with math.
Conclusion
Two-stage least squares is a great tool for fighting endogeneity. It provides us a

means to use exogenous changes in an endogenous independent variable to isolate
causal effects. It’s easy to implement, both conceptually (two simple regressions)
and practically (let the computer do it).
The problem is that a fully convincing 2SLS can be pretty elusive. In
observational data in particular, it is very difficult to come up with a perfect or even
serviceable instrument because the assumption that the instrument is uncorrelated
with the error term is unverifiable statistically and often arguable in practice. The
method also often produces imprecise estimates, which means that even a good
instrument might not tell us much about the relationship we are studying. Even
imperfect instruments, though, can be useful because they can be less prone to
bias than OLS, especially if they perform well in first-stage models.
When we can do the following, we can be confident we understand
instrumental variables and 2SLS:
• Section 9.1: Explain the logic of instrumental variables models.
• Section 9.2: Explain the first- and second-stage regressions in 2SLS. What
two conditions are necessary for an instrument to be valid?
• Section 9.3: Explain how to use multiple instruments in 2SLS.
• Section 9.4: Explain quasi-instruments and weak instruments and their

implications for 2SLS analysis. Identify results from the first stage that must
be reported.
• Section 9.5: Explain how the first-stage results affect the precision of the
second-stage results.
• Section 9.6: Explain what simultaneity is and why it causes endogeneity.

Describe how to use 2SLS to estimate simultaneous equations, noting the
difference from non-simultaneous models.
Further Reading
Murray (2006a) summarizes the instrumental variables approach and is particu-
larly good at discussing finite sample bias and many statistical tests that are useful
in diagnosing whether instrumental variables conditions are met. Baiocchi, Cheng,
local average and Small (2014) provide an intuitive discussion of instrumental variables in health
treatment effect The research.
causal effect for those One topic that has generated considerable academic interest is the possibility
people affected by the
that the effect of X differs within a population. In this case, 2SLS estimates the
instrument only.
Relevant if the effect of
local average treatment effect, which is the causal effect only for those affected
X on Y varies within the by the instrument. This effect is considered “local” in the sense of describing the
population. effect for the specific class of individuals for whom the endogenous X1 variable
was influenced by the exogenous Z variable.8
In addition, scholars who study instrumental variables methods discuss the
monotonicity importance of monotonicity, which is a condition under which the effect of the
Monotonicity requires instrument on the endogenous variable goes in the same direction for everyone in
that the effect of the a population. This condition rules out the possibility that an increase in Z causes
instrument on the some units to increase X and other units to decrease X.
endogenous variable go Finally, scholars also discuss the stable unit treatment value assumption,
in the same direction for
the condition under which the treatment doesn’t vary in unmeasured ways across
everyone in a
population. individuals and there are no spillover effects that might be anticipated—for
example, if an untreated neighbor of someone in the treatment group somehow
stable unit benefits from the treatment via the neighbor who is in the group.
treatment value
assumption The 8
Suppose, for example, that the effect of education on future wages differs for students who like
condition that an school (they learn a lot in school, so more school leads to higher wages) and students who hate
instrument has no school (they learn little in school, so more school does not lead to higher wages for them). If we use
spillover effect. month of birth as an instrument, then the variation in years of schooling we are looking at is only the
variation among people who would or would not drop out of high school after their sophomore year,
depending on when they turned 16. The effect of schooling for those folks might be pretty small, but
that’s what the 2SLS approach will estimate.
Imbens (2014) and Chapter 4 of Angrist and Pischke (2009) discuss these
points in detail and provide mathematical derivations. Sovey and Green (2011)
discuss these and related points, with a focus on the instrumental variables in
political science.
Key Terms
Exclusion condition (300) Overidentification test (309) Stable unit treatment value
Identified (318) Quasi-instrument (311) assumption (324)
Inclusion condition (300) Reduced form equation Two-stage least squares
Instrumental variable (297) (317) (295)
Local average treatment Simultaneous equation Weak instrument (312)
effect (324) model (315)
Monotonicity (324)
Computing Corner
Stata
1. To estimate a 2SLS model in Stata, use the ivregress 2sls command

(ivregress stands for instrumental variable regression). It works like
the reg command in Stata, but now the endogenous variable (X1 in the
example below) is indicated, along with the instrument (Z in our notation
in this chapter) in parentheses. The , first subcommand tells Stata to
also display the first-stage regression, something we should always do:
ivregress 2sls Y X2 X3(X1 = Z), first
2. It is important to assess the explanatory power of the instruments in the

first-stage regression.
• The rule of thumb when there is only one instrument is that the t
statistic on the instrument in the first stage should be greater than
3. The higher, the better.
• When there are multiple instruments, run an F test using the test
command. The rule of thumb is that the F statistic should be larger
than 10.
reg X1 Z1 Z2 X2 X3
test Z1=Z2=0
3. To estimate a simultaneous equation model, we simply use the ivregress

2sls command:
ivregress 2sls Y1 W1 Z1 (Y2 = Z2), first
ivregress 2sls Y2 W1 Z2 (Y1 = Z1), first
1. To estimate a 2SLS model in R, we can use the ivreg command from the
AER package.
• See page 85 on how to install the AER package. Recall that we need
to tell R to use the package with the library command below for
each R session in which we use the package.
• Other packages provide similar commands to estimate 2SLS

models.
• The ivreg command operates like the lm command. We indicate

the dependent variable and the independent variables for the main
equation. The new bit is that we include a vertical line, after which
we note the independent variables in the first stage. R figures out
that whatever is in the first part but not the second is an endogenous
variable. In this case, X1 is in the first part but not the second and
therefore is the endogenous variable:
library(AER)
ivreg(Y ~ X1 + X2 + X3 | Z1 + Z2 + X2 + X3)
2. It is important to assess the explanatory power of the instruments in the

first-stage regression.
• If there is only one instrument, the rule of thumb is that the t statistic
on the instrument in the first stage should be greater than 3. The
higher, the better.
lm(X1 ~ Z1 + X2 + X3)
• When there are multiple instruments, run an F test with an

unrestricted equation that includes the instruments and a restricted
equation that does not. The rule of thumb is that the F statistic
should be greater than 10. See page 171 on how to implement an F
test in R.
Unrestricted = lm(X1 ~ Z1 + Z2 + X2 + X3)
Restricted = lm(X1 ~ X2 + X3)
3. We can also use the ivreg command to estimate a simultaneous equation

model. Indicate the full model, and then after the vertical line, indicate the
Exercises 327
reduced form variables that will be included (which is all variables but the
other dependent variable):
library(AER)
ivreg(Y1 ~ Y2 + W1 + Z1 | Z1 + W1 + Z2)
ivreg(Y2 ~ Y1 + W1 + Z2 | Z1 + W1 + Z2)
Exercises
1. Does economic growth reduce the odds of civil conflict? Miguel,
Satyanath, and Sergenti (2004) used an instrumental variables approach
to assess the relationship between economic growth and civil war. They
provided data (available in RainIV.dta) on 41 African countries from 1981
to 1999, including the variables listed in Table 9.7.
(a) Estimate a bivariate OLS model in which the occurrence of civil

conflict is the dependent variable and lagged GDP growth is the
independent variable. Comment on the results.
(b) Add control variables for initial GDP, democracy, mountains, and
ethnic and religious fractionalization to the model in part (a). Do
these results establish a causal relationship between the economy
and civil conflict?
(c) Consider lagged rainfall growth as an instrument for lagged GDP

growth. What are the two conditions needed for a good instrument?
Describe whether and how we test the two conditions. Provide
appropriate statistical results.
TABLE 9.7 Variables for Rainfall and Economic Growth Data

InternalConflict Coded 1 if civil war with greater than 25 deaths and 0 otherwise
LaggedGDPGrowth Lagged GDP growth
InitialGDPpercap GDP per capita at the beginning of the period of analysis, 1979
Democracy A measure of democracy (called a “polity” score); values range from −10 to 10
Mountains Percent of country that is mountainous terrain

EthnicFrac Ethnic-linguistic fractionalization
ReligiousFrac Religious fractionalization
LaggedRainfallGrowth Lagged estimate of change in millimeters of precipitation from previous year

(d) Explain in your own words how instrumenting for GDP with rain
could help us identify causal effect of the economy on civil conflict.
(e) Use the dependent and independent variables from part (b), but
now instrument for lagged GDP growth with lagged rainfall growth.
Comment on the results.
(f) Redo the 2SLS model in part (e), but this time, use dummy
variables to add country fixed effects. Comment on the quality of the
instrument in the first stage and the results for the effect of lagged
economic growth in the second stage.
(g) (funky) Estimate the first stage from the 2SLS model in part (f), and
save the residuals. Then estimate a regular OLS model that includes
the same independent variables from part (f) and country dummies.
Use lagged GDP growth (do not use fitted values), and now include
the residuals from the first stage you just saved. Compare the
coefficient on lagged GDP growth you get here to the coefficient on
that variable in the 2SLS. Discuss how endogeneity is being handled
in this specification.
2. Can television inform people about public affairs? It’s a tricky question
because the nerds (like us) who watch public-affairs-oriented TV are
pretty well informed to begin with. Therefore, political scientists Bethany
Albertson and Adria Lawrence (2009) conducted a field experiment in
which they randomly assigned people to treatment and control conditions.
Those assigned to the treatment condition were told to watch a specific
television broadcast about affirmative action and that they would be
interviewed about what they had seen. Those in the control group were not
told about the program but were told that they would be interviewed again
later. The program they studied aired in California prior to the vote on
Proposition 209, a controversial proposition relating to affirmative action.
Their data (available in NewsStudy.dta) includes the variables listed in
Table 9.8.
TABLE 9.8 Variables for News Program Data

ReadNews Political news reading habits (never = 1 to every day = 7)

PoliticalInterest Interest in political affairs (not interested = 1 to very interested = 4)
Education Education level (eighth grade or less = 1 to advanced graduate degree = 13)
TreatmentGroup Assigned to watch program (treatment = 1; control = 0)
WatchProgram Actually watched program (watched = 1, did not watch = 0)
InformationLevel Information about Proposition 209 prior to election (none = 1 to great deal = 4)
Exercises 329
(a) Estimate a bivariate OLS model in which the information the

respondent has about Proposition 209 is the dependent variable and
whether the person watched the program is the independent variable.
Comment on the results, especially whether and how they might be
biased.
(b) Estimate the model in part (a), but now include measures of political
interest, newspaper reading, and education. Are the results different?
Have we defeated endogeneity?
(c) Why might the assignment variable be a good instrument for

watching the program? What test or tests can we run?
(d) Estimate a 2SLS model from using assignment to the treatment

group as an instrument for whether a given respondent watched the
program. Use the additional independent variables from part (b).
Compare the first-stage results to results in part (c). Are they similar?
Are they identical? (Hint: Compare sample sizes.)
(e) What do the 2SLS results suggest about the effect of watching the
program on information levels? Compare the results to those in part
(b). Have we defeated endogeneity?
3. Suppose we want to understand the demand curve for fish. We’ll use the
following demand curve equation:
t = β0 + β1 Pricet + t
QuantityD D
Economic theory suggests β1 < 0. Following standard practice, we

estimate elasticity of demand with respect to price by means of log-log
models.
TABLE 9.9 Variables for Fish Market Data

Price Daily price of fish (logged)

Supply Daily supply of fish (logged)
Stormy Dummy variable indicating a storm at sea based on height of waves and wind
speed at sea
Day1 Dummy variable indicating Monday
Day2 Dummy variable indicating Tuesday
Day3 Dummy variable indicating Wednesday
Day4 Dummy variable indicating Thursday
Rainy Dummy variable indicating rainy day at the fish market
Cold Dummy variable indicating cold day at the fish market
(a) To see that prices and quantities are endogenous, draw supply and
demand curves and discuss what happens when the demand curve
shifts out (which corresponds to a change in the error term of the
demand function). Note also what happens to price in equilibrium
and discuss how this event creates endogeneity.
(b) The data set fishdata.dta (from Angrist, Graddy, and Imbens 2000)
provides data on prices and quantities of a certain kind of fish (called
whiting) over 111 days at the Fulton Street Fish Market, which then
existed in Lower Manhattan. The variables are indicated in Table 9.9.
The price and quantity variables are logged. Estimate a naive OLS
model of demand in which quantity is the dependent variable and
price is the independent variable. Briefly interpret results, and then
discuss whether this analysis is useful.
(c) Angrist, Graddy, and Imbens suggest that a dummy variable indicat-
ing a storm at sea is a good instrumental variable that should affect
the supply equation but not the demand equation. Stormy is a dummy
variable that indicates a wave height greater than 4.5 feet and wind
speed greater than 18 knots. Use 2SLS to estimate a demand function
in which Stormy is an instrument for Price. Discuss first-stage and
second-stage results, interpreting the most relevant portions.
(d) Reestimate the demand equation but with additional controls. Con-
tinue to use Stormy as an instrument for price, but now also include
covariates that account for the days of the week and the weather on
shore. Discuss first-stage and second-stage results, interpreting the
most relevant portions.
4. Does education reduce crime? If so, spending more on education could be

a long-term tool in the fight against crime. The file inmates.dta contains
data used by Lochner and Moretti in their 2004 article on the effects of
education on crime. Table 9.10 describes the variables.
(a) Estimate a model with prison as the dependent variable and educa-
tion, age, and African-American as independent variables. Make this
a fixed effects model by including dummies for state of residence
(state) and year of census data (year). Report and briefly describe
the results.
(b) Based on the OLS results, can we causally conclude that increasing
education will reduce crime? Why is it difficult to estimate the effect
of education on criminal activity?
(c) Lochner and Moretti used 2SLS to improve upon their OLS esti-
mates. They used changes in compulsory attendance laws (set by
Exercises 331
TABLE 9.10 Variables for Education and Crime Data

prison Dummy variable equals 1 if the respondent is in prison and 0 otherwise
educ Years of schooling

age Age
AfAm Dummy variable for African-Americans

state State of residence (FIPS codes)
year Census year

ca9 Dummy equals 1 if state compulsory schooling equals 9 years and 0 otherwise
ca10 Dummy equals 1 if state compulsory schooling equals 10 years and 0 otherwise
ca11 Dummy equals 1 if state compulsory schooling is 11 or more years and 0 otherwise
FIPS codes are Federal Information Processing Codes for states (and also countries).
each state) as an instrument. The variable ca9 indicates that compul-

sory schooling is equal to 9 years, ca10 indicates that compulsory
schooling is equal to 10 years, and ca11 indicates that compulsory
schooling is equal to 11 or more years. The control group is 8 or
fewer years. Does this set of instruments satisfy the two conditions
for good instruments?
(d) Estimate a 2SLS model using the instruments just described and the
control variables from the OLS model above (including state and
year dummy variables). Briefly explain the results.
(e) 2SLS is known for being less precise than OLS. Is that true here? Is
this a problem for the analysis in this case? Why or why not?
5. Does economic growth lead to democracy? This question is at the heart of

our understanding of how politics and the economy interact. The answer
also exerts huge influence on policy: if we believe economic growth leads
to democracy, then we may be more willing to pursue economic growth
first and let democracy come later. If economic growth does not lead to
democracy, then perhaps economic sanctions or other tools may make
sense if we wish to promote democracy. Acemoglu, Johnson, Robinson,
and Yared (2008) analyzed this question by using data on democracy and
GDP growth from 1960 to 2000. The data is in the form of five-year
panels—one observation for each country every five years. Table 9.11
describes the variables.
(a) Are countries with higher income per capita more democratic? Run
a pooled regression model with democracy (democracy_fh) as the
dependent variable and logged GDP per capita (log_gdp) as the
TABLE 9.11 Variables for Income and Democracy Data

democracy_fh Freedom House measure of democracy (range from 0 to 1)
log_gdp Log real GDP per capita

worldincome Trade-weighted log GDP
year Year
YearCode Order of years of data set (1955 = 1, 1960 = 2, 1965 = 3, etc.)
CountryCode Numeric code for each country
independent variable. Lag log_gdp so that the model reflects that

income at time t − 1 predicts democracy at time t. Describe the
results. What are the concerns with this model?
(b) Rerun the model from part (a), but now include fixed effects for year
and country. Describe the model. How does including these fixed
effects change the results?
(c) To better establish causality, the authors use 2SLS. One of the
instruments that they use is changes in the income of trading partners
(worldincome). They theorize that the income of a given country’s
trading partners should predict its own GDP but should not directly
affect the level of democracy in the country. Discuss the viability
of this instrument with specific reference to the conditions that
instruments need to satisfy. Provide evidence as appropriate.
(d) Run a 2SLS model that uses worldincome as an instrument for

logged GDP. Remember to lag both. Compare the coefficient and
standard error to the OLS and panel data results.
Experiments: Dealing with 10
Real-World Challenges
In the 2012 presidential election, the Obama campaign

team was famously teched up. Not just in iPhones and
laptops, but also in analytics. They knew how to do
all the things we’re talking about in this book: how
to appreciate the challenges of endogeneity, how to
analyze data effectively, and perhaps most important of
all, how to design randomized experiments to answer the
questions they were interested in.
One thing they did was work their e-mail list almost
to exhaustion with a slew of fund-raising pitches over
the course of the campaign. These pitches were not
random—or, wait, actually they were random in the sense that the campaign tested
them ruthlessly by means of experimental methods (Green 2012). On June 26,
2012, for example, they sent e-mail messages with randomly selected subject lines,
ranging from the minimalist “Change” to the sincere “Thankful every day” to the
politically scary “I will be outspent.” The campaign then tracked which subject
lines generated the most donations. The “I will be outspent” message kicked butt,
producing almost five times the donations the “Thankful every day” subject line
did. As a result, the campaign sent millions of people e-mails with the “I will be
outspent” subject line and, according to the campaign, raised millions of dollars
more than they would have if they had used one of the other subject lines tested.
Of course, campaigns are not the only organizations that use randomized
experiments. Governments and researchers interested in health care, economic
development, and many other public policy issues use them all the time. And
experiments are important in the private sector as well. One of the largest credit
card companies in the United States, Capital One grew from virtually nothing
largely on the strength of a commitment to experiment-driven decision making.
Google, Amazon, Facebook, and eBay also experiment relentlessly.
333
334 CHAPTER 10 Experiments: Dealing with Real-World Challenges
Randomized experiments pose an alluring solution to our quest for exogene-

ity. Let’s create it! That is, let’s use randomization to ensure that our independent
variable of interest will be uncorrelated with the error term. As we discussed
in Section 1.3, if our independent variable is uncorrelated with everything, it is
uncorrelated with the error term. Hence, if the independent variable is random, it
is exogenous, and unbiased inference will be a breeze.
In theory, analysis of randomized experiments should be easy. We randomly
pick a group of subjects to be the treatment group, treat them, and then look
for differences in the results from an untreated control group.1 As discussed in
Section 6.1, we can use OLS to estimate a difference of means model with an
equation of the form
Yi = β0 + β1 Treatmenti + i (10.1)
where Yi is the outcome we care about and Treatmenti equals one for subjects in
the treatment group. In reality, randomized experiments face a host of challenges.
Not only are they costly, potentially infeasible, and sometimes unethical, as
discussed in Section 1.3, they run into several challenges that can undo the desired
exogeneity of randomized experiments. This chapter focuses on these challenges.
Section 10.1 discusses the challenges raised by possible dissimilarity of the
treatment and control groups. If the treatment group differs from the control group
in ways other than the treatment, we can’t be sure whether it’s the treatment or
other differences that explain differences across these groups. Section 10.2 moves
on to the challenges raised by non-compliance with assignment to an experimental
group. Section 10.3 shows how to use the 2SLS tools from Chapter 9 to deal
with non-compliance. Section 10.4 discusses the challenge posed to experiments
by attrition, a common problem that arises when people leave an experiment.
Section 10.5 changes gears to discuss natural experiments, which occur without
intervention by researchers.
We refer to the attrition, balance, and compliance challenges facing experi-
ABC issues Three ments as ABC issues.2 Every analysis of experiments should discuss these ABC
issues that every issues explicitly.
experiment needs to
address: attrition,
balance, and
compliance. 1
Often the control group is given a placebo treatment of some sort. In medicine, this is the
well-known sugar pill. In social science, a placebo treatment may be an experience that shares the
form of the treatment but not the content. For example, in a study of advertising efficacy, a placebo
group might be shown a public service ad. The idea is that the mere act of viewing an ad, any ad,
could affect respondents and that ad designers want their ad to cause changes over and above that
baseline effect.
2
We actually discuss balance first, followed by compliance and then attrition, because this order
follows the standard sequence of experimental analysis. We’ll stick with calling them ABC issues,
though, because BCA doesn’t sound as cool as ABC.
10.1 Randomization and Balance

When we run experiments, we worry that randomization may fail to produce
comparable treatment and control groups, in which case the treatment and control
groups might differ in more ways than just the experimental treatment. If the
treatment group is older, for example, we worry that the differences between the
results posted by the treatment and control groups could be due to age rather than
the treatment or lack of it.
In this section, we discuss how to try to ensure that treatment and control
groups are equivalent, explain how treatment and control groups can differ, show
how to detect such differences, and tell what to do if there are differences.
Blocking to ensure similar treatment and control groups

Ideally, researchers will be able to ensure that their treatment and control groups
blocking Picking are similar. They do this by blocking, which is a way of ensuring that the
treatment and control treatment and control groups picked will be the same for selected covariates.
groups so that they are A simple form of blocking is to separate the sample into men and women and
equal in covariates.
then randomly pick treatment and control subjects within those blocks. This
ensures that the treatment and control groups will not differ by sex. Unfortunately,
there are limits to blocking. Sometimes it just won’t work in the context of an
experiment being carried out in the real world. Or more pervasively, practical
concerns arise because it gets harder and harder to make blocking work as we
increase the number of variables we wish to block. For example, if we want to
ensure that treatment and control groups are the same in each age group and
in both sexes, we must pick subsets of women in each age group and men in
each age group. If we add race to our wish list, then we’ll have even fewer
individuals in targeted blocks to randomize within. Eventually, things get very
complicated, and our sample size can’t provide people in every block. The Further
Reading section at the end of the chapter points to articles with more guidance on
blocking.
Why treatment and control groups may differ

Differences in treatment and control groups can arise both when no blocking is
possible and when blocking is not able to account for all variables. Sometimes
the randomization procedures may simply fail. This may happen because some
experimental treatments are quite valuable. A researcher may want random
allocation of a prized treatment (e.g., free health care, access to a new cancer
drug, or admission to a good school). It is possible that the family of a sick
person or ambitious schoolchild will be able to get that individual into the
treatment group. Or perhaps the people implementing the program aren’t quite
on board with randomization and put some people in or out of the treatment
group for their own reasons. Or maybe the folks doing the randomization
screwed up.
In other cases, the treatment and control groups may differ simply due to
chance. Suppose we want to conduct a random experiment on a four-person family
of mom, dad, big sister, and little brother. Even if we pick the two-person treatment
and control groups randomly, we’ll likely get groups that differ in important ways.
Maybe the treatment group will be dad and little brother—too many guys there. Or
maybe the treatment group will be mom and dad—too many middle-aged people
there. In these cases, any outcome differences between the treatment and control
groups would be due not only to the treatment but also possibly to the sex or
age differences. Of course the odds that the treatment and control groups differ
substantially fall rapidly as the sample size increases (a good reason to have a big
sample!). The chance that such differences occur never completely disappears,
however.
Checking for balance

An important first step in analyzing an experiment is therefore to check for
balance Treatment balance. Balance exists when the treatment and control groups are similar in all
and control groups are measurable ways. The core diagnostic for balance involves comparing difference
balanced if the of means for all possible independent variables between those assigned to the
distributions of control
treatment and control groups. We accomplish this by using our OLS difference
variables are the same
for both groups.
of means test (as discussed on page 180) to assess for each X variable whether the
treatment and control groups are different. Thus, we start with
Xi = γ0 + γ1 TreatmentAssignedi + νi (10.2)
where TreatmentAssignedi is 1 for those assigned to the treatment group and 0 for
those assigned to the control group. We use γ (gamma) to indicate the coefficients
and ν (nu) to indicate the error term. We do not use β and here, to emphasize that
the model differs from the main model (Equation 10.1). We estimate Equation 10.2
for each potential independent variable; each equation will produce a different γ̂1
estimate. A statistically significant γ̂1 estimate indicates that the X variable differed
across those assigned to the treatment and control groups.3
Ideally, we won’t see any statistically significant γ̂1 estimates; this outcome
would indicate that the treatment and control groups are balanced. If the γ̂1
estimates are statistically significant for many X variables, we do not have balance
in our experimentally assigned groups, which suggests systematic interference
with the planned random assignments.
We should keep statistical power in mind when we evaluate balance tests. As
discussed in Section 4.4, statistical power relates to the probability of rejecting
3
More advanced balance tests also allow us to assess whether the variance of a variable is the same
across treatment and control groups. See, for example, Imai (2005).
the null hypothesis when we should. Power is low in small data sets, since
when there are few observations, we are unlikely to find statistically significant
differences in treatment and control groups even when there really are differences.
In contrast, power is high for large data sets; that is, we may observe statistically
significant differences even when the actual differences are substantively small.
Hence, balance tests are sensitive not only to whether there are differences across
treatment and control groups but also to the factors that affect power. We should
therefore be cautious in believing we have achieved balance in a small sample set,
and we should be sure to assess the substantive importance of any differences we
see in large samples.
What if the treatment and control groups differ for only one or two variables?
Such an outcome is not enough to indicate that randomization failed. Recall that
even when there is no difference between treatment and control groups, we will
reject the null hypothesis of no difference 5 percent of the time when α = 0.05.
Thus, for example, if we look at at 20 variables, it would be perfectly natural for
the means of the treatment and control groups to differ statistically significantly
for one of those variables.
Good results on balancing tests also suggest (without proving) that balance
has been achieved even on the variables we can’t measure. Remember, the key to
experiments is that no unmeasured factor in the error term is correlated with the
independent variable. Given that we cannot see the darn things in the error term,
it seems a bit unfair to expect us to have any confidence about what’s going on in
there. However, if balance has been achieved for everything we can observe, we
can reasonably (albeit cautiously) speculate that the treatment and control groups
are also balanced for factors we cannot observe.
What to do if treatment and control groups differ

If we do find imbalances, we should not ignore them. First, we should assess the
magnitude of the difference. Even if only one X variable differs across treatment
and control groups, a huge difference could be a sign of a deeper problem. Second,
we should control for even smallish differences in treatment and control groups in
our analysis, lest we conflate outcome differences in Y across treatment and control
groups and differences in some X for which treatment and control groups differ. In
other words, when we have imbalances, it is a good idea to use multivariate OLS,
even though in theory we need only bivariate OLS when our independent variable
is randomly assigned. For example, if we find that the treatment and control groups
differ in age, we should estimate
Yi = β0 + β1 Treatmenti + β2 Agei + i
In adding control variables, we should be careful to control only for variables

that are measured before the treatment or do not vary over time. If we control for
a variable measured after the treatment, it is possible that it will be affected by the
treatment itself, thereby making it hard to figure out the actual effect of treatment.
For example, suppose we are analyzing an experiment in which job training was
randomly assigned within a certain population. In assessing whether the training
helped people get jobs, we would not want to control for test scores measured after
the treatment because the scores could have been affected by the training. Since
part of the effect of treatment may be captured by this post-treatment variable,
including such a post-treatment variable will muddy the analysis.
REMEMBER THIS
1. Experimental treatment and control groups are balanced if the average values of independent
variables are not substantially different for people assigned to treatment and control groups.
2. We check for balance by conducting difference of means tests for all possible independent
variables.
3. When we assess the effect of a treatment, it is a good idea to control for imbalanced variables.
CASE STUDY Development Aid and Balancing

One of the most important challenges in modern times
is figuring out how best to fight the grinding poverty
that bedevils much of the world’s population. Some think
that alleviating poverty is simply a question of money:
spend enough, and poverty goes away. Others are skep-
tical, wondering if the money spent by governmental
and non-governmental organizations actually does any
good.
Using observational studies to settle this debate is
dicey. Such studies estimate something like the following
equation:
Healthit = β0 + β1 Aidit + β2 Xit + it (10.3)
where Healthit is the health of person i at time t, Aidit is the amount of foreign aid
going to person i’s village at time t, and Xit represents one or more variables that
affect the health of person i at time t. The problem is that the error may be correlated
with aid. Aid may flow to places where people are truly needy, with economic and
social problems that go beyond any simple measure of poverty. Or resources may
flow to places that are actually better off and better able to attract attention than
simple poverty statistics would suggest.
In other words, aid is probably endogenous. And because we cannot know if
aid is positively or negatively correlated with the error term, we have to admit that
we don’t know whether the actual effects are larger or smaller than what we observe
with the observational analysis. That’s not a particularly satisfying study.
If the government resources flowed exogenously, however, we could analyze
health and other outcomes and be much more confident that we are measuring
the effect of the aid. One example of a confidence-inspiring study is the Progresa
experiment in Mexico, described in Gertler (2004). In the late 1990s the Mexican
government wanted to run a village-based health care program but realized it
did not have enough resources to cover all villages at once. The government
decided the fairest way to pick villages was to pick them randomly, and voila! an
experiment was born. Government authorities randomly selected 320 villages as
treatment cases and implemented the program there. The Mexican government
also monitored 185 control villages, where no new program was implemented. In
the program, eligible families received a cash transfer worth about 20 to 30 percent
of household income if they participated in health screening and education
activities, including immunizations, prenatal visits, and annual health checkups.
Before assessing whether the treatment worked, analysts needed to assess
whether randomization worked. Were villages indeed selected randomly, and if
so, were they similar with regard to factors that could influence health? Table 10.1
provides results for balancing tests for the Progresa program. The first column has
the γ̂0 estimates from Equation 10.2 for various X variables. These are the averages
of the variable in question for the young children in the control villages. The second
column displays the γ̂1 estimates, which indicate how much higher or lower the
TABLE 10.1 Balancing Tests for the Progresa Experiment: Difference of Means Tests
Using OLS
Dependent variable γ̂γ 0 γ̂γ 1 t stat (γ̂γ 1 ) p value (γ̂γ 1 )
1. Age (in years) 1.61 0.01 0.11 0.91
2. Male 0.49 0.02 1.69 0.09

3. Child was ill in last 4 weeks 0.32 0.01 0.29 0.77
4. Father’s years of education 3.84 −0.04 0.03 0.98

5. Mother’s years of education 3.83 −0.33 1.87 0.06
6. Father speaks Spanish 0.93 0.01 1.09 0.28

7. Mother speaks Spanish 0.92 0.02 0.77 0.44
8. Own house 0.91 0.01 0.73 0.47
9. House has electricity 0.71 −0.07 1.69 0.09

10. Hectares of land owned 0.79 0.02 0.59 0.55
11. Male daily wage rate (pesos) 31.22 −0.74 0.90 0.37
12. Female daily wage rate (pesos) 27.84 −0.58 0.69 0.49
Sample size 7,825
Results from 12 different OLS regressions in which the dependent variable is as listed at left. The coefficients are
from the model Xi = γ0 + γ1 Treatmenti + νi (see Equation 10.2).
average of the variable in question is for children in the treatment villages. For
example, the first line indicates that the children in the treatment village were 0.01
year older than the children in the control village. The t statistic is very small for
this coefficient and the p value is high, indicating that this difference is not at all
statistically significant. For the second row, the male variable equals 1 for boys and
0 for girls. The average of this variable indicates the percent of the sample that
were boys. In the control villages, 49 percent of the children were males; 51 percent
(γ̂0 + γ̂1 ) of the children in the treatment villages were male. This 2 percent difference
is statistically significant at the 0.10 level (given that p < 0.10). The most statistically
significant difference we see is in mother’s years of education, for which the p value
is 0.06. In addition, houses in the treatment group were less likely to have electricity
(p = 0.09).
The study author took the results to indicate that balance had been achieved.
We see, though, that achieving balance is an art, rather than a science, because
for 12 variables, only one or perhaps two would be expected to be statis-
tically significant at the α = 0.10 level if there were, in fact, no differences
across the groups. These imbalances should not be forgotten; in this case, the
analysts controlled for all the listed variables when they estimated treatment
effects.
And by the way, did the Progresa program work? In a word, yes. Results from
difference of means tests revealed that kids in the treatment villages were sick less
often, taller, and less likely to be anemic.
10.2 Compliance and Intention-to-Treat Models

compliance The Many social science experiments also have to deal with compliance problems,
condition of subjects which arise when some people assigned to a treatment do not experience the
receiving the treatment to which they were assigned. A compliance problem can happen,
experimental for example, when someone is randomly assigned to receive a phone call
treatment to which
asking for a chairitable donation. If the person does not answer the phone,
they were assigned.
we say (perhaps a bit harshly) that he failed to comply with the experimental
treatment.
In this section, we show how non-compliance can create endogeneity. Then
we present a schematic for thinking about the problem and introduce so-called
intention-to-treat models as one way to deal with the problem.
Non-compliance and endogeneity

Non-compliance is often non-random, opening a back door for endogeneity to
weasel its way into experiments because the people who comply with a treatment
may differ systematically from the people who do not. This is precisely the
problem we use experiments to avoid.
Educational voucher experiments illustrate how endogeneity can sneak in

with non-compliance. These experiments typically start when someone ponies up
a ton of money to send poor kids to private schools. Because there are more poor
kids than money, applicants are randomly chosen in a lottery to receive vouchers
to attend private schools. These are the kids in the treatment group. The kids who
aren’t selected in the lottery are the control group.4 After a year of schooling (or
more), the test scores of the treatment and control groups are compared to see
whether kids who had vouchers for private schools did better. Because being in
a voucher school is a function of a random lottery, we can hope that the only
systematic difference between the treatment and control groups is whether the
children in the treatment group attended the private school. If so, it is fair to say
that the treatment caused any differences in outcomes we observe.
Non-compliance complicates matters. Not everyone who receives the voucher
uses it to attend private school. In a late 1990s New York City voucher experiment
discussed by Howell and Peterson (2004), for example, 74 percent of families who
were offered vouchers used them in the first year. That number fell to 62 percent
after two years and 53 percent after three years of the program. Kids with vouchers
might end up not attending a private school for lots of reasons. They might find
the private school unwelcoming or too demanding. Their family might move.
Some of these causes are plausibly related to academic performance: a child who
finds private school too demanding is likely to be less academically ambitious
than one who does not have that reaction. In that case, the kids who actually use
vouchers to attend private school (the “compliers” in our terminology) are not a
randomly selected group; rather, they are a group that could differ systematically
from kids who decline to use the vouchers. The result can be endogeneity because
the variable of interest (attending private school) might be correlated with factors
in the error term (such as academic ambition) that explain test performance.
Schematic representation of the non-compliance problem

Figure 10.1 provides a schematic of the non-compliance problem (Imai 2005).
At the top level, a researcher randomly assigns subjects to receive the treatment
or not. If a subject is assigned to receive a treatment, Zi = 1; if a subject
is not assigned to receive a treatment, Zi = 0. Subjects assigned treatment
who actually receive it are the compliers, and for them, Ti = 1, where T
indicates whether the person actually received the treatment. The people who
are assigned to treatment (Zi = 1) but do not actually receive it (Ti = 0) are the
non-compliers.
For everyone in the control group, Zi = 0, indicating that those kids were not
assigned to receive the treatment. We don’t observe compliance for people in the
control group because they’re not given a chance to comply. Hence, the dashed
4
Researchers in this area are careful to analyze only students who actually applied for the vouchers.
This is because the students (and parents) who apply for vouchers for private schools almost certainly
differ systematically from students (and parents) who do not.
Treatment assignment
(random)
Zi = 1 Zi = 0
Compliance Compliance
(non-random) (unobserved)
Ti = 1 Ti = 0 Ti = 0
Compliers Non-compliers Would-be Would-be

compliers non-compliers
FIGURE 10.1: Compliance and Non-compliance in Experiments
lines in Figure 10.1 indicate that we can’t know who among the control group are
would-be compliers and would-be non-compliers.5
We can see the mischief caused by non-compliance when we think about
how to compare treatment and control groups in this context. We could compare
the students who actually went to the private school (Ti = 1) to those who didn’t
(Ti = 0). Note, however, that the Ti = 1 group includes only compliers—students
who, when given the chance to go to a private school, took it. These students
are likely to be more academically ambitious than the non-compliers. The Ti =
0 group includes non-compliers (for whom Zi = 1) and those not assigned
to treatment (for whom Zi = 0). This comparison likely stacks the deck in
favor of finding that the private schools improve test scores because this Ti =
1 group has a disproportionately high proportion of educationally ambitious
students.
Another option is to compare the compliers (the Zi = 1 and Ti = 1 students)
to the whole control group (the Zi = 0 students). This method, too, is problematic.
The control group has two types of students—would-be compliers and would-be
non-compliers—while the treatment group in this approach only has compliers.
Any differences found with this comparison could be attributed either to the effect
of the private school or to the absence of non-compliers from the complier group,
whereas the control group includes both complier types and non-complier types.
5
An additional wrinkle in the real world is that people from the control group may find a way to
receive the treatment without being assigned to treatment. For example, in the New York voucher
experiment just discussed, 5 percent of the control group ended up in private schools without having
received a voucher.
Intention-to-treat models
intention-to-treat A better approach is to conduct an intention-to-treat (ITT) analysis. To conduct an
(ITT) analysis ITT ITT analysis, we compare the means of those assigned treatment (the whole Zi = 1
analysis addresses group, which consists of those who complied and those who did not comply with
potential endogeneity
the treatment) to those not assigned treatment (the Zi = 0 group, which consists
that arises in
experiments owing to
of would-be compliers and would-be non-compliers). The ITT approach sidesteps
non-compliance. We non-compliance endogeneity at the cost of producing estimates that are statistically
compare the means of conservative (meaning that we expect the estimated coefficients to be smaller
those assigned than the actual effect of the treatment).
treatment and those not To understand ITT, let’s start with the non-ITT model we really care
assigned treatment, about:
irrespective of whether
the subjects did or did
not actually receive the
treatment. Yi = β0 + β1 Treatmenti + i (10.4)
For individuals who receive no treatment (Treatmenti = 0), we expect Yi to

equal some baseline value, β0 . For individuals who have received the treatment
(Treatmenti = 1), we expect Yi to be β0 + β1 . This simple bivariate OLS model
allows us to test for the difference of means between treatment and control
groups.
The problem, as we have discussed, is that non-compliance creates correlation
between treatment and the error term because the type of people who comply
with the treatment may differ systematically from non-compliers. The idea behind
the ITT approach is to look for differences between the whole treatment group
(whether they complied or not) and the whole control group. The model is
Yi = δ0 + δ1 Zi + νi (10.5)
In this model, Z is 1 for individuals assigned to the treatment group and 0

otherwise. We use δ to highlight our use of assignment to treatment as the
independent variable rather than actual treatment. In this model, δ is an ITT
estimator because it estimates the difference between all the people we intended
to treat and all the people we did not intend to treat.
Note that Z is uncorrelated with the error term. It reflects assignment to
treatment (rather than actual compliance with treatment); hence, none of the
compliance issues are able to sneak in correlation with anything, including the
error term. Therefore, the coefficient estimate associated with the treatment
assignment variable will not be clouded by other factors that could explain both
the dependent variable and compliance. For example, if we use ITT analysis
to explain the relationship between test scores and attending private schools,
we do not have to worry that our key independent variable is also capturing
the individuals who, being more academically ambitious kids, may have been
more likely to use the private school vouchers. ITT avoids this problem by
comparing all kids given a chance to use the vouchers to all kids not given that
chance.
ITT is not costless, however. When there is non-compliance, ITT will under-
estimate the treatment effect. This means the ITT estimate, δˆ1 , is a lower-bound
estimate of β, the estimate of the effect of the treatment itself from Equation 10.4.
In other words, we expect the magnitude of the δˆ1 parameter estimated from
Equation 10.5 to be smaller than or equal to the β1 parameter in Equation 10.4.
To see why, consider the two extreme possibilities: zero compliance and full
compliance. If there is zero compliance, such that no one assigned treatment
complied (Ti = 0 for all Zi = 1), then δ1 = 0 because there is no difference
between the treatment and control groups. (No one took the treatment!) At the
other extreme, if everyone assigned treatment (Zi = 1) also complied (Ti = 1),
then the Treatmenti variable in Equation 10.4 will be identical to Zi (treatment
assignment) in Equation 10.5. In this instance, β̂1 will be an unbiased estimator
of β1 because there are no non-compliers messing up the exogeneity of the
random experiment. In this case, β̂1 = δˆ1 because the variables in the models are
identical.
Hence, we know that the ITT estimate of δˆ1 is going to be somewhere
between zero and an unbiased estimator of the true treatment effect. The lower
the compliance, the more the ITT estimate will be biased toward zero. The
ITT estimator is still preferable to β̂1 from a model with treatment received
when there are non-compliance problems; this is because β̂1 can be biased
when compliers differ from non-compliers, causing endogeneity to enter the
model.
The ITT approach is a cop-out, but in a good way. When we use it, we’re
being conservative in the sense that the estimate will be prone to underestimate
the magnitude of the treatment effect. If the ITT approach reveals an effect, it will
be due to treatment, not to endogenous non-compliance issues.
Researchers regularly estimate ITT effects. Sometimes whether someone did
or did not comply with a treatment is not known. For example, if the experimenter
mailed advertisements to randomly selected households, it will be very hard, if
not impossible, to know who actually read the ads (Bailey, Hopkins, and Rogers
2015).
Or sometimes the ITT effect is the most relevant quantity of interest.
Suppose, for example, we know that compliance will be spotty and we want
to build non-compliance into our estimate of a program’s effectiveness. Miguel
and Kremer (2004) analyzed an experiment in Kenya that provided medical
treatment for intestinal worms to children at randomly selected schools. Some
children in the treated schools, however, missed school the day the medicine
was administered. An ITT analysis in this case compares kids assigned to
treatment (whether or not they were in school on that day) to kids not assigned
to treatment. Because some kids will always miss school for a treatment like this,
policy makers may care more about the ITT estimated effect of the treatment
because ITT takes into account both the treatment effect and the less-than-perfect
compliance.
REMEMBER THIS
1. In an experimental context, a person assigned to receive a treatment who actually receives the
treatment is said to comply with the treatment.
2. When compliers differ from non-compliers, non-compliance creates endogeneity.
3. ITT analysis compares people assigned to treatment (whether they complied or not) to people
in the control group.
• ITT is not vulnerable to endogeneity due to non-compliance.
• ITT estimates will be smaller in magnitude than the true treatment effect. The more
numerous the non-compliers, the closer to zero the ITT estimates will be.
1. Will there be balance problems if there is non-compliance? Why or why not?
2. Suppose there is non-compliance but no signs of balance problems. Does this mean the
non-compliance must be harmless? Why or why not?
3. For each of the following scenarios, discuss (i) whether non-compliance is likely to be an issue,
(ii) the likely implication of non-compliance for comparing those who received treatment to the
control group, and (iii) what exactly an ITT variable would consist of.
(a) Suppose an international aid group working in a country with low literacy rates randomly
assigned children to a treatment group that received one hour of extra reading help each
day and a control group that experienced only the standard curriculum. The dependent
variable is a reading test score after one year.
(b) Suppose an airline randomly upgraded some economy class passengers to business class.
The dependent variable is satisfaction with the flight.
(c) Suppose the federal government randomly selected a group of school districts that could
receive millions of dollars of aid for revamping their curriculum. The control group
receives nothing from the program. The dependent variable is test scores after three years.
10.3 Using 2SLS to Deal with Non-compliance

An even better way to deal with non-compliance is to use 2SLS to directly estimate
the effect of treatment. The key insight is that randomized treatment assignment is
a great instrument. Randomized assignment satisfies the exclusion condition (that
Z is uncorrelated with ) because it is uncorrelated with everything, including
the error term. Random assignment also usually satisfies the inclusion condition
because being randomly assigned to treatment typically predicts whether a person
got the treatment.
In this section, we build on material from Section 9.2 to show how to use
2SLS to deal with non-compliance. We accomplish this by working through an
example and by showing the sometimes counterintuitive way we use variables in
this model.
Example of using 2SLS to deal with non-compliance

To see how to use 2SLS to analyze an experiment with non-compliance, let’s
look at an experimental study of get-out-the-vote efforts. Political consultants
often joke that they know half of what they do works, they just don’t know
which half. An experiment might help figure out which half (or third or quarter!)
works.
We begin by laying out what an observational study of campaign effectiveness
looks like. A simple model is
Turnouti = β0 + β1 Campaign contacti + i (10.6)
where Turnouti equals 1 for people who voted and 0 for those who did
not.6 The independent variable is whether or not someone was contacted by a
campaign.
What is in the error term? Certainly, political interest will be because more
politically attuned people are more likely to vote. We’ll have endogeneity if
political interest (incorporated in the error term) is correlated with contact by a
campaign (the independent variable). We will probably have endogeneity because
campaigns do not want to waste time contacting people who won’t vote. Hence,
we’ll have endogeneity unless the campaign is incompetent (or, ironically, run by
experimentalists).
Such endogeneity could corrupt the results easily. Suppose we find a positive
association between campaign contact and turnout. We should worry that the
relationship is due not to the campaign contact but to the kind of people who
were contacted—namely, those who were more likely to vote before they were
contacted. Such concerns make it very hard to analyze campaign effects with
observational data.
6
The dependent variable is a dichotomous variable. We discuss such dependent variables in more
detail in Chapter 12.
Professors Alan Gerber and Don Green (2000, 2005) were struck by these
problems with observational studies and have almost single-handedly built an
empire of experimental studies in American politics.7 As part of their signat-
ure study, they randomly assigned citizens to receive in-person visits from a get-
out-the-vote campaign. In their study, all the factors that affect turnout would be
uncorrelated with assignment to receive the treatment.8
Compliance is a challenge in such studies. When campaign volunteers
knocked on doors, not everyone answered. Some people weren’t home. Some were
in the middle of dinner. Maybe a few ran out the back door screaming when they
saw a hippie volunteer ringing their doorbell.
Non-compliance, of course, could affect the results. If the more socially
outgoing types answered the door (hence receiving the treatment) and the more
reclusive types did not (hence not receiving the treatment even though they were
assigned to it), the treatment variable as delivered would depend not only on the
random assignment but also on how outgoing a person was. If more outgoing
people are more likely to vote, then treatment as delivered will be correlated with
the sociability of the experimental subject, and we will have endogeneity.
To get around this problem, Gerber and Green used treatment assignment as
an instrument. This variable, which we’ve been calling Zi , indicates whether a
person was randomly selected to receive a treatment. This variable is well suited
to satisfy the requisite conditions for a good instrument discussed in Section 9.2.
First, Zi should be included in the first stage because being randomly assigned to
be contacted by the campaign does indeed increase campaign contact. Table 10.2
shows the results from the first stage of Gerber and Green’s turnout experiment.
The dependent variable, treatment delivered, is 1 if the person actually talked to
the volunteer canvasser and 0 otherwise. The independent variable is whether the
person was or was not assigned to treatment.
TABLE 10.2 First-Stage Regression in Campaign

Experiment: Explaining Contact
Personal visit assigned (Z = 1) 0.279∗

(0.003)
[t = 95.47]
Constant 0.000
(0.000)
[t = 0.00]
N 29, 380

∗
7
Or should we say double-handedly? Or, really, quadruple-handedly?
8
The study also looked at other campaign tactics, such as phone calls and mailing postcards. These
didn’t work as well as the personal visits; for simplicity, we focus on the in-person visits.
TABLE 10.3 Second-Stage Regression in Campaign

Experiment: Explaining Turnout
Personal visit (T̂) 0.087∗

(0.026)
[t = 3.34]
Constant 0.448∗
(0.003)
[t = 138.38]
N 29, 380
The dependent variable is 1 for individuals who voted and 0

otherwise. The independent variable is the fitted value from the first
stage.

∗
These results suggest that 27.9 percent of those assigned to be visited were
actually visited. In other words, 27.9 percent of the treatment group complied
with the treatment. This estimate is hugely statistically significant, in part owing
to the large sample size. The intercept is 0.0, implying that no one in the
non-contact-assigned group was contacted by this particular get-out-the-vote
campaign.
The treatment assignment variable Zi also is highly likely to satisfy the 2SLS
exclusion condition because the randomized treatment assignment variable Zi
affects Y only through people actually getting campaign contact. Being assigned
to be contacted by the campaign in and of itself does not affect turnout. Note that
we are not saying that the people who actually complied (received a campaign
contact) are random, for all the reasons just given in relation to concerns about
compliance come into play here. We are simply saying that when we put a check
next to randomly selected names indicating that they should be visited, these folks
were indeed randomly selected. That means that Z is uncorrelated with and can
therefore be excluded from the main equation.
In the second-stage regression, we use the fitted values from the first-stage
regression as the independent variable. Table 10.3 shows that the effect of a
personal visit is to increase probability of turning out to vote by 8.7 percentage
points. This estimate is statistically significant, as we can see from the t stat, 3.34.
We could improve the precision of the estimates by adding covariates, but doing
so is not necessary to avoid bias.
Understanding variables in 2SLS models of non-compliance

Understanding the way the fitted values work is useful for understanding how
2SLS works here. Table 10.4 shows the three different ways we are using to
measure campaign contact for three hypothetical observations. In the first column
is treatment assignment. Volunteers were to visit Laura and Bryce but not Gio.
TABLE 10.4 Various Measures of Campaign Contact in 2SLS

Model for Selected Observations
Name Contact-assigned Contact-delivered Contact-fitted
(Z) (T) (T̂)
Laura 1 1 0.279
Bryce 1 0 0.279
Gio 0 0 0.000
This selection was randomly determined. In the second column is actual contact,
which is observed contact by the campaign. Laura answered the door when the
campaign volunteer knocked, but Bryce did not. (No one went to poor Gio’s door.)
The third column displays the fitted value from the first-stage equation for the
treatment variable. These fitted values depend only on contact assignment. Laura
and Bryce were assigned to be called randomly (Z = 1), so both their fitted values
were X̂ = 0.0 + 0.279 × 1 = 0.279 even though Laura was actually contacted and
Bryce wasn’t. Gio was not assigned not to be visited (Z = 0), so his fitted contact
values was X̂ = 0.0 + 0.279 × 0 = 0.0.
2SLS uses the “contact-fitted” (T̂) variable. It is worth taking the time to really
understand T̂, which might be the weirdest thing in the whole book.9 Even though
Bryce was not contacted, his T̂i is 0.279, just the same as Laura, who was in fact
visited. Clearly, this variable looks very different from actual observed campaign
contact. Yes, this is odd, but it’s a feature, not a bug. The core inferential problem,
as we’ve noted, is endogeneity in actual observed contact. Bryce might be avoiding
contact because he loathes politics. That’s why we don’t want to use observed
contact as a variable—it would capture not only the effect of contact but also the
fact that the type of people who get contact in observational data are different. The
fitted value, however, varies only according the Z—something that is exogenous.
In other words, by looking at the bump up in expected contact associated with
being in the randomly assembled contact-assigned group, we have isolated the
exogenous bump up in contact associated with the exogenous factor and can assess
whether it is associated with a corresponding bump up in voting turnout.
REMEMBER THIS
1. 2SLS is useful for analyzing experiments when there is imperfect compliance with the
experimental treatment.
2. Assignment to treatment typically satisfies the inclusion and exclusion conditions necessary
for instruments in 2SLS analysis.
9
Other than the ferret thing in Chapter 3—also weird.
CASE STUDY Minneapolis Domestic Violence Experiment

Instrumental variables can be used to analyze an ambi-
tious and, at first glance, very unlikely experiment. It
deals with domestic violence, a social ill that has long
challenged police and others trying to reduce it. This may
sound like a crazy place for an experiment, but stay tuned
because it turns out not to be.
The goal is to figure out what police should do when
they come upon a domestic violence incident. Police
can either take a hard line and arrest suspects whenever
possible, or they can be conciliatory and decline to make
an arrest as long as no one is in immediate danger. Either
approach could potentially be more effective: arresting
suspects creates clear consequences for offenders, while not arresting them may
possibly defuse the situation.
So what should police do? This is a great question to answer empirically. A
model based on observational data would look like
Arrested lateri = β0 + β1 Arrested initiallyi + β2 Xi + i (10.7)
where Arrested later is 1 if the person is arrested at some later date for domestic
violence and 0 otherwise, Arrested initially is 1 if the suspect was arrested at the
time of the initial domestic violence report and 0 otherwise, and X refers to other
variables, such as whether a weapon or drugs were involved in the first incident.
Why might there be endogeneity? (That is, why might we suspect a cor-
relation between Arrested initially and the error term?) Elements in the error
term include person-specific characteristics. Some people who have police called
on them are indeed nasty; let’s call them the bad eggs. Others are involved
in a once-in-a-lifetime incident; in the overall population of people who have
police called on them, they are the (relatively) good eggs. Such personality
traits are in the error term of the equation predicting domestic violence in the
future.
We could also easily imagine that people’s good or bad eggness will affect
whether they are arrested initially. Police who arrive at the scene of a domestic
violence incident involving a bad egg will, on average, find more threat; police who
arrive at the scene of an incident involving a (relatively) good egg will likely find
the environment less threatening. We would expect police to arrest the bad egg
types more often, and we would expect these folks to have more problems in the
future. Observational data could therefore suggest that arrests make things worse
because those arrested are more likely to be bad eggs and therefore more likely to
be rearrested.
The problem is endogeneity. The correlation of the Arrested initially variable

and the personal characteristics in the error term prevents observational data from
isolating the effect of the policy (arrest) from the likelihood that this policy will, at
least sometimes, be differentially applied across types of people.
An experiment is promising here, at least in theory. If police randomly choose
to arrest people when domestic violence has been reported, then our arrest variable
would no longer be correlated with the personal traits of the perpetrators. Of course
this idea is insane, right? Police can’t randomly arrest people (can they?). Believe it
or not, researchers in Minneapolis created just such an experiment. More details are
in Angrist (2006); we’ll simplify the experiment a bit. The Minneapolis researchers
gave police a note pad to document incidents. The note pad had randomly colored
pages; the police officer was supposed to arrest or not arrest the perpetrator based
on the color of the page.
Clearly, perfect compliance is impossible and undesirable. No police depart-
ment could tell its officers to arrest or not based simply on the color of pages in
a notebook. Some circumstances are so dangerous that an arrest must be made,
notebook be damned. Endogeneity concerns arise because the type of people
arrested under these circumstances (the bad eggs) are different from those not
arrested.
2SLS can rescue the experimental design. We’ll show first how randomization
in experiments satisfies the 2SLS conditions and then show how 2SLS works and
how it looks versus other approaches.
The inclusion condition is that Z explains X. In this case, the condition requires
that assignment to the arrest treatment actually predict being arrested. Table 10.5
shows that those assigned to be arrested were 77.3 percentage points more likely
to be arrested, even when the reported presence of a weapon or drugs at the scene
TABLE 10.5 First-Stage Regression in Domestic

Violence Experiment: Explaining
Arrests
Arrest assigned (Z = 1) 0.773∗

(0.043)
[t = 17.98]
Weapon 0.064
(0.045)
[t = 1.42]
Drugs 0.088∗
(0.040)
[t = 2.20]
Constant 0.216
N 314

∗
TABLE 10.6 Selected Observations for Minneapolis

Domestic Violence Experiment
Observation Arrest-assigned Arrest-delivered Arrest-fitted
(Z) (T) (T̂)
1 1 1 0.989
2 1 0 0.989
3 0 1 0.216
4 0 0 0.216
is controlled for. The effect is massively statistically significant, with a t statistic

of 17.98. The intercept was not directly reported in the original paper, but from
other information in that paper, we can determine that γ̂0 = 0.216 in our first-stage
regression.
Assignment to the arrest treatment is very plausibly uncorrelated with the
error term. This condition is not testable and must instead be argued based on
non-statistical evidence. Here the argument is pretty simple: the instrument was
randomly generated and therefore not correlated with anything, in the error term
or otherwise.
Before we present the 2SLS results, let’s be clear about the variable used in the
2SLS model as opposed to the variables used in other approaches. Table 10.6 shows
the three different ways to measure arrest. The first (Z) is whether an individual
was assigned to the arrest treatment. The second (T) is whether a person was in
fact arrested. The third (T̂) is the fitted value of arrest based on Z. We report four
examples, assuming that no weapons or drugs were reported in the initial incident.
Person 1 was assigned to be arrested and in fact was arrested. His fitted value is
γ̂0 + γ̂1 × 1 = 0.216 + 0.773 = 0.989. Person 2 was assigned to be arrested and was not
arrested. His fitted value is the same as person 1’s: γ̂0 + γ̂1 × 1 = 0.216 + 0.773 = 0.989.
Person 3 was not assigned to be arrested but was in fact arrested. He was probably
pretty nasty when the police showed up. His fitted value is γ̂0 + γ̂1 × 0 = 0.216 + 0 =
0.216. Person 4 was not assigned to be arrested and was not arrested. He was
probably relatively calm when the police showed up. His fitted value is γ̂0 + γ̂1 × 0 =
0.216 + 0 = 0.216. Even though we suspect that persons 3 and 4 are very different
types of people, the fitted values are the same, which is a good thing because
factors associated with actually being arrested (the bad eggness) that are correlated
with the error term in the equation predicting future arrests are purged from the T̂
variable.
Table 10.7 shows the results from three different ways to estimate a model in
which Arrested later is the dependent variable. The models also control for whether
a weapon or drugs had been reported in the initial incident. The OLS model uses
treatment delivered (T) as the independent variable. The ITT model uses treatment
assigned (Z) as the independent variable. The 2SLS model uses the fitted value of
treatment (T̂) as the independent variable.
TABLE 10.7 Using Different Estimators to Analyze the

Minneapolis Results of the Domestic Violence
Experiment
OLS ITT 2SLS
∗
Arrest −0.070 −0.108 −0.140∗
(0.038) (0.041) (0.053)
[t = 1.84] [t = 2.63] [t = 2.64]
Weapon 0.010 0.004 0.005
(0.043) (0.042) (0.043)
[t = 0.23] [t = 0.10] [t = 0.12]
Drugs 0.057 0.052 0.064
(0.039) (0.038) (0.039)
[t = 1.46] [t = 1.37] [t = 1.64]
N 314 314 314
Dependent variable is a dummy variable indicating rearrest.

∗
The first column shows that OLS estimates a decrease of 7 percentage points in
probability of a rearrest later. The independent variable was whether someone was
actually arrested. This group includes people who were randomly assigned to be
arrested and people in the no-arrest-assigned treatment group who were arrested
anyway. We worry about bias when we use this variable because we suspect that
the bad eggs were more likely to get arrested.10
The second column shows that ITT estimates being assigned to the arrest
treatment lowers the probability of being arrested later by 10.8 percentage points.
This result is more negative than the OLS estimate and is statistically significant. The
ITT model avoids endogeneity because treatment assignment cannot be correlated
with the error term. The approach will understate the true effect when there was
non-compliance, either because some people not assigned to the treatment got it
or because everyone who was assigned to the treatment actually received it.
The third column shows the 2SLS results. In this model, the independent
variable is the fitted value of the treatment. The estimated coefficient on arrest is
even more negative than the ITT estimate, indicating that the probability of rearrest
for individuals who were arrested is 14 percentage points lower than for individuals
who were not initially arrested. The magnitude is double the effect estimated by
OLS. This result implies that Minneapolis can on average reduce the probability of
another incident by 14 percentage points by arresting individuals on the initial call.
2SLS is the best model because it accounts for non-compliance and provides an
10
The OLS model reported here is still based on partially randomized data because many people were
arrested owing to the randomization in the police protocol. If we had purely observational data with
no randomization, the bias of OLS would be worse, as it’s likely that only bad eggs would have been
arrested.
unbiased estimate of the effect that arresting someone initially has on likelihood of
a future arrest.
This study was quite influential and spawned similar investigations elsewhere;
see Berk, Campbell, Klap, and Western (1992) for more details.
10.4 Attrition
attrition Occurs Another challenge for experiments is attrition—people dropping out of an

when people drop out experiment altogether, preventing us from observing the dependent variable for
of an experiment them. Attrition can happen when experimental subjects become frustrated with
altogether such that we the experiment and discontinue participation, when they are too busy to respond,
do not observe the and when they move away or die. Attrition can occur in both treatment and control
dependent variable for
groups.
them.
In this section, we explain how attrition can infect randomized experiments
with endogeneity, show how to detect problematic attrition, and describe three
ways to counteract the effects of attrition.
Attrition and endogeneity

Attrition opens a back door for endogeneity to enter our experiments when it is
non-random. Suppose we randomly give people free donuts. If some of these
subjects eat so many donuts that they can’t rise from the couch to answer the
experimenter’s phone calls, we no longer have data for these folks. This is a
problem because the missing observations would have told of people who got lots
of donuts and had a pretty bad health outcome. Losing these observations will
make donuts look less bad and thereby bias our conclusions.
Attrition is real. In the New York City school choice experiment discussed
earlier in this chapter, researchers intended to track test scores of students in the
treatment and control groups over time. A surprising number of students, however,
could not be tracked. Some had moved away, some were absent on test days, and
some probably got lost in the computer system.
Attrition can be non-random as well. In the New York school choice
experiment, 67 percent of African-American students in the treatment group took
the test in year 2 of the experiment, while only 55 percent of African-American
students in the control group took the test in year 2. We should wonder if these
groups are comparable and worry about the possibility that any test differentials
discovered were due to differential attrition rather than the effects of private
schooling.
Detecting problematic attrition

Detecting problematic attrition is therefore an important part of any experimental
analysis. First, we should assess whether attrition was related to treatment.
Commonsensically, we can simply look at attrition rates in treatment and control
10.4 Attrition 355
groups. Statistically, we could estimate the following model:
Attritioni = δ0 + δ1 Treatmenti + νi (10.8)
where Attritioni equals 1 for observations for which we do not observe the
dependent variable and equals 0 when we observe the dependent variable. A
statistically significant δˆ1 would indicate differential attrition across treatment and
control groups.
We can add some nuance to our evaluation of attrition by looking for
differential attrition patterns in the treatment and control groups. Specifically,
we can investigate whether the treatment variable interacted with one or more
covariates in a model explaining attrition. In our analysis of a randomized charter
school experiment, we might explore whether high test scores in earlier years were
associated with differential attrition in the treatment group. If we use the tools for
interaction variables discussed in Section 6.4, the model would be
Attritioni = δ0 + δ1 Treatmenti + δ2 EarlyTestScoresi

+ δ3 Treatment × EarlyTestScoresi + νi
where EarlyTestScoresi is the pre-experimental test score of student i. If δ3 is not

zero, then the treatment would appear to have had a differential effect on kids who
were good students in the pre-experimental period. Perhaps kids with high test
scores were really likely to stick around in the treated group (which means they
attended charter schools) while the good students in the control group (who did
not attend a charter school) were less likely to stick around (perhaps moving to
a different school district and thereby making unavailable their test scores for the
period after the experiment had run). In this situation, treated and control groups
would differ on the early test score measure, something that should show up in a
balance test limited to those who remained in the sample.
Dealing with attrition

There is no magic bullet to zap attrition, but three strategies can prove useful. The
first is simply to control for variables associated with attrition in the final analysis.
Suppose we found that kids with higher pretreatment test scores were more likely
to stay in the experiment. We would be wise to control for pretreatment test scores
with multivariate OLS. However, this strategy cannot counter unmeasured sources
of attrition that could be correlated with treatment status and post-treatment test
scores.
trimmed data set A A second approach to attrition is to use a trimmed data set, which will make
data set for which the groups more plausibly comparable. A trimmed data set offsets potential bias
observations are due to attrition because certain observations are removed. Suppose we observe 10
removed in a way that
percent attrition in the treated group and 5 percent attrition in the control group. We
offsets potential bias
due to attrition.
should worry that weak students were dropping out of the treatment group, making
the comparison between treated and untreated groups invalid because the treated
group may have shed some of its weakest students. A statistically conservative
approach here would be to trim the control group by removing another 5 percent
of the weakest students before doing our analysis so that both groups in the data
now have 10 percent attrition rates. This practice is statistically conservative in the
sense that it makes it harder to observe a statistically significant treatment effect
because it is unlikely that literally all of those who dropped out from the treatment
group were the worst students.
selection model A third approach to attrition is to use a selection model. The most famous
Simultaneously selection model is called a Heckman selection model (1979). In this approach,
accounts for whether we we would model both the process of being observed (which is a dichotomous
observe the dependent
variable equaling 1 for those for whom we observe the dependent variable and
variable and what the
dependent variable is.
0 for others) and the outcome (the model with the dependent variable of interest,
such as test scores). These models build on the probit model we shall discuss in
Chapter 12. More details are in the Further Reading section at the end of this
chapter.
REMEMBER THIS
1. Attrition occurs when individuals drop out of an experiment, causing us to lack outcome data
for them.
2. Non-random attrition can cause endogeneity even when treatment is randomly assigned.
3. We can detect problematic attrition by looking for differences in attrition rates across treated
and control groups.
4. Attrition can be addressed by using multivariate OLS, trimmed data sets, or selection
models.
Discussion Question
Suppose each of the following experimental populations suffered from attrition. Speculate on
the likely implications of not accounting for attrition in the analysis.
(a) Researchers were interested in the effectiveness of a new drug designed to lower
cholesterol. They gave a random set of patients the drug; the rest got a placebo pill.
(b) Researchers interested in rehabilitating former prisoners randomly assigned some
newly released individuals to an intensive support group. The rest received no such
access. The dependent variable was an indicator for returning to prison within five
years.
10.4 Attrition 357
CASE STUDY Health Insurance and Attrition

In the United States, health care consumes about
one-sixth of the entire economy and is arguably
the single biggest determinant of future budget
deficits in the country.
Figuring out how to deliver high-quality care
more efficiently is therefore one of the most serious
policy questions we face. One option that attracts a
lot of interest is to change the way we pay for health
care. We could, for example, make consumers pay
more for medical care, to encourage them to use
only what they really need. In such an approach,
health insurance would cover the really big catastrophic items (think heart trans-
plant) but would cover less of the more mundane, potentially avoidable items
(think flu visits).
To know whether such an approach will work, we need to answer two
questions. First, are health care outcomes the same or better when costs to the
consumer are higher? It’s not much of a reform if it saves money by making us
sicker. Second, do medical expenditures go down when people have to pay more
for medical services?
Because there are many health insurance plans on the private market, we could
imagine using observational data to answer these questions. We could see whether
people on relatively stingy health insurance plans (that pay only for very major
costs) are as healthy as others who spend less on routine health care.
Such an approach really wouldn’t be very useful though. Why? You guessed it:
Insurance is endogenous; those who expect to demand more services
have a clear incentive to obtain more complete insurance, either
by selecting a more generous option at the place of employment,
by working for an employer with a generous insurance plan, or by
purchasing privately more generous coverage (Manning, Newhouse,
Duan, Keeler, and Leibowitz 1987, 252).
In other words, because sick people probably seek out better health care coverage,
a non-experimental analysis of health coverage and costs would be likely to show
that health care costs more for those with better coverage. That wouldn’t mean the
generous coverage caused costs to go up; such a relationship could simply be the
endogeneity talking.
Suppose we don’t have a good measure of whether someone has diabetes. We
would expect that people with diabetes seek out generous coverage because they
expect to rack up considerable medical expenses. The result would be a correlation
between the error term and type of health plan, with people in the generous health
plan having lower health outcomes (because of all those people with diabetes who
signed up for the generous plan). Or maybe insurance companies figure out a way to
measure whether people have diabetes and not let them into generous insurance
plans, which would mean the people in the generous plans would be healthier than
others. Here, too, the diabetes in the error term would be correlated with the type
of health plan, although in the other direction.
Thus, we have a good candidate for a randomized experiment, which is exactly
what ambitious researchers at RAND Corporation designed in the 1970s. They
randomly assigned people to various health plans, including a free plan that
covered medical care at no cost and various cost-sharing plans that had different
levels of co-payments. With randomization, the type of people assigned to a free
plan should be expected to be the same as the type of people assigned to a
cost-sharing plan. The only expected difference between the groups should be their
health plans; hence, to the extent that the groups differed in utilization or health
outcomes, the differences could be attributed to differences in the health plans.
The RAND researchers found that medical expenses were 45 percent higher for
people in plans with no out-of-pocket medical expenses than for those who had
stingy insurance plans (which required people to pay 95 percent of costs, up to a
$1,000 yearly maximum). In general, health outcomes were no worse for those in
the stingy plans.11 This experiment has been incredibly influential—it is the reason
we pay $10 or whatever when we check out of the doctor’s office.
Attrition is a crucial issue in evaluating the RAND experiment. Not everyone
stayed in the experiment. Inevitably in such a large study, some people moved,
some died, and others opted out of the experiment because they were unhappy
with the plan in which they were randomly placed. The threat to the validity of this
experiment is that this attrition may have been non-random. If the type of people
who stayed with one plan differed systematically from the type of people who
stayed with another plan, comparing health outcomes or utilization rates across
these groups may be inappropriate, given that the groups differ both in their health
plans and in the type of people who remain in the wake of attrition.
Aron-Dine, Einav, and Finkelstein (2013) reexamined the RAND data in light of
attrition and other concerns. They showed that 1,894 people had been randomly
assigned to the free plan. Of those, 114 (6 percent) were non-compliers who
declined to participate. Of the remainder who participated, 89 (5 percent) left the
experiment. These low numbers for non-compliance and attrition are not very sur-
prising. The free plan was gold plated, covering everything. The cost-sharing plan
requiring the highest out-of-pocket expenditures had 1,121 assigned participants.
Of these, 269 (24 percent) declined the opportunity to participate, and another
145 (13 percent) left the experiment. These patterns contrast markedly from the
non-compliance and attrition patterns for the free plan.
What kind of people would we expect to leave a cost-sharing plan? Probably
people who ended up paying a lot of money under the plan. And what kind
of people would end up paying a lot of money under a cost-sharing plan? Sick
people, most likely. So that means we have reason to worry that the free plan
had all kinds of people, but that the cost-sharing plans had a sizable hunk of
11
Outcomes for people in the stingy plans were worse for some subgroups and some conditions,
however, leading the researchers to suggest programs targeted at specific conditions rather than
providing fee-free service for all health care.
10.4 Attrition 359
sick people who pulled out. So any finding that the cost-sharing plans yielded
the same health outcomes could have one of two causes: the plans did not
have different health impacts or the free plan was better but had a sicker
population.
Aron-Dine, Einav, and Finkelstein (2013) therefore conducted an analysis on
a trimmed data set based on techniques from Lee (2009). They dropped the
highest spenders in the free-care plan until they had a data set with the same
proportion of observations from those assigned to the free plan and to the costly
plan. Comparing these two groups is equivalent to assuming that those who
left the costly plan were the patients requiring the most expensive care; since
this is unlikely to be completely true, the results from such a comparison are
considered a lower bound—actual differences between the groups would be
larger if some of the people who dropped out from the costly plan were not
among the most expensive patients. The results indicated that the effect of the
cost-sharing plan was still negative, meaning that it lowered expenditures. How-
ever, the magnitude of the effect was less than the magnitude reported in the initial
study, which did little to account for differential attrition across the various types
of plans.
Review Questions
Consider a hypothetical experiment in which researchers evaluated a program that paid teachers a
substantial bonus if their students’ test scores rose. The researchers implemented the program in 50
villages and also sought test score data in 50 randomly selected villages.
Table 10.8 on the next page provides results from regressions using data available to the
researchers. Each column shows a bivariate regression in which Treatment was the independent
variable. This variable equaled 1 for villages where teachers were paid for student test scores and
0 for the control villages.
Researchers also had data on average village income, village population, and whether or not test
scores were available (a variable that equals 1 for villages that reported test scores and 0 for villages
that did not report test scores.)
1. Is there a balance problem? Use specific results in the table to justify your answer.
2. Is there an attrition problem? Use specific results in the table to justify your answer.
3. Did the treatment work? Justify your answer based on results here, and discuss what, if any,
additional information you would like to see.
TABLE 10.8 Regression Results for Models Relating Teacher Payment Experiment
(for Review Questions)
Dependent Variable
Test scores Village population Village income Test score availability
∗ ∗
Treatment 24.0 −20.00 500.0 0.20∗
(8.00) (100.0) (200.0) (0.08)
[t = 3.00] [t = 0.20] [t = 2.50] [t = 2.50]
Constant 50.0∗ 500.0∗ 1,000.0∗ 0.70∗

(10.00) (100.0) (200.0) (0.05)
[t = 5.00] [t = 5.00] [t = 5.00] [t = 14.00]
N 80 100 100 100

∗
10.5 Natural Experiments

As a practical matter, experiments cannot cover every research question. As we
discussed in Section 1.3, experiments we might think up are often infeasible,
unethical, or unaffordable.
Sometimes, however, an experiment may fall into our laps. That is, we might
natural experiment find that the world has essentially already run a natural experiment that pretty
Occurs when a much looks like a randomized experiment, but we didn’t have to muck about
researcher identifies a actually implementing it. In a natural experiment, a researcher identifies a situation
situation in which the
in which the values of the independent variable have been determined by a random,
values of the
or at least exogenous, process. In this section, we discuss some of the clever
have been determined ways researchers have been able to use natural experiments to answer interesting
by a random, or at least research questions.
exogenous, process. In an ideal natural experiment, an independent variable is exogenously
determined, leaving us with treatment and control groups that look pretty much
as they would look if we had intentionally designed a random experiment. One
example is in elections. In 2010, a hapless candidate named Alvin Greene won the
South Carolina Democratic primary election to run for the U.S. Senate. Greene
had done no campaigning and was not exactly an obvious senatorial candidate.
He had been involuntarily discharged from both the army and the air force and
had been unemployed since leaving the military. He was also under indictment for
showing pornographic pictures to a college student. Yet he won 59 percent of the
vote in the 2010 primary against a former state legislator. While some wondered if
something nefarious was going on, many pointed to a more mundane possibility:
when voters don’t know much about candidates, they might pick the first name
they see. Greene was first on the ballot, and perhaps that’s why he did so well.12
An experimental test of this proposition would involve randomly rotating
the ballot order of candidates and seeing if candidates who appear first on the
12
Greene went on to get only 28 percent of the vote in the general election but vowed to run for
president anyway.
10.5 Natural Experiments 361
ballot do better. Conceptually, that’s not too hard, but practically, it is a lot to
ask, given that election officials are pretty protective of how they run elections.
In the 1998 Democratic primary in New York City, however, election officials
decided on their own to rotate the order of candidates’ names by precinct. Political
scientists Jonathan Koppell and Jennifer Steen got wind of this decision and
analyzed the election as a natural experiment. Their 2004 paper found that in
71 of 79 races, candidates received more votes in precincts where they were
listed first. In seven of those races, the differences were enough to determine the
election outcome. That’s pretty good work for an experiment the researchers didn’t
even set up.
Researchers have found other clever opportunities for natural experiments.
An important question is whether economic stimulus packages of tax cuts and
government spending increases that were implemented in response to the 2008
recession boosted growth. At a first glance, such analysis should be easy. We know
how much the federal government cut taxes and increased spending. We also know
how the economy performed. Of course things are not so simple because, as former
chair of the Council of Economic Advisers Christina Romer (2011) noted, “Fiscal
actions are often taken in response to other things happening in the economy.”
When we look at the relationship between two variables, like consumer spending
and the tax rebate, we “need to worry that a third variable, like the fall in wealth,
is influencing both of them. Failing to take account of this omitted variable leads
to a biased estimate of the relationship of interest.”
One way to deal with this challenge is to find exogenous variation in stimulus
spending that is not correlated with any of the omitted variables we worry about.
This is typically very hard, but sometimes natural experiments pop up. For
example, Parker, Souleles, Johnson, and McClelland (2013) noted that the 2008
stimulus consisted of tax rebate checks that were sent out in stages according to
the last two digits of recipients’ Social Security numbers. Thus, the timing was
effectively random for each family. After all, the last two digits are essentially
randomly assigned to people when they are born. This means that the timing of
the government spending by family was exogenous. An analyst’s dream come
true! The researchers found that family spending among those that got a check
was almost $500 more than those who did not, bolstering the case that the fiscal
stimulus boosted consumer spending.
REMEMBER THIS
1. In a natural experiment, the values of the independent variable have been determined by a
random, or at least exogenous, process.
2. Natural experiments are widely used and can be analyzed with OLS, 2SLS, or other
tools.
CASE STUDY Crime and Terror Alerts

One need not have true randomness for a natural exper-
iment. One needs only exogeneity, something quite dif-
ferent from randomness, as the following example about
the effect of police on crime makes clear. As discussed
earlier (page 256), observational data used to estimate
the following model
Crimest = β0 + β1 Policest + st (10.9)
is likely to suffer from endogeneity and risks statistical

catastrophe because one of the variables is probably
endogenous.
Could we use experiments to test the relationship?
Sure. All we need to do is head down to the police station and ask that officers be
assigned to different places at random. The idea is not completely crazy, and frankly,
it is the kind of thing police should consider doing. It’s not an easy sell, though. Can
you imagine the outrage if a shocking crime occurred in an area that had randomly
been assigned a low number of officers?
Economists Jonathan Klick and Alexander Tabarrok identified in 2005 a clever
natural experiment that looks much like the randomized experiment we proposed.
They noticed that Washington, DC, deployed more police when the terror alert level
was high. A high terror alert was not random; presumably, it had been prompted by
some cause, somewhere. The condition was exogenous, though. Whatever leads
terrorists to threaten carnage, it was not associated with factors that lead local
criminals in Washington, DC, to rob a liquor store. In other words, it was highly
unlikely that terror alerts correlated with the things in the error term causing
endogeneity, as we have discussed. It was as if someone had designed a study
in which extra police would be deployed at random times, only in this case the
“random” times were essentially selected by terrorist suspects with no information
about crime in DC rather than by a computerized random number generator, as
typically would be used in an academic experiment.
Klick and Tabarrok therefore assessed whether crime declined when the terror
alert level was high. Table 10.9 reports their main results: crimes decreased when
the terror alert level went up. The researchers also controlled for subway ridership,
to account for the possibility that if more people (and tourists in particular) were
around, there might be more targets for crime. The effect of the high terror alerts
was still negative. Because this variable was exogenous to crime in the capital
and could, Klick and Tabarrok argued, affect crime only by means of the increased
police presence, they asserted that their result provided pretty good evidence that
police can reduce crime. They used OLS, but the tools of analysis were really less
important than the vision of finding something that caused exogenous changes to
police deployment and then tracking changes in crime. Again, this is a pretty good
day’s work for an experiment the people who analyzed it didn’t run.
Conclusion 363
TABLE 10.9 Effect of Terror Alerts on Crime
High terror alert −7.32∗ −6.05∗

(2.88) (2.54)
[t = 2.54] [t = 2.38]
Subway ridership 17.34∗
(5.31)
[t = 3.27]
N 506 506
Dependent variable is total number of crimes in Washington,

DC, from March 12 to July 30, 2003.

∗
Conclusion
Experiments are incredibly promising for statistical inference. To find out if X
causes Y, do an experiment. Change X for a random subset of people. Compare
what happens to Y for the treatment and control groups. The approach is simple,
elegant, and has been used productively countless times.
For all their promise, though, experiments are like movie stars—idealized by
many but tending to lose some luster in real life. Movie stars’ teeth are a bit yellow,
and they aren’t particularly witty without a script. By the same token, experiments
don’t always achieve balance; they sometimes suffer from non-compliance and
attrition; and in many circumstances they aren’t feasible, ethical, or generalizable.
For these reasons, we need to take particular care when examining experi-
ments. We need to diagnose and, if necessary, respond to ABC issues (attrition,
balance, and compliance). Every experiment needs to assess balance to ensure
that the treatment and control groups do not differ systematically except for the
treatment. Many social science experiments also have potential non-compliance
problems since people can choose not to experience the randomly assigned
treatment. Non-compliance can induce endogeneity if we use Treatment delivered
as the independent variable, but we can get back to unbiased inference if we use
ITT or 2SLS to analyze the experiment. Finally, at least some people invariably
leave the experiment, which can be a problem if the attrition is related to the
treatment. Attrition is hard to overcome but must be diagnosed, and if it is a
problem, we should at least use multivariate OLS or trimmed data to lessen the
validity-degrading effects.
The following steps provide a general guide to implementing and analyzing
a randomized experiment:
1. Identify a target population.
2. Randomly pick a subset of the population and give them the treatment.
The rest are the control group.
3. Diagnose possible threats to internal validity.
(a) Assess balance with difference of means tests for all possible
(b) Assess compliance by looking at what percent of those assigned to

treatment actually experienced it.
(c) Assess non-random attrition by looking for differences in observa-

tion patterns across treatment and control groups.
4. Gather data on the outcome variable Y, and assess differences between

treated and control groups.
(a) If there is perfect balance and compliance and no attrition, use

bivariate OLS. Multivariate OLS also will be appropriate and will
provide more precise estimates.
(b) If there are imbalances, use multivariate OLS, controlling for vari-
ables that are unbalanced across treatment and control groups.
(c) If there is imperfect compliance, use ITT analysis and 2SLS.
(d) If there is attrition, use multivariate OLS, trim the data, or use a
selection model.
When we can do the following, we can say we are on track to understand

social science experiments:
• Section 10.1: Explain how to assess whether randomization was successful

with balancing tests.
• Section 10.2: Explain how imperfect compliance can create endogeneity.

Describe the ITT approach and how it avoids conflating treatment effects
and non-compliance effects, and discuss how ITT estimates relate to the
actual treatment effects.
• Section 10.3: Explain how 2SLS can be useful for experiments with
imperfect compliance.
• Section 10.4: Explain how attrition can create endogeneity, and describe
some steps we can take to diagnose and deal with attrition.
• Section 10.5: Explain natural experiments.

Further Reading
Experiments are booming in the social sciences. Gerber and Green (2012) provide
a comprehensive guide to field experiments. Banerjee and Duflo (2011) give
an excellent introduction to experiments in the developing world, and Duflo,
Glennerster, and Kremer (2008) provide an experimental toolkit that’s useful for
experiments in the developing world and beyond. Dunning (2012) has published
a detailed guide to natural experiments. A readable guide by Manzi (2012) is
also a critique of randomized experiments in social science and business. Manzi
(2012, 190) refers to a report to Congress in 2008 that identified policies that
demonstrated significant results in randomized field trials.
Attrition is one of the harder things to deal with, and different analysts take
different approaches. Gerber and Green (2012, 214) discuss their approaches
to dealing with attrition. The large literature on selection models includes, for
example, Das, Newey, and Vella (2003). Some experimentalists resist using
selection models because those models rely heavily on assumptions about the
distributions of error terms and functional form.
Imai, King, and Stuart (2008) discuss how to use blocking to get more
efficiency and less potential for bias in randomized experiments.
Key Terms
ABC issues (334) Compliance (340) Selection model (356)
Attrition (354) Intention-to-treat analysis Trimmed data set (355)
Balance (336) (343)
Blocking (335) Natural experiment (360)
Computing Corner
Stata
1. To assess balance, estimate a series of bivariate regression models with
all X variables as dependent variables and treatment assignment as
independent variables:
reg X1 TreatmentAssignment
reg X2 TreatmentAssignment
2. To implement an ITT model, estimate a model with the outcome of

interest as the dependent variable and treatment assignment as the main
independent variable. Other variables can be included, especially if there
are balance problems.
reg Y TreatmentAssignment X1 X2
3. To implement an 2SLS model, estimate a model with the outcome

of interest as the dependent variable and treatment assignment as an
instrument for Treatment delivered. Other variables can be included,
especially if there are balance problems.
ivregress Y (Treatment = TreatmentAssignment) X1 X2
R
1. To assess balance, estimate a series of bivariate regression models with
all “X” variables as dependent variables and treatment assignment as
independent variables:
lm(X1 ~ TreatmentAssignment)
lm(X2 ~ TreatmentAssignment)
2. To estimate an ITT model, estimate a model with the outcome of interest as

the dependent variable and treatment assignment as the main independent
variable. Other variables can be included, especially if there are balance
problems.
lm(Y ~ TreatmentAssignment+ X1 + X2)
3. To estimate a 2SLS model, estimate a model with the outcome of interest

as the dependent variable and treatment assignment as an instrument for
Treatment delivered. Other variables can be included, especially if there
are balance problems. As discussed on page 326, use the ivreg command
from the AER library:
library(AER)
ivreg(Y ~ Treatment + X2 | TreatmentAssignment + X2)
Exercises
1. In an effort to better understand the effects of get-out-the-vote messages
on voter turnout, Gerber and Green (2005) conducted a randomized field
experiment involving approximately 30,000 individuals in New Haven,
Connecticut, in 1998. One of the experimental treatments was randomly
assigned in-person visits where a volunteer visited the person’s home and
encouraged him or her to vote. The file GerberGreenData.dta contains the
variables described in Table 10.10.
(a) Estimate a bivariate model of the effect of actual contact on voting.

Is the model biased? Why or why not?
(b) Estimate compliance by estimating what percent of treatment-

assigned people actually were contacted.
Exercises 367
TABLE 10.10 Variables for Get-out-the-Vote Experiment

Voted Voted in the 1998 election (voted = 1, otherwise = 0)
ContactAssigned Assigned to in-person contact (assigned = 1, otherwise = 0)

ContactObserved Actually contacted via in-person visit (treated = 1, otherwise = 0)
Ward Ward number

PeopleHH Household size
(c) Use ITT to estimate the effect of being assigned treatment on whether
someone turned out to vote. Is this estimate likely to be higher
or lower than the actual effect of being contacted? Is it subject to
endogeneity?
(d) Use 2SLS to estimate the effect of contact on voting. Compare the
results to the ITT results. Justify your choice of instrument.
(e) We can use ITT results and compliance rates to generate a Wald
estimator, which is an estimate of the treatment effects calculated by
dividing the ITT effect by the coefficient on the treatment assignment
variable in the first-stage model of the 2SLS model. (If no one in the
non-treatment-assignment group gets the treatment, this coefficient
will indicate the compliance rate; more generally, this coefficient
indicates the net effect of treatment assignment on probability of
treatment observed.) Calculate this quantity by using the results in
part (b) and (c), and compare to the 2SLS results. It helps to be as
precise as possible. Are they different? Discuss.
(f) Create dummy variables indicating whether respondents lived in

Ward 2 and Ward 3. Assess balance for Wards 2 and 3 and also for
the people-per-household variables. Is imbalance a problem? Why
or why not? Is there anything we should do about it?
(g) Estimate a 2SLS model including controls for Ward 2 and Ward 3
residence and the number of people in the household. Do you expect
the results to differ substantially? Why or why not? Explain how the
first-stage results differ from the balance tests described earlier.
2. In Chapter 9 (page 328), we considered an experiment in which people

were assigned to a treatment group that was encouraged to watch a
television program on affirmative action. We will revisit that analysis,
paying attention to experimental challenges.
(a) Check balance in treatment versus control for all possible independ-
ent variables.
(b) What percent of those assigned to the treatment group actu-

ally watched the program? How is your answer relevant for the
analysis?
(c) Are the compliers different from the non-compliers? Provide evi-
dence to support your answer.
(d) In the first round of the experiment, 805 participants were inter-
viewed and assigned to either the treatment or the control condition.
After the program aired, 507 participants were re-interviewed about
the program. With only 63 percent of the participants re-interviewed,
what problems are created for the experiment?
(e) In this case, data (even pretreatment data) is available only for the
507 people who did not leave the sample. Is there anything we
can do?
(f) We estimated a 2SLS model earlier (page 328). Calculate a

Wald estimator by dividing the ITT effect by the coefficient
on the treatment assignment variable in the first-stage model of
the 2SLS model. (If no one in the non-treatment-assigned group
gets the treatment, this coefficient will indicate the compliance
rate; more generally, this coefficient indicates the net effect of
treatment assignment on probability of treatment observed.) In
all models, control for measures of political interest, newspaper
reading, and education. Compare the results for the effect of
watching the program to OLS (using actual treatment) and 2SLS
estimates.
3. In their 2004 paper “Are Emily and Greg More Employable than
Lakisha and Jamal? A Field Experiment on Labor Market Discrimina-
tion,” Marianne Bertrand and Sendhil Mullainathan discuss the results
of their field experiment on randomizing names on job resumes. To
assess whether employers treated African-American and white applicants
similarly, they had created fictitious resumes and randomly assigned
white-sounding names (e.g., Emily and Greg) to half of the resumes
and African-American-sounding names (e.g., Lakisha and Jamal) to
the other half. They sent these resumes in response to help-wanted
ads in Chicago and Boston and collected data on the number of
callbacks received. Table 10.11 describes the variables in the data set
resume_HW.dta.
Exercises 369
TABLE 10.11 Variables for Resume Experiment

education 0 = not reported; 1 = some high school; 2 = high school graduate; 3 = some college;
4 = college graduate or more
yearsexp Number of years of work experience
honors 1 = resume mentions some honors, 0 = otherwise

volunteer 1 = resume mentions some volunteering experience, 0 = otherwise
military 1 = applicant has some military experience, 0 = otherwise
computerskills 1 = resume mentions computer skills, 0 = otherwise

afn_american 1 = African-American-sounding name, 0 = white-sounding name
call 1 = applicant was called back, 0 = applicant not called back

female 1 = female, 0 = male
h_quality 1 = high-quality resume, 0 = low-quality resume
(a) What would be the concern of looking at the number of callbacks by

race from an observational study?
(b) Check balance between the two groups (resumes with African-
American-sounding names and resumes with white-sounding
names) on the following variables: education, years of experi-
ence, volunteering experience, honors, computer skills, and gender.
The treatment is whether the resume had or did not have an
African-American sounding name as indicated by the variable
afn_american.
(c) What would compliance be in the context of this experiment? Is there
a potential non-compliance problem?
(d) What variables do we need in order to use 2SLS to deal with
non-compliance?
(e) Calculate the ITT for receiving a callback from the resumes. The
variable call is coded 1 if a person received a callback and 0
otherwise. Use OLS with call as the dependent variable.
(f) We’re going to add covariates shortly. Discuss the implications of
adding covariates to this analysis of a randomized experiment.
(g) Rerun the analysis from part (e) with controls for education, years
of experience, volunteering experience, honors, computer skills, and
gender. Report the results, and briefly describe the effect of having
an African-American-sounding name and if/how the estimated effect

changed from the earlier results.
(h) The authors were also interested to see whether race had a differential
effect for high-quality resumes and low-quality resumes. They
created a variable h_quality that indicated a high-quality resume
based on labor market experience, career profile, existence of gaps
in employment, and skills. Use the controls from part (g) plus
the high-quality indicator variable to estimate the effect of having
an African-American-sounding name for high- and low-quality
resumes.
4. Improving education in Afghanistan may be key to bringing devel-

opment and stability to that country. In 2007, only 37 percent of
primary-school-age children in Afghanistan attended schools, and there
was a large gender gap in enrollment (with girls 17 percentage points less
likely to attend school). Traditional schools in Afghanistan serve children
from numerous villages. Some believe that creating more village-based
schools can increase enrollment and students’ performance by bringing
education closer to home. To assess this belief, researchers Dana Burde
and Leigh Linden (2013) conducted a randomized experiment to test the
effects of adding village-based schools. For a sample of 12 equal-sized
village groups, they randomly selected 5 groups to receive a village-based
school. One of the original village groups could not be surveyed and was
dropped, resulting in 11 village groups, with 5 treatment villages in which
a new school was built and 6 control villages in which no new school was
built.
This question focuses on the treatment effects for the fall 2007 semester,
which began after the schools had been provided. There were 1,490
children across the treatment and control villages. Table 10.12 displays
the variables in the data set schools_experiment_HW.dta.
(a) What issues are associated with studying the effects of new schools
in Afghanistan that are not randomly assigned?
(b) Why is checking balance an important first step in analyzing a

randomized experiment?
(c) Did randomization work? Check the balance of the following

variables: age of child, girl, number of sheep family owns, length of
time family lived in village, farmer, years of education for household
head, number of people in household, and distance to nearest school.
(d) On page 68, we noted that if errors are correlated, the standard
OLS estimates for the standard error of β̂ are incorrect. In this
Exercises 371
TABLE 10.12 Variables for Afghan School Experiment

formal_school Enrolled in school
testscores Fall test scores (normalized); tests were to be given to all

children whether in school or not
treatment Assigned to village-based school = 1, otherwise = 0

age Age of child
girl Girl = 1, boy = 0
sheep Number of sheep owned

duration_village Duration family has lived in village
farmer Farmer = 1, otherwise = 0

education_head Years of education of head of household
number_ppl_hh Number of people living in household

distance_nearest_school Distance to nearest school
f07_test_observed Equals 1 if test was observed for fall 2007 and 0 otherwise
Clustercode Village code

f07_hh_id Household ID
case, we might expect errors to be correlated within village. That is,

knowing the error for one child in a given village may provide some
information about the error for another child in the same village. (In
Stata, the way to generate standard errors that account for correlated
errors within some unit is to use the , cluster(ClusterName)
command at the end of Stata’s regression command. In this case,
the cluster is the village, as indicated with the variable Clustercode.)
Redo the balance tests from part (c) with clustered standard errors.
Do the coefficients change? Do the standard errors change? Do our
conclusions change?
(e) Calculate the effect on fall enrollment of being in a treatment village.

Use OLS, and report the fitted value of the school attendance variable
for control and treatment villages, respectively.
(f) Calculate the effect on fall enrollment of being in a treatment

village, controlling for age of child, sex, number of sheep family
owns, length of time family lived in village, farmer, years of
education for household head, number of people in household, and
distance to nearest school. Use the standard errors that account for
within-village correlation of errors. Is the coefficient on treatment
substantially different from the bivariate OLS results? Why or
why not? Briefly note any control variables that are significantly
associated with attending school.
(g) Calculate the effect on fall test scores of being in a treatment

village. Use the model that calculates standard errors that account
for within-village correlation of errors. Interpret the results.
(h) Calculate the effect on test scores of being in a treatment village, con-
trolling for age of child, sex, number of sheep family owns, length of
time family lived in village, farmer, years of education for household
head, number of people in household, and distance to nearest school.
Use the standard errors that account for within-village correlation of
errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control
variables that are significantly associated with higher test scores.
(i) Compare the sample size for the enrollment and test score data. What
concern does this comparison raise?
(j) Assess whether attrition was associated with treatment. Use the
standard errors that account for within-village correlation of errors.
Regression Discontinuity: Looking 11
for Jumps in Data
So far, we’ve been fighting endogeneity with two strate-

gies. One is to soak up as much endogeneity as we can by
including control variables or fixed effects, as we have
done with OLS and panel data models. The other is to
create or find exogenous variation via randomization or
instrumental variables.
In this chapter, we offer a third way to fight endo-
geneity: looking for discontinuities. A discontinuity is
a point at which a graph suddenly jumps up or down.
Potential discontinuities arise when a treatment is given
in a mechanical way to observations above some cutoff.
Jumps in the dependent variable at the cutoff point indicate the causal effects of
discontinuity treatments under reasonably general conditions.
Occurs when the graph Suppose, for example, that we want to know whether drinking alcohol causes
of a line makes a sudden grades to go down. An observational study might be fun, but worthless: it’s a pretty
jump up or down. good bet that the kind of people who drink a lot also have other things in their
error term (e.g., lack of interest in school) that also account for low grades. An
experimental study might even be more fun, but pretty unlikely to get approved
(or even finished!).
We still have some tricks to get at the effect of drinking, however. Consider
the U.S. Air Force Academy, where the drinking age is strictly enforced. Students
over 21 are allowed to drink; those under 21 are not allowed to drink and face
expulsion if caught. If we can compare the performance on final exams of those
students who had just turned 21 to those who had not, we might be able to identify
the causal effect of drinking.
Carrell, Hoekstra, and West (2010) made this comparison, and Figure 11.1
summarizes their results. Each circle shows average test score for Air Force
Academy students grouped by age. The circle on the far left shows the average
test score for students who were 270 days before their 21st birthday when they
took their test. The circle on the far right shows the average test score for students
who reached their 21st birthday 270 days before their test. In the middle would be
those who had just turned 21.
373
374 CHAPTER 11 Regression Discontinuity: Looking for Jumps in Data
cutoff
Normalized
grade
0.2
0.1
−270 −180 −90 0 90 180 270
Age at final exam

(measured in days from 21st birthday)
FIGURE 11.1: Drinking Age and Test Scores
We’ve included fit lines to help make the pattern clear. Those who had not
yet turned 21 scored higher. There is a discontinuity at the zero point in the figure
(corresponding to students taking a test on their 21st birthday). If we can’t come
up with another explanation for test scores to change at this point, we have pretty
good evidence that drinking hurts grades.
regression Regression discontinuity (RD) analysis formalizes this logic by using
discontinuity (RD) regression analysis to identify possible discontinuities at the point of application
analysis Techniques of the treatment. For the drinking age case, RD analysis involves fitting an OLS
that use regression
model that allows us to see if there is a discontinuity at the point students become
analysis to identify
possible discontinuities
legally able to drink.
at the point at which RD analysis has been used in a variety of contexts in which a treatment of
some treatment applies. interest is determined by a strict cutoff. Card, Dobkin, and Maestas (2009) used RD
analysis to examine the effect of Medicare on health because Medicare eligibility
kicks in the day someone turns 65. Lee (2008) used RD analysis to study the
effect of incumbency on reelection to Congress because incumbents are decided
by whoever gets more votes. Lerman (2009) used RD analysis to assess the effect
of being in a high-security prison on inmate aggression because the security level

of the prison to which a convict is sent depends directly on a classification score
determined by the state.
RD analysis can be an excellent option in the design of research studies.
Standard observational data may not provide exogeneity. Good instruments are
hard to come by. Experiments can be expensive or infeasible. And even when
experiments work, they can seem unfair or capricious to policy makers, who may
not like the idea of allocating a treatment randomly. In RD analysis, the treatment
is assigned according to a rule, which to many people seems more reasonable and
fair than random assignment.
RD models can work in the analysis of individuals, states, counties, and other
units. In this chapter, we mostly discuss RD analysis as applied to individuals,
but the technique works perfectly well to analyze other units that have treatment
assigned by a cutoff rule of some sort.
This chapter explains how to use RD models to estimate causal effects.
Section 11.1 presents the core RD model. Section 11.2 then presents ways to more
flexibly estimate RD models. Section 11.3 shows how to limit the data sets and
create graphs that are particularly useful in the RD context. The RD approach
is not bulletproof, though, and Section 11.4 discusses the vulnerabilities of the
approach and how to diagnose them.
11.1 Basic RD Model

In this section, we introduce RD models by explaining the important role of
the assignment variable in the model. We then translate the RD model into a
convenient graphical form and explain the key condition necessary for the model
to produce unbiased results.
The assignment variable in RD models

assignment variable The necessary ingredient an RD discontinuity model is an assignment variable
An assignment variable that determines whether someone does or does not receive a treatment. People
determines whether with values of the assignment variable above some cutoff receive the treatment;
someone receives some
people with values of the assignment variable below the cutoff do not receive the
treatment. People with
values of the assignment
treatment. As long as the only thing that changes at the cutoff is that the person gets
variable above some the treatment, then any jump up or down in the dependent variable at the cutoff
cutoff receive the will reflect the causal effect of the treatment.
treatment; people with One way to understand why is to look at observations very, very close to the
values of the assignment cutoff. The only difference between those just above and just below the cutoff
variable below the is the treatment. For example, Medicare eligibility kicks in when someone turns
cutoff do not receive the
65. If we compare the health of people one minute before their 65th birthday to
treatment.
the health of people who turned 65 one minute ago, we could reasonably believe
that the only difference between those two groups is that the federal government
provides health care for some but not others.
That’s a pretty extreme example, though. As a practical matter, we typically

don’t have data on very many people very close to our cutoff. Because statistical
precision depends on sample size (as we discussed on page 147), we typically can’t
expect very useful estimates unless we expand our data set to include observations
some degree above and below the cutoff. For Medicare, for example, perhaps we’ll
need to look at people days, weeks, or months from their 65th birthday to get
a reasonable sample size. Thus, the treated and untreated will differ not only in
whether they got the treatment but also in the assignment variable. People 65 years
and 2 months old not only can get Medicare, they are also older than people 2
months shy of their 65th birthday. While 4 months doesn’t seem like a lot for an
individual, health declines with age in the whole population, and some people will
experience a bad turn during those few months.
RD models therefore control for treatment and the assignment variable. In its
most basic form, an RD model looks like
Yi = β0 + β1 Ti + β2 (X1i − C) + i (11.1)
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
where Ti is a dummy variable indicating whether person i received the treatment

and X1i − C is our assignment variable, which indicates how much above or below
the cutoff an observation is. For reasons we’ll explain soon, it is useful to use
an assignment variable of this form (that indicates how much above or below the
cutoff a person was).
Graphical representation of RD models

Figure 11.2 displays a scatterplot of data and fitted lines for a typical RD model.
This picture captures the essence of RD. If we understand it, we understand RD
models. The distance to the cutoff variable, X1i − C, is along the horizontal axis.
In this particular example, C = 0, meaning that the eligibility for the treatment
kicked in when X1 was zero. Those with X1 above zero got the treatment; those
with X1 below zero did not get the treatment. Starting from the left, we see that the
dependent variable rises as X1i − C gets bigger and, whoa, jumps up at the cutoff
point (when X1 = 0). This jump at the cutoff, then, is the estimated causal effect
of the treatment.
The parameters in the model are easy to locate in the figure. The most
important parameter is β1 , which is the effect of being in the treatment group.
This is the jump at the heart of RD analysis. The slope parameter, β̂2 , captures the
relationship between the distance to the cutoff variable and the dependent variable.
In this basic version of the RD model, this slope is the same above and below the
cutoff.
Dependent
variable
cutoff
(Y)
4,000
3,000 pe)
slo
(the
β2
β0 + β1
2,000 Jump is β1
β0
1,000
−500 0 500 1,000
C
Assignment variable (X1)
FIGURE 11.2: Basic RD Model, Yi = β0 + β1 Ti + β2 (X1i − C)
Figure 11.3 displays more examples of results from RD models. In panel (a),
β1 is positive, just as in Figure 11.2, but β2 is negative, creating a downward slope
for the assignment variable. In panel (b), the treatment has no effect, meaning that
β1 = 0. Even though everyone above the cutoff received the treatment, there is no
discernible discontinuity in the dependent variable at the cutoff point. In panel (c),
β1 is negative because there is a jump downward at the cutoff, implying that the
treatment lowered the dependent variable.
The key assumption in RD models

The key assumption for RD analysis to work is that the error term itself does not
jump at the point of the discontinuity. In other words, we’re assuming that when
the assignment variable crosses the cutoff, the error term, whatever is in it, will be
continuous without any jumps up or down. We discuss in Section 11.4 how this
condition can be violated.
Dependent cutoff cutoff cutoff

variable 3,000
4,000
(Y) 3,500
3,000 2,500
3,000
2,500 2,000
2,000
2,000 1,500
1,500
1,000
1,000
1,000
500
500
0
0
0
−800 −400 0 400 −800 −400 0 400 −800 −400 0 400
Assignment variable (X1) Assignment variable (X1) Assignment variable (X1)

(a) (b) (c)
FIGURE 11.3: Possible Results with Basic RD Model
One of the cool things about RD analysis is that even if the error term is
correlated with the assignment variable, the estimated effect of the treatment is still
valid. To see why, suppose C = 0, the error and assignment variable are correlated,
and we characterize the correlation as follows:
i = ρX1i + νi (11.2)
where the Greek letter rho (ρ, pronounced “row”) captures how strongly the
error and X1i are related and νi is a random term that is uncorrelated with X1i .
In the Medicare example, mortality is the dependent variable, the treatment T is
Medicare (which kicks in the second someone turns 65), age is the assignment
variable, and health is in the error term. It is totally reasonable to believe that health
is related to age, and we use Equation 11.2 to characterize such a relationship.
If we estimate the following model that does not control for the assignment
variable (X1i )
Yi = β0 + β1 Ti + i
there will be endogeneity because the treatment, T, depends on X1i , which is

correlated with the error. In the Medicare example, if we predict mortality as a
function of Medicare only, the Medicare variable will pick up not only the effect
of the program but also the effect of health, which is in the error term, which is
correlated with age, which is in turn correlated with Medicare.
If we control for X1i , however, the correlation between T and disappears. To
see why, we begin with the basic RD model (Equation 11.1). For simplicity, we
assume C = 0. Using Equation 11.2 to substitute for yields
Yi = β0 + β1 Ti + β2 X1i + i
= β0 + β1 Ti + β2 X1i + ρX1i + νi
We can then rearrange and relabel β2 + ρ as β̃ 2 , producing
Yi = β0 + β1 Ti + (β2 + ρ)X1i + νi
= β0 + β1 Ti + β̃ 2 X1i + νi
Notice that we have an equation in which the error term is now νi (the part of
Equation 11.2 that is uncorrelated with anything). Hence, the treatment variable,
T, in the RD model is uncorrelated with the error term even though the assignment
variable is correlated with the error term. This means that OLS will provide an
unbiased estimate of β1 , the coefficient on Ti .
Meanwhile, the coefficient we estimate on the X1i assignment variable is β̃ 2
(notice the squiggly on top), a combination of β2 (with no squiggly on top and
the actual effect of X1i on Y) and ρ (the degree of correlation between X1i and the
error term in the original model, ).
Thus, we do not put a lot of stock into the estimate of the variable on the
assignment variable because the coefficient combines the actual effect of the
assignment variable and the correlation of the assignment variable and the error.
That’s OK, though, because our main interest is in the effect of the treatment, β1 .
REMEMBER THIS
An RD analysis can be used when treatment depends on an assignment variable being above some
cutoff C.
1. The basic model is
Yi = β0 + β1 Ti + β2 (X1i − C) + i
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
2. RD models require that the error term be continuous at the cutoff. That is, the value of the error
term must not jump up or down at the cutoff.
3. RD analysis identifies a causal effect of treatment because the assignment variable soaks up
the correlation of error and treatment.
1. Many school districts pay for new school buildings with bond issues that must be approved by
voters. Supporters of these bond issues typically argue that new buildings improve schools and
thereby boost housing values. Cellini, Ferreira, and Rothstein (2010) used RD analysis to test
whether passage of school bonds caused housing values to rise.
(a) What is the assignment variable?
(b) Explain how to use a basic RD approach to estimate the effect of school bond passage
on housing values.
(c) Provide a specific equation for the model.
2. U.S. citizens are eligible for Medicare the day they turn 65 years old. Many believe that people
with health insurance are less likely to die prematurely because they will be more likely to seek
treatment and doctors will be more willing to conduct tests and procedures for them. Card,
Dobkin, and Maestas (2009) used RD analysis to address this question.
(a) What is the assignment variable?
(b) Explain how to use a basic RD approach to estimate the effect of Medicare coverage on
the probability of dying prematurely.
(c) Provide a specific equation for the model. (Don’t worry that the dependent variable is a
dummy variable; we’ll deal with that issue later on in Chapter 12.)
11.2 More Flexible RD Models

In a basic RD model, the slope of the line is the same on both sides of the cutoff
for treatment. This might not be the case in reality. In this section, we show how
to implement more flexible RD models that allow the slope to vary or allow for a
non-linear relationship between the assignment variable and outcomes.
Varying slopes model

Most RD applications allow the slope to vary above and below the threshold.
By incorporating tools we discussed in Section 6.4, the following will produce
estimates in which the slope is different below and above the threshold:
Yi = β0 + β1 Ti + β2 (X1i − C) + β3 (X1i − C)Ti + i
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
The new term at the end of the equation is an interaction between T and
X1 − C. The coefficient on that interaction, β3 , captures how different the slope is
for observations where X1 is greater than C. The slope for untreated observations
(for which Ti = 0) will simply be β2 , which is the slope for observations to
the left of the cutoff. The slope for the treated observations (for which Ti = 1)
will be β2 + β3 , which is the slope for observations to the right of the cutoff.
(Recall our discussion in Chapter 6, page 202, regarding the proper interpretation
of coefficients on interactions.)
Figure 11.4 displays examples in which the slopes differ above and below the
cutoff. In panel (a), β2 = 1 and β3 = 2. Because β3 is greater than zero, the slope
is steeper for observations to the right of the cutoff. The slope for observations to
the left of the cutoff is 1 (the value of β2 ), and the slope for observations to the
right of the cutoff is β2 + β3 = 3.
In panel (b) of Figure 11.4, β3 is zero, meaning that the slope is the same (and
equal to β2 ) on both sides of the cutoff. In panel (c), β3 is less than zero, meaning
that the slope is less steep for observations for which X1 is greater than C. Note
that just because β3 is negative, the slope for observations to the right of the cutoff
need not be negative (although it may be). A negative value of β3 simply means
that the slope is less steep for observations to the right of the cutoff. In panel (c),
β3 = −β2 , which is why the slope is zero to the right of the cutoff.
In estimating an RD model with varying slopes, is important to use X1i − C
instead of X1i for the assignment variable. In this model, we’re estimating two
separate lines. The intercept for the line for the untreated group is β̂0 , and the
intercept for the line for the treated group is β̂0 + β̂1 . If we used X1i as the
assignment variable (instead of X1i − C), the β̂1 estimate would indicate the
differences in treated and control when X1i is zero even though we care about
the difference between treated and control when X1i equals the cutoff. By using
X1i − C instead of X1i for the assignment variable, we have ensured that β̂1 will
indicate the difference between treated and control when X1i − C is zero, which
occurs, of course, when X1i = C.
Polynomial model
Once we start thinking about how the slope could vary across different values of
X1 , it is easy to start thinking about other possibilities. Hence, more technical RD
Dependent
variable cutoff cutoff cutoff
(Y) 5,000
3,000
2,500
4,000
2,500
2,000
3,000
2,000
2,000 1,500
1,500
1,000
1,000
1,000
0
500
500
−800 −400 0 400 −800 −400 0 400 −800 −400 0 400
Assignment variable (X1) Assignment variable (X1) Assignment variable (X1)

(a) (b) (c)
FIGURE 11.4: Possible Results with Differing Slopes RD Model
analyses spend a lot of effort estimating relationships that are even more flexible
than the varying slopes model. One way to estimate more flexible relationships
between the assignment variable and outcome is to use our polynomial regression
model from Chapter 7 (page 221) to allow the relationship between X1 to Y to
wiggle and curve. The RD insight is that however wiggly that line gets, we’re still
looking for a jump (a discontinuity) at the point where the treatment kicks in.
For example, we can use polynomial models to allow the estimated lines to
curve differently above and below the treatment threshold with a model like the
following:
Yi = β0 + β1 Ti + β2 (X1i − C) + β3 (X1i − C)2 + β4 (X1i − C)3

+ β5 (X1i − C)Ti + β6 (X1i − C)2 Ti + β7 (X1i − C)3 Ti + i
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
Figure 11.5 shows two relationships that can be estimated with such a
polynomial model. In panel (a), the value of Y accelerates as X1 approaches the
Dependent Dependent
variable cutoff variable cutoff
(Y ) (Y )
10 10
8 8
6 6
4 4
2 2
0 0
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Assignment variable (X1) Assignment variable (X1)

(a) (b)
FIGURE 11.5: Fitted Lines for Examples of Polynomial RD Models

cutoff, dips at the point of treatment, and accelerates again from that lower point. In
panel (b), the relationship appears relatively flat for values of X1 below the cutoff,
but there is a fairly substantial jump up in Y at the cutoff. After that, Y rises sharply
with X1 and then falls sharply.
It is virtually impossible to predict funky non-linear relationships like these
ahead of time. The goal is to find a functional form for the relationship between
X1 − C and outcomes that soaks up any relation between X1 − C and outcomes to
ensure that any jump at the cutoff reflects only the causal effect of the treatment.
This means we can estimate the polynomial models and see what happens even
without a full theory about how the line should wiggle.
With this flexibility comes danger, though. Polynomial models are quite
sensitive and sometimes can produce jumps at the cutoff that are bigger than they
should be. Therefore, we should always report simple linear models as well to
avoid seeming to be fishing around for a non-linear model that gives us the answer
we’d like.
REMEMBER THIS
When we conduct RD analysis, it is useful to allow for a more flexible relationship between
assignment variable and outcome.
• A varying slopes model allows the slope to vary on different sides of the treatment cutoff:
• We can also use polynomial models to allow for non-linear relationships between the
assignment and outcome variables.
Review Question
For each panel in Figure 11.6, indicate whether each of β1 , β2 , and β3 is less than, equal to, or greater
than zero for the varying slopes RD model:

cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X X
(a) (b)
cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X X
(c) (d)
cutoff cutoff
Y 10
Y 10
8 8
6 6
4 4
2 2
0 0
0 2 3 4 6 8 10 0 2 3 4 6 8 10
X X
(e) (f)
FIGURE 11.6: Various Fitted Lines for RD Model of Form Yi = β0 + β1 Ti + β2 (X1i − C) +

β3 (X1i − C)Ti (for Review Question)
11.3 Windows and Bins

There are other ways to make RD models flexible. An intuitive approach is to
simply focus on a subset of the data near the threshold. In this section, we show
the benefits and costs of that approach and introduce binned graphs as a useful
tool for all RD analysis.
Adjusting the window

As we discussed earlier, polynomial models can be a bit hard to work with. An
easier alternative (or at least supplement) to polynomial models is to narrow
window The range of the window—that is, the range of the assignment variable to which we limit
observations we our analysis. Accordingly, we look only at observations with values of the
examine in an RD assignment variable in this range. Ideally, we’d make the window very, very
analysis. The smaller the
small near the cutoff. For such a small window, we’d be looking only at the
window, the less we
need to worry about
observations just below and just above the cutoff. These observations would be
non-linear functional very similar, and hence the treatment effect would be the difference in Y for
forms. the untreated (those just below the cutoff) and the treated (those just above the
cutoff).
A smaller window allows us to worry less about the functional form on both
sides of the cutoff. Figure 11.7 provides some examples. In panels (a) and (b), we
show the same figures as in Figure 11.5 but highlight a small window. To the right
of each of these panels, we show just the line in the highlighted smaller window.
While the relationships are quite non-linear for the full window, we can see that
they are approximately linear in the smaller windows. For example, when we look
only at observations of X1 between −1 and 1 for panel (a), we see two more or less
linear lines on each side of the cutoff. When we look only at observations of X1
between −1 and 1 for panel (b), we see a more or less flat line below the cutoff and
line with a positive slope above the cutoff. So even though the actual relationships
between the assignment variable and Y are non-linear in both panels, a reasonably
binned graphs simple varying slopes model should be more than sufficient when we focus on the
Used in RD analysis. The smaller window. A smaller window for these cases allows us to feel more confident
assignment variable is that our results do not depend on sensitive polynomial models but instead reflect
divided into bins, and differences between treated and untreated observations near the cutoff.
the average value of the
As a practical matter, we usually don’t have very many observations in a small
dependent variable is
plotted for each bin. The window near the cutoff. If we hope to have any statistical power, then, we’ll need
plots allow us to to make the window large enough to cover a reasonable number of observations.
visualize a discontinuity
at the treatment cutoff.
Binned graphs
A convenient trick that helps us understand non-linearities and discontinuities in
our RD data is to create binned graphs. Binned graphs look like scatterplots
but are a bit different. To construct a bin plot, we divide the X1 variable into
Dependent variable (Y )
cutoff
cutoff
10
4
6
3
4
2
2
0
−1 0 1
−4 −2 0 2 4 6
Assignment variable (X1 − C)

(a)
Dependent variable (Y ) cutoff
cutoff
10
7
8
6
5
4 4
3
2
0 −1 0 1
−4 −2 0 2 4 6

(b)
FIGURE 11.7: Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5
multiple regions (or “bins”) above and below the cutoff; we then calculate the
average value of Y within each of those regions. When we plot the data, we get
something that looks like panel (a) of Figure 11.8. Notice that there is a single
cutoff (C) cutoff (C)

Dependent Dependent
variable variable
(Y ) (Y )
3,500 3,500
3,000 3,000
2,500 2,500
2,000 2,000 β1
1,500 1,500
1,000 1,000
1,000 1,500 2,000 1,000 1,500 2,000
Assignment variable (X1) Assignment variable (X1)

(a) (b)
FIGURE 11.8: Bin Plots for RD Model
observation for each bin, producing a graph that’s cleaner than a scatterplot of all
observations.
The bin plot provides guidance for selecting the right RD model. If the
relationship is highly non-linear or seems dramatically different above and below
the cutpoint, the bin plot will let us know. In panel (a) of Figure 11.8, we
see a bit of non-linearity because there is a U-shaped relationship between
X1 and Y for values of X1 below the cutoff. This relationship suggests that a
quadratic could be appropriate, or even simpler, the window could be narrowed
to focus only on the range of X1 where the relationship is more linear. Panel
(b) of Figure 11.8 shows the fitted lines based on an analysis that used only
observations for which X1 is between 900 and 2,200. The implied treatment
effect is the jump in the data indicated by β1 in the figure. We do not actually
use the binned data to estimate the model; we use the original data in our
regressions.
REMEMBER THIS
1. It is useful to look at smaller window sizes when possible by considering only data close to the
treatment cutoff.
2. Binned graphs help us visualize the discontinuity and the possibly non-linear relationship
between assignment variable and outcome.
CASE STUDY Universal Prekindergarten

“Universal prekindergarten” is the name of a policy of
providing high-quality, free school to 4-year-olds. If it
works as advocates say, universal kindergarten (or pre-K)
will counteract socioeconomic disparities, boost produc-
tivity, and decrease crime.
But does it work? Gormley, Phillips, and Gayer (2008)
used RD analysis to evaluate one piece of the puzzle by
looking at the impact of universal pre-K on test scores
in Tulsa, Oklahoma. They could do so because children
born on or before September 1, 2001, were eligible to
enroll in the program for the 2005–2006 school year,
while children born after this date had to wait until the next school year to enroll.
Figure 11.9 is a bin plot for this analysis. The dependent variable is test
scores from a letter-word identification test that measures early writing skills. The
children took the test a year after the older kids started pre-K. The kids born before
September 1 spent the year in pre-K; the kids born after September 1 spent the year
doing whatever it is 4-year-olds do when not in pre-K.
The horizontal axis shows age measured in days from the pre-K cutoff date.
The data is binned in groups of 14 days so that each data point shows the
average test scores for children with ages in a 14-day range. While the actual
statistical analysis uses all observations, the binned graph helps us see the
relationship between the cutoff and test scores better than a scatterplot of all
observations would.
One of the nice features of RD analysis is that the plot often tells the story. We’ll
do formal statistical analysis in a second, but in this case, as in many RD examples,
we know how the story is going to end just from the bin plot.
There’s no mistaking the data: a jump in test scores occurs precisely at the point
of discontinuity. There’s a clear relationship of kids scoring higher as they get older
(as we can see from the positive slope on age), but right at the age-related cutoff for
the pre-K program, there is a substantial jump up. The kids above the cutoff went
to pre-K. The kids who were below the cutoff did not. If the program had no effect,
cutoff
Test
score 12
10
−300 −200 −100 0 100 200 300
Age (days from cutoff)
FIGURE 11.9: Binned Graph of Test Scores and Pre-K Attendance
the kids who didn’t go to pre-K would score lower than the kids who did, simply
because they were younger. But unless the program boosted test scores, there is
no obvious reason for a discontinuity to be located right at the cutoff.
Table 11.1 shows statistics results for the basic and varying slopes RD models.
For the basic model, the coefficient on the variable for pre-K is 3.492 and highly
significant, with a t statistic of 10.31. The coefficient indicates the jump that we see
in Figure 11.9. The age variable is also highly significant. No surprise there as older
children did better on the test.
In the varying slopes model, the coefficient on the treatment is virtually
unchanged from the basic model, indicating a jump of 3.479 in test scores for
the kids who went to pre-K. The effect is again highly statistically significant, with
a t statistic of 10.23. The coefficient on the interaction is insignificant, however,
indicating that the slope on age is the same for kids who had pre-K and those who
didn’t.
TABLE 11.1 RD Analysis of Prekindergarten

Basic Varying slopes
∗
Pre-K 3.492 3.479∗
(0.339) (0.340)
[t = 10.31] [t = 10.23]
Age – C 0.007∗ 0.007∗
(0.001) (0.001)
[t = 8.64] [t = 6.07]
Pre-K × (Age – C) 0.001
(0.002)
[t = 0.42]
Constant 5.692∗ 5.637∗
(0.183) (0.226)
[t = 31.07] [t = 24.97]
N 2, 785 2, 785
2
R 0.323 0.323

∗
The conclusion? Universal pre-K increased school readiness in Tulsa.
11.4 Limitations and Diagnostics

The RD approach is a powerful tool. It allows us to generate unbiased treatment
effects as long as treatment depends on some threshold and the error term is
continuous at the treatment threshold. However, RD analysis can go wrong, and
in this section, we discuss situations in which this type of analysis doesn’t work
and how to detect these breakdowns. We also discuss limitations on how broadly
we can generalize RD results.
Imperfect assignment
One drawback to the RD approach is that it’s pretty rare to have an assignment
variable that decisively determines treatment. If we’re looking at the effect of going
to a certain college, for example, we probably cannot use RD analysis because
admission was based on multiple factors, none them cut and dried. Or if we’re
trying to assess the effectiveness of a political advertising campaign, it’s unlikely
that the campaign simply advertised in cities where its poll results were less than
some threshold; instead, the managers probably selected certain criteria to identify
where they might advertise and then decided exactly where to run ads on the basis
of a number of factors (including gut feel).
fuzzy RD models In the Further Reading section at the end of the chapter, we point to readings
RD models in which the on so-called fuzzy RD models, which can be used when the assignment variable
assignment variable imperfectly predicts treatment. Fuzzy RD models can be useful when there
imperfectly predicts
is a point at which treatment becomes much more likely but isn’t necessarily
treatment.
guaranteed. For example, a college might look only at people with test scores on an
admission exam of 160 or higher. Being above 160 may not guarantee admission,
but there is a huge leap in probability of admission for those who score 160 instead
of 159.
Discontinuous error distribution at threshold

A bigger problem for RD models occurs when the error can be discontinuous
at the treatment threshold. Real people living their lives may do things that
create a jump in the error term at the discontinuity. For example, suppose that
a high school GPA above 3.0 makes students eligible for a tuition discount at
their state university. This seems like a promising RD design: use high school
GPA as the assignment variable, and set a threshold at 3.0. We can then see, for
example, whether graduation rates (Y) are higher for students who got the tuition
discount.
The problem is that the high school students (and teachers) know the threshold
and how close they are to it. Students who plan ahead and really want to go to
college will make damn sure that their high school GPA is north of 3.0. Students
who are drifting through life and haven’t gotten around to thinking about college
won’t be so careful. Therefore, we could expect that when we are looking at
students with GPAs near 3.0, the more ambitious students pile up on one side
and the slackers pile up on the other. If we think ambition influences graduation
(it does!), then ambition (something in the error term) jumps at the discontinuity,
messing up the RD design.
Any RD analysis therefore should discuss whether the only thing happening
at the discontinuity is the treatment. Do the individuals always know about the
cutoff? Sometimes they don’t. Perhaps a worker training program enrolls people
who score over some number on a screening test. The folks taking the test probably
don’t know what the number is, so they’re unlikely to be able to game the system.
And even if people know the score they need, it’s often reasonable to assume that
they’ll do their best because presumably they won’t know precisely how much
effort will be enough to exceed the cutoff. If the test can be retaken, though, the
more ambitious folks might keep taking it until they pass, while the less ambitious
will head home to binge-watch Breaking Bad. In such a situation, something in
the error term (ambition) would jump at the cutoff because the ambitious people
would tend to be above the cutoff and the less ambitious people would be below it.
Diagnostic tests for RD models

Given the vulnerabilities of the RD model, two diagnostic tests are important to
assess the appropriateness of the RD approach. First, we want to know if the
assignment variable itself acts peculiar at the cutoff. If the values of the assignment
variable cluster just above the cutoff, we should worry that people know about
the cutoff and are able to manipulate things to get over it. In such a situation,
it’s quite plausible that the people who are able to just get over the cutoff differ
from those who do not, perhaps because the former have more ambition (as in our
GPA example), or better contacts, or better information, or other advantages. To
the extent that these factors also affect the dependent variable, we’ll violate the
assumption that the error term does not have a discrete jump at the cutoff.
The best way to assess whether there is clustering on one side of the
cutoff is to create a histogram of the assignment variable and see if it shows
unusual activity at the cutoff point. Panel (a) in Figure 11.10 is a histogram of
assignment values in a case with no obvious clustering. The frequency of values
in each bin for the assignment variable bounces around a bit here and there,
but it’s mostly smooth. There is no clear jump up or down at the cutoff. In
contrast, the histogram in panel (b) shows clear clustering just above the cutoff.
When faced with data like panel (b), it’s pretty reasonable to suspect that the
word is out about the cutoff and people have figured out how to get over the
threshold.1
cutoff cutoff
Frequency Frequency
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
−5 −3 −1 1 3 5 −5 −3 −1 1 3 5
Assignment variable (X1 − C) Assignment variable (X1 − C )

(a) (b)
FIGURE 11.10: Histograms of Assignment Variable for RD Analysis
1
Formally testing for discontinuity of the assignment variable at the cutoff is a bit tricky. McCrary
(2008) has more details. Usually, a visual assessment provides a good sense of what is going on,
although it’s a good idea to try different bin sizes to make sure that what you’re seeing is not an
artifact of one particular choice for bin size.
The second diagnostic test involves assessing whether other variables act
weird at the discontinuity. For RD analysis to be valid, we want only Y, nothing
else, to jump at the point where T = 1. If some other variable jumps at the
discontinuity, we may wonder if people are somehow self-selecting (or being
selected) based on unknown additional factors. If so, it could be that the jump
at Y is being caused by these other factors jumping at the discontinuity, not the
treatment. A basic diagnostic test of this sort looks like
X2i = γ0 + γ1 Ti + γ2 (X1i − C) + νi
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
A statistically significant γ̂1 coefficient from this model means that X2 jumps at
the treatment discontinuity, which casts doubt on the main assumption of the RD
model—namely, that the only thing happening at the discontinuity is movement
from the untreated to the treated category.
A significant γ̂1 from this diagnostic test doesn’t necessarily kill the RD
analysis, but we would need to control for X2 in the RD model and explain
why this additional variable jumps at the discontinuity. It also makes sense to
use varying slopes models, polynomial models, and smaller window sizes in
conducting balance tests.
Including any variable that jumps at the discontinuity is only a partial fix,
though, because if we observe a difference at the cutoff in a variable we can mea-
sure, it’s plausible that there is also a difference at the cutoff in a variable we can’t
measure. We can measure education reasonably well. It’s a lot harder to measure
intelligence, however. And it’s extremely hard to measure conscientiousness. If
we see that people are more educated at the cutoff, we’ll worry that they are also
more intelligent and conscientious—that is, we’ll worry that at the discontinuity,
our treated group may differ from the untreated group in ways we can’t
measure.
Generalizability of RD results
An additional limitation of RD is that it estimates a very specific treatment effect,
also known as the local average treatment effect. This concept comes up for
instrumental variables models as well (as discussed in the Further Reading Section
of Chapter 9 on page 324). The idea is that the effects of the treatment may differ
within the population: a training program might work great for some types of
people but do nothing for others. The treatment effect estimated by RD analysis is
the effect of the treatment on folks who have X1 equal to the threshold. Perhaps the
treatment would have no effect on people with very low values of the assignment
variable. Or perhaps the treatment effect grows as the assignment variable grows.
RD analysis will not be able to speak to these possibilities because we observe only
the treatment happening at one cutoff. Hence, it is possible that the RD results will
not generalize to the whole population.
REMEMBER THIS
To assess the appropriateness of RD analysis:
1. Qualitatively assess whether people have control over the assignment variable.
2. Conduct diagnostic tests.
• Assess the distribution of the assignment variable by using a histogram to see if there is
clustering on one side of the cutoff.
• Run RD models, and use other covariates as dependent variables. The treatment should not
be associated with any discontinuity in any covariate.
CASE STUDY Alcohol and Grades

The Air Force Academy alcohol and test score example
that began the chapter provides a great example of how
RD analysis and RD diagnostics work. Table 11.2 shows
the actual RD results behind Figure 11.1 from page 374.
The first-column results are based on a varying slopes
model in which the key variable is the dummy variable
indicating that a student was older than 21 when he or
she took the exam. This model also controlled for the
assignment variable, age, allowing the effect of age to
vary before and after people turned 21. The dependent
variable is standardized test scores; thus, the results in
the first column indicate that turning 21 decreased test
scores by 0.092 standard deviation. This effect is highly statistically significant,
with a t statistic of 30.67. Adding controls strengthens the results, as reported in
the second column. The results are quite similar when we allow the age variable
to affect test scores non-linearly by including a quadratic function of age in
the model.
Are we confident that the only thing that happens at the discontinuity is
that students become eligible to drink? That is, are we confident that there
is no discontinuity in the error term at the point people turn 21? First, we
want to think about the issue qualitatively. Obviously, people can’t affect their
age, so there’s little worry that anyone is manipulating the assignment vari-
able. And while it is possible, for example, that good students decide to drop
out just after their 21st birthday, which would mean that the students we
observe who just turned 21 are more likely to be bad students, it doesn’t seem
particularly likely.
Frequency
1,500
1,000
500
−270 −180 −90 0 90 180 270
Cutoff
FIGURE 11.11: Histogram of Age Observations for Drinking Age Case Study
We can also run diagnostic tests. Figure 11.11 shows the frequency of observa-
tions for students above and below the age cutoff. There is no sign of manipulation
of the assignment variable: the distribution of ages is mostly constant, with some
apparently random jumps up and down.
We can also assess whether other covariates showed discontinuities at the
21st birthday. Since, as discussed earlier, the defining RD assumption is that the
only discontinuity at the cutoff is in the dependent variable, we hope to see no
TABLE 11.2 RD Analysis of Drinking Age and Test Scores

Varying slopes Varying slopes with Quadratic with
control variables control variables
Discontinuity at 21 −0.092∗ −0.114∗ −0.106∗

(0.03) (0.02) (0.03)
[t = 30.67] [t = 57.00] [t = 35.33]
N 38, 782 38, 782 38, 782
All three specifications control for age, allowing the slope to vary on either side of the cutoff. The second and third
specifications control for semester, SAT scores, and other demographics factors.
∗
From Carrell, Hoekstra, and West (2010).

Conclusion 397
TABLE 11.3 RD Diagnostics for Drinking Age and Test Scores

SAT math SAT verbal Physical fitness
Discontinuity at 21 2.371 1.932 0.025

(2.81) (2.79) (0.04)
[t = 0.84] [t = 0.69] [t = 0.63]
N 38, 782 38, 782 38, 782
All three specifications control for age, allowing the slope to vary on either side of the cutoff.
From Carrell, Hoekstra, and West (2010).
statistically significant discontinuities when other variables are used as dependent

variables. The model we’re testing is
Covariatei = γ0 + γ1 Ti + γ2 (Agei − C) + νi
where
Ti = 1 if X1i ≥ C
Ti = 0 if X1i < C
Table 11.3 shows results for three covariates: SAT math scores, SAT verbal scores, and
physical fitness. For none of these covariates is γ̂1 statistically significant, suggesting
that there is no jump in covariates at the point of the discontinuity, a conclusion that
is consistent with the idea that the only thing changing at the discontinuity is the
treatment.
Conclusion
RD analysis is a powerful statistical tool. It works even when the treatment we
are trying to analyze is correlated with the error. It works because the assignment
variable—a variable that determines whether a unit gets the treatment—soaks up
endogeneity. The only assumption we need is that there is no discontinuity in the
error term at the cutoff in the assignment variable X1 .
If we have such a situation, the basic RD model is super simple. It is just an
OLS model with a dummy variable (indicating treatment) and a variable indicating
distance to the cutoff. More complicated RD models allow more complicated
relationships between the assignment variable and the dependent variable. No
matter the model, however, the heart of RD analysis remains looking for a jump
in the value of Y at the cutoff point for assignment to treatment. As long as there
is no discontinuity in relationship between error and the outcome at the cutoff, we
can attribute any jump in the dependent variable to the effect of the treatment.
RD analysis is an essential part of any econometric toolkit. It can fill in a hole

when panel data, instrumental variable, or experimental techniques aren’t up to
the task. RD analysis is also quite clean. Anybody can pretty much see the answer
by looking at a binned graph, and the statistical models are relatively simple to
implement and explain.
RD analysis is not without pitfalls, however. If people can manipulate their
score on the assignment variable, then the RD estimate no longer captures just
the effect of treatment; it also captures the effects of whatever qualities are
overrepresented among the folks who were able to boost their assignment score
above the threshold. For this reason, it is very important to report diagnostics that
help us sniff out possible discontinuities in the error term at the cutoff.
We are on the right track when we can do the following:
• Section 11.1: Write down a basic RD model, and explain all terms,
including treatment variable, assignment variable, and cutoff, as well as
how RD models overcome endogeneity.
• Section 11.2: Write down and explain RD models with varying slopes and
non-linear relationships.
• Section 11.3: Explain why it is useful to look at a smaller window. Explain

a binned graph and how it differs from a conventional scatterplot.
• Section 11.4: Explain conditions under which RD analysis might not be

appropriate. Explain qualitative and statistical diagnostics for RD models.
Further Reading
Imbens and Lemieux (2008) and Lee and Lemieux (2010) go into additional detail
on RD designs, including discussions of fuzzy RD models. Bloom (2012) gives
another useful overview of RD methods. Cook (2008) provides a history of RD
applications. Buddlemeyer and Skofias (2003) compare performance of RD and
experiments and find that RD analysis works well as long as discontinuity is
rigorously enforced.
See Grimmer, Hersh, Feinstein, and Carpenter (2010) for an example of using
diagnostics to critique RD studies with election outcomes as an RD assignment
variable.
Key Terms
Assignment variable (375) Fuzzy RD models (392) Window (386)
Binned graphs (386) Regression discontinuity
Discontinuity (373) analysis (374)
Computing Corner
Stata
To estimate an RD model in Stata, create a dummy treatment variable and an
X1 − C variable and use the syntax for multivariate OLS.
1. The following commands create variables needed for RD analysis. It is

helpful to create a scalar variable named “cutoff” that is simply a variable
with a single value (in contrast to a typical variable, which has a list of
values). For this example, we assume the cutoff is 10. The variable T
indicates treatment; in many data sets, it will already exist. The variable
Assign is the assignment variable.
scalar cutoff = 10
gen T = 0
replace T = 1 if X1 > cutoff
gen Assign = X1 - cutoff
2. The basic RD model is a simple OLS model:

reg Y T Assign
3. To estimate a model with varying slopes, first create an interaction variable

and then run OLS:
gen AssignxT = Assign * T
reg Y T Assign AssignxT
4. To create a scatterplot with the fitted lines from a varying slopes RD model,
use the following:
graph twoway (scatter Y Assign) (lfit Y Assign if T == 0) /*
*/ (lfit Y Assign if T == 1)
R
To estimate an RD model in R, we create a dummy treatment variable and a
X1 − C variable and use the syntax for multivariate OLS.
1. The following commands create variables needed for RD. It is useful to

create a scalar variable named “cutoff” that is a simply a variable with
a single value (in contrast to a typical variable that has a list of values).
For this example, we assume the cutoff is 10. The variable T indicates
treatment; in many data sets, it will already exist. The variable Assign is
the assignment variable.
cutoff = 10
T = 0
T[X1 > cutoff] = 1
Assign = X1 - cutoff
2. The basic RD model is a simple OLS model:

RDResults = lm(Y ~ T + Assign)
3. To estimate a model with varying slopes, first create an interaction variable

and then run OLS:
AssignxT = Assign*T
RDResults = lm(Y ~ T + Assign + AssignxT)
4. There are many different ways to use R to create a scatterplot with the
fitted lines from a varying slopes RD model. Here is one example for a
model in which the assignment variable ranges from −1,000 to 1,000 with
a cutoff at zero. This example uses the results from the OLS regression
model RDResults:
plot(Assign, Y)
lines(−1000:0, RDResults$coef[1] + RDResults$coef[3]
∗(−1000:0))
lines(0:1000, RDResults$coef[1] + RDResults$coef[2] +
(RDResults$coef[3] + RDResults$coef[4])∗(0:1000))
Exercises
1. As discussed on page 389, Gormley, Phillips, and Gayer (2008) used
RD analysis to evaluate the impact of pre-K on test scores in Tulsa.
Children born on or before September 1, 2001, were eligible to enroll
in the program during the 2005–2006 school year, while children born
after this date had to wait to enroll until the 2006–2007 school year.
Table 11.4 lists the variables. The pre-K data set covers 1,943 children
just beginning the program in 2006–2007 (preschool entrants) and 1,568
children who had just finished the program and began kindergarten in
2006–2007 (preschool alumni).
(a) Why should there be a jump in the dependent variable right at the
point where a child’s birthday renders him or her eligible to have
participated in preschool the previous year (2005–2006) rather than
the current year (2006–2007)? Should we see jumps at other points
as well?
(b) Assess whether there is a discontinuity at the cutoff for the free-lunch
status, gender, and race/ethnicity covariates.
Exercises 401
TABLE 11.4 Variables for Prekindergarten Data

age Days from the birthday cutoff. The cutoff value is coded as 0; negative values indicate
days born after the cutoff; positive values indicate days born before the cutoff
cutoff Treatment indicator (1 = born before cutoff, 0 = born after cutoff)
wjtest01 Woodcock-Johnson letter-word identification test score
female Female (1 = yes, 0 = no)

black Black (1 = yes, 0 = no)
white White (1 = yes, 0 = no)

hispanic Hispanic (1 = yes, 0 = no)
freelunch Eligible for free lunch based on low income in 2006–07 (1 = yes, 0 = no)
(c) Repeat the tests for covariate discontinuities, restricting the sample
to a one-month (30-day) window on either side of the cutoff. Do the
results change? Why or why not?
(d) Use letter-word identification test score as the dependent variable

to estimate a basic RD model controlling for treatment status (born
before the cutoff) and the assignment variable (age measured as
days from the cutoff). What is the estimated effect of the preschool
program on letter-word identification test scores?
(e) Estimate the effect of pre-K by using an RD specification that allows

the relationship to vary on either side of the cutoff. Do the results
change? Should we prefer this model? Why or why not?
(f) Add controls for lunch status, gender, and race/ethnicity to the
model. Does adding these controls change the results? Why or why
not?
(g) Reestimate the model from part (f), limiting the window to one
month (30 days) on either side of the cutoff. Do the results change?
How do the standard errors in this model compare to those from the
model using the full data set?
2. Gormley, Phillips, and Gayer (2008) also used RD analysis to evaluate

the impact of Head Start on test scores in Tulsa. Children born on or
before September 1, 2001, were eligible to enroll in the program in the
2005–2006 school year, while children born after this date could not enroll
until the 2006–2007 school year. The variable names and definitions are
the same as in Table 11.4, although in this case, the data refers to 732
children just beginning the program in 2006–2007 (Head Start entrants)

and 470 children who had just finished the program and were beginning
kindergarten in 2006–2007 (Head Start alumni).
(a) Assess whether there is a discontinuity at the cutoff for the free-lunch
status, gender, and race/ethnicity covariates.
(b) Repeat the tests for covariate discontinuities, restricting the sample
to a one-month (30-day) window on either side of the cutoff. Do the
results change? Why or why not?
(c) Use letter-word identification test score as the dependent variable

to estimate a basic RD model. What is the estimated effect of the
preschool program on letter-word identification test scores?
(d) Estimate the effect of Head Start by using an RD specification that

allows the relationship to vary on either side of the cutoff. Do the
results change? Should we prefer this model? Why or why not?
(e) Add controls for lunch status, gender, and race/ethnicity to the
model. Do the results change? Why or why not?
(f) Reestimate the model from part (e), limiting the window to one
month (30 days) on either side of the cutoff. Do the results change?
How do the standard errors in this model compare to those from the
model using the full data set?
3. Congressional elections are decided by a clear rule: whoever gets the most
votes in November wins. Because virtually every congressional race in the
United States is between two parties, whoever gets more than 50 percent
of the vote wins.2 We can use this fact to estimate the effect of political
party on ideology. Some argue that Republicans and Democrats are very
distinctive; others argue that members of Congress have strong incentives
to respond to the median voter in their districts, regardless of party. We can
assess how much party matters by looking at the ideology of members of
Congress in the 112th Congress (which covered the years 2011 and 2012).
Table 11.5 lists the variables.
(a) Suppose we try to explain congressional ideology as a function of

political party only. Explain how endogeneity might be a problem.
2
We’ll look only at votes going to the two major parties, Democrats and Republicans, to ensure a
nice 50 percent cutoff.
Exercises 403
(b) How can an RD model fight endogeneity when we are trying to assess
if and how party affects congressional ideology?
(c) Generate a scatterplot of congressional ideology against

GOP2party2010, and based on this plot, discuss what you think the
RD analysis will indicate.
(d) Write down a basic RD model for this question, and explain the
terms.
(e) Estimate a basic RD model, and interpret the coefficients.
(f) Create an adjusted assignment variable (equal to GOP2party2010

−0.50), and use it to estimate a varying slopes RD model. Interpret
the coefficients. Create a graphic that has a scatterplot of the data
and fitted lines from the model, and calculate the fitted values for
four observations: a Democrat with GOP2party2010 = 0, a Democrat
with GOP2party2010 = 0.5, a Republican with GOP2party2010 =
0.5, and a Republican with GOP2party2010 = 1.0.
(g) Reestimate the varying slopes model, but use the unadjusted variable
(and unadjusted interaction). Compare the coefficient estimates
to your results in part (f). Calculate the fitted values for four
observations: a Democrat with GOP2party2010 = 0, a Democrat
with GOP2party2010 = 0.5, a Republican with GOP2party2010 =
0.5, and a Republican with GOP2party2010 = 1.0). Compare to the
fitted values in part (f).
TABLE 11.5 Variables for Congressional Ideology Data

GOP2party2010 The percent of the vote received by the Republican congressional candidate in the
district in 2010. Ranges from 0 to 1.
GOPwin2010 Dummy variable indicating Republican won; equals 1 if GOP2party2010 >
0.5 and equals 0 otherwise.
Ideology The conservativism of the member of Congress as measured by Carroll, Lewis, Lo,
Poole, and Rosenthal (2009, 2014). Ranges from –0.779 to 1.293. Higher values
indicate more conservative voting in Congress.
ChildPoverty Percentage of district children living in poverty. Ranges from 0.03 to 0.49.
MedianIncome Median income in the district. Ranges from $23,291 to $103,664.

Obama2008 Percent of vote for Barack Obama in the district in 2008 presidential election. Ranges
from 0.23 to 0.95.
WhitePct Percent of the district that is non-Hispanic white. Ranges from 0.03 to 0.97.
TABLE 11.6 Variables for Head Start Data

County County indicator
Mortality County mortality rate for children aged 5 to 9 from 1973 to 1983, limited to causes
plausibly affected by Head Start
Poverty Poverty rate in 1960: transformed by subtracting cutoff; also divided by 10 for easier
interpretation
HeadStart Dummy variable indicating counties that received Head Start assistance: counties with
poverty greater than 59.2 are coded as 1; counties with poverty less than 59.2 are coded
as 0
Bin The “bin” label for each observation based on dividing the poverty into 50 bins
(h) Assess whether there is clustering of the dependent variable just

above the cutoff.
(i) Assess whether there are discontinuities at GOP2party2010 =

0.50 for ChildPoverty, MedianIncome, Obama2008, and WhitePct.
Discuss the implications of your findings.
(j) Estimate a varying slopes model controlling for ChildPoverty,

MedianIncome, Obama2008, and WhitePct. Discuss these results in
light of your findings from the part (i).
(k) Estimate a quadratic RD model, and interpret the results.
(l) Estimate a varying slopes model with a window of GOP vote share
from 0.4 to 0.6. Discuss any meaningful differences in coefficients
and standard errors from the earlier varying slopes model.
(m) Which estimate is the most credible?
4. Ludwig and Miller (2007) used a discontinuity in program funding for

Head Start to test the impacts on child mortality rates. In the 1960s, the
federal government helped 300 of the poorest counties in the United States
write grants for Head Start programs. Only counties where poverty was
greater than 59.2 percent received this assistance. This problem explores
the effects of Head Start on child mortality rates. Table 11.6 lists the
variables.
(a) Write out an equation for a basic RD design to assess the effect
of Head Start assistance on child mortality rates. Draw a picture
of what you expect the relationship to look like. Note that in
this example, treatment occurs for low values of the assignment
variable.
Exercises 405
(b) Explain how RD analysis can identify a causal effect of Head Start
assistance on mortality.
(c) Estimate the effect of Head Start on mortality rate by using a basic
RD design.
(d) Estimate the effect of Head Start on mortality rate by using a varying
slopes RD design.
(e) Estimate a basic RD model with (adjusted) poverty values that are
between –0.8 and 0.8. Comment on your findings.
(f) Implement a quadratic RD design. Comment on the results.
(g) Create a scatterplot of the mortality and poverty data. What do you
see?
(h) Use the following code to create a binned graph of the mortality and
poverty data. What do you see?3
egen BinMean = mean(Mortality), by(Bin)
graph twoway (scatter BinMean Bin, ytitle("Mortality") /*
*/ xtitle("Poverty") msize(large) xline(0.0) )/*
*/ (lfit BinMean Bin if HeadStart == 0, clcolor(blue)) /*
*/ (lfit BinMean Bin if HeadStart == 1, clcolor(red))
(i) Rerun the quadratic model, and save predicted values as

FittedQuadratic. Include the fitted values in the graph from
part (h) by adding (scatter FittedQuadratic Poverty) to the
code above. Explain the results.
3
The trick to creating a binned graph is associating each observation with a bin label that is in the
middle of the bin. Stata code that created the Bin variable is (where we use semicolon to indicate line
breaks to save space) scalar BinNum = 50; scalar BinMin = -6; scalar BinMax
= 3; scalar BinLength = (BinMax-BinMin)/BinNum; gen Bin = BinMin +
BinLength*(0.5+(floor((Poverty-BinMin)/BinLength))). The Bin variable here sets the
value for each observation to the middle of the bin; there are likely to be other ways to do it.
PA R T III
Limited Dependent Variables

Dummy Dependent Variables 12
Think of a baby girl born just . . . now. Somewhere

in the world, it has happened. This child’s life will be
punctuated by a series of dichotomous events. Was she
born prematurely? Will she go to pre-K? Will she attend
a private school? Will she graduate from high school?
Will she get a job? Get married? Buy a car? Have a child?
Vote Republican? Have health care? Live past 80 years
old?
When we use data to analyze such phenomena—
and many others—we need to confront the fact that
the outcomes are dichotomous. That is, they either
happened or didn’t, meaning that our dependent variable
is either 1 (happened) or 0 (didn’t happen). Although we can continue to use OLS
for dichotomous dependent variables, the probit and logit models we introduce in
this chapter often fit the data better. Probit and logit models come with a price,
dichotomous though: they are more complicated to interpret.
Divided into two parts. This chapter explains how to deal with dichotomous dependent variables.
A dummy variable Section 12.1 shows how to use OLS to estimate these models. OLS does fine,
is an example of a but there are some things that aren’t quite right. Hence, Section 12.2 introduces
dichotomous variable. latent variable models to help us analyze dichotomous outcomes. Section 12.3
then presents the workhorse probit and logit models. These models differ from
OLS, and Section 12.4 explains how. Section 12.5 then presents the somewhat
laborious process of interpreting coefficients from these models. Probit and logit
models have several cool properties, but ease of interpretation is not one of them.
Section 12.6 shows how to test hypotheses involving multiple coefficients when
we’re working with probit and logit models.
409
410 CHAPTER 12 Dummy Dependent Variables
12.1 Linear Probability Model

The easiest way to analyze a dichotomous dependent variable is to use the linear
linear probability probability model (LPM). This is a fancy way of saying, “Just run your darn OLS
model (LPM) Used model already.”1 The LPM has witnessed a bit of a renaissance lately as people
when the dependent have realized that despite some clear defects, it often conveniently and effectively
variable is dichotomous.
characterizes the relationships between independent variables and outcomes. If
This is an OLS model in
which the coefficients
there is no endogeneity (a big if, as we know all too well), then the coefficients
are interpreted as the will be the right sign and will generally imply a substantive relationship similar to
change in probability of that estimated by the more complicated probit and logit models we’ll discuss later
observing Yi = 1 for a in this chapter.
one-unit change in X. In this section, we show how the LPM works and describe its limitations.
LPM and the expected value of Y

One nice feature of OLS is that it generates the best estimate of the expected value
of Y as a linear function of the independent variables. In other words, we can think
of OLS as providing us
E[Yi | X1 , X2 ] = β0 + β1 X1i + β2 X2i
where E[Yi | X1 , X2 ] is the expected value of Yi given the values of X1i and X2i . This
term is also referred to as the conditional value of Y.2
When the dependent variable is dichotomous, the expected value of Y is equal
to the probability that the variable equals 1. For example, consider a dependent
variable that is 1 if it rains and 0 if it doesn’t. If there is a 40 percent chance of
rain, the expected value of this variable is 0.40. If there is an 85 percent chance of
rain, the expected value of this variable is 0.85. In other words, because E[Y | X] =
Probability(Y = 1 | X), OLS with a dependent variable provides
Pr(Y = 1 | X1 , X2 ) = β0 + β1 X1i + β2 X2i
The interpretation of β̂1 from this model is that a one-unit increase in X1 is

associated with a β̂1 increase in the probability of observing Y = 1.
1
We discussed dichotomous independent variables in Chapter 7.
2
The terms linear and non-linear can get confusing. A linear model is one of the form
Yi = β0 + β1 X1i + β1 X2i + · · · , where none of the parameters to be estimated is multiplied, divided, or
raised to powers of other parameters. In other words, all the parameters enter in their own little plus
term. In a non-linear model, some of the parameters are multiplied, divided, or raised to powers of
other parameters. Linear models can estimate some non-linear relationships (by creating terms that
are functions of the independent variables, not the parameters). We described this process in
Section 7.1. Such polynomial models will not, however, solve the deficiencies of OLS for
dichotomous dependent variables. The models that do address the problems, the probit and logit
models we cover later in this chapter, are complex functions of other parameters and are therefore
necessarily non-linear models.
TABLE 12.1 LPM of the Probability

of Admission to Law School
GPA 0.032∗
(0.003)
[t = 9.68]
Constant −2.28∗
(0.256)
[t = 8.91]
N 514
R2 0.23
Minimum Ŷi −0.995
Maximum Ŷi 0.682
Robust standard errors in parentheses.

∗
Table 12.1 displays the results from an LPM of the probability of admission
into a competitive Canadian law school (see Bailey, Rosenthal, and Yoon 2014).
The independent variable is college GPA (measured on a 100-point scale, as is
common in Canada). The coefficient on GPA is 0.032, meaning that an increase
in one point on the 100-point GPA scale is associated with a 3.2 percentage point
increase in the probability of admission into this law school.
Figure 12.1 is a scatterplot of the law school admissions data. It includes the
fitted line from the LPM. The scatterplot looks different from a typical regression
model scatterplot because the dependent variable is either 0 or 1, creating two
horizontal lines of observations. Each point is a light vertical line, and when there
are many observations, the scatterplot appears as a dark bar. We can see that folks
with GPAs under 80 mostly do not get admitted, while people with GPAs above
85 tend to get admitted.
The expected value of Y based on the LPM is a straight line with a slope
of 0.032. Clearly, as GPAs rise, the probability of admission rises as well.
The difference from OLS is that instead of interpreting β̂1 as the increase
in the value of Y associated with a one-unit increase in X, we now interpret
β̂1 as the increase in the probability Y equals 1 associated with a one-unit
increase in X.
Limits to LPM
While Figure 12.1 is generally sensible, it also has a glaring flaw. The fitted line
goes below zero. In fact, the fitted line goes far below zero. The poor soul with
a GPA of 40 has a fitted value of −0.995. This is nonsensical (and a bit sad).
Probabilities must lie between 0 and 1. For a low enough value of X, the predicted
Probability of
admission
1
0.75
0.5
0.25 Fitted values from LPM
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on a 100-point scale)
FIGURE 12.1: Scatterplot of Law School Admissions Data and LPM Fitted Line
value falls below zero; for a high enough value of X, the predicted value exceeds
one.3
That LPM sometimes provides fitted values that make no sense isn’t the only
problem. We could, after all, simply say that any time we see a fitted value below 0,
we’ll call that a 0 and any time we see a fitted value above 1, we’ll call that a 1. The
deeper problem is that fitting a straight line to data with a dichotomous dependent
variable runs the risk of misspecifying the relationship between the independent
variables and the dichotomous dependent variable.
Figure 12.2 illustrates an example of LPM’s problem. Panel (a) depicts a
fitted line from an LPM that uses law school admissions data based on the six
hypothetical observations indicated. The line is reasonably steep, implying a
3
In this particular figure, the fitted probabilities do not exceed 1 because GPAs can’t go higher than
100. In other cases, though, the independent variable may not have such a clear upper bound. Even
so, it is extremely common for LPM fitted values to be less than 0 for some observations and greater
than 1 for other observations.
Probability of
admission
0.75
(a) 0.5
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
Probability of
admission
0.75 Dashed line is new LPM

fitted line when three
(b) observations are added
0.5
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
FIGURE 12.2: Misspecification Problem in an LPM
clear relationship. Now suppose that we add three observations from applicants
with very high GPAs, all of whom were admitted. These observations are the
triangles in the upper right of panel (b). Common sense suggests these observations
should strengthen our belief that GPAs predict admission into law school. Sadly,
LPM lacks common sense. The figure shows that the LPM fitted line with the new
observations (the dashed line) is flatter than the original estimate, implying that the
estimated relationship is weaker than the relationship we estimated in the original
model with less data.
What’s that all about? Once we come to appreciate that the LPM needs to fit
a linear relationship, it’s pretty easy to understand. If these three new applicants
had higher GPAs, then from an LPM perspective, we should expect them to have
a higher probability of admission than the applicants in the initial sample. But
the dependent variable can’t be higher than 1, so the LPM interprets the new data
as suggesting a weaker relationship. In other words, because these applicants had
higher independent variables but not higher dependent variables, the LPM suggests
that the independent variable is not driving the dependent variable higher.
What really is going on is that once GPAs are high enough, students are
pretty much certain to be admitted. In other words, we expect a non-linear
relationship—the probability of admission rises with GPAs up to a certain level,
then levels off as most applicants whose GPAs are above that level are admitted.
The probit and logit models we develop next allow us to capture precisely this
possibility.4
In LPM’s defense, it won’t systematically estimate positive slopes when
the actual slope is negative. And we should not underestimate its convenience
and practicality. Nonetheless, we should worry that LPM may leave us with an
incomplete view of the relationship between the independent and dichotomous
dependent variables.
REMEMBER THIS
The LPM uses OLS to estimate a model with a dichotomous dependent variable.
1. The coefficients are easy to interpret: a one-unit increase in Xj is associated with a βj increase
in the probability that Y equals 1.
2. Limitations of the LPM include the following:
• Fitted values of Ŷi may be greater than 1 or less than 0.
• Coefficients from an LPM may mischaracterize the relationship between X and Y.
12.2 Using Latent Variables to Explain Observed Variables

Given the limits to the LPM, our goal is to develop a model that will produce fitted
values between 0 and 1. In this section, we describe the S curves that achieve this
goal and introduce latent variables as a tool that will help us estimate S curves.
4
LPM also has a heteroscedasticity problem. As discussed earlier, heteroscedasticity is a less serious
problem than endogeneity, but heteroscedasticity forces us to cast a skeptical eye toward standard
errors estimated by LPM. A simple fix is to use heteroscedasticity robust standard errors we
discussed on page 68 in Chapter 3; for more details, see Long (1997, 39). Rather than get too
in-the-weeds solving heteroscedasticity in LPMs, however, we might as well run the probit or logit
models described shortly.
Probability of
admission
1
Fitted values from

0.75 probit model
0.5
0.25
Fitted values from LPM
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on 100-point scale)
FIGURE 12.3: Scatterplot of Law School Admissions Data and LPM- and Probit-Fitted Lines
S-curves
Figure 12.3 shows the law school admissions data. The LPM fitted line, in all
its negative probability glory, is still there, but we have also added a fitted curve
from a probit model. The probit-fitted line looks like a tilted letter “S,” and so the
relationship between X and the dichotomous dependent variable is non-linear. We
explain how to generate such a curve over the course of this chapter, but for now,
let’s note some of its nice features.
For applicants with GPAs below 70 or so, the probit-fitted line has flattened
out. This means that no matter how low students’ GPAs go, their fitted probability
of admission will not go below zero. For applicants with very high GPAs,
increasing scores lead to only small increases in the probability of admission. Even
if GPAs were to go very, very high, the probit-fitted line flattens out, and no one
will have a predicted probability of admission greater than one.
Not only does the S-shaped curve of the probit-fitted line avoid nonsensical
probability estimates, it also reflects the data better in several respects. First, there
is a range of GPAs in which the effect on admissions is quite high. Look in the
range from around 80 to around 90. As GPA rises in this range, the effect on
probability of admission is quite high, much higher than implied by the LPM fitted
line. Second, even though the LPM fitted values for the high GPAs are logically
possible (because they are between 0 and 1), they don’t reflect the data particularly
well. The person with the highest GPA in the entire sample (a GPA of 92) is
predicted by the LPM model to have only a 68 percent probability of admission.
The probit model, in contrast, predicts a 96 percent probability of admission for
this GPA star.
Latent variables
latent variable For a To generate such non-linear fitted lines, we’re going to think in terms of a latent
probit or logit model, an variable. Something is latent if you don’t see it, and a latent variable is something
unobserved continuous we don’t see, at least not directly. We’ll think of the observed dummy dependent
variable reflecting the
variable (which is zero or one) as reflecting an underlying continuous latent
propensity of an
individual observation of
variable. If the value of an observation’s latent variable is high, then the dependent
Yi to equal 1. variable for that observation is likely to be one; if the value of an observation’s
latent variable is low, then the dependent variable for that observation is likely
to be zero. In short, we’re interested in a latent variable that is an unobserved
continuous variable reflecting the propensity of an individual observation of Yi to
equal 1.
Here’s an example. Pundits and politicians obsess over presidential approval.
They know that a president’s reelection and policy choices are often tied to the
state of his approval. Presidential approval is typically measured with a yes-or-no
question: Do you approve of the way the president is handling his job? That’s
our dichotomous dependent variable, but we know full well that the range of
responses to the president covers far more than two choices. Some people froth
at the mouth in anger at the mention of the president. Others think “Meh.” Others
giddily support the president.
It’s useful to think of these different views as different latent attitudes toward
the president. We can think of the people who hate the president as having very
negative values of a latent presidential approval variable. People who are so-so
about the president have values of a latent presidential approval variable near zero.
People who love the president have very positive values of a latent presidential
approval variable.
We think in terms of a latent variable because it is easy to write down a
model of the propensity to approve of the president. It looks like an OLS model.
Specifically, Yi∗ (pronounced “Y-star”) is the latent propensity to be a 1 (an ugly
phrase, but that’s really what it is). It depends on some independent variable X and
the β’s.
Yi∗ = β0 + β1 X1i + i (12.1)
We’ll model the observed dichotomous dependent variable as a function of

this unobserved latent variable. We observe Yi = 1 (notice the lack of a star) for
people whose latent feelings are above zero.5 If the latent variable is less than zero,
we observe Yi = 0. (We ignore non-answers to keep things simple.)
This latent variable approach is consistent with how the world works. There
are folks who approve of the president but differ in the degree to which they
approve; they are all ones in the observed variable (Y) but vary in the latent variable
(Y ∗ ). There are folks who disapprove of the president but differ in the degree of
their disapproval; they are all zeros in the observed variable (Y) but vary in the
latent variable (Y ∗ ).
Formally, we connect the latent and observed variables as follows. The
observed variable is

0 if Yi∗ < 0
Yi =
1 if Yi∗ ≥ 0
Plugging in Equation 12.1 for Yi∗ , we observe Yi = 1 if
β0 + β1 Xi + i ≥ 0
i ≥ −β0 − β1 X1i
In other words, if the random error term is greater than or equal to −β0 − β1 Xi ,
we’ll observe Yi = 1. This implies
Pr(Yi = 1 | X1 ) = Pr(i ≥ −β0 − β1 X1i )
With this characterization, the probability that the dependent variable is one
is necessarily bounded between 0 and 1 because it is expressed in terms of the
probability that the error term is greater or less than some number. Our task in the
next section is to characterize the distribution of the error term as a function of
the β parameters.
REMEMBER THIS
Latent variable models are helpful to analyze dichotomous dependent variables.
1. The latent (unobserved) variable is
Yi∗ = β0 + β1 X1i + i
2. The observed variable is

0 if Yi∗ < 0
Yi =
1 if Yi∗ ≥ 0
5
Because the latent variable is unobserved, we have the luxury of using zero to label the point in the
latent variable space at which folks become ones.
12.3 Probit and Logit Models

Both the probit model and the logit model allow us to estimate the relationship
between X and Y in a way that forces the fitted values to lie between zero and
one, thereby producing estimates that more accurately capture the full relationship
between X and Y than LPMs do.
The probit and logit models are effectively very similar, but they differ in the
equations they use to characterize the error term distributions. In this section, we
explain the equations behind each of these two models.
Probit model
probit model A way The key assumption in a probit model is that the error term (i ) is itself normally
to analyze data with distributed. We’ve worked with the normal distribution a lot because the central
a dichotomous limit theorem (from page 56) implies that with enough data, OLS coefficient
dependent variable.
estimates are normally distributed no matter how is distributed. For the probit
The key assumption is
that the error term is
model, we’re saying that itself is normally distributed. So while normality of β̂1
normally distributed. is a proven result for OLS, normality of is an assumption in the probit model.
Before we explain the equation for the probit model, it is useful to do a bit
of bookkeeping. We have shown that Pr(Yi = 1 | X1 ) = Pr(i ≥ −β0 − β1 X1i ),
but this equation can be hard to work with given the widespread convention
in probability of characterizing the distribution of a random variable in terms
of the probability that it is less than some value. Therefore, we’re going to do
a quick trick based on the symmetry of the normal distribution: because the
distribution is symmetrical when it has the same shape on each side of the
mean, the probability of seeing something larger than some number is the same
as the probability of seeing something less than the negative of that number.
Figure 12.4 illustrates this property. In panel (a), we shade the probability of
being greater than −1.5. In panel (b), we shade the probability of being less than
1.5. The symmetry of the normal distribution backs up what our eyes suggest:
the shaded areas are equal in size, indicating equal probabilities. In other words,
Pr(i > −1.5) = Pr(i < 1.5). This fact allows us to rewrite Pr(Yi = 1 | X1 ) =
Pr(i ≥ −β0 − β1 X1i ) as
Pr(Yi = 1 | X1 ) = Pr(i ≤ β0 + β1 X1i )
There isn’t a huge conceptual issue here, but now it’s much easier to characterize
the model with conventional tools for working with normal distributions. In
cumulative particular, stating the condition in this way simplifies our use of the cumulative
distribution function distribution function (CDF) of a standard normal distribution. The CDF tells us
(CDF) Indicates how how much of normal distribution is to the left of any given point. Feed the CDF a
much of normal
number, and it will tell us the probability that a standard normal random variable
distribution is to the left
of any given point.
is less than that number.
Figure 12.5 on page 420 shows examples for several values of β0 + β1 X1i . In
panel (a), the portion of a standard normal probability density to the left of −0.7
Probability
density
0.4
0.3
(a) 0.2
0.1 Shaded area is probability

that is greater than −1.5
−3 −2 −1.5 −1 0 1 2 3
β 0 + β1X1i
Probability
density
0.4
0.3
(b) 0.2
Shaded area is probability

0.1
that i is less than 1.5 (which equals
probability is greater than −1.5)
−3 −2 −1 0 1 1.5 2 3
β 0 + β 1X1i
FIGURE 12.4: Symmetry of Normal Distribution
is shaded. Below that, in panel (d), the CDF function with the value of the CDF
at −0.7 is highlighted. The value is roughly 0.25, which is the area of the normal
curve that is to the left of −0.7 in panel (a).
Panel (b) in Figure 12.5 shows a standard normal density curve with the
portion to the left of +0.7 shaded. Clearly, this is more than half the distribution.
The CDF below it, in panel (e), shows that in fact roughly 0.75 of a standard
normal density is to the left of +0.7. Panel (c) shows a standard normal probability
density function (PDF) with the portion to the left of 2.3 shaded. Panel (f), below
that, shows a CDF and highlights its CDF value at 2.3, which is about 0.99. Notice
that the CDF can’t be less than 0 or more than 1 because it is impossible to have
less than 0 percent or more than 100 percent of the area of the normal density to
the left of any number.
Since we know Yi = 1 if i ≤ β0 + β1 X1i , the probability Yi = 1 will be the
CDF defined at the point β0 + β1 X1i .
Probability 0.4 0.4 0.4

density
function
(PDF)
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1X1i β0 + β1X1i β0 + β1X1i
(a) (b) (c)
Probability
< β0 + β1Xi 1 1 1
(CDF)
0.75 0.75 0.75
0.5 0.5 0.5
0.25 0.25 0.25
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1X1i β0 + β1X1i β0 + β1X1i

(d) (e) (f)
FIGURE 12.5: PDFs and CDFs
The notation we’ll use for the normal CDF is Φ() (the Greek letter Φ is
pronounced “fi,” as in Wi-Fi), which indicates the probability that a normally
distributed random variable ( in this case) is less than the number in parentheses.
In other words,
Pr(Yi = 1) = Pr(i ≤ β0 + β1 X1i )

= Φ(β0 + β1 X1i )
The probit model produces estimates of β that best fit the data. That is, to
the extent possible, probit estimates will produce β̂’s that lead to high predicted
probabilities for observations that actually were ones. Likewise, to the extent
possible, probit estimates will produce β̂’s that lead to low predicted probabilities
for observations that actually were zeros. We discuss estimation after we introduce
the logit model.
Logit model
logit model A way A logit model also allow us to estimate parameters for a model with a dichotomous
to analyze data with dependent variable in a way that forces the fitted values to lie between 0 and
a dichotomous 1. They are functionally very similar to probit models. The difference from a
dependent variable.
probit model is the equation that characterizes the error term. The equation differs
The error term in a logit
model is logistically
dramatically from the probit equation, but it turns out that this difference has little
distributed. Pronounced practical import.
“low-jit”. In a logit model,
Pr(Yi = 1) = Pr(i ≤ β0 + β1 X1i )

eβ0 +β1 Xi
= (12.2)
1 + eβ0 +β1 X1i
To get a feel for the logit equation, consider what happens when β0 + β1 X1i
is humongous. In the numerator, e is raised to that big number, which leads
to a super big number. In the denominator will be that same number plus 1,
which is pretty much the same number. Hence, the probability will be very, very
close to 1. But no matter how big β0 + β1 X1i gets, the probability will never
exceed 1.
If β0 + β1 X1i is super negative, the numerator of the logit function will have e
raised to a huge negative number, which is the same as one over e raised to a big
number, which is essentially zero. The denominator will have that number plus
one, meaning that the fraction is very close to 01 , and therefore the probability that
Yi = 1 will be very, very close to 0. No matter how negative β0 + β1 X1i gets, the
probability will never go below 0.6
The probit and logit models are rivals, but friendly rivals. When properly
interpreted, they yield virtually identical results. Do not sweat the difference.
Simply pick probit or logit and get on with life. Back in the early days of
computers, the logit model was often preferred because it is computation-
ally easier than the probit model. Now powerful computers make the issue
moot.
6
If β0 + β1 X1i is zero, then Pr(Yi = 1) = 0.5. It’s a good exercise to work out why. The logit function
can also be written as
1
Pr(Yi = 1) =
1 + e−(β0 +β1 X1i )
REMEMBER THIS
The probit and logit models are very similar. Both estimate S-shaped fitted lines that are always above
0 and below 1.
1. In a probit model,
Pr(Yi = 1) = Φ(β0 + β1 X1i )
where Φ() is the standard normal CDF indicating the probability that a standard normal random
variable is less than the number in parentheses.
2. In a logit model,
eβ0 +β1 X1i

Pr(Yi = 1) =
1 + eβ0 +β1 X1i
1. Come up with an example of a dichotomous dependent variable of interest, and then do the
following:
(a) Describe the latent variable underlying the observed dichotomous variable.
(b) Identify a continuous independent variable that may explain this dichotomous dependent
variable. Create a scatterplot of what you expect observations of the independent and
dependent variables to be.
(c) Sketch and explain the relationship you expect between your independent variable and
the probability of observing the dichotomous dependent variable equal to 1.
2. Come up with another example of a dichotomous dependent variable of interest. This
time, identify a dichotomous independent variable as well, and finish up by doing the
following:
(a) Create a scatterplot of what you expect observations of the independent and dependent
variables to be.
(b) Sketch and explain the relationship you expect between your independent variable and
the probability of observing the dichotomous dependent variable equal to 1.
12.4 Estimation 423
12.4 Estimation
So how do we select the best β̂ for the data given? The estimation process for the
maximum likelihood probit and logit models is called maximum likelihood estimation (MLE). It is
estimation (MLE) The more complicated than estimating coefficients using OLS. Understanding the inner
estimation process used workings of MLE is not necessary to implement or understand probit and logit
to generate coefficient
models. Such an understanding can be helpful for more advanced work, however,
estimates for probit and
logit models, among
and we discuss the technique in more detail in the citations and additional notes
others. section on page 561.
In this section, we explain the properties of MLE estimates, describe the fitted
values produced by probit and logit models, and show how goodness of fit is
measured in MLE models.
Properties of MLE estimates

Happily, many major statistical properties of OLS estimates carry over to MLE
estimates. For large samples, the parameter estimates are normally distributed
and consistent if there is no endogeneity. That means we can interpret statistical
significance and create confidence intervals and p values much as we have done
z test A hypothesis with OLS models. One modest difference is that we use z tests rather than t tests for
test involving MLE models. A z test compares test statistics to critical values based on the normal
comparison of a distribution. Because the t distribution approximates the normal distribution in
test statistic and a
large samples, z tests and t tests are very similar, practically speaking. The critical
critical value based on a
normal distribution.
values will continue to be the familiar values we used in OLS. In particular, z tests
and t tests share the rule of thumb that we reject the null hypothesis if the statistic
is greater than 2.
Fitted values from the probit model

The estimated β̂’s from a probit model will produce fitted lines that best fit the
data. Figure 12.6 shows examples. Panel (a) shows a classic probit-fitted line. The
observed data are indicated with short vertical lines. For low values of X, Y is
mostly zero, with a few exceptions. There is a range of X that has a pretty even
mix of Y = 0 and Y = 1 observations; then, for high values of X, all Y’s are one.
The estimated β̂ 0 coefficient is −3, indicating that low values of X are associated
with low probabilities that Y = 1. The estimated β̂1 coefficient is positive because
higher values of X are associated with a high probability of observing Y = 1.
To calculate fitted values for the model depicted in panel (a) of Figure 12.6, we
need to supply a value of X and use the coefficient estimates in the probit equation.
Since β̂ 0 = −3 and β̂1 = 2, the fitted probability of observing Y = 1 when X = 0
is
Ŷi = Pr(Yi = 1)
= Φ(β̂ 0 + β̂1 Xi )
= Φ(−3 + 2 × 0)
= Φ(−3)
= 0.001
Y=1 Y=1
1 1
0.75 0.75
0.5 0.5
0.25 β0 = –3 0.25 β0 = –4
β1 = 2 β1 = 6
0 0
0 1 2 3 0 1 2 3
X X
(a) (b)
Y=1 Y=1
1 1
0.75 0.75
0.5 0.5
0.25 β0 = –1 0.25 β0 = 3
β1 = 1 β1 = –2
0 0
0 1 2 3 0 1 2 3
X X
(c) (d)
FIGURE 12.6: Examples of Data and Fitted Lines Estimated by Probit
Based on the same coefficient estimates, the fitted probability of observing

Y = 1 when X = 1.5 is
Ŷi = Φ(−3 + 2 × 1.5)

= Φ(0)
= 0.5
Panel (b) of Figure 12.6 shows a somewhat similar relationship, but the
transition between the Y = 0 and Y = 1 observations is starker. When X is less
than about 0.5, the Y’s are all zero; when X is greater than about 1.0, the Ys are all
one. This pattern of data indicates a strong relationship between X and Y, and β̂1 is,
not surprisingly, larger in panel (b) than in panel (a). The fitted line is quite steep.
12.4 Estimation 425
Panel (c) of Figure 12.6 shows a common situation in which the relationship
between X and Y is rather weak. The estimated coefficients produce a fitted line
that is pretty flat; we don’t even see the full S-shape emblematic of probit models.
If we were to display the fitted line for a much broader range of X values, we
would see the S-shape because the fitted probabilities would flatten out at zero
for sufficiently negative values of X and would flatten out at one for sufficiently
positive values of X. Sometimes, as in this case, the flattening of a probit-fitted
line occurs outside the range of observed values of X.
Panel (d) of Figure 12.6 shows a case of a positive β̂ 0 coefficient and negative
β̂1 . This case best fits the pattern of the data in which Y = 1 for low values of X
and Y = 0 for high values of X.
Fitted values from the logit model

For logit, the fitted values are calculated as
ˆ
eβ̂ 0 + β1 X1i +β̂ 2 X2i +· · ·
Ŷi =
1 + eβ̂ 0 + βˆ1 X1i +β̂ 2 X2i +· · ·
Yes, that’s pretty ugly. Usually (but not always) there is a convenient way to
get statistical software to generate fitted values if we ask nicely. We’ll discuss how
in this chapter’s Computing Corner.
Goodness of fit for MLE models

log likelihood The The overall fit of a probit or logit model is reported with a log likelihood statistic,
log of the probability of often written as “log L.” This statistic is a by-product of the MLE estimation
observing the Y process. The log likelihood is the log of the probability of observing the Y
outcomes we report,
outcomes we did with the given X data and β̂’s. It is an odd way to report
given the X data and the
β̂’s.
how well the model fits because, well, it is incomprehensible. The upside of the
incomprehensibility of this fit statistic is that we are less likely to put too much
emphasis on it, in contrast to the more accessible R2 for OLS models, which is
sometimes overemphasized (as we discussed in Section 3.7).
The log likelihood is useful in hypothesis tests involving multiple coefficients.
Just as R2 feeds into the F statistic (as discussed on page 158), the log likelihood
feeds into the test statistic used when we are interested in hypotheses involving
multiple coefficients in probit or logit models, as we discuss later in Section 12.6.
REMEMBER THIS
1. Probit and logit models are estimated via MLE instead of OLS.
2. We can assess the statistical significance of MLE estimates of β̂ by using z tests, which closely
resemble t tests in large samples for OLS models.
TABLE 12.2 Sample Probit Results for Review Questions

(a) (b)
X1 0.5 1.0
(0.1) (1.0)
X2 −0.5 −3.0
(0.1) (1.0)
Constant 0.00 3.0

(0.1) (0.0)
N 500 500
log L −1,000 −1,200
Review Questions
1. For each panel in Figure 12.6, identify the value of X that produces Ŷi = 0.5. Use the probit
equation.
2. Based on Table 12.2, indicate whether the following statements are true, false, or indetermi-
nate.
(a) The coefficient on X1 in column (a) is statistically significant.
(b) The coefficient on X1 in column (b) is statistically significant.
(c) The results in column (a) imply that a one-unit increase in X1 is associated with a
50-percentage-point increase in the probability that Y = 1.
(d) The fitted probability found by using the estimate in column (a) for X1i = 0 and
X2i = 0 is 0.
(e) The fitted probability found by using the estimate in column (b) for X1i = 0 and X2i = 0
is approximately 1.
3. Based on Table 12.2, indicate the fitted probability for the following:
(a) Column (a) and X1i = 4 and X2i = 0.
(b) Column (a) and X1i = 0 and X2i = 4.
(c) Column (b) and X1i = 0 and X2i = 1.
12.5 Interpreting Probit and Logit Coefficients

The LPM may have its problems, but it is definitely easy to interpret: a one-unit
increase in X is associated with a β̂1 increase in the probability that Y = 1.
12.5 Interpreting Probit and Logit Coefficients 427
Probit and logit models have their strengths, but being easy to interpret is
not one of them. This is because the β̂’s feed into the complicated equations
defining the probability of observing Y = 1. These complicated equations keep
the predicted values above zero and less than one, but they can do so only by
allowing the effect of X to vary across values of X.
In this section, we explain how the estimated effect of X1 on Y in probit and
logit models depends not only on the value of X1 , but also on the value of the other
independent variables. We then describe approaches to interpreting the coefficient
estimates from these models.
The effect of X1 depends on the value of X1

Figure 12.7 displays the fitted line from the probit model of law school admission.
Increasing GPA from 70 to 75 leads to a small change in predicted probability
(about 3 percentage points). Increasing GPA from 85 to 90 is associated with
a substantial increase in predicted probability (about 30 percentage points).
Probability of
admission
1
Probability rises
by 0.01
when GPA goes
from 95 to 100
Probability rises
0.75 by 0.30
when GPA goes
from 85 to 90
0.5
0.25
Probability rises
by 0.03
when GPA goes
from 70 to 75
0
65 70 75 80 85 90 95 100
GPA
(on 100-point scale)
FIGURE 12.7: Varying Effect of X in Probit Model

The change in predicted probability then gets small—really small—when we

increase GPA from 95 to 100 (about 1 percentage point).
This is certainly a more complicated story than in OLS, but it’s perfectly
sensible. Increasing a very low GPA really doesn’t get a person seriously
considered for admission. For a middle range of GPAs, increases are indeed
associated with real increases in probability of being admitted. After a certain
point, however, higher GPAs have little effect on the probability of being admitted
because the likelihood that anyone with such a high GPA would be rejected is slim.
The effect of X1 depends on the values of the other

independent variables
There’s another wrinkle: the other variables. In probit and logit models, the effect
of increasing X1 varies not only over values of X1 , but also over values of the other
variables in the model. Suppose, for example, that we’re analyzing law school
admission in terms of college GPAs and standardized Law School Admission Test
(LSAT) test scores. The effect of GPAs actually depends on the value of the LSAT
test score. If an applicant’s LSAT test score is very high, the predicted probability
will be near 1 based on that alone, and there will be very little room for a higher
GPA to affect the predicted probability of being admitted to law school. If an
applicant’s LSAT test score is low, then there will be a lot more room for a higher
GPA to affect the predicted probability of admission.
We know that the estimated effect of X1 on the probability Y = 1 depends
on the values of X1 and the other independent variables, but this creates a knotty
problem: How do we convey the magnitude of the estimated effect? In other words,
how do we substantively interpret probit and logit coefficients?
There are several reasonable ways to approach this issue. Here we focus
on simulations. If X1 is a continuous variable, we summarize the effect of
X1 on the probability Y = 1 by calculating the average increase that would
occur in fitted probabilities if we were to increase X1 by one standard deviation
for every observation. First, we use the estimated β̂’s to calculate the fitted
values for all observations. We could, for example, create a variable called P1
for which the values are the fitted probabilities for each observation given the
estimated β̂ coefficients and the actual values of the independent variables for
each observation:
P1i = Φ(β̂ 0 + β̂1 X1i + β̂ 2 X2i + β̂ 3 X3i + · · · )
Then we want to know, for each observation, the estimated effect of increasing
X1 by one standard deviation. We could create a variable called P2 that is the fitted
value given the estimated β̂ coefficients and the actual values of the independent
variables for each observation with one important exception: for each observation,
we use the true value of X1 plus a standard deviation of X1 :
P2i = Φ(β̂ 0 + β̂1 (X1i + σX1 ) + β̂ 2 X2i + β̂ 3 X3i + · · · )

The average difference in these two fitted values across all observations is
the simulated effect of increasing X1 by one standard deviation. The difference
for each observation will be driven by the magnitude of β̂1 because the difference
in these fitted values is all happening in the term multiplied by β̂1 . In short, this
means that the bigger β̂1 , the bigger the simulated effect of X1 will be.
It is not set in stone that we add one standard deviation. Sometimes it may
make sense to calculate these quantities by simply using an increase of one or
some other amount.
These simulations make the coefficients interpretable in a commonsense
way. We can say things like, “The estimates imply that increasing GPA by one
standard deviation is associated with an average increase of 15 percentage points
in predicted probability of being admitted to law school.” That’s a mouthful, but
much more meaningful than the β̂ itself.
If X1 is a dummy variable, we summarize the effect of X1 slightly differently.
We calculate what the average increase in fitted probabilities would be if the value
of X1 for every observation were to go from zero to one. We’ll need to estimate
two quantities for each observation. First, we’ll need to calculate the estimated
probability that Y = 1 if X1 (the dummy variable) were equal to 0, given our β̂
estimates and the actual values of the other independent variables. For this purpose,
we could, for example, create a new variable called P0 :
P0i = Φ(β̂ 0 + β̂1 × 0 + β̂ 2 X2i + β̂ 3 X3i + · · · )
Then we want to know, for each observation, what the estimated probability
that Y = 1 is if the dummy variable were to equal 1. For this purpose, we could,
for example, create a new variable called P1 that is the estimated probability that
Y = 1 if X1 (the dummy variable) were equal to 1 given our β̂ estimates and the
actual values of the other independent variables:
P1i = Φ(β̂ 0 + β̂1 × 1 + β̂ 2 X2i + β̂ 3 X3i + · · · )
Notice that the only difference between P0 and P1 is that in P0 , X1 = 0 for all
observations (no matter what the actual value of X1 is) and in P1 , X1 = 1 for all
observations (no matter what the actual value of X1 is). The larger the value of β̂1 ,
the larger the difference between P0 and P1 will be for each observation. If β̂1 = 0,
then P0 = P1 for all observations and the estimated effect of X1 is clearly zero.
The approach we have just described is called the observed-value,
discrete-differences approach to estimating the effect of an independent variable
on the probability Y = 1. “Observed value” comes from our use of these
observed values in the calculation of simulated probabilities. The alternative to
the observed-value approach is the average-case approach, which creates a single
composite observation whose independent variables equal sample averages. We
discuss the average-case approach in the citations and additional notes section on
page 562.
The “discrete-difference” part of our approach involves the use of specific
differences in the value of X1 when simulating probabilities. The alternative to the
discrete-differences approach is the marginal-effects approach, which calculates

the effect of changing X1 by a minuscule amount. This calculus-based approach
is a bit more involved (but easy with a simple trick) and produces results that
are generally similar to the approach we present. We discuss the marginal-effects
approach in the citations and additional notes section on page 563 and show how
to implement the approach in this chapter’s Computing Corner on pages 446
and 448.
Interpreting logit coefficients proceeds in the same way, only we use the
logit equation (Equation 12.2) instead of the probit equation. For example, for
an observed-value, discrete-differences simulation of the effect of a continuous
variable, we calculate logit-fitted values for all observations and then, when the
variable has increased by one standard deviation, calculate logit-fitted values.
The average difference in fitted values is the simulated effect of an increase in
the variable of one standard deviation.
REMEMBER THIS
1. Use the observed-value, discrete-differences method to interpret probit coefficients as
follows:
• If X1 is continuous:
(a) For each observation, calculate P1i as the standard fitted probability from the probit
results:
P1i = Φ(β̂ 0 + β̂1 X1i + β̂ 2 X2i + β̂ 3 X3i + · · · )
(b) For each observation, calculate P2i as the fitted probability when the value of X1i is
increased by one standard deviation (σX1 ) for each observation:
P2i = Φ(β̂ 0 + β̂1 (X1i + σX1 ) + β̂ 2 X2i + β̂ 3 X3i + · · · )
(c) The simulated effect of increasing X1 by one standard deviation is the average
difference P2i − P1i across all observations.
• If X1 is a dummy variable:
(a) For each observation, calculate P1i as the fitted probability but with X1i set to 0 for
all observations:
P0i = Φ(β̂ 0 + β̂1 × 0 + β̂ 2 X2i + β̂ 3 X3i + · · · )
(b) For each observation, calculate P1i as the fitted probability but with X1i set to 1 for
all observations:
P1i = Φ(β̂ 0 + β̂1 × 1 + β̂ 2 X2i + β̂ 3 X3i + · · · )

(c) The simulated effect of going from 0 to 1 for the dummy variable X1 is the average
difference P1i − P0i across all observations.
2. To interpret logit coefficients by using the observed-value, discrete-differences method,
proceed as with the probit model, but use the logit equation to generate fitted
values.
Review Questions
Suppose we have data on restaurants in Los Angeles and want to understand what causes them to
go out of business. Our dependent variable is a dummy variable indicating bankruptcy in the year
of the study. One independent variable is the years of experience running a restaurant of the owner.
Another independent variable is a dummy variable indicating whether or not the restaurant had a
liquor license.
1. Explain how to calculate the effects of the owner’s years of experience on the probability a
restaurant goes bankrupt.
2. Explain how to calculate the effects of having a liquor license on the probability a restaurant
goes bankrupt.
CASE STUDY Econometrics in the Grocery Store

We’ve all been there. It’s late at night, we’re pick-
ing up groceries, and we have to choose between
a brand-name and a store-brand product. The store
brand is a bit cheaper, and we know the product in the
bottle is basically the same as the brand name, so why
not save a quarter? And yet, maybe the brand-name is
better, and the brand-name label is so pretty . . .
Marketing people want to know how we solve
such existential dilemmas. And they have lots of data
to help them, especially when they can link facts about
consumers to their grocery receipts. Ching, Erdem, and
Keane (2009) discuss a particular example involving purchase of ketchup in two U.S.
cities. Their model is an interesting two-stage model of decision making, but for our
purposes, we will focus on whether people who buy ketchup choose store-brand
or name-brand versions.
Such decisions by consumers are affected by marketing choices like pricing,

displays, and advertisements. Characteristics of the consumer, such as income and
household size, also matter.
An LPM of the purchase decision is
Pr(Purchase store brandi = 1) = β0 + β1 Price differencei + β2 Brand name displayi

+ β3 Store brand displayi + β4 Brand name featuredi
+ β5 Store brand featuredi + β6 Incomei
+ β7 Household sizei + εi
where the price difference is the average price of the brand-name products
(e.g., Heinz and Hunt’s) minus the store-brand product, the display variables are
dummy variables indicating whether brand-name and store-brand products were
displayed, and the featured variables are dummy variables indicating whether
brand-name and store-brand products were featured in advertisements.7
A probit model of the purchase decision is
Pr(Purchase store brandi = 1) = Φ(β0 + β1 Price differencei

+ β2 Brand name displayi + β3 Store brand displayi
+ β4 Brand name featuredi
+ β5 Store brand featuredi + β6 Incomei
+ β7 Household sizei )
Table 12.3 presents the results. As always, the LPM results are easy to interpret.
The coefficient on price difference indicates that consumers are 13.4 percentage
points more likely to purchase the store-brand ketchup if the average price of the
brand-name products is $1 more expensive than the store brand, holding all else
equal. (The average price difference in this data set is only $0.13, so $1 is a big
difference in prices.) If the brand-name ketchup is displayed, consumers are 4.2
percentage points less likely the buy the brand-name product, while consumers
are 21.4 percentage points more likely to buy the store-brand ketchup when it is
displayed. These, and all the other coefficients, are statistically significant.
The fitted probabilities of buying store-brand ketchup from the LPM range
from –13 percent to +77 percent. Yeah, the negative fitted probability is weird.
Probabilities below zero do not make sense, and that’s one of the reasons why the
LPM makes people a little squeamish.
The second and third columns of Table 12.3 display probit and logit results.
These models are, as we know, designed to avoid nonsensical fitted values and to
better capture the relationship between the dependent and independent variables.
7
The income variable ranges from 1 to 14, with each value corresponding to a specified income
range. This approach to measuring income is pretty common, even though it is not super precise.
Sometimes people break an income variable coded this way into dummy variables; doing so does not
affect our conclusions in this particular case.
TABLE 12.3 Multiple Models of Probability of Buying

Store-Brand Ketchup
LPM Probit Logit
Price difference 0.134* 0.685* 1.415*

(0.016) (0.074) (0.141)
[t = 8.62] [z = 9.30] [z = 10.06]
Brand name displayed −0.042* −0.266* −0.543*
(0.005) (0.035) (0.071)
[t = 8.93] [z = 7.59] [z = 7.68]
Store brand displayed 0.214* 0.683* 1.164*
(0.021) (0.060) (0.101)
[t = 9.83] [z = 11.39] [z = 11.49]
Brand name featured −0.083* −0.537* −1.039*
(0.004) (0.027) (0.056)
[t = 21.68] [z = 19.52] [z = 18.41]
Store brand featured 0.304* 0.948* 1.606*
(0.021) (0.055) (0.091)
[t = 14.71] [z = 17.29] [z = 17.55]
Household income −0.010* −0.055* −0.109*
(0.001) (0.004) (0.008)
[t = 13.09] [z = 12.57] [z = 12.95]
Household size 0.005* 0.030* 0.060*
(0.002) (0.008) (0.016)
[t = 3.44] [z = 3.62] [z = 3.84]
Constant 0.171* −0.929* −1.537*
(0.007) (0.037) (0.069)
[t = 23.79] [z = 25.13] [z = 22.17]
N 23,436 23,436 23,436
R2 0.083
log L −7,800.202 −7,799.8
Minimum Ŷi −0.133 0.004 0.007
Maximum Ŷi 0.765 0.863 0.887
Standard errors in parentheses (robust standard errors for LPM).

∗
Interpreting statistical significance in probit and logit models is easy as we need

simply to look at whether the z statistic is greater than 1.96. The price difference
coefficients in the probit and logit models are highly statistically significant, with z
statistics of 9.30 and 10.06, respectively. The coefficients on all the other variables
are also statistically significant in both probit and logit models, as we can see from
their z statistics.
Interpreting the coefficients in the probit and logit models is not straightfor-
ward, however. Does the fact that the coefficient on the price difference variable
in the probit model is 0.685 mean that consumers are 68.5 percentage points more
likely to buy store-brand ketchup when brand-name ketchup is $1 more expensive?
Does the coefficient on the store brand display variable imply that consumers
are 68.3 percentage points more likely to buy store-brand ketchup when it is on
display?
No. No. (No!) The coefficient estimates from the probit and logit models feed
into the complicated probit and logit equations on pages 418 and 421. We need
extra steps to understand what they mean. Table 12.4 shows the results when we
use our simulation technique to understand the substantive implications of our
estimates. The estimated effect of a $1 price increase from the probit model is
calculated by comparing the average fitted value for all individuals at their actual
values of their independent variables to the average fitted value for all individuals
when the price difference variable is increased by one for every observation (and all
other variables remain at their actual values). This value is 0.191, a bit higher than
the LPM estimate of 0.134 we see in Table 12.3 but still in the same ballpark.
Our simulations are slightly different for dummy independent variables. For
example, to calculate the estimated effect of displaying the brand-name ketchup
from the probit model, we first calculate fitted values from the probit model
assuming the value of this variable is equal to 0 for every consumer while using
the actual values of the other variables. Then, we calculate fitted values from the
probit model assuming the value of the brand name displayed variable is equal to
1 for everyone, again using the actual values of the other variables. The average
difference in these fitted probabilities is –0.049, indicating our probit estimates
imply that displaying the brand-name ketchup lowers the probability of buying the
store brand by 4.9 percentage points, on average.
The logit-estimated effects in Table 12.4 are generated via a similar process,
using the logit equation instead of the probit equation. The logit-estimated effects
for each variable track the probit-estimated effects pretty closely. This pattern is
not surprising because the two models are doing the same work, just with different
assumptions about the error term.
Figure 12.8 on page 435 helps us visualize the results by displaying the fitted
values from the LPM, probit, and logit estimates. We’ll display fitted values as a
TABLE 12.4 Estimated Effect of Independent Variables on Probability of Buying

Store-Brand Ketchup
Variable Simulated change Probit Logit
Price difference Increase by 1, other variables at actual values 0.191 0.237

Brand name displayed From 0 to 1, other variables at actual values −0.049 −0.052
Store brand displayed From 0 to 1, other variables at actual values 0.190 0.184
Brand name featured From 0 to 1, other variables at actual values −0.089 −0.089
Store brand featured From 0 to 1, other variables at actual values 0.286 0.279
Household income Increase by 1, other variables at actual values −0.011 −0.012
Household size Increase by 1, other variables at actual values 0.006 0.007
LPM Probit Logit

Probability of
buying
store-brand
ketchup 1 1 1
Store brand Store brand

displayed displayed
0.75 0.75 0.75
Store brand
displayed
0.5 0.5 0.5
0.25 0.25 0.25
Store brand not Store brand not Store brand not

0 0 0
displayed displayed displayed
−1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5
Price difference Price difference Price difference
FIGURE 12.8: Fitted Lines from LPM, Probit, and Logit Models
function of price differences and whether the store-brand product is displayed;

we could create similar graphs for other combinations of independent variables.
The solid line in each panel is the fitted line for choices by consumers when the
store-brand ketchup is displayed. The dashed line in each panel is the fitted line
for choices by consumers when the store-brand ketchup is not displayed. We
display price differences from –1.5 to 3. In the data, the lowest price difference is
around –1 (for instances in which the brand-name ketchup was $1 cheaper than
the store-brand ketchup), and the highest price difference is 0.75 (for instances in
which the brand-name ketchup was $0.75 more expensive). We show this (perhaps
unrealistic) range of price differences so that we can see a bit more of the shape of
the fitted values.
In all panels of Figure 12.8, we see that fitted probabilities of buying
store-brand ketchup increase as the price difference between brand-name and
store-brand ketchup increases. We also see that consumers are more likely to buy
the store brand when it is displayed.
One of the LPM lines dips below zero. That’s what LPMs do. It’s screwy. On the
whole, however, the LPM lines are pretty similar to the probit and logit lines. The
probit and logit lines are quite similar to each other as well. In fact, the fitted values
from the probit and logit models are very similar, as is common. Their correlation is
0.998. The probit and logit fitted values don’t quite show the full S-shaped curve;
they would, however, if we were to extend the graph to include even higher (and
less realistic) price differences.
12.6 Hypothesis Testing about Multiple Coefficients

Sometimes we are interested in hypotheses about multiple coefficients. That is, we
might not want to know simply whether β1 is different from zero, but whether is
it bigger than β2 . In this section, we show how to use MLE models such as probit
and logit to conduct such tests.
In the OLS context, we used F tests to examine hypotheses involving multiple
coefficients; we discussed these tests in Section 5.6. The key idea was to compare
the fit of a model that imposed no restrictions to the fit of a model that imposed
the restriction implicit in the null hypothesis. If the null hypothesis is true,
then forcing the computer to spit back results consistent with the null will not
reduce the fit very much. If the null hypothesis is false, though, forcing the
computer to spit back results consistent with the null hypothesis will reduce the fit
substantially.
We’ll continue to use the same logic here. The difference is that we do not
measure fit with R2 as with OLS but with the log likelihood, as described in Section
12.5. We will look at the difference in log likelihoods from the restricted and
likelihood ratio (LR) unrestricted estimates. The statistical test is called a likelihood ratio (LR) test,
test A statistical test for and the test statistic is
maximum likelihood
models that is useful in LR = 2(log LUR − log LR ) (12.3)
testing hypotheses
involving multiple If the null hypothesis is true, the log likelihood should be pretty much the same
coefficients. for the restricted and unrestricted versions of the model. Hence, a big difference
in the likelihoods indicates that the null is false. Statistical theory implies that if
the null hypothesis is true, the difference in log likelihoods will follow a specific
distribution, and we therefore can use that distribution to calculate critical values
for hypothesis testing. The distribution is a χ 2 , with degrees of freedom equal to
the number of equal signs in the null hypothesis (recall that χ is the Greek letter
chi, pronounced “ky” as in Kyle). We show in this chapter’s Computing Corner
how to generate critical values and p values based on this distribution. Appendix H
provides more information on the χ 2 distribution, starting on page 549.8
8
It may seem odd that this is called a likelihood ratio test when the statistic is the difference in log
likelihoods. The test can also be considered as the log of the ratio of the two likelihoods. Because
An example makes this process clear. It’s not hard. Suppose we want to
know if displaying the store-brand ketchup is more effective than featuring it in
advertisements. This is the kind of thing people get big bucks for when they do
marketing studies.
Using our LR test framework, we first want to characterize the unrestricted
version of the model, which is simply the model with all the covariates in it:
Pr (Purchase store brandi = 1) = Φ(β0 + β1 Price differencei

+ β2 Brand name displayi
+ β3 Store brand displayi
+ β5 Store brand featuredi
+ β6 Incomei + β7 Household sizei )
This is considered unrestricted because we are letting the coefficients on the store
brand display and store brand featured variables be whatever best fit the data.
The null hypothesis is that the effect of displaying and featuring store-brand
ketchup is the same—that is, H 0 : β3 = β5 . We impose this null hypothesis on the
model by forcing the computer to give us results in which the coefficients on the
display and featured variables for the store brand are equal. We do this by replacing
β5 with β3 in the model (which we can do because under the null hypothesis they
are equal), yielding a restricted model of
Pr (Purchase store brandi = 1) = Φ(β0 + β1 Price differencei

+ β3 Store brand displayi
+ β3 Store brand featuredi
= Φ(β0 + β1 Price difference i
+ β3 (Store brand displayi
+ Store brand featuredi )
Look carefully, and notice that the β3 is multiplied by (Store brand displayi +
Store brand featuredi ) in this restricted equation.
log LLUR = log LUR − log LR , however, we can use the form given. Most software reports the log
R
likelihood, not the (unlogged) likelihood, so it’s more convenient to use the difference of log
likelihoods than the ratio of likelihoods. The 2 in Equation 12.3 is there to make things work; don’t
ask.
To conduct the LR test, we need simply to estimate these two models,

calculate the difference in log likelihoods, and compare this difference to a critical
value from the appropriate distribution. We estimate the restricted model by
creating a new variable, which is Store brand displayi + Store brand featuredi .
Table 12.5 shows the results. The unrestricted column is the same as the probit
TABLE 12.5 Unrestricted and Restricted Probit Results for LR Test

Unrestricted Restricted model for
model H0 : βStore brand displayed = βStore brand featured
Price difference 0.685* 0.678*

(0.074) (0.074)
[z = 9.30] [z = 9.23]
Brand name displayed –0.266* –0.267*
(0.035) (0.035)
[z = 7.59] [z = 7.65]
Store brand displayed 0.683*
(0.06)
[z = 11.39]
Brand name featured –0.537* –0.535*
(0.027) (0.027)
[z = 19.52] [z = 19.50]
Store brand featured 0.948*
(0.055)
[z = 17.29]
Household income –0.055* –0.055*
(0.004) (0.004)
[z = 12.57] [z = 12.59]
Household size 0.030* 0.030*
(0.008) (0.008)
[z = 3.62] [z = 3.62]
Store brand displayed + 0.824*
Store brand featured –0.034
[z = 23.90]
Constant –0.929* –0.928
(0.037) (0.037)
[z = 25.13] [z = 25.12]
N 23,436 23,436
log L –7,800.202 –7,804.485

∗
column in Table 12.3. At the bottom is the unrestricted log likelihood that will
feed into the LR test.
This is a good time to do a bit of commonsense approximating. The
coefficients on the store brand display and store brand featured variables in the
unrestricted model in Table 12.5 are both positive and statistically significant, but
the coefficient on the store brand featured variable is quite a bit higher than the
coefficient on the store brand displayed variable. Both coefficients have relatively
small standard errors, so it is reasonable to expect that there’s a difference,
suggesting that H 0 is false.
From Table 12.5, it is easy to calculate the LR test statistic:
LR = 2(log LUR − log LR )

= 2(−7, 800.202 + 7, 804.485)
= 8.57
Using the tools described in this chapter’s Computing Corner, we can calculate that
the p value associated with an LR value of 8.57 is 0.003, well below a conventional
significance level of 0.05.
Or, equivalently, we can reject the null hypothesis if the LR statistic is greater
than the critical value for our significance level. The critical value for a significance
level of 0.05 is 3.84, and our LR test statistic of 8.57 exceeds that. This means we
can reject the null that the coefficients on the display and featured variables are
the same. In other words, we have good evidence that consumers responded more
strongly when the product was featured in ads than when it was displayed.
REMEMBER THIS
Use the LR test to examine hypotheses involving multiple coefficients for probit and logit models.
1. Estimate an unrestricted model that is the full model:
Pr(Yi = 1) = β0 + β1 X1i + β2 X2i + β3 X3i + i
2. Write down the null hypothesis.

3. Estimate a restricted model by using the conditions in the null hypothesis to restrict the full
model:
• For H0 : β1 = β2 , the restricted model is
Pr(Yi = 1) = β0 + β1 X1i + β1 X2i + β3 X3i + i

= β0 + β1 (X1i + X2i ) + β3 X3i + i
• For H0 : β1 = β2 = 0, the restricted model is
Pr(Yi = 1) = β0 + 0 × X1i + 0 × X2i + β3 X3i + i

= β0 + β3 X3i + i
4. Use the log likelihood values from the unrestricted and restricted models to calculate the LR
test statistic:
LR = 2(log LUR − log LR )
5. The larger the difference between the log likelihoods, the more the null hypothesis is reducing
fit and, therefore, the more likely we are to reject the null.
• The test statistic is distributed according to a χ 2 distribution with degrees of freedom equal
to the number of equal signs in the null hypothesis.
• Code for generating critical values and p values for this distribution is in the Computing
Corner on pages 446 and 448.
CASE STUDY Civil Wars

Civil wars produce horrific human misery. They are all
too often accompanied by atrocities and a collapse of
civilization.
What causes civil wars? Obviously the subject is
complicated, but is it the case that civil wars are much
more likely in countries that are divided along ethnic
or religious lines? Many think so, arguing that these
preexisting divisions can explode into armed conflict.
Stanford professors James Fearon and David Laitin (2003)
aren’t so sure. They suspect that instability stemming
from poverty is more important.
In this case study, we explore these possible determinants of civil war. We’ll
see that while omitted variable bias plays out in a broadly similar fashion across
LPM and probit models, the two approaches nonetheless provide rather different
pictures about what is going on.
The dependent variable is civil war onset between 1945 and 1999, coded for
161 countries that had a population of at least half a million in 1990. It is 1 for
country-years in which a civil war began and 0 in all other country-years. We’ll look
at three independent variables, each measured within each country:
• Ethnic fractionalization measures ethnic divisions; it ranges from 0.001 to

0.93, with mean of 0.39 and a standard deviation of 0.29. The higher the value
of this variable, the more divided a country is ethnically.
TABLE 12.6 Probit Models of the Determinants of Civil Wars

LPM Probit
(a) (b) (a) (b)
∗ ∗
Ethnic 0.019 0.012 0.451 0.154
fractionalization (0.007) (0.007) (0.141) (0.149)
[t = 2.94] [t = 1.61] [z = 3.20] [z = 1.03]
Religious −0.002 0.002 −0.051 0.033

fractionalization (0.008) (0.008) (0.185) (0.198)
[t = 0.32] [t = 0.26] [z = 0.28] [z = 0.17]
GDP per capita −0.0015∗ −0.108∗

(in $1,000 U.S.) (0.0002) (0.024)
[t = 6.04] [z = 4.58]
Constant 0.010∗ 0.017∗ −2.297∗ −1.945∗

(0.003) (0.003) (0.086) (0.108)
[t = 3.45] [t = 4.94] [z = 26.67] [z = 18.01]
N 6,610 6,373 6,610 6,373

2
R 0.002 0.004
σ̂ 0.128 0.128
log L −549.092 −508.545
Standard errors in parentheses (robust standard errors for LPM).

∗
• Religious fractionalization measures religious divisions; it ranges from 0 to

0.78, with a mean of 0.37 and a standard deviation of 0.22. The higher the
value of this variable, the more divided a country is in matters of religion.
• GDP is lagged GDP per capita. The GDP measure is lagged to avoid any
taint from the civil war itself, which almost surely had an effect on the
economy. It is measured in thousands of inflation-adjusted U.S. dollars. The
variable ranges from 0.05 to 66.7, with a mean of 3.65 and a standard
deviation of 4.53.
Table 12.6 shows results for LPM and probit models. For each method, we
present results with and without GDP. We see a similar pattern when GDP is omitted.
In the LPM (a) specification, ethnic fractionalization is statistically significant and
religious fractionalization is not. The same is true for the probit (a) specification that
does not have GDP.
Fearon and Laitin’s suspicion, however, was supported by both LPM and probit
analyses. When GDP is included, the ethnic fractionalization variable becomes
insignificant in both LPM and probit (although it is close to significant in the LPM).
The GDP variable is highly statistically significant in both LPM and probit models.
So the general conclusion that GDP seems to matter more than ethnic fraction-
alization does not depend on which model we use to estimate this dichotomous
dependent variable model.
Probability of
civil war
0.04
0
Fitted values from
probit model
−0.04
Fitted values from LPM
−0.08
0 10 20 30 40 50 60 70
GDP (per capita in $1,000 U.S.)
FIGURE 12.9: Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic and
Religious Variables at Their Means)
Yet, the two models do tell slightly different stories. Figure 12.9 shows the fitted
lines from the LPM and probit models for the specifications that include the GDP
variable. When calculating these lines, we held the ethnic and religious variables
at their mean values. The LPM model has its characteristic brutally straight fitted
line. It suggests that whatever its wealth, a country sees its probability of civil war
decline as it gets even wealthier. It does this to the point of not making sense—the
fitted probabilities are negative (hence meaningless) for countries with per capita
GDP above about $20,000 per year.
In contrast, the probit model has a curve. We’re seeing only a hint of the S
curve because even the poorest countries have less than a 4 percent probability of
experiencing civil war. But we do see that the effect of GDP is concentrated among
the poorest countries. For them, the effect of income is relatively higher, certainly
higher than the LPM suggests. But for countries with about $10,000 per capita GDP
per year, income shows basically no effect on the probability of a civil war. So even
as the broad conclusion that GDP matters is similar in the LPM and probit models,
the way in which GDP matters is quite different across the models.
Further Reading 443
Conclusion
Things we care about are often dichotomous. Think of unemployment, vote choice,
graduation, war, or countless other phenomena. We can use OLS to analyze
such data via LPM, but we risk producing models that do not fully reflect the
relationships in the data.
The solution is to fit an S-shaped relationship via probit or logit models. Probit
and logit models are, as a practical matter, interchangeable as long as sufficient
care is taken in the interpretation of coefficients. The cost of these models is that
they are more complicated, especially with regard to interpreting the coefficients.
We’re in good shape when we can:
• Section 12.1: Explain the LPM. How do we estimate it? How do we

interpret the coefficient estimates? What are two drawbacks?
• Section 12.2: Describe what a latent variable is and how it relates to the
observed dichotomous variable.
• Section 12.3: Describe the probit and logit models. What is the equation for
the probability that Yi = 1 for a probit model? What is the equation for the
probability that Yi = 1 for a logit model?
• Section 12.4: Discuss the estimation procedure used for probit and logit
models and how to generate fitted values.
• Section 12.5: Explain how to interpret probit coefficients using the

observed-value, discrete-differences approach.
• Section 12.6: Explain how to test hypotheses about multiple coefficients

using probit or logit models.
Further Reading
There is no settled consensus on the best way to interpret probit and logit
coefficients. Substantive conclusions rarely depend on the mode of presentation,
so any of the methods is legitimate. Hanmer and Kalkan (2013) argue for the
observed-value approach and against the average-value approach.
MLE models do not inherit all properties of OLS models. In OLS, het-
eroscedasticity does not bias coefficient estimates; it only makes the conventional
equation for the standard error of β̂1 inappropriate. In probit and logit models,
heteroscedasticity can induce bias (Alvarez and Brehm 1995), but correcting for
heteroscedasticity may not always be feasible or desirable (Keele and Park 2006).
King and Zeng (2001) discuss small-sample properties of logistic models,
noting in particular that small-sample bias can be large when the dependent
variable is a rare event, with only a few observations falling in the less frequent
category.
Probit and logit models are examples of limited dependent variable models.
In these models, the dependent variable is restricted in some way. As we have
seen, the dependent variable in probit models is limited to two values, 1 and 0.
MLE can be used for many other types of limited dependent variable models. If
the dependent variable is ordinal with more than two categories (e.g., answers to
a survey question for which answers are very satisfied, satisfied, dissatisfied, and
very dissatisfied), an ordered probit model is useful. It is based on MLE methods
and is a modest extension of the probit model. Some dependent variables are
categorical. For example, we may be analyzing the mode of transportation to work
(with walking, biking, driving, and taking public transportation as options). In
such a case, multinomial logit, another MLE technique, is useful. Other dependent
variables are counts (number of people on a bus) or lengths of time (how long
between buses or how long someone survives after a disease diagnosis). Models
with these dependent variables also can be estimated with MLE methods, such as
count models and duration models. Long (1997) introduces maximum likelihood
and covers a broad variety of MLE techniques. King (1989) explains the general
approach. Box-Steffensmeier and Jones (2004) provide an excellent guide to
duration models.
Key Terms
Cumulative distribution Linear probability model Maximum likelihood
function (418) (410) estimation (423)
Dichotomous (409) Log likelihood (425) Probit model (418)
Latent variable (416) Logit model (421) z test (423)
Likelihood ratio test (436)
Computing Corner
Stata
To implement the observed-value, discrete-differences approach to interpreting
estimated effects for probit in Stata, do the following.
** Estimate probit model
probit Y X1 X2
** Generate predicted probabilities for all observations

gen P1 = normal(_b[_cons] + _b[X1]*X1 + _b[X2]*X2) if /*
*/ e(sample)
** “normal“ refers to normal CDF function
** _b[_cons] is beta0 hat,_b[X1] is beta1 hat etc
** “e(sample)“ tells Stata to only use observations
** used in probit analysis
** Or, equivalently,generate predicted values via

** predict command
probit Y X1 X2
predict P1 if e(sample)
** Create variable with X1 + standard deviation of X1

** (which here equals 1)
gen X1Plus = X1 + 1

** using X1Plus
gen P2 = normal(_b[_cons] + _b[X1]*X1Plus + _b[X2]*X2) /*
/* if e(sample)
** Calculate difference in probabilities for all

** observations
gen PDiff = P2 - P1
** Display results
sum PDiff if e(sample)
** Estimate probit model
probit Y X1 X2

** with X1=0
gen P0 = normal(_b[_cons] + _b[X1]*0 + _b[X2]*X2) if /*

/* e(sample)

** with X1=1
gen P1 = normal(_b[_cons] + _b[X1]*1 + _b[X2]*X2) if /*
/* e(sample)
** Calculate difference in probabilities for all

** observations
gen PDiff = P1 - P0
** Display results
sum PDiff if e(sample)
• The margins command produces average marginal effects, which are the
average of the slopes with respect to each independent variable evaluated
at observed values of the independent variables. See the discussion on
marginal-effects on page 563 for more details. These are easy to implement
in Stata, with similar syntax for both probit and logit models.
probit Y X1 X2
margins, dydx(X1)
• To conduct an LR test in Stata, use the lrtest command. For example, to

test the null hypothesis that the coefficients on both X2 and X3 are zero, we
can first run the restricted model and save the results using the estimates
store command:
probit Y X1
estimates store RESTRICTED
Then we run the unrestricted command followed by the lrtest command
and the name of the restricted model:
probit Y X1 X2 X3
lrtest RESTRICTED
Stata will produce a value of the LR statistic and a p value. We
can implement an LR test manually simply by running the restricted
and unrestricted models and plugging the log likelihoods into the LR
test equation of 2(log LUR − log LR ) as explained on page 436. To
ascertain the critical value for LR test with one degree of freedom
(d.f. = 1) and 0.95 confidence level, type display invchi2(1, 0.95)
To ascertain the p value for LR test with d.f. = 1 and substituting log
likelihood values in for logLunrestricted and logLrestricted, type
display 1-chi2(1, 2*(logLunrestricted - logLrestricted)).
Even easier, we can use Stata’s test command to conduct a Wald test,
which is asymptotically equivalent to the LR test (which is a fancy way of
saying the test statistics get really close to each other as the sample size
goes to infinity). For example,
probit Y X1 X2 X3
test X2 = X3 =0
• To estimate a logit model in Stata, use logic and structure similar to those
for a probit model. Here are the key differences for the continuous variable
example:
logit Y X1 X2
gen LogitP1 = exp(_b[_cons]+_b[X1]*X1+_b[X2]*X2)/*
*/(1+exp(_b[_cons]+_b[X1]*X1+_b[X2]*X2))
gen LogitP2 = exp(_b[_cons]+_b[X1]*X1Plus+_b[X2]*X2)/*
*/(1+exp(_b[_cons]+_b[X1]*X1Plus+_b[X2]*X2))
• To graph fitted lines from a probit or logit model that has only one
independent variable, first estimate the model and save the fitted values.
Then use the following command:
graph twoway (scatter ProbitFit X)
R
To implement a probit or logit analysis in R, we use the glm function, which stands
for “generalized linear model” (as opposed to the lm function, which stands for
“linear model”).
## Estimate probit model and name it Result
Result = glm(Y ~ X1 + X2, family = binomial(link =
"probit"))
## Create variable named P1 with fitted values from

## probit model
P1 = pnorm(Result$coef[1] + Result$coef[2]*X1
+ Result$coef[3]*X2)
## pnorm is the normal CDF function in R
## Result$coef[1] is beta0 hat, Result$coef[2] is
## beta1 hat etc
## Create variable named X1Plus which is X1 + standard

## deviation of X1 (which here equals 1)
X1Plus = X1 +1
## Create P2: fitted value using X1Plus instead of X1

P2 = pnorm(Result$coef[1] + Result$coef[2]*X1Plus
## Calculate average difference in two fitted
probabilities mean(P2-P1, na.rm=TRUE)
## “na.rm=TRUE“ tells R to ignore observations with
## missing data
## Estimate probit model and name it Result
Result = glm(Y ~ X1 + X2, family = binomial(link =
“probit“))
## Create: P0 fitted values with X1 set to zero

P0 = pnorm(Result$coef[1] + Result$coef[2]*0
## Create P1: fitted values with X1 set to one

P1 = pnorm(Result$coef[1] + Result$coef[2]*1
## Calculate average difference in two fitted probabilities

mean(P1-P0, na.rm=TRUE)
• To produce average marginal effects (as discussed on the discussion on

marginal-effects on page 563) for continuous X1 , use the following:
MargEffects = Result$coef[2]* dnorm(Result$coef[1]
+ Result$coef[2]*X1 + Result$coef[3]*X2)
## dnorm is PDF function in R
mean(MargEffects, na.rm=TRUE)
• Various packages exist to ease the interpretation of probit coefficients.

For example, we can install the package mfx (using install.packages
(“mfx“)), estimate a probit model (e.g., the model “Result” above) and
then use
library(mfx)
probitmfx(Result, data = data, atmean = FALSE)
which will produce average observed-value, discrete-difference esti-
mates for dichotomous independent variables and average observed-value
marginal effects (as discussed on page 563) for continuous independent
variables.
• To estimate an LR test of H0 : β1 = β2 in R, do the following:

## Unrestricted probit model
URModel = glm(Y ~ X1 + X2 + X3, family = binomial(link =
"probit"))
## Restricted probit model

X1plusX2 = X1 + X2
RModel = glm(Y ~ X1plusX2 + X3, family = binomial(link =
"probit"))
## Calculate LR test statistic using logLik function

LR = 2*(logLik(URModel)[1] - logLik(RModel)[1])
## Critical value for LR test with d.f. =1 and 0.95

## confidence level
qchisq(0.95, 1)
## p-value for likelihood ratio test with d.f. =1

1-pchisq(LR, 1)
If we wanted to test H0 : β1 = β2 = 0, we would use a different restricted
Exercises 449
TABLE 12.7 Variables for Iraq War Data

BushVote04 Dummy variable = 1 if person voted for President Bush in 2004 and 0 otherwise
ProIraqWar02 Position on Iraq War, ranges from 0 (opposed war) to 3 (favored war)
Party02 Partisan affiliation, ranges from 0 for strong Democrats to 6 for strong Republicans
BushVote00 Dummy variable = 1 if person voted for President Bush in 2000 and 0 otherwise
CutRichTaxes02 Views on cutting taxes for wealthy, ranges from 0 (oppose) to 2 (favor)
Abortion00 Views on abortion, ranges from 1 (strongly oppose) to 4 (strongly support)
equation:
## Restricted probit model
RModel = glm(Y ~ X3, family = binomial(link =
"probit"))
and proceed with the rest of the test.
• To estimate a logit model in R, use logic and structure similar to those for
a probit model. Here are the key differences for the continuous variable
example:
Result = glm(Y ~ X1+X2, family = binomial(link ="logit"))
P1 = exp(Result$coef[1]+Result$coef[2]*X1+Result$coef[3]*X2)/
(1+exp(Result$coef[1]+Result$coef[2]*X1+Result$coef[3]*X2))
P2 = exp(Result$coef[1]+Result$coef[2]*X1Plus+Result$coef[3]*X2)/
(1+exp(Result$coef[1]+Result$coef[2]*X1Plus+Result$coef[3]*X2))
• To graph fitted lines from a probit or logit model that has only one
independent variable, first estimate the model and save it. In this case, we’ll
save a probit model as ProbResults. Create a new variable that spans the
range of the independent variable. In this case, we create a variable called
Xsequence that ranges from 1 to 7 in steps of 0.05 (the first value is 1,
the next is 1.05, etc.). We then use the coefficients from the ProbResults
model and this Xsequence variable to plot fitted lines:
Xsequence = seq(1, 7, 0.05)
plot(Xsequence, pnorm(ProbResults$coef[1] +
ProbResults$coef[2]*Xsequence), type = "l")
Exercises
1. In this question, we use the data set BushIraq.dta to explore the effect
of opinion about the Iraq War on the presidential election of 2004. The
variables we will focus on are listed in Table 12.7.
(a) Estimate two probit models: one with only ProIraqWar02 as the
independent variable and the other with all the independent variables
listed in the table. Which is better? Why? Comment briefly on

statistical significance.
(b) Use the model with all the independent variables and the
observed-value, discrete-differences approach to calculate the effect
of a one standard deviation increase in ProIraqWar02 on support for
Bush.
(c) Use the model with all the independent variables listed in the table
and the observed-value, discrete-differences approach to calculate
the effect of an increase of one standard deviation in Party02 on
support for Bush. Compare to the effect of ProIraqWar02.
(d) Use Stata’s marginal effects command to calculate the marginal

effects of all independent variables. Briefly comment on differences
from calculations in parts (a) and (c).
(e) Use logit to run the same model.
(i) Briefly comment on patterns of statistical significance com-

pared to probit results.
(ii) Briefly comment on coefficient values compared to probit

results.
(iii) Use Stata’s margins commands to calculate marginal effects

of variables, and briefly comment on differences or similarities
from probit results.
(f) Calculate the correlation of the fitted values from the probit and logit
models.
(g) Run a LR test on the null hypothesis that the coefficients on

the three policy opinion variables (ProIraqWar02, CutRichTaxes02,
Abortion00) all equal 0. Do this work manually (showing your work)
and using the Stata commands for an LR test.
2. Public attitudes toward global warming influence the policy response to

the issue. The data set EnvSurvey.dta provides data from a nationally
representative survey of the U.S. public that asked multiple questions about
the environment and energy. Table 12.8 lists the variables.
(a) Use an LPM to estimate the probability of saying that global
warming is real and caused by humans (the dependent variable is
HumanCause2). Control for sex, being white, education, income,
age, and partisan identification.
Exercises 451
(i) Which variable has the most important influence on this

opinion? Why?
(ii) What are the minimum and maximum fitted values from this
model? Discuss implications briefly.
(iii) Add age-squared to the model. What is the effect of

age? Use a simple sketch if necessary, with key point(s)
identified.
(b) Use a probit model to estimate the probability of saying that global
warming is real and caused by humans (the dependent variable
is HumanCause2). Use the independent variables from part (a),
including the age-squared variable.
(i) Compare statistical significance with LPM results.
(ii) What are the minimum and maximum fitted values from this
model? Discuss implications briefly.
(iii) Use the observed-value, discrete-differences approach to indi-

cate the effect of partisan identification on the probability of
saying global warming is real and caused by humans. For
simplicity, simulate the effect of an increase of one unit on this
seven-point scale (as opposed to the effect of one standard devi-
ation, as we have done for continuous variables in other cases).
Compare to LPM and “marginal-effects” interpretations.
(iv) Use the observed-value, discrete-differences approach to indi-

cate the effect of being male on the probability of saying global
TABLE 12.8 Variables for Global Warming Data

Male Dummy variable = 1 for men and 0 otherwise

White Dummy variable = 1 for whites and 0 otherwise
Education Education, ranging from 1 for no formal education to 14 for professional/doctorate

degree (treat as a continuous variable)
Income Income, ranging from 1 for household income < $5,000 to 19 for household income
> $175,000 (treat as a continuous variable)
Age Age in years
Party7 Partisan identification, ranging from 1 for strong Republican, 2 for not-so-strong
Republican, 3 leans Republican, 4 undecided/independent, 5 leans Democrat, 6
not-so-strong Democrat, 7 strong Democrat
warming is real and caused by humans. Compare to LPM and

“marginal-effects” interpretations.
(c) The survey described in this item also included a survey experiment
in which respondents were randomly assigned to different question
wordings for an additional question about global warming. The idea
was to see which frames were most likely to lead people to agree that
the earth is getting warmer. The variable we analyze here is called
WarmAgree. It records whether respondents agreed that the earth’s
average temperature is rising. The experimental treatment consisted
of four different ways to phrase the question.
• The variable Treatment equals 1 for people who were asked

“Based on your personal experiences and observations, do you
agree or disagree with the following statement: The average
temperature on earth is getting warmer.”
• The variable Treatment equals 2 for people who were given the
following information before being asked if they agreed that
the average temperature of the earth is getting warmer: “The
following figure [Figure 12.10] shows the average global tem-
perature compared to the average temperature from 1951–1980.
The temperature analysis comes from weather data from more
than 1,000 meteorological stations around the world, satellite
observations of sea surface temperature, and Antarctic research
station measurements.”
• The variable Treatment equals 3 for people who were given the
following information before being asked if they agreed that
average temperature of the earth is getting warmer: “Scientists
working at the National Aeronautics and Space Administration
(NASA) have concluded that the average global temperature
Global Temperature Difference (°C)

0.6
0.4
2011
Annual Mean
5-year Mean
0.2
–0.2
–0.4 uncertainty
1880 1900 1920 1940 1960 1980 2000
FIGURE 12.10: Figure Included for Some Respondents in Global Warming Survey Experiment
Exercises 453
has increased by about a half degree Celsius compared to the

average temperature from 1951–1980. The temperature analysis
comes from weather data from more than 1,000 meteorological
stations around the world, satellite observations of sea surface
temperature, and Antarctic research station measurements.”
• The variable Treatment equals 4 for people who were simply

asked “Do you agree or disagree with the following statement:
The average temperature on earth is getting warmer.” This is the
control group.
Which frame was most effective in affecting opinion about global

warming?
3. What determines whether organizations fire their leaders? It’s often

hard for outsiders to observe performance, but in sports, many facets
of performance (particularly winning percentage) are easily observed.
Michael Roach (2013) provides data on the performance and firing of NFL
football coaches. Table 12.9 lists the variables.
(a) Run a probit model explaining whether the coach was fired as a
function of winning percentage. Graph fitted values from this model
on same graph with fitted values results from a bivariate LPM (use
the lfit command to plot LPM results). Explain the differences in
the plots.
(b) Estimate LPM, probit, and logit models of coach firings by using
winning percentage, lagged winning percentage, a new coach
dummy, strength of schedule, and coach tenure as independent
variables. Are the coefficients substantially different? How about the
z statistics?
(c) Indicate the minimum, mean, and maximum of the fitted values for
each model, and briefly discuss each.
TABLE 12.9 Variables for Football Coach Data

FiredCoach A dummy variable if the football coach was fired during or after the season
(1 = fired, 0 = otherwise)
WinPct The winning percentage of the team
LagWinPct The winning percentage of the team in the previous year

ScheduleStrength A measure of schedule difficulty based on records of opposing teams
NewCoach A dummy variable indicating if the coach was new (1 = new, 0 = otherwise)
Tenure The number of years the coach has coached the team
(d) What are the correlations of the three fitted values?
(e) It’s kind of odd to say that lag winning percentage affects the
probability that new coaches were fired because they weren’t
coaching for the year associated with the lagged winning percentage.
Include an interaction for the new coach dummy variable and lagged
winning percentage. The effect of lagged winning percentage on
probability of being fired is the sum of the coefficients on lagged
winning percentage and the interaction. Test the null hypothesis that
lagged winning percentage has no effect on new coaches (meaning
coaches for whom NewCoach = 1). Use a Wald test (which is most
convenient) and a LR test.
4. Are members of Congress more likely to meet with donors than with
mere constituents? To answer this question, Kalla and Broockman (2015)
conducted a field experiment in which they had political activists attempt
to schedule meetings with 191 congressional offices regarding efforts to
ban a potentially harmful chemical. The messages the activists sent out
were randomized. Some messages described the people requesting the
meeting as “local constituents,” and others described the people requesting
the meeting as “local campaign donors.” Table 12.10 describes two key
variables from the experiment.
(a) Before we analyze the experimental data, let’s suppose we were to

conduct an observational study of access based on a sample of Amer-
icans and ran a regression in which the dependent variable indicates
having met with a member of Congress and the independent variable
was whether the individual donated money to a member of Congress.
Would there be concerns about endogeneity? If so, why?
(b) Use a probit model to estimate the effect of the donor treatment
condition on probability of meeting with a member of Congress.
Interpret the results.
(c) What factors are missing from the model? What does this omission
mean for our results?
TABLE 12.10 Variables for Donor Experiment

donor_treat Dummy variable indicating that activists seeking meeting were identified as donors
(1 = donors, 0 = otherwise)
staffrank Highest-ranking person attending the meeting: 0 for no one attended meeting, 1 for
non-policy staff, 2 for legislative assistant, 3 for legislative director, 4 for chief of
staff, 5 for member of Congress
Exercises 455
(d) Use an LPM to make your estimate. Interpret the results. Assess the
correlation of the fitted values from the probit model and LPM.
(e) Use an LPM to assess the probability of meeting with a senior staffer
(defined as staffrank > 2).
(f) Use an LPM to assess the probability of meeting with a low-level

staffer (defined staffrank = 1).
(g) Table 12.11 shows results for balance tests (covered in Section 10.1)
for two variables: Obama vote share in the congressional district
and the overall campaign contributions received by the member
of Congress contacted. Discuss the implication of these results for
balance.
TABLE 12.11 Balance Tests for Donor Experiment

Obama percent Total contributions
Treated −0.71 −104, 569

(1.85) (153, 085)
[z = 0.38] [z = 0.68]
Constant 65.59 1, 642, 801

(1.07) (88, 615)
[z = 61.20] [z = 18.54]
N 191 191

PA R T IV
Advanced Material
Time Series: Dealing with 13
Stickiness over Time
Global warming is a policy nightmare. Addressing

it requires complex international efforts that impose
substantial costs on people today with the hope of
preventing future harms, many of which would impact
people not yet born.
Empirically, global warming is no picnic either. A
hot day or a major storm comes, and invariably, someone
says global warming is accelerating. The end is near! If
it gets cold or snows, someone says global warming is a
fraud. Kids, put some more coal on the campfire!
Rigorous scientific assessment of climate change is
crucial in order to guide our response to the threats. One of the more famous
(and relatively simple) data sets used in this process is data on the average global
time series data temperature over time. This data is an example of time series data, or data for
Consists of observations a particular unit (e.g., a country or planet) over time. Time series data is distinct
for a single unit over from cross-sectional data, which is data for many units at a given point in time
time. (e.g., data on the GDP per capita in all countries in 2018).
Analyzing time series data is deceptively tricky because the data in one year
cross-sectional data almost certainly depends on the data in the year before. This seemingly innocuous
Data having
fact creates complications, some of which are easy to deal with and others of which
observations for
multiple units for one are a major pain in the tuckus.
time period. In this chapter, we introduce two approaches to time series data. The first
treats the year-to-year interdependence as the result of autocorrelated errors. As
discussed earlier on page 68, autocorrelation doesn’t cause our OLS coefficients
to be biased, but it will typically cause standard OLS estimates of the variance of
β̂1 to be incorrect. While it takes some work to deal with this problem, it’s really
not that hard to handle.
The second approach to time series data treats the dependent variable in one
period as directly depending on what the value of the dependent variable was in
the previous period. In this approach, the data remembers: a bump up in year 1
will affect year 2, and because the value in year 2 will affect year 3, and so on,
459
460 CHAPTER 13 Time Series: Dealing with Stickiness over Time
the bump in year 1 will percolate through the entire data series. Such a dynamic
model includes a lagged dependent variable as an independent variable. Dynamic
models might seem pretty similar to other OLS models, but they actually differ in
important and funky ways.
This chapter covers both approaches to dealing with time series data.
Section 13.1 introduces a model for autocorrelation. Section 13.2 shows how
to use this model to detect autocorrelation, and Section 13.3 presents two ways
to properly account for autocorrelated errors. Section 13.4 introduces dynamic
models, and Section 13.5 discusses an important but complicated aspect of
dynamic models called stationarity.
13.1 Modeling Autocorrelation

One reasonable approach to time series data is to think of the errors as being
correlated over time. If errors are correlated, β̂1 is unbiased, but the standard
equation for the variance of β̂1 (Equation 5.10 on page 146) is not accurate.1
Often the variance estimated by OLS will be too low and will cause our confidence
intervals to be too small, potentially leading us to reject the null hypothesis when
we shouldn’t.
Model with autoregressive error

We start with a familiar regression model:
Yt = β0 + β1 Xt + t (13.1)
The notation for this differs from the notation for our standard OLS model. Instead
of using i to indicate each individual observation, we use t to indicate each time
period. Yt therefore indicates the dependent variable at time t; Xt indicates the
independent variable at time t.
This model helps us appreciate the potential that errors may be correlated in
time series data. To get a sense of how this happens, first let us consider a seemingly
random fact: sunspots are a solar phenomenon that may affect temperature and that
strengthen and weaken somewhat predictably over a roughly 11-year cycle. Now
suppose that we’re trying to assess if carbon emissions affect global temperature
with a data set that does not have a variable for sunspot activity. The fact that we
haven’t measured sunspots means that they will be in the error term, and the fact
that they cycle up and down over an 11-year period means that the errors in the
autoregressive
model will be correlated.
process A process in
which the value of a Here we will model autocorrelated errors by assuming the errors follow an
variable depends autoregressive process. In an autoregressive process, the value of a variable
directly on the value
from the previous 1
We show how the OLS equation for the variance of β̂1 depends on the errors being uncorrelated on
period. page 499.
13.1 Modeling Autocorrelation 461
depends directly on the value from the previous period. The equation for an
autoregressive error process is
t = ρt−1 + νt (13.2)
This equation says that the error term for time period t equals ρ times the error
in the previous term plus a random error, νt . We assume that νt is uncorrelated
with the independent variable and other error terms. We call t−1 the lagged error
lagged variable A because it is the error from the previous period. We indicate a lagged variable
variable with the values with the subscript t − 1 instead of t. A lagged variable is a variable with the values
from the previous from the previous period.2
period.
The absolute value of ρ must be less than one in autoregressive models. If ρ
were greater than one, the errors would tend to grow larger in each time period
and would spiral out of control.
We often refer to autoregressive models as AR models. In AR models, the
AR(1) model A errors are a function of errors in previous periods. If errors are a function of only
model in which the the errors from the previous period, the model is referred to as an AR(1) model
errors are assumed to
(pronounced A-R-1). If the errors are a function of the errors from two previous
depend on their value
from the previous periods, the model is referred to as an AR(2) model, and so on. We’ll focus on
period. AR(1) models in this book.
Visualizing autocorrelated errors

The ρ term indicates the extent to which the errors are correlated over time. If ρ
is zero, then the errors are not correlated, and the autoregressive model reduces to
a simple OLS model (because Equation 13.2 becomes t = νt when ρ = 0). If ρ is
greater than zero, then a high value of in period t − 1 is likely to lead to a high
value of in period t. Think of the errors in this case as being a bit sticky. Instead
of bouncing around like independent random values, they tend to run high for a
while, then low for a while.
If ρ is less than zero, we have negative autocorrelation. With negative
autocorrelation, a positive value of in period t − 1 is more likely to lead to a
negative value of in the next period. In other words, the errors bounce violently
back and forth over time.
Figure 13.1 shows examples of errors with varying degrees and types of
autocorrelation. Panel (a) shows an example in which ρ is 0.8. This positive
autocorrelation produces a relatively smooth graph, with values tending to be
2
Some important terms here sound similar but have different meanings. Autocorrelation refers to
errors being correlated with each other. An autoregressive model is the most common, but not the
exclusive, way to model autocorrelation. It is possible to model correlated errors differently. In a
moving average error process, for example, errors can be the average of errors from some number of
previous periods. In Section 13.4 we’ll use an autoregressive model for the dependent variable rather
than for the error.
Positive autocorrelation No autocorrelation Negative autocorrelation

ρ = 0.8 ρ = 0.0 ρ = −0.8
Error
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
Year Year Year

(a) (b) (c)
FIGURE 13.1: Examples of Autocorrelation
above zero for a few periods, then below zero for a few periods, and so on. This
graph is telling us that if we know the error in one period, we have some sense of
what it will be in the following period. That is, if the error is positive in period t,
then it’s likely (but not certain) to be positive in period t + 1.
Panel (b) of Figure 13.1 shows a case of no autocorrelation. The error in time
t is not a function of the error in the previous period. The telltale signature of no
autocorrelation is the randomness: the plot is generally spiky, but here and there
the error might linger above or below zero, without a strong pattern.
Panel (c) of Figure 13.1 shows negative serial correlation with ρ = −0.8. The
signature of negative serial correlation is extreme spikiness because a positive error
is more likely to be followed by a negative error, and vice versa.
REMEMBER THIS
1. Autocorrelation refers to the correlation of errors with each other.
2. A standard way to model autocorrelated error is to assume they come from an autoregressive
process in which the error term in period t is a function of the error in previous periods.
3. The equation for error in an AR(1) model is
t = ρt−1 + νt
Discussion Question
1. Discuss whether autocorrelation is likely in each of the following examples.

(a) Monthly unemployment data in Mexico from 1980 to 2014.
(b) Daily changes of stock price of Royal Dutch Shell (the largest company traded on the
London stock market) from January 1, 2014, to December 31, 2014.
(c) Responses to anonymous telephone surveys about ratings of musical bands by randomly
selected individuals in Germany.
(d) Responses to in-person, public questions about ratings of musical bands by a class of
14-year-olds in Germany.
(e) Monthly levels of interest rates in China from 2000 to 2015.
(f) Monthly changes of interest rates in China from 2000 to 2015.
13.2 Detecting Autocorrelation

Just because data is time series data does not necessarily mean the errors will be
correlated. We need to assess whether autocorrelation exists in our data and model.
If it does, we need to correct for it. If it does not, we can go on our merry way with
OLS. In this section, we show how to detect autocorrelation graphically and with
auxiliary regressions.
Using graphical methods to detect autocorrelation

The first way to detect autocorrelation is simply to graph the error terms over
time. Autocorrelated data has a distinctive pattern and will typically jump out
pretty clearly from a graph. As is typical with graphical methods, looking at a
picture doesn’t yield a cut-and-dried answer. The advantage, though, is that it
allows us to understand the data, perhaps helping us catch a mistake or identify an
unappreciated pattern.
To detect autocorrelation graphically, we first run a standard OLS model,
ignoring the autocorrelation, and generate residuals, which are calculated as
ˆt = Yt − β̂0 − β̂1 Xt . (If our model had more independent variables, we would
include them in the calculation.) We simply graph these residuals over time and
describe what we see.
If the movement over time of the errors is relatively smooth, as in panel (a)
of Figure 13.1, they’re positively correlated. If errors bounce violently, as in panel
(c) of Figure 13.1, they’re negatively correlated. If we can’t really tell, the errors
probably are not strongly correlated.
Wait a minute! Why are we looking at residuals from an OLS equation that
does not correct for autocorrelation? Isn’t the whole point of this chapter that we
need to account for autocorrelation? Busted, right?
Actually, no. And here’s where understanding the consequences of auto-
correlation is so valuable. Autocorrelation does not cause bias. The β̂’s from
an OLS model that ignores autocorrelation are unbiased even when there is
autocorrelation. Because the residuals are a function of these β̂’s, they are
unbiased, too. The OLS standard errors are flawed, but we’re not using them to
create the residuals in the graph.
Positive autocorrelation is common in time series data. Panel (a) of
Figure 13.2 shows global climate data over time with a fitted line from the
following model:
Temperaturet = β0 + β1 Yeart + t (13.3)
The temperature hovers above the trend line for periods (such as around World
War II and now) and below the line for other periods (such as 1950–1980). This
hovering is a sign that the error in one period is correlated with the error in the
next period. Panel (b) of Figure 13.2 shows the residuals from this regression. For
each observation, the residual is the distance from the fitted line; the residual plot
is essentially panel (a) tilted so that the fitted line in panel (a) is now the horizontal
line in panel (b).
Temperature
1
0.8
0.6
(a)
0.4
0.2
1890 1910 1930 1950 1970 1990 2010
Year
Residuals
0.4
0.2
(b) 0.0
−0.2
−0.4
1890 1910 1930 1950 1970 1990 2010
Year
FIGURE 13.2: Global Average Temperature since 1880
Using an auxiliary regression to detect autocorrelation

A more formal way to detect autocorrelation is to use an auxiliary regression to
estimate the degree of autocorrelation. We have seen auxiliary regressions before
(in the multicollinearity discussion on page 147, for example); they are additional
regressions that are related to, but not the same as, the regression of interest.
When detecting autocorrelation, we estimate the following model:
ˆt = ρˆt−1 + νt (13.4)
where ˆt and ˆt−1 are, respectively, simply the residuals and lagged residuals from
the initial OLS estimation of Yt = β0 + β1 Xt + t . If ρ̂ is statistically significantly
different from zero, we have evidence of autocorrelation.3
3
This approach is closely related to a so-called Durbin-Watson test for autocorrelation. This test
statistic is widely reported, but it has a more complicated distribution than a t distribution and
requires use of specific tables. In general, it produces results similar to those from the auxiliary
regression process described here.
TABLE 13.1 Using OLS and Lagged

Residual Model to Detect
Autocorrelation
Lagged residual 0.608∗

(0.072)
[t = 8.39]
Constant 0.000
(0.009)
[t = 0.04]
N 127
2
R 0.36

∗
Table 13.1 shows the results of such a lagged residual model for the climate
data in Figure 13.2. The dependent variable in this model is the residual from
Equation 13.3, and the independent variable is the lagged value of that residual.
We’re using this model to estimate how closely ˆt and ˆt−1 are related. The answer?
They are strongly related. The coefficient on ˆt−1 is 0.608, meaning that our ρ̂
estimate is 0.608, which is quite a strong relation. The standard error is 0.072,
implying a t statistic of 8.39, which is well beyond any conventional critical value.
We can therefore handily reject the null hypothesis that ρ = 0 and conclude that
errors are autocorrelated.
REMEMBER THIS
To detect autocorrelation in time series data:
1. Graph the residuals from a standard OLS model over time. If the plot is relatively smooth,
positive autocorrelation is likely to exist. If the plot is relatively spiky, negative autocorrelation
is likely.
2. Estimate the following OLS model:
ˆt = ρˆt−1 + νt
A statistically significant estimate of ρ indicates autocorrelation.

13.3 Fixing Autocorrelation

When we see evidence of autocorrelation, we’ll naturally want to “fix” it. We
should be clear about what we are fixing, though. As we emphasized in Chapter
3 on page 69, the problem with correlated errors is not bias. The problem with
autocorrelated errors is that standard OLS errors will be incorrect. And for time
series models, the standard errors will not only be incorrect, they will often be too
small, which could lead us to reject the null hypothesis when we should not.
Therefore, our task here is to get better standard errors for models with
autocorrelated errors. One way to do this is to simply use the unbiased β̂
estimates from OLS and so-called Newey-West standard errors that account
for autocorrelation. Another approach involves preprocessing our variables in a
manner that purges the model of autocorrelation. Even though that may sound
like cheating, it’s actually OK and takes only a few steps.
Newey-West standard errors

The simplest way to deal with autocorrelated errors is to let the computer calculate
Newey-West them in a manner that accounts for autocorrelation. Newey-West standard errors
standard errors are standard errors for OLS estimates that do not require us to assume errors are
Standard errors for the uncorrelated with each other. The math behind them is hard; the method wasn’t
coefficients in OLS that devised until 1987 by Whitney Newey and Kenneth West. (Those guys are the
are appropriate even reason we call these standard errors Newey-West standard errors, by the way).
when errors are
And the calculations are laborious, which is why we like our computers to do the
autocorrelated.
work for us.
Life is pretty easy when we use Newey-West standard errors. The coefficient
estimates are exactly the same as in OLS, and the computer will provide standard
errors that account for autocorrelation. This process is similar to the process we
discussed for creating heteroscedasticity-consistent standard errors on page 68.
There are a couple of differences from heteroscedasticity-consistent standard
errors, though. First, unlike heteroscedasticity-consistent standard errors, which
often are not that different from OLS standard errors, Newey-West standard errors
are more likely to be larger (and potentially a good bit larger) than OLS standard
errors. In addition, Newey-West standard errors requires us to specify a number of
lags for error terms. While we’ve been focusing on an AR(1) process, the process
could have more lags. Determining the proper number of lags is a bit of an art; we
provide a rule of thumb in the Computing Corner at the end of the chapter.
ρ -Transforming the data

Generalized least
squares (GLS) An
A very different approach to dealing with autocorrelation entails what we will
approach to estimating call ρ-transforming the data. This approach is an example of generalized least
linear regression models squares (GLS), a process that contrasts to OLS (notice the “ordinary” in OLS
that allows for
correlation of errors.
becomes “generalized” in GLS).4 In this process, we use a different estimation

process producing unbiased β̂ estimates (just like in OLS) and appropriate
standard errors (which improves on OLS). We’ll discuss differences between the
Newey-West and ρ-transformation approaches at the end of this section.
The ρ-transformation process purges autocorrelation from the data by
transforming the dependent and independent variables before we estimate our
model. Once we have purged the autocorrelation, using OLS on the transformed
data will produce an unbiased estimate of β̂1 and an appropriate estimate of
var( β̂1 ).
The purging process is called ρ-transforming (“rho transforming”) the data.
Because these steps are automated in many software packages, we typically will
not do them manually. If we understand the steps, though, can use the computed
results more confidently and effectively.
We begin by replacing the t in the main equation (Equation 13.1) with
ρt−1 + νt from Equation 13.2:
Yt = β0 + β1 Xt + ρt−1 + νt (13.5)
This equation looks like a standard OLS equation except for a pesky ρt−1 term.
Our goal is to zap that term. Here’s how:
1. Write an equation for the lagged value of Yt that simply requires replacing
the t subscripts with t − 1 subscripts in the original model:
Yt−1 = β0 + β1 Xt−1 + t−1 (13.6)
2. Multiply both sides of Equation 13.6 by ρ:
ρYt−1 = ρβ0 + ρβ1 Xt−1 + ρt−1 (13.7)
3. Subtract the equation for ρYt−1 (Equation 13.7) from Equation 13.5.
That is, subtract the left side of Equation 13.7 from the left side of
Equation 13.5, and subtract the right side of Equation 13.7 from the right
side of Equation 13.5.
Yt − ρYt−1 = β0 − ρβ0 + β1 Xt − ρβ1 Xt−1 + t − ρt−1
4
Technically, GLS is an approach that accounts for known aspects of the error structure such as
autocorrelation. Since we need to estimate the extent of autocorrelation, the approach we discuss here
is often referred to as “feasible GLS,” or FGLS, because it is the only feasible approach given
uncertainty about the true error structure.
4. Notice in Equation 13.2 that t − ρt−1 = νt , and rewrite:
Yt − ρYt−1 = β0 − ρβ0 + β1 Xt − ρβ1 Xt−1 + νt
5. Rearrange things a bit:
Yt − ρYt−1 = β0 (1 − ρ) + β1 (Xt − ρXt−1 ) + νt
6. Use squiggles to indicate the transformed variables, where Ỹt = Yt −ρYt−1 :

β̃0 = β0 (1 − ρ), and X̃t = Xt − ρXt−1 :
Ỹt = β̃0 + β1 X̃t + νt (13.8)
The key thing is to look at the error term in this new equation. It is νt ,
which we said at the outset is the well-behaved (not autocorrelated) part of the
error term. Where is t , the naughty, autocorrelated part of the error term? Gone!
That’s the thing. That’s what we accomplished with these equations: we end up
with an equation that looks pretty similar to our OLS equation with a dependent
variable (Ỹt ), parameters to estimate (β̃0 and β1 ), an independent variable (X̃t ),
and an error term (νt ). The difference is that unlike our original model (based on
Equations 13.1 and 13.2), this model has no autocorrelation. By using Ỹt and X̃t ,
we have transformed the model from one that suffers from autocorrelation to one
that does not.
While the coefficients produced by OLS (and used for Newey-West) need only
that the error term be uncorrelated with Xt (our standard condition for exogeneity),
the coefficients produced by the ρ-transformed model need the error term to be
uncorrelated with Xt , Xt−1 and Xt+1 , in order to be unbiased.5
The ρ-transformed model is also referred to as a Cochrane-Orcutt model or a
Prais-Winsten model.6
Estimating a ρ -transformed model

What we have to do, then, is estimate a model with the Ỹ and X̃ (note the squiggles
over the variable names) instead of Y and X. Table 13.2 shows the transformed
variables for several observations. The columns labeled Y and X show the original
data. The columns labeled Ỹ and X̃ show the transformed data. We assume for
this example that ρ̂ = 0.5. In this case, the Ỹ observation for 2001 will be the
actual value in 2001 (which is 110) minus ρ̂ times the value of Y in 2000: Ỹ2001 =
110 − (0.5 × 100) = 60. Notice that the first observation in the ρ-transformed
5
Wooldridge (2013, 424) has more detail on the differences in the two approaches. To get a sense of
why the ρ-transformed model has more conditions for exogeneity, note that the independent variable
in Equation 13.8 is composed of both Xt and Xt−1 , and we know from past results that the correlation
of errors and independent variables is a problem.
6
The Prais-Winsten approximates the values for the missing first observation in the ρ-transformed
data. These names are useful to remember when we’re looking for commands in Stata and R to
analyze time series data.
data will be missing because we don’t know the lagged value for that
observation.
Once we’ve created these transformed variables, things are easy. If we think
in terms of a spreadsheet, we’ll simply use the columns Ỹ and X̃ when we estimate
the ρ-transformed model.
It is worth emphasizing that the β̂1 coefficient we estimate in the
ρ-transformed model is an estimate of β1 . Throughout all the rigmarole of the
transformation process, the value of β1 doesn’t change. The value of β1 in
the original equation is the same as the value of β1 in the transformed equation.
Hence, when we get results from ρ-transformed models, we still speak of them in
the same terms as β1 estimates from standard OLS. That is, a one-unit increase in
X is associated with a β̂1 increase in Y.7
One non-intuitive thing is that even though the underlying data is the same,
the β̂ estimates differ from the OLS estimates. Both OLS and ρ-transformed
coefficient estimates are unbiased and consistent, which means that in expectation,
the estimates equal the true value, and as we get more data they converge to the true
value. These things can be true and the models can still yield different coefficient
estimates. Just like if we flip a coin 100 times we are likely to get something
different every time, we go through the process even though the expected number
of heads is 50. That’s pretty much what is going on here, as the two approaches
are different realizations of random processes that are correct on average but still
have random noise.
Should we use Newey-West standard errors or the ρ-transformed approach?
Each approach has its virtues. The ρ-transformed approach is more statistically
“efficient,” meaning that it will produce smaller (yet still appropriate) standard
errors than Newey-West. The downside of the ρ-transformed approach is that it
requires additional assumptions to produce unbiased estimates of β̂.
ρ = 0.5)
TABLE 13.2 Example of ρ -Transformed Data (for ρ̂
Original data ρ -Transformed data
Ỹ X̃
Year Y X ρ Yt – 1 )
(= Y – ρ̂ ρ Xt – 1 )
(= X – ρ̂
2000 100 50 — —
2001 110 70 110 – (0.5 * 100) = 60 70 – (0.5 * 50) = 45

2002 130 60 130 – (0.5 * 110) = 75 60 – (0.5 * 70) = 25
7
The intercept estimated in a ρ-transformed model is actually β0 (1 − ρ̂). If we want to know the
fitted value for Xt = 0 (which is the meaning of the intercept in a standard OLS model), we need to
divide β̃0 by (1 − ρ̂).
REMEMBER THIS
1. Newey-West standard errors account for autocorrelation. These standard errors are used with
OLS β̂ estimates.
2. We can also correct for autocorrelation by ρ-transforming the data, a process that purges
autocorrelation from the data and produces different estimates of β̂ than OLS.
a. The model is
Ỹt = β̃0 + β1 X̃t + νt
where Ỹt = Yt − ρYt−1 , β̃0 = β0 (1 − ρ), and X̃t = Xt − ρXt−1 .

b. We interpret β̂1 from a ρ-transformed model the same as we do for standard OLS.
c. This process is automated in many statistical packages.
CASE STUDY Using an AR(1) Model to Study Global Temperature Changes

Figure 13.3 shows the global average tem-
perature data we worked with in Chapter 7
on page 227. Temperature appears to rise
over time, and we want to assess whether
this increase is meaningful.
We noted in our discussion of Table 7.1
that autocorrelation probably made the OLS
standard errors incorrect. Here we revisit
the example and use our two approaches
to deal with autocorrelated errors. We work
with the quadratic model that allows the
rate of temperature change to change over
time:
Temperaturet = β0 + β1 Yeart + β2 Yeart2 + t
The first column of Table 13.3 shows results from a standard OLS analysis of the
model. The t statistics for β̂1 and β̂ 2 are greater than 5 but, as we have discussed,
are not believable due to the corruption of standard OLS standard errors by the
correlation of errors.
The first column of Table 13.3 also reports that ρ̂ = 0.514; this result was gener-
ated by estimating an auxiliary regression with residuals as the dependent variable
and lagged residuals as the independent variable. The autocorrelation is lower
than in the model that does not include year squared as an independent variable
Temperature
(deviation 1
from
pre-industrial
average)
0.75
0.5
0.25
1880 1900 1920 1940 1960 1980 2000
Year
FIGURE 13.3: Global Temperature Data
(as reported on page 466) but is still highly statistically significant, suggesting that
we need to correct for autocorrelation.
The second column of Table 13.3 shows results with Newey-West standard
errors. The coefficient estimates do not change, but the standard errors and t
statistics do change. Note also that the standard errors are bigger and the t statistics
are smaller than in the OLS model.
The third column of Table 13.3 shows results from a ρ-transformed model.
β̂1 and β̂ 2 haven’t changed much from the first column. This outcome isn’t too
surprising given that both OLS and ρ-transformed models produce unbiased
estimates of β1 and β2 . The difference is in the standard errors. The standard error on
each of the Year and Year2 variables has almost doubled, which has almost halved
the t statistics for β̂1 and β̂ 2 to near 3. In this particular instance, the relationship
TABLE 13.3 Global Temperature Model Estimated by Using OLS,

Newey-West, and ρ -Transformation Models
OLS Newey-West ρ -Transformed
Year −0.165∗ −0.165∗ −0.174∗

(0.031) (0.035) (0.057)
[t = 5.31] [t = 3.57] [t = 3.09]
Year squared 0.000044∗ 0.000044∗ 0.000046∗
(0.000008) (0.000012) 0.000015)
[t = 5.48] [t = 3.68] [t = 3.20]
Constant 155.68∗ 155.68∗ 79.97∗
(30.27) (44.89) (26.67)
[t = 5.14] [t = 3.47] [t = 2.99]
ρ̂ 0.514∗ Same −0.021
(from auxiliary regression) (0.077) as (0.090)
[t = 6.65] OLS [t = 0.28]
N 128 128 127
2
R 0.79 0.79 0.55

∗
between year and temperature is so strong that even with these larger standard
errors, we reject the null hypotheses of no relationship at conventional significance
levels (such as α = 0.05 or α = 0.01). What we see, though, is the large effect on the
standard errors of addressing autocorrelation.
Several aspects of the results from the ρ-transformed model are worth noting.
First, the ρ̂ from the auxiliary regression is now very small (−0.021) and statistically
insignificant, indicating that we have indeed purged the model of first-order
autocorrelation. Well done! Second, the R2 is lower in the ρ-transformed model. It’s
reporting the traditional goodness of fit statistic for the transformed model, but it
is not directly meaningful or comparable to the R2 in the original OLS model. Third,
the constant changes quite a bit, from 155.68 to 79.97. Recall, from the footnote on
page 470 that that the constant in the ρ-transformed model is actually β0 (1 − ρ),
where ρ is the estimate of autocorrelation in the untransformed model. This means
that the estimate of β0 is 1 79.97
– 0.514 = 164.5, which is reasonably close to the estimate
of β̂0 in the OLS model.
13.4 Dynamic Models

dynamic model A Another way to deal with time series data is to use a dynamic model. In a
time series model that dynamic model, the value of the dependent variable directly depends on the value
includes a lagged of the dependent variable in the previous term. In this section, we explain the
dependent variable as
an independent variable.
dynamic model and discuss three ways in which the model differs from OLS
models.
Dynamic models include the lagged dependent variable

In mathematical terms, a basic dynamic model is
Yt = γYt−1 + β0 + β1 Xt + t (13.9)
where the new term is γ times the value of the lagged dependent variable, Yt−1 .
The coefficient γ indicates the extent to which the dependent variable depends on
its lagged value. The higher it is, the more the dependence across time. If the data
is really generated according to a dynamic process, omitting the lagged dependent
variable would put us at risk for omitted variable bias, and given that the coefficient
on the lagged dependent variable is often very large, that means we risk large
omitted variable bias if we omit the lagged dependent variable when γ = 0.
As a practical matter, a dynamic model with a lagged dependent variable is
super easy to implement: just add the lagged dependent variable as an independent
variable.
Three ways dynamic models differ from OLS models

This seemingly modest change in the model shakes up a lot of our statisti-
cal intuition. Some things that seemed simple in OLS become weird. So be
alert.
First, the interpretation of the coefficients changes. In non-dynamic OLS
models (which simply means OLS models that do not have a lagged dependent
variable as an independent variable), a one-unit increase in X is associated with a
β̂1 increase in Y. In a dynamic model, it’s not so simple. Suppose X increases by
one unit in period 1. Then Y1 will go up by β1 ; we’re used to seeing that kind of
effect. Y2 will also go up because Y2 depends on Y1 . In other words, an increase in
X has not only immediate effects but also long-term effects because the boost to
Y will carry forward via the lagged dependent variable.
β1
In fact, if −1 < γ < 1, a one-unit increase in X will cause a 1−γ increase in
Y over the long term. If γ is big (near 1), then the dependent variable has a lot
8
of memory. A change in one period strongly affects the value of the dependent
variable in the next period. In this case, the long-term effect of X will be much
bigger than β̂1 because the estimated long-term effect will be β̂1 divided by a
small number. If γ is near zero, on the other hand, then the dependent variable has
little memory, meaning that the dependent variable depends little on its value in
the previous period. In this case, the long-term effect of X will be pretty much β̂1
because the estimated long-term effect will be β̂1 divided by a number close to 1.
8
The condition that the absolute value of γ is less than 1 rules out certain kinds of explosive
processes where Y gets increasingly bigger or smaller every period. This condition is related to a
requirement that data be “stationary,” as discussed on page 476.
A second distinctive characteristic of dynamic models is that correlated errors

cause a lot more trouble in dynamic models than in non-dynamic models. Recall
that in OLS, correlated errors mess up the standard OLS estimates of the variance
of β̂1 , but they do not bias the estimates of β̂1 . In dynamic models, correlated
errors cause bias. It’s not too hard to see why. If t is correlated with t−1 , it also
must be correlated with Yt−1 because Yt−1 is obviously a function of t−1 . In such a
situation, one of the independent variables (Yt−1 ) is correlated with the error, which
is a bias-causing no-no in OLS. A couple factors take the edge off the damage from
this correlation: there is less bias for β̂1 than for the estimate of the coefficient on
the lagged dependent variable, and if the autocorrelation in the errors is modest or
weak, this bias is relatively small.
A third distinctive characteristic of dynamic models is that including a lagged
dependent variable that is irrelevant (meaning γ = 0) can lead to biased estimates
of β̂1 . Recall from page 150 that in OLS, including an irrelevant variable (a
variable whose true coefficient is zero) will increase standard errors but will
not cause bias. In a dynamic model, though, including the lagged dependent
variable when γ = 0 leads β̂1 to be biased if the error is autocorrelated, and the
independent variable itself follows an autoregressive process (such that its value
depends on its lagged value). When these two conditions hold, including a lagged
dependent variable when γ = 0 can cause the estimated coefficient on X to be
vastly understated: the lagged dependent variable will have wrongly soaked up
much of the explanatory power of the independent variable.
Should we include a lagged dependent variable in our time series model?
On the one hand, if we exclude the lagged dependent variable when it should
be there (when γ = 0), we risk omitted variable bias. On the other hand, if we
include it when it should not be there (when γ = 0), we risk bias if the errors are
autocorrelated. It’s quite a conundrum.
There is no firm answer, but we’re not helpless. The best place to start is the
nature of the dependent variable being modeled. If we have good reasons to suspect
that the process truly is dynamic, then including the lagged dependent variable
is the best course. For example, many people suspect that political affiliation is a
dynamic process. What party a person identifies with depends not only on external
factors like the state of the economy but also on what party he or she identified
with last period. It’s well known that many people interpret facts through partisan
lenses. Democrats will see economic conditions in a way that is most favorable
to Democrats; Republicans will see economic conditions in a way that is most
favorable to Republicans. This means that party identification will be sticky in a
manner implied by the dynamic model, and it is therefore sensible to include a
lagged dependent variable in the model.
In addition, when we include a lagged dependent variable, we should test
for autocorrelated errors. If we find that the errors are autocorrelated, we should
worry about possible bias in the estimate of β̂1 ; the higher the autocorrelation of
errors, the more we should worry. We discussed how to test for autocorrelation
on page 463. If we find autocorrelation, we can ρ-transform the data to purge the
autocorrelation; we’ll see an example in another case study on global warming on
page 482.
REMEMBER THIS
1. A dynamic time series model includes a lagged dependent variable as a control variable. For
example,
Yt = γYt−1 + β0 + β1 Xt + t
2. Dynamic models differ from standard OLS models.

β1
(a) Independent variables have short-term effects (β1 ) and long-term effects ( 1−γ ). The
long-term effects occur because a short-term effect on Y will affect subsequent values
of the dependent variable through the influence of the lagged dependent variable.
(b) Autocorrelation causes bias in models with a lagged dependent variable.
(c) Including a lagged dependent variable when the true value of γ is zero can cause severe
bias if the errors are correlated and the independent variable follows some kind of
autoregressive process.
13.5 Stationarity
stationarity A time We also need to think about stationarity when we analyze time series data. A
series term indicating stationary variable has the same distribution throughout the entire time series. This
that a variable has the is a complicated topic, and we’ll only scratch the surface here. The upshot is that
same distribution stationarity is good and its opposite, non-stationarity, is bad. When working with
throughout the time series data, we want to make sure our data is stationary.
entire time series. In this section, we define non-stationarity as a so-called unit root problem and
Statistical analysis of
then explain how spurious regression results are a huge danger with non-stationary
non-stationary variables
can yield spurious data. Spurious regression results are less likely with stationary data. We also show
regression results. how to detect non-stationarity and what to do if we find it.
Non-stationarity as a unit root process

A variable is stationary if it has the same distribution for the entire time series. A
variable is non-stationary if its distribution depends on time. A variable for which
the mean is getting constantly bigger, for example, is a non-stationary variable.
Non-stationary variables come in multiple flavors, but we’ll focus on a case in
which data is prone to display persistent trends in a way we define more precisely
soon. To help us understand non-stationarity we begin with a very simple dynamic
model in which Yt is a function of its previous value:
Yt = γYt−1 + t (13.10)
We consider three cases for γ , the coefficient on the lagged dependent

variable: when it is less than one, equal to one, or greater than one. If the absolute
value of γ is less than one, life is relatively easy. The lagged dependent variable
affects the dependent variable, but the effect diminishes over time. To see why,
note that we can write the value of Y in the third time period as a function of
the previous values of Y simply by substituting for the previous values of Y (e.g.,
Y2 = γY1 + 2 ):
Y3 = γY2 + 3
= γ(γY1 + 2 ) + 3
= γ(γ(γY0 + 1 ) + 2 ) + 3
= γ 3Y0 + γ 2 1 + γ2 + 3
When γ < 1, the effect of any given value of Y will decay over time. In this case,
the effect of Y0 on Y3 is γ 3 Y0 ; because γ < 1, γ 3 will be less than one. We could
extend the foregoing logic to show that the effect of Y0 on Y4 will be γ 4 , which is
less than γ 3 , when γ < 1. The effect of the error terms in a given period will also
have a similar pattern. This case presents some differences from standard OLS, but
it turns out that because the effects of previous values of Y and error fade away,
we will not face a fundamental problem when we estimate coefficients.
What if we have γ > 1? In this case, we’d see an explosive process because
the value of Y would grow by an increasing amount. Time series analysts rule out
such a possibility on theoretical grounds. Variables just don’t explode like this,
certainly not indefinitely, as implied by a model with γ > 1.
The tricky case occurs when γ = 1 exactly. In this case, the variable is said
unit root A variable to have a unit root. In a model with a single lag of the dependent variable, a unit
with a unit root has a root simply means that the coefficient on the lagged dependent variable (γ for the
coefficient equal to one model as we’ve written it) is equal to one. The terminology is a bit quirky: “unit”
on the lagged variable in
refers to the number 1, and “root” refers to the source of something, in this case the
an autoregressive
model.
lagged dependent variable that is a source for the value of the dependent variable.
Non-stationarity and spurious results

A variable with a unit root is non-stationary and causes several problems. The most
spurious regression serious is that spurious regression results are highly probable when a variable is
A regression that being regressed with a unit root on another variable with a unit root. A spurious
wrongly suggests X has regression is one in which the regression results suggest that X affects Y when in
an effect on Y.
fact X has no effect on Y; spurious results might be simply thought of as bogus
results.9
It’s reasonably easy to come up with possible spurious results in time series
data. Think about the U.S. population from 1900 to 2010. It rose pretty steadily,
right? Now think about the price of butter since 1900 to 2010. It also rose steadily.
9
Other problems are that the coefficient on the lagged dependent variable will be biased downward,
preventing the coefficient divided by its standard error from following a t distribution.
If we were to run a regression predicting the price of butter as a function of

population, we would see a significant coefficient on population because low
values of population went with low butter prices and high values of population
went with high butter prices. Maybe that’s true, but here’s why we should be
skeptical: it’s quite possible these are just two variables that both happen to
be trending up. We could replace the population of the United States with the
population of Yemen (also trending up) and the price of butter with the number
of deer in the United States (also trending up). We’d again have two variables
trending together, and if we put them in a simple OLS model, we would observe
a spurious positive relationship between the population of Yemen and deer in the
United States. Silly, right?
A non-stationary variable is prone to spurious results because a variable with
a unit root is trendy. Not in a fashionable sense, but in a streaky sense. A variable
with a unit root might go up for while, then down for even longer, blip up, and then
continue down. These unit root variables look like Zorro slashed out their pattern
with his sword: a zig up, a long zag down, another zig up, and so on.10
Figure 13.4 shows examples of two simulated variables with unit roots. In
panel (a), Y is simulated according to Yt = Yt−1 + t . In this particular simulation,
Y mostly goes up, but in some periods, it goes down for a bit. In panel (b), X
is simulated according to Xt = Xt−1 + νt . In this particular simulation, X trends
mostly down, with a flat period early on and some mini-peaks later in the time
series. Importantly, X and Y have absolutely nothing to do with each other with
respect to the way they were generated. For example, when we generated values
of Y, the values of X played no role.
Panel (c) of Figure 13.4 scatterplots X and Y and includes a fitted OLS
regression line. The regression line has a negative slope that is highly statistically
significant. And completely spurious. The variables are completely unrelated. We
see a significant relationship simply because Y was working its way up while X
was working its way down for most of the first part of the series. These movements
create a pattern in which a negative OLS coefficient occurs, but it does not indicate
an actual relationship. In other words, panel (c) of Figure 13.4 is a classic example
of a spurious regression.
Of course, this is a single example. It is, however, quite representative because
unit root variables are so prone to trends. When Y goes up, there is a pretty
good chance that X will be on a trend, too: if X is going up, too, then the OLS
coefficient on X would be positive; if X is trending down when Y is trending
up, then the OLS coefficient on X would be negative. Hence, coefficient signs in
these spurious regression results are not predictable. What is predictable is that two
such variables will often exhibit spurious statistically significant relationships.11
10
Zorro’s slashes would probably go more side to side, so maybe think of unit root variables as
slashed by an inebriated Zorro.
11
The citations and additional notes section on page 565 has code to simulate variables with unit
roots and run regressions using those variables. Using the code makes it easy to see that the
proportion of simulations with statistically significant (spurious) results is very high.
Y X
5
25
0
20
−5
15
−10
10
−15
5
−20
0 −25
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)
Y
25
20
15
10
5 β1 = −0.81
t stat for β1 = −36.1
0
−25 −20 −15 −10 −5 0 5

X
(c)
FIGURE 13.4: Data with Unit Roots and Spurious Regression
Spurious results are less likely with stationary data

Variables without unit roots behave differently. Panels (a) and (b) of Figure 13.5
show a simulation of two time series variables in which the coefficient on the
lagged dependent variable is 0.5 (as opposed to 1.0 in the unit root simulations).
They certainly don’t look like Zorro sword slashes. They look more like Zorro
sneezed them out. And OLS finds no relationship between the two variables, as
is clear in panel (c), a scatterplot of X and Y. Again, this is a single simulation,
but it is a highly representative one because variables without unit roots typically
don’t exhibit the trendiness that causes unit root variables to produce spurious
regressions.
Unit roots are surprisingly common in theory and practice. Unit roots are also
known as random walks because the series starts at Yt−1 and takes a random step
(the error term), then takes another random step from the next value, and so on.
Random walks are important in finance: the efficient-market hypothesis holds that
stock market prices account for all information, and therefore there will be no
Y X
2
2
1
0
0
−1
−2 −2
−3
−4
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)
−2
β1 = −0.08
t stat for β1 = −0.997
−4
−3 −2 −1 0 1 2
X
(c)
FIGURE 13.5: Data without Unit Roots
systematic pattern going forward. A classic book about investing is A Random

Walk Down Wall Street (Malkiel 2003); the title is not, ahem, random, but connects
unit roots to finance via the random walk terminology. In practice, many variables
show signs of having unit roots, including GDP, inflation, and other economic
variables.
Detecting unit roots and non-stationarity

To test for a unit root (which means the variable is non-stationary), we test whether
γ is equal to 1 for the dependent variable and the independent variables. If γ
is equal to 1 for a variable or variables, we have non-stationarity and worry
about spurious regression and other problems associated with non-stationary
data.
Dickey-Fuller test A The main test for unit roots has a cool name: the Dickey-Fuller test. This is a
test for unit roots; used hypothesis test in which the null hypothesis is γ = 1 and the alternative hypothesis
in dynamic models. is γ < 1.
The standard way to implement the Dickey-Fuller test is to transform the

model by subtracting Yt−1 from both sides of Equation 13.10:
Yt − Yt−1 = γYt−1 − Yt−1 + t

ΔYt = (γ − 1)Yt−1 + t
ΔYt = αYt−1 + t
where the dependent variable ΔYt is now the change in Y in period t and the
independent variable is the lagged value of Y. We pronounce ΔYt as “delta Y.”
Here we’re using notation suggesting a unit root test for the dependent variable.
We also run unit root tests with the same approach for independent variables.
This transformation allows us to reformulate the model in terms of a new
coefficient we label as α = γ − 1. Under the null hypothesis that γ = 1, our
new parameter α equals 0. Under the alternative hypothesis that γ < 1, our new
parameter α is less than 0.
augmented It’s standard to estimate a so-called augmented Dickey-Fuller test that
Dickey-Fuller test A includes a time trend and a lagged value of the change of Y (ΔYt−1 ):
test for unit root for time
series data that includes ΔYt = αYt−1 + β0 + β1 Timet + β2 ΔYt−1 + t (13.11)
a time trend and lagged
values of the change in
where Timet is a variable indicating which time period observation t is. Time is
the variable as
independent variables. equal to 1 in the first period, 2 in the second period, and so forth.
The focus of the Dickey-Fuller approach is the estimate of α. What we do
with our estimate of α takes some getting used to. The null hypothesis is that Y is
non-stationary. That’s bad. We want to reject the null hypothesis. The alternative
is that the Y is stationary. That’s good. If we reject the null hypothesis in favor of
the alternative hypothesis that α < 0, then we are rejecting the non-stationarity of
Y in favor of inferring that Y is stationary.
The catch is that if the variable actually is non-stationary, the estimated
coefficient is not normally distributed, which means the coefficient divided by
its standard error will not have a t distribution. Hence, we have to use so-called
Dickey-Fuller critical values, which are bigger than standard critical values,
making it hard to reject the null hypothesis that the variable is non-stationary. We
show how to implement Dickey-Fuller tests in the Computing Corner at the end
of this chapter; more details are in the references indicated in the Further Reading
section.
How to handle non-stationarity

If the Dickey-Fuller test indicates that a variable data is non-stationary, the
standard approach is to move to a differenced model in which all variables are
converted from levels (e.g., Yt , Xt ) to differences (e.g., ΔYt , ΔXt , where Δ indicates
the difference between the variable at time t and time t − 1). We’ll see an example
on page 484.
REMEMBER THIS
A variable is stationary if its distribution is the same for the entire data set. A common violation of
stationarity occurs when data has a persistent trend.
1. Non-stationary data can lead to statistically significant regression results that are spurious when
two variables have similar trends.
2. The test for stationarity is a Dickey-Fuller test. Its most widely used format is an augmented
Dickey-Fuller test:
ΔYt = αYt−1 + β0 + β1 Timet + β2 ΔYt−1 + t
If we reject the null hypothesis that α = 0, we conclude that the data is stationary and can use
untransformed data. If we fail to reject the null hypothesis that α = 0, we conclude the data is
non-stationary and therefore should use a model with differenced data.
CASE STUDY Dynamic Model of Global Temperature

One of the central elements in discussions about global
warming is the role of carbon dioxide. Figure 13.6 plots
carbon dioxide output and global temperature from
1880 into the twenty-first century. The solid line is tem-
perature, measured in deviation in degrees Fahrenheit
from pre-industrial average temperature. The values for
temperature are on the left. The dotted line is carbon
dioxide emissions, measured in parts per million, with
values indicated on the right. These variables certainly
seem to move together. The question is: how confident
are we that this relationship is in any way causal?
We’ll analyze this question with a dynamic model.
We begin with a model that allows for the non-linear time trend from page 471;
this model has Year and Year2 as independent variables.12
We’ll also include temperature from the previous time period. This is the
lagged dependent variable—including it makes the model a dynamic model.
The independent variable of interest here is carbon dioxide. We want to know if
increases in carbon dioxide are associated with increases in global temperature.
12
Including these variables is not a no-brainer. One might argue that the independent variables are
causing the non-linear time trend, and we don’t want the time trend in there to soak up variance.
Welcome to time series analysis. Without definitively resolving the question, we’ll include time
trends as an analytically conservative approach in the sense that it will typically make it harder, not
easier, to find statistical significance for independent variables.
Temperature Carbon
(deviation dioxide
1 370
from (parts per
Temperature (left-hand scale)
pre-industrial million)
average, in Carbon dioxide (right-hand scale)
Fahrenheit)
0.75 350
0.5 330
0.25 310
0 290
1880 1900 1920 1940 1960 1980 2000

Year
FIGURE 13.6: Global Temperature and Carbon Dioxide Data
The model is
Temperaturet = γTemperaturet−1 + β0 + β1 Yeart + β2 Yeart2 + β3 CO2t + t (13.12)
where CO2t is a measure of the concentration of carbon dioxide in the atmosphere
at time t. This is a much (much!) simpler model than climate scientists use; our model
simply gives us a broad-brush picture of whether the relationship between carbon
dioxide and temperature can be ascertained in macro-level data.
Our first worry is that the data might not be stationary. If that is the case, there
is a risk of spurious regression. Therefore, the first two columns of Table 13.4 show
Dickey-Fuller results for the substantive variables, temperature and carbon dioxide.
We use an augmented Dickey-Fuller test of the following form:
ΔTemperaturet = αTemperaturet−1 + β1 Yeart + β2 ΔTemperaturet−1 + t
Recall that the null hypothesis in a Dickey-Fuller test is that the data is
non-stationary. The alternative hypothesis in a Dickey-Fuller test is that the data
is stationary; we will accept this alternative only if the coefficient is sufficiently
negative. (Yes, this way of thinking takes a bit of getting used to.)
To show that data is stationary (which is a good thing!), we need a sufficiently
negative t statistic on the estimate of α. For the temperature variable, the t statistic
in the Dickey-Fuller test is −4.22.13 As we discussed earlier, the critical values for the
Dickey-Fuller test are not the same as those for standard t tests. They are listed at
the bottom of Table 13.4. Because the t statistic on the lagged value of temperature
is more negative than the critical value, even at the one percent level, we can reject
the null hypothesis of non-stationarity. In other words, the temperature data is
stationary. We get a different answer for carbon dioxide. The t statistic is positive.
That immediately dooms a Dickey-Fuller test because we need to see t statistics
more negative than the critical values to be able to reject the null. In this case, we do
not reject the null hypothesis and therefore conclude that the carbon dioxide data
is non-stationary. This means that we should be wary of using the carbon dioxide
variable directly in a time series model.
A good way to begin to deal with non-stationarity is to use differenced data,
which we generate by creating a variable that is the change of a variable in period
t, as opposed to the level of the variable.
We still need to check for stationarity with the differenced data, though, so back
we go to Table 13.4 for the Dickey-Fuller tests. This time we see that the last two
columns use the changes in the temperature and carbon dioxide variables to test
for stationarity. The t statistic on the lagged value of the change in temperature
of −12.04 allows us to easily reject the null hypothesis of non-stationarity for
temperature. For carbon dioxide, the t statistic on the lagged value of the change
in carbon dioxide is −3.31, which is more negative than the critical value at the
TABLE 13.4 Dickey-Fuller Tests for Stationarity

Temperature Carbon dioxide Change in Change in
temperature carbon dioxide
Lag value −0.353 0.004 −1.669 −0.133

(0.084) (0.002) (0.139) (0.040)
[t = −4.22] [t = 0.23] [t = −12.04] [t = −3.31]
Time trend 0.002 0.000 0.000 0.002

(0.001) (0.001) (0.000) (0.001)
Lag change −0.093 0.832 0.304 0.270

(0.093) (0.054) (0.087) (0.088)
(Intercept) −3.943 −1.648 −0.487 −4.057
(0.974) (1.575) (0.490) (1.248)
N 126 126 125 125
2
R 0.198 0.934 0.673 0.132
Dickey-Fuller α = 0.01: −3.99

critical values α = 0.05: −3.43
α = 0.10: −3.13
Decision (for α = 0.10) Stationary Non-stationary Stationary Stationary
13
So far in this book we have been reporting the absolute value of t statistics as the sign does not
typically matter. Here we focus on negative t statistics to emphasize the fact that the α coefficient
needs to be negative to reject the null hypothesis of stationarity.
10 percent level. We conclude that carbon dioxide is stationary. However, because

CO2 is stationary only at the 10 percent level, a thorough analysis would also explore
additional time series techniques, such as the error correction model discussed in
the Further Reading section.14
Because of the non-stationarity of the carbon dioxide variable, we’ll work with
a differenced model in which the variables are changes. The dependent variable
is the change in temperature. The independent variables reflect change in each of
the variables from Equation 13.12. Because the change in Year is 1 every year, this
variable disappears (a variable that doesn’t vary is no variable!). The intercept will
now capture this information on the rise or fall in the dependent variable each year.
The other variables are simply the changes in the variables in each year.
ΔTemperaturet = γΔTemperaturet−1 + β0 + β1ΔYeart2 + β2ΔCO2t + t
Table 13.5 displays the results. The change in carbon dioxide is indeed statisti-
cally significant, with a coefficient of 0.052 and a t statistic of 2.00. In this instance,
then, the visual relationship between temperature and carbon dioxide holds up
even after we have accounted for apparent non-stationarity in the carbon dioxide
data.
TABLE 13.5 Change in Temperature as a Function of Change

in Carbon Dioxide and Other Factors
Change of carbon dioxide 0.052∗

(0.026)
[t = 2.00]
Lag temperature change −0.308∗

(0.087)
[t = 3.55]
Change year squared −0.0003

(0.0002)
[t = 1.21]
(Intercept) 0.992
(0.830)
[t = 1.20]
N 126
R2 0.110

∗
14
Dickey-Fuller tests tend to be low powered (see, e.g., Kennedy 2008, 302). This means that these
tests may fail to reject the null hypothesis when the null is false. For this reason, some people are
willing to use relatively high significance levels (e.g., α = 0.10). The costs of failing to account for
non-stationarity when it is present are high, while the costs of accounting for non-stationarity when
data is stationary are modest. Thus, many researchers are inclined to use differenced data when there
are any hints of non-stationarity (Kennedy 2008, 309).
Conclusion
Time series data is all over: prices, jobs, elections, weather, migration, and
much more. To analyze it correctly, we need to address several econometric
challenges.
One is autocorrelation. Autocorrelation does not cause coefficient estimates
from OLS to be biased and is therefore not as problematic as endogeneity.
Autocorrelation does, however, render the standard equation for the variance of
β̂ (from page 146) inaccurate. Often standard OLS will produce standard errors
that are too small when there is autocorrelation, giving us false confidence about
how precise our understanding of the relationship is.
We can correct for autocorrelation with one of two approaches. We can use
Newey-West standard errors that use OLS β̂ estimates and calculate standard
errors in a way that accounts for autocorrelated errors. Or we can ρ-transform
the data to produce unbiased estimates of β1 and correct standard errors of β̂1 .
Another, more complicated challenge associated with time series data is the
possibility that the dependent variable is dynamic, which means that the value of
the dependent variable in one period depends directly on its value in the previous
period. Dynamic models include the lagged dependent variable as an independent
variable.
Dynamic models exist in an alternative statistical universe. Coefficient
interpretation has short-term and long-term elements. Autocorrelation cre-
ates bias. Including a lagged dependent variable when we shouldn’t creates
bias, too.
As a practical matter, time series analysis can be hard. Very hard. This chapter
lays the foundations, but there is a much larger literature that gets funky fast.
In fact, sometimes the many options can feel overwhelming. Here are some
considerations to keep in mind when working with time series data:
• Deal with stationarity. It’s often an advanced topic, but it can be a serious
problem. If either a dependent or an independent variable is stationary,
one relatively easy fix is to use variables that measure changes (commonly
referred to as differenced data) to estimate the model.
• It’s probably a good idea to use a lagged dependent variable—and it’s then
advisable to check for autocorrelation. Autocorrelation does not cause bias
in standard OLS, but when a lagged dependent variable is included, it can
cause bias.
We may reasonably end up estimating a ρ-transformed model, a model with

a lagged dependent variable, and perhaps a differenced model. How do we know
which model is correct? Ideally, all models provide more or less the same result.
Whew. All too often, though, they do not. Then we need to conduct diagnostics
and also think carefully about the data-generating process. Is the data dynamic,
such that this year’s dependent variable depends directly on last year’s? If so, we
Further Reading 487
should probably lean toward the results from the model with the lagged dependent
variable. If not, we might lean toward the ρ-transformed result. Sometimes we may
simply have to report both and give our honest best sense of which one seems more
consistent with theory and the data.
After reading and discussing this chapter, we should be able to describe and
explain the following key points:
• Section 13.1: Define autocorrelation, and describe its consequences for

OLS.
• Section 13.2: Describe two ways to detect autocorrelation in time series
data.
• Section 13.3: Explain Newey-West standard errors and the process of
ρ-transforming data to address autocorrelation in time series data.
• Section 13.4: Explain what a dynamic model is and three differences
between dynamic models and OLS models.
• Section 13.5: Explain stationarity and how non-stationary data can produce
spurious results. Explain how to test for stationarity.
Further Reading
Researchers do not always agree on whether lagged dependent variables should
be included in models. Achen (2000) discusses bias that can occur when lagged
dependent variables are included. Keele and Kelly (2006) present simulation
evidence that the bias that occurs when one includes a lagged dependent variable is
small unless the autocorrelation of errors is quite large. Wilson and Butler (2007)
discuss how the bias is worse for the coefficient on the lagged dependent variable.
De Boef and Keele (2008) discuss error correction models, which can accom-
modate a broad range of time series dynamics. Grant and Lebo (2016) critique
error correction methods. Box-Steffensmeier and Helgason (2016) introduce a
symposium on the approach.
Another relatively advanced concept in time series analysis is cointegration,
a phenomenon that occurs when a linear combination of possibly non-stationary
variables is stationary. Pesaran, Shin, and Smith (2001) provide a widely used
approach to that integrates unit root and cointegration tests; Philips (2018)
provides an accessible introduction to these tools.
Pickup and Kellstedt (2017) present a very useful guide to thinking about
models that may have both stationary and non-stationary variables in them.
Stock and Watson (2011) provide an extensive introduction to the use of time
series models to forecast economic variables.
For more on the Dickey-Fuller test and its critical values, see Greene (2003,
638).
Key Terms
AR(1) model (461) Dynamic model (474) Spurious regression (477)
Augmented Dickey-Fuller Generalized least squares Stationarity (476)
test (481) (467) Time series data (459)
Autoregressive process (460) Lagged variable (461) Unit root (477)
Cross-sectional data (459) Newey-West standard errors
Dickey-Fuller test (480) (467)
Computing Corner
Stata
1. To detect autocorrelation, proceed in the following steps:
** Estimate basic regression model
regress Temp Year
** Save residuals using resid subcommand
predict Err, resid
** Plot residuals over time
scatter Err Year
** Tell Stata which variable indicates time
tsset year
** An equivalent way to do the auxiliary regression
reg Err L.Err
** “ L.“ for lagged values requires tsset command
2. To estimate OLS coefficients with Newey-West standard errors allowing

for up to (for example) three lags, use
newey Y X1 X2, lag(3)
The rule of thumb is to set the number of lags equal to the fourth root of
the number of observations. (Yes, that seems a bit obscure, but that’s what
it is.) To calculate this in Stata, use
disp _Nˆ(0.25)
where _N is Stata’s built-in function indicating the number of observations
in a data set.
3. To correct for autocorrelation using a ρ-transformation, proceed in two

steps:
tsset Year
prais AvgTemp Year, corc twostep
The tsset command informs Stata which variable orders the data chrono-
logically. The prais command (pronounced “price” and named after one
of the originators of the technique) is the main command for estimating
ρ-transformed models. The corc subcommand after the comma tells Stata
to estimate a Cochrane-Orcutt version of a ρ-transformed model, and the
twostep subcommand tells Stata not to iterate beyond two steps.15 To

learn about other options for this command, type help prais. We discuss
the difference between the Prais-Winston and Cochrane-Orcutt models in
footnote 6 on page 469.
4. Running a dynamic model is simple: just include a lagged dependent

variable. If we have already told Stata which variable indicates time by
using the tsset command described in item 1, we can simply run reg Y
L.Y X1 X2.
Or we can create a lagged dependent variable manually before running the
model:
gen LagY = Y[_n-1] /* (This approach requires that data is
ordered sequentially) */
reg Y LagY X1 X2 X3
5. To implement an augmented Dickey-Fuller test, type

dfuller Y, trend lags(1) regress
In so doing, you’re using Stata’s dfuller command; the trend sub-
command will include the trend variable, and the lags(1) subcommand
will include the lagged change. The regress subcommand displays the
regression results underlying the Dickey-Fuller test. Stata automatically
displays the relevant critical values for this test.
1. To detect autocorrelation in R, first make sure that the data is ordered from
earliest to latest observation, and then proceed in the following steps:
# Estimate basic regression model
ClimateOLS = lm(Temp ~ Year)
# Save residuals
Err = resid(ClimateOLS)
# Plot residuals over time
plot (Year, Err)
# Generate lagged residual variable
LagErr = c(NA, Err[1:(length(Err)-1)])
# Auxiliary regression
LagErrOLS = lm(Err ~ LagErr)
# Display results
summary(LagErrOLS)
2. To produce Newey-West standard errors allowing for up to (for example)

three lags, use a package called “sandwich” (which we need to install the
15
Wooldridge (2013, 425) notes that there is no clear benefit from iterating more than one time.
first time we use it; we desribe how to install packages on page 86):
library(sandwich)
sqrt(diag(NeweyWest(ClimateOLS, lag = 3, prewhite = FALSE,
adjust = TRUE)))
where ClimateOLS is the OLS model estimated above. The Newey-West
command produces a variance-covariance matrix for the standard errors.
We use the diag function to pull out the relevant parts of it, and we then
take the square root of that. The prewhite and adjust subcommands
are set to produce the same results that the Stata Newey-West command
provides.
The rule of thumb is to set the number of lags equal to the fourth root of
the number of observations. (Yes, that seems a bit obscure, but that’s what
it is.) To calculate this in R use
length(X1)ˆ(0.25)
3. To correct for autocorrelation using a ρ-transformation, we can use a

package called “orcutt” (which we need to install the first time we use
it; we desribe how to install packages on page 86):
library(orcutt)
summary(cochrane.orcutt(ClimateOLS))
where ClimateOLS is the OLS model estimated above.
A similar package called “prais” estimates a Prais-Winston model; we
discuss the difference between the Prais-Winston and Cochrane-Orcutt
models in footnote 6 on page 469. We show how to estimate a
Cochrane-Orcutt model manually in the appendix on page 565.
4. Running a dynamic model is simple: just include a lagged dependent

variable.
ClimateLDV = lm(Temp ~ LagTemp + Year)
5. We can implement an augmented Dickey-Fuller test by creating the

variables in the model and running the appropriate regression. For
example,
ChangeTemp = Temp - LagTemp # Create Delta Temp
LagChangeTemp = c(NA, ChangeTemp[1:(N-1)]) # Create lag of Delta Temp
AugDickeyF = lm(ChangeTemp ~ LagTemp + Year + LagChangeTemp)
summary(AugDickeyF) # Display results
Exercises
1. The Washington Post published data on bike share ridership (measured
in trips per day) over the month of January 2014. Bike share ridership
is what we want to explain. The Post also provided data on daily low
Exercises 491
temperature (a variable we call lowtemp) and a dummy variable for

weekends. We’ll use these as our explanatory variables. The data is
available in BikeShare.dta.
(a) Use an auxiliary regression to assess whether the errors are autocor-
related.
(b) Estimate a model with Newey-West standard errors. Compare the
coefficients and standard errors to those produced by a standard OLS
model.
(c) Estimate a model that corrects for AR(1) autocorrelation using
the ρ-transformation approach.16 Are these results different from a
model in which we do not correct for AR(1) autocorrelation?
2. These questions revisit the monetary policy data we worked with in

Chapter 6 (page 215).17
(a) Estimate a model of the federal funds rate, controlling for whether
the president was a Democrat, the number of quarters from the last
election, an interaction of the Democrat dummy variable and the
number of quarters from the last election, and inflation. Use a plot
and an auxiliary regression to assess whether there is first-order
autocorrelation.
(b) Estimate the model from part (a) with Newey-West standard errors.
Compare the coefficients and standard errors to those produced by a
standard OLS model.
(c) Estimate the model from part (a) by using the ρ-transformation
approach, and interpret the coefficients.
(d) Estimate the model from part (a), but add a variable for the lagged
value of the federal funds rate. Interpret the results, and use a plot
and an auxiliary regression to assess whether there is first-order
autocorrelation.
(e) Estimate the model from part (c) with the lagged dependent variable.
Use the ρ-transformation approach, and interpret the coefficients.
16
Stata users should use the subcommands as discussed in the Computing Corner.
17
As discussed in the Computing Corner, Stata needs us to specify a variable that indicates the
chronological order of the data. (Not all data sets are ordered sequentially from earliest to latest
observation.) The “date” variable in the data set for this exercise is not formatted to indicate order as
needed by Stata. Therefore, we need to create a variable indicating sequential order:
gen time = _n
which will be the observation number for each observation (which works in this case because the data
is sequentially ordered). Then we need to tell Stata that this new variable is our time series sequence
identifier with
tsset time
which allows us to proceed with Stata’s time series commands. In R, we can use the tools discussed
in the Computing Corner without necessarily creating a the “time” variable.
TABLE 13.6 Variables for James Bond Movie Data

GrossRev Gross revenue, measured in millions of U.S. dollars and adjusted for inflation
Rating Average rating by viewers on online review sites (IMDb and Rotten Tomatoes) as of
April 2013
Budget Production budget, measured in millions of U.S. dollars and adjusted for inflation
Actor Name of main actor
Order A variable indicating the order of the movies; we use this variable as our “time”
indicator even though movies are not evenly spaced in time
3. The file BondUpdate.dta contains data on James Bond films from 1962 to
2012. We want to know how budget and ratings mattered for how well the
movies did at the box office. Table 13.6 describes the variables.
(a) Estimate an OLS model in which the amount each film grossed is
the dependent variable and ratings and budgets are the independent
variables. Assess whether there is autocorrelation.
(b) Estimate the model from part (a) with Newey-West standard errors.
Compare the coefficients and standard errors to those produced by a
standard OLS model.
(c) Correct for autocorrelation using the ρ-transformation approach. Did
the results change? Did the autocorrelation go away?
(d) Now estimate a dynamic model. Find the short-term and (approxi-
mate) long-term effects of a 1-point increase in rating.
(e) Assess the stationarity of the revenue, rating, and budget variables.
(f) Estimate a differenced model and explain the results.
(g) Build from the above models to assess the worth (in terms of revenue)
of specific actors.
Advanced OLS 14
In Part One of this book, we worked through the OLS

model from the basic bivariate model to a variety of
multivariate models. We focused on the practical and
substantive issues that researchers deal with every day.
It can also be useful to look under the hood to see
exactly how things work. That’s what we do in this
chapter. We derive the OLS estimate of β̂1 in a sim-
plified model and show it is unbiased in Section 14.1.
Section 14.2 derives the variance of β̂1 , showing that
the basic equation for variance of β̂1 requires that
errors be homoscedastic and not correlated with each
other. Section 14.3 explains how to calculate power. Section 14.4 derives the
omitted variable bias conditions explained in Chapter 5. Section 14.5 shows how
to anticipate the sign of omitted variable bias, a useful tool when we’re faced
with an omitted variable problem. Section 14.6 extends the omitted variable bias
framework to models with multiple independent variables. Things get complicated
fast. However, we can see how the core intuition carries on. Section 14.7 derives
the equation for attenuation bias due to measurement error. Section 14.8 provides
additional detail for the discussion of post-treatment variables introduced in
Section 7.3.
How to Derive the OLS Estimator and Prove

14.1 Unbiasedness
The best way to appreciate how the OLS assumptions come together to produce
coefficient estimates that are unbiased, consistent, normally distributed, and with a
specific standard error equation is to derive the equations for the β̂ estimates. The
good news is that the process is really quite cool. The other good news is that it’s
not that hard. The bad news is, well, math. Two good newses beat one bad news,
so off we go.
493
494 CHAPTER 14 Advanced OLS
In this section, we derive the equation for β̂1 for a simplified regression model
and then show how β̂1 is unbiased if X and are not correlated.
Deriving the OLS estimator

We work here with a simplified model that has a variable and coefficient but no
intercept. This model builds from King, Keohane, and Verba (1994, 98).
Yi = β1 Xi + i (14.1)
Not having β0 in the model simplifies the derivation considerably while retaining
the essential intuition about how the assumptions matter.1
Our goal is to find the value of β̂1 that minimizes the sum of the squared
residuals; this value will produce a line that best fits the scatterplot. The residual
for a given observation is
î = Yi − β̂1 Xi
The sum of squared residuals for all observations is

î2 = (Yi − β̂1 Xi )2 (14.2)
We want to figure out what value of β̂1 minimizes this sum. A little simple
calculus does the trick. A function reaches a minimum or maximum at a point
where its slope is flat—that is, where the slope is zero. The derivative is the slope,
so we simply have to find the point at which the derivative is zero.2 The process is
the following:
1. Take the derivative of Equation 14.2:

d î2
= (−2)(Yi − β̂1 Xi )Xi
d β̂1
2. Set the derivative to 0:

(−2)(Yi − β̂1 Xi )Xi = 0
3. Divide both sides by −2:

(Yi − β̂1 Xi )Xi = 0
1
We’re actually just forcing β0 to be zero, which means that the fitted line goes through the origin.
In real life, we would virtually never do this; in real life, we probably would be working with a
multivariate model, too.
2
For any given “flat” spot, we have to figure out if we are at a peak or in a valley. It is very easy to do
this. Simply put, if we are at a peak, our slope should get more negative as X gets bigger (we go
downhill); if we are at a minimum, our slope should get bigger as X goes higher. The second
derivative measures changes in the derivative, so it must be negative for a flat spot to be a maximum
(and we need to be aware of things like “saddle points”—topics covered in any calculus book).
4. Separate the sum into its two additive pieces:

Yi Xi − β̂1 Xi2 = 0
5. Move terms to opposite sides of the equal sign:

Yi Xi = β̂1 Xi2
6. β̂1 is a constant, so we can pull it out of the summation:

Yi Xi = β̂1 Xi2

7. Divide both sides by Xi2 :

YX
i 2 i = β̂1 (14.3)
Xi
Equation 14.3, then, is the OLS estimate for β̂1 in a model with no β0 . It looks
quite similar to the equation for the OLS estimate of β̂1 in the bivariate model with
β0 (which is Equation 3.4 on page 49). The only difference is that here we do not
subtract X from X and Y from
Y. To derive Equation 3.4, we would do steps 1
through 7 by using î2 = (Yi − β̂0 − β̂1 Xi )2 , taking the derivative with respect
to β̂0 and with respect to β̂1 to produce two equations, which we would then solve
simultaneously.
Properties of OLS estimates

The estimate β̂1 is a random variable because its equation includes Yi , which we
know depends on i , which is a random variable. Hence, β̂1 will bounce around
as the values of i bounce around.
We can use Equation 14.3 to explain the relationship of β̂1 to the true value
of β1 by substituting for Yi in the β̂1 equation as follows:
1. Begin with the equation for β̂1 :

Yi Xi
β̂1 = 2
Xi
2. Use Equation 14.1 (which is the simplified model we’re using here, in
which β0 = 0) to substitute for Yi :

(β1 Xi + i )Xi
β̂1 = 2
Xi
3. Distribute Xi in the numerator:

(β1 Xi2 + i Xi )
β̂1 = 2
Xi
4. Separate the sum into additive pieces:

β1 Xi2 i Xi
β̂1 = 2 + 2
Xi Xi
5. β1 is constant, so we can pull it out of the first sum:

2
X i Xi
β̂1 = β1 i2 + 2
Xi Xi
6. This equation characterizes the estimate in terms of the unobserved “true”

values of β1 and :

i Xi
β̂1 = β1 + 2 (14.4)
Xi
In other words, β̂1 is β1 (the true value) plus an ugly fraction with sums of and
X in it.
From this point, we can show that β̂1 is unbiased. Here we need to show the
conditions under which the expected value of β̂1 = β1 . In other words, the expected
value of β̂1 is the value of β̂1 we would get if we repeatedly regenerated data
sets from the original model and calculated the average of all the β̂1 ’s estimated
from these multiple data sets. It’s not that we would ever do this—in fact, with
observational data the task is impossible. Instead, thinking of estimating β̂1 from
multiple realizations from the true model is a conceptual way for us to think about
whether the coefficient estimates on average skew too high, too low, or are just
right.
It helps the intuition to note that we could, in principle, generate the
expected value of β̂1 ’s for an experiment by running it over and over again and
calculating the average of the β̂1 ’s estimated. Or, more plausibly, we could run
a computer simulation in which we repeatedly regenerated data (which would
involve simulating a new i for each observation for each iteration) and calculating
the average of the β̂1 ’s estimated.
expected value The To show that β̂1 is unbiased, we use the formal statistical concept of expected
average value of a large value. The expected value of a random variable is the value we expect the random
number of realizations variable to be, on average. (For more discussion, see Appendix B on page 538.)
of a random variable.
1. Take expectations of both sides of Equation 14.4:

i Xi
E[ β̂1 ] = E[β1 ] + E 2
Xi
2. The expectation of a fixed number is that number, meaning E[β1 ] = β1 .

Recall that in our model, β1 (without the hat) is some unknown
number—maybe 2, maybe 0, maybe −0.341. Hence, the expectation of
β1 is simply whatever number it is. It’s like asking what the expectation of
the number 2 is. It’s 2!

i Xi
E[ β̂1 ] = β1 + E 2
Xi
3. Use the fact that E[k × g()] = k × E[g()] for constant k and random
function g(). Here 1X 2 is a constant (equaling 1 over whatever the sum
i
of Xi2 is), and i Xi is a function of random variables (the i ’s).
1
E[ β̂1 ] = β1 + 2 E i Xi
Xi
4. We can move the expectation operator inside the summation because the
expectation of a sum is the sum of expectations:
1
E[ β̂1 ] = β1 + 2 E[i Xi ] (14.5)
Xi
Equation 14.5 means that the expectation of β̂1 is the true value (β1 ) plus some
number 1X 2 times the sum of i Xi ’s. At this point, we use our Very Important
i
Condition, which is the exogeneity condition that i and Xi be uncorrelated. We
show next that this condition is equivalent to saying that E[i Xi ] = 0, which means

E[i Xi ] = 0, which will imply that E[ β̂1 ] = β1 , which is what we’re trying to
show.
1. If i and Xi are uncorrelated, then the covariance of i and Xi is zero because

correlation is simply a rescaled version of covariance:
cov(Xi , i )
correlation(Xi , i ) =
var(Xi )var(i )
2. Using the definition of covariance and setting it to zero yields the

following, where we refer to the mean of Xi as μX and the mean of the i
distribution as μ (the Greek letter μ is pronounced “mew,” which rhymes
with dew):
cov(Xi , i ) = E[(Xi − μx )(i − μ )] = 0
3. Multiplying out the covariance equation yields
E[Xi i − Xi μ − μx i + μx μ ] = 0
4. Using the fact that the expectation of a sum is the sum of expectations, we
can rewrite the equation as
E[Xi i ] − E[Xi μ ] − E[μx i ] + E[μx μ ] = 0
5. Using the fact that μ and μX are fixed numbers, we can pull them out of
the expectations:
E[Xi i ] − μ E[Xi ] − μx E[i ] + μx μ = 0
6. Here we add an another assumption that is necessary, but not of great

substantive interest. We assume that the mean of the error distribution
is zero. In other words, we assume μ = 0, which is another way of
saying that the error term in our model is simply the random noise around
whatever the constant is.3 This assumption allows us to cancel any term
with μ or with E[i ]. In other words, if the exogeneity condition is satisfied
and the error is uncorrelated with the error term, then
E[Xi i ] = 0
If E[Xi i ] = 0, Equation 14.5 tells us that the expected value of β̂1 will be β1 .
In other words, if the error term and the independent variable are uncorrelated, the
OLS estimate β̂1 is an unbiased estimator of β1 . The same logic carries through
in the bivariate model that includes β0 and in multivariate OLS models as well.
Showing that β̂1 is unbiased does not say much about whether any given
estimate will be near β1 . The estimate β̂1 is a random variable after all, and it
is possible that some β̂1 will be very low and some will be very high. All that
unbiasedness says is that on average, β̂1 will not run higher or lower than the true
value.
REMEMBER THIS
1. We derive the β̂1 equation by setting the derivative of the sum of squared residuals equation to
zero and solving for β̂1 .
2. The key step in showing that β̂1 is unbiased depends on the condition that X and are
uncorrelated.
3
In a model that has a non-zero β0 , the estimated constant coefficient would absorb any non-zero
mean in the error term. For example, if the mean of the error term is actually 5, the estimated constant
is 5 bigger than what it would be otherwise. Because we so seldom care about the constant term, it’s
reasonable to think of the β̂0 estimate as including the mean value of any error term.
14.2 How to Derive the Equation for the Variance of β̂1 499
β1
14.2 How to Derive the Equation for the Variance of β̂
In this section, we show how to derive an equation for the standard error of β̂1 .
This in turn reveals how we use the conditions that errors are homoscedastic and
uncorrelated with each other. Importantly, these assumptions are not necessary for
unbiasedness of OLS estimates. If these assumptions do not hold, we can still use
OLS, but we’ll have to do something different (as discussed in Chapter 13, for
example) to get the right standard error estimates.
We’ll combine two assumptions and some statistical properties of the variance
operator to produce a specific equation for the variance of β̂1 . We assume that the
Xi are fixed numbers and the ’s are random variables.
1. We start with the β̂1 equation (Equation 14.4) and take the variance of both
sides:

i Xi
var[ β̂1 ] = var β1 + 2
Xi
2. Use the fact that the variance of a sum of a constant (the true value
β1 ) and a function of a random variable is simply the variance of the
function of the random variable (see variance fact 1 in Appendix C
on page 539).

i Xi
var[ β̂1 ] = var 2
Xi
3. Note that 1X 2 is a constant (as we noted on page 497 too), and use
i
variance fact 2 (on page 540) that variance of k times a random variable is
k2 times the variance of that random variable.
1
2
var[ β̂1 ] = 2 var i Xi
Xi
4. The no-autocorrelation condition (as discussed in Section 3.6) means that

corr(i , j ) = 0 for all i = j. If this condition is satisfied, we can treat the
variance of a sum as the sum of the variances (using variance fact 4 on
page 540, which says that the variance of a sum of uncorrelated random
variables equals the sum of the variances of these random variables).
2
1
var[ β̂1 ] = 2 var[Xi i ]
Xi
5. Within the summation, reuse variance fact 2 (page 540).
1
2
var[ β̂1 ] = 2 Xi2 var[i ]
Xi
6. If we assume homoscedasticity (as discussed in Section 3.6), we can make

additional simplifications. If the error term is homoscedastic, the variance
for each i is σ 2 , which we can pull out of the summation and cancel.
2
1
var[ β̂1 ] = 2 Xi2 σ 2
Xi
2
X
= σ 2i 2
2
( Xi )
σ2
= 2 (14.6)
Xi
7. If we don’t assume homoscedasticity, we can use î2 as the estimate

for variance of each observation, yielding a heteroscedasticity-consistent
variance estimate.
1
2
var[ β̂1 ] = 2 Xi2 î2 (14.7)
Xi
Equation 14.7 is great in that it provides an appropriate estimate for the

variance of of β̂1 even when errors are heteroscedastic. However, it is quite
unwieldy, making it harder for us to see the intuition about variance that we can
access with the variance of β̂1 when errors are homoscedastic.
In this section, we have derived the variance of β̂1 in our simplified model
with no constant (for both homoscedastic and heteroscedastic cases). If we write
N Xi2 2
the denominator in Equation 14.6 as N instead of Xi , Equation 14.6
looks similar to the equation for the var( β̂1 ) in a bivariate model that we saw in
Chapter 3 on page 62. The difference is that when β0 is included in the model,
2
the denominator of the variance is N (Xi − X N ), which equals Nvar(X) for
large samples. The derivation process is essentially the same and uses the same
assumptions for the same purposes.
Let’s take a moment to appreciate how amazing it is that we have been able
to derive an equation for the variance of β̂1 . With just a few assumptions, we can
characterize how precise our estimate of β̂1 will be as a function of the variance
of and the Xi values.
The equation for the variance of β̂1 in the multivariate model is similar (see
Equation 5.10 on page 146), and the intuition discussed here applies for that model
as well.
14.3 Calculating Power 501
REMEMBER THIS
1. We derive the variance of a β̂1 by starting with the β̂1 equation.
2. If the errors are homoscedastic and not correlated with each other, the variance equation is in
a convenient form.
3. If the errors are not homoscedastic and uncorrelated with each other, OLS estimates are still
unbiased, but the easy-to-use standard OLS equation for the variance of β̂1 is no longer
appropriate.
14.3 Calculating Power

In Section 4.4, we introduced statistical power and showed that it is an important
concept that helps us appreciate the risks we run of committing a Type II error
(also known as false-negative results). In this section, we provide more details on
calculating power.
We begin by noting that the probability we commit a Type II error for any true
value of β1True = 0 is the probability that the t statistic is less than the critical value.
(Recall, of course, that we fail to reject the null hypothesis when the t statistic
is less than the critical value.) We can write this condition as follows (where the
condition following the vertical line is what we’re assuming to be true):
β̂1
Pr(Type II error given β1 = β1True ) = Pr < Critical value | β1 = β1True
se( β̂1 )
(14.8)
This probability will depend on the actual value of β1 , since we know that the
distribution of β̂1 will depend on the true value of β1 .
ˆ
The key element of this equation is Pr se(ββ1ˆ ) < Critical value | β1 = β1True .
1
This mathematical term seems complicated, but we actually know a fair bit about
ˆ
it. For a large sample size, the t statistic which is se(ββ1ˆ ) will be normally
1
distributed with a variance of 1 around the true value divided by the standard error
of the estimated coefficient. And from the properties of the normal distribution
(see Appendix G on page 543 for a review), this means that
β1True
Pr(Type II error given β1 = β1True ) = Φ Critical value − (14.9)
se( β̂1 )
where Φ() indicates the normal cumulative density function (see page 420 for
more details).
Power is simply 1 – Pr(Type II error). This quantity will vary depending on

the true value of β1 we wish to use in our power calculations.4
Deciding what true value to use in calculating power can be puzzling. There
really is no specific value that we should look at; in fact, the point is that we
can pick any value and calculate the power. We might pick a value of β1 that
indicates a substantial real-world effect and find the probability of rejecting the
null for that value. If the probability is low (meaning power is low), we should be
a bit skeptical because we may not have enough data to reject the null for such
a low true value. If the probability is high (meaning power is high), we can be
confident that if the true β1 is that value, then we probably can reject the null
hypothesis.
Review Questions
1. For each of the following, indicate the power of the test of the null hypothesis H0 : β1 = 0 against
the alternative hypothesis of HA : β1 > 0 for a large sample size and α = 0.01 for the given true
value of β1 . We’ll assume se( β̂1 ) = 0.75. Draw a sketch to help explain your numbers.
(a) β1True = 1
(b) β1True = 2
2. Suppose the estimated se( β̂1 ) doubled. What will happen to the power of the test for the two
cases in question 1? First, answer in general terms. Then calculate specific answers.
3. Suppose se( β̂1 ) = 2.5. What is the probability of committing a Type II error for each of the
true values given for β1 in question 1?
14.4 How to Derive the Omitted Variable Bias Conditions

On page 137 in Chapter 5, we discussed omitted variable bias, a concept that is
absolutely central to understanding multivariate OLS. In this section, we derive
the conditions for omitted variable bias to occur.
Suppose the true model is
Yi = β0 + β1 X1i + β2 X2i + νi (14.10)
4
bit easier by using the fact that 1 − Φ(−Z) = Φ(Z) to write the
And we canmake the calculation a
β1True
power as Φ se( βˆ1 )
− Critical value .
14.4 How to Derive the Omitted Variable Bias Conditions 503
where Yi is the dependent variable, X1i and X2i are two independent variables, and
νi is an error term that is not correlated with any of the independent variables.
For example, suppose the dependent variable is test scores and the independent
variables are class size and family wealth. We assume (for this discussion) that νi
is uncorrelated with X1i and X2i .
What happens if we omit X2 and estimate the following model?
OmitX2 OmitX2
Yi = β0 + β1 X1i + i (14.11)
OmitX2
where we will use β1 to indicate the estimate we get from the model that omits
OmitX2
variable X2 . How close will β̂ 1 (the coefficient on X1i in Equation 14.11) be to
OmitX2
the true value (β1 in Equation 14.10)? In other words, will β̂ 1 be an unbiased
estimator of β1 ? This situation is common with observational data because we
will almost always suspect that we are missing some variables that explain our
dependent variable.
OmitX2
The equation for β̂ 1 is the equation for a bivariate slope coefficient (see
Equation 3.4). It is
N
OmitX2 i=1 (X1i − X 1 )(Yi − Y)
β̂ 1 = N (14.12)
i=1 (X1i − X 1 )
2
OmitX2
Will β̂ 1 be an unbiased estimator of β1 ? With a simple substitution and a
bit of rearranging, we can answer this question. We know from Equation 14.10 that
the true value of Yi is β0 + β1 X1i + β2 X2i + νi . Because the values of β are fixed,
the average of each is simply its value. That is, β 0 = β0 , and so forth. Therefore,
Y will be β0 + β1 X 1i + β2 X 2i + ν i . Substituting for Yi and Y in Equation 14.12 and
doing some rearranging yields

OmitX2 (X1i − X 1 )(β0 + β1 X1i + β2 X2i + νi − β0 − β1 X 1 − β2 X 2i − ν i )
β̂ =
(X1i − X 1 )2

(X1i − X 1 )(β1 (X1i − X 1 ) + β2 (X2i − X 2 ) + νi − ν i )
=
(X1i − X 1 )2

Gathering terms and recalling that β1 (X1i − X 1i )2 = β1 (X1i − X 1i )2 yields

OmitX2 (X1i − X 1 )2 (X1i − X 1 )(X2i − X 2 ) (X1i − X 1 )(νi − ν)
β̂ = β1 + β2 +
(X1i − X 1 ) 2 (X1i − X 1 )2 (X1i − X 1 )2
We then take the expected value of both sides. Ourassumption that ν is

uncorrelated with X1 means that the expected value of (X1i − X 1 )(νi − ν) is
zero, which causes the last term with the ν’s to drop from the equation.5 This

5
The logic is similar to our proof on page 498 that if X and are uncorrelated, then E Xi i = 0.

In this case, (X1i − X 1 ) is analogous to Xi in the earlier proof, and (νi − ν) is analogous to i in the
earlier proof.
leaves us with
OmitX
(X1i − X 1 )(X2i − X 2 )
E β̂ 1 = β1 + β2
2
(14.13)
(X1i − X 1 )2
OmitX2
meaning that the expected value of β̂ 1 is β1 plus β2 times a messy fraction.
OmitX2
the estimate β̂ 1
In other words, will deviate, on average, from the true value,
(X1i −X 1 )(X2i −X 2 )
β1 , by β2 (X −X )2 .
1i 1
(X1i −X 1 )(X2i −X 2 )
Note that
(X −X )2
is simply the equation for the estimate of δˆ1 from
1i 1
the following model:
X2i = δ0 + δ1 X1i + τi
See, for example, page 49, and note the use of X2 and X 2 where we had Y and Y
in the standard bivariate OLS equation.
OmitX2
We can therefore conclude that our coefficient estimate β̂ 1 from the
model that omitted X2 will be an unbiased estimator of β1 if β2 δˆ1 = 0. This
condition is most easily satisfied if β2 = 0. In other words, if X2 has no effect
on Y (meaning β2 = 0), then omitting X2 does not cause our coefficient estimate
to be biased. This is excellent news. If it were not true, our model would have to
include variables that had nothing to do with Y. That would be a horrible way to
live.
The other way for β2 δˆ1 to be zero is for δˆ1 to be zero, which happens
whenever X1 would have a coefficient of zero in a regression in which X2
is the dependent variable and X1 is the independent variable. In short, if X1
and X2 are independent (such that regressing X2 on X1 yields a slope coeffi-
OmitX2
cient of zero), then even though we omitted X2 from the model, β̂ 1 will
be an unbiased estimate of β1 , the true effect of X1 on Y (from Equation 14.10).
No harm, no foul.
The flip side of these conditions is that when we estimate a model that
omits a variable that affects Y (meaning that β2 = 0) and is correlated with
the included variable, OLS will be biased. The extent of the bias depends on
how much the omitted variable explains Y (which is determined by β2 ) and
how much the omitted variable is related to the included variable (which is
reflected in δˆ1 ).
What is the takeaway here? Omitted variable bias is a problem if both of the
following conditions are met: (1) the omitted variable actually matters (β2 = 0)
and (2) X2 (the omitted variable) is correlated with X1 (the included variable).
This shorthand is remarkably useful in evaluating OLS models.
14.5 Anticipating the Sign of Omitted Variable Bias 505
REMEMBER THIS
The conditions for omitted variable bias can be derived by substituting the true value of Y into the β̂1
equation for the model with X2 omitted.
14.5 Anticipating the Sign of Omitted Variable Bias

It is fairly common that an omitted variable must remain omitted because we
simply do not have a measure of it. In these situations, all is not lost. (A lot is lost,
but not all.) We can use the concepts we have developed so far to work through
the implication of omitting the variable in question. In this section, we show how
to anticipate the effects of omitting a variable.
Suppose we are interested in explaining the effect of education on wages. We
estimate the model
Incomei = β0 + β1 Educationi + i (14.14)
where Incomei is the monthly salary or wages of individual i and Educationi is the
number of years of schooling individual i completed. We are worried, as usual,
that certain factors in the error term are correlated with education.
We worry, for example, that some people are more productive than others
(a factor in the error term that affects income) and that productive folks are more
likely to get more schooling (school may be easier for them). In other words, we
fear the true equation is
Incomei = β0 + β1 Educationi + β2 Productivityi + i (14.15)
where Productivityi taps the combination of intelligence, diligence, and maturity

that leads person i to add a lot of value to his or her organization. Most data sets
will not have a good measure of it. What can we do?
Without the variable, we’re stuck, but at least we can figure out whether
omitting productivity will push our estimates of the effect of education higher
or lower.6 Our omitted variable bias results (e.g., Equation 14.13) indicate that
the bias from omitting productivity depends on the effect of productivity on
the dependent variable (β2 ) and on the relationship between productivity and
education, the included variable.
In our example, we believe productivity boosts income (β2 > 0). We also
believe that there is a positive relationship between education and productivity.
6
Another option is to use panel data that allows us to control for certain unmeasured factors, as we
did in Chapter 8. Or we can try to find exogenous variation in education (variation in education that is
not due to differences in productivity); that’s what we did in Chapter 9.
TABLE 14.1 Effect of Omitting X2 on Coefficient Estimate for X1

Correlation β2
of Effect of omitted variable on Y
X1 and X2 >0 0 <0
>0 Overstate coefficient No bias Understate coefficient

0 No bias No bias No bias
<0 Understate coefficient No bias Overstate coefficient
Cell entries show sign of bias for omitted variable bias problem in which a single variable (X2 ) is omitted.
The true equation is Equation 14.10 and the estimated model is Equation 14.11. If β2 > 0 and X1 and X2 are
OmitX2
positively correlated, βˆ1 (the expected value of the coefficient on X1 from a model that omits X2 ) will be
larger than the actual value of β1 .
Hence, the bias will be positive because it is β2 > 0 times the effect of the
productivity on education. A positive bias implies that omitting productivity
induces a positive bias for education. In other words, the effect of education on
income in a model that does not control for productivity will be overstated. The
magnitude of the bias will be related to how strong these two components are.
If we think productivity has a huge effect on income and is strongly related to
education levels, then the size of the bias is large.
In this example, this bias would lead us to be skeptical of a result from a
model like Equation 14.14 that omits productivity. In particular, if we were to find
that β̂1 is greater than zero, we would worry that the omitted variable bias had
inflated the estimate. On the other hand, if the results showed that education did
not matter or had a negative coefficient, we would be more confident in our results
because the bias would on average make the results larger than the true value, not
smaller. This line of reasoning, called “signing the bias,” would lead us to treat the
estimated effects based on Equation 14.14 as an upper bound on the likely effects
of education on income.
Table 14.1 summarizes the relationship for the simple case of one omitted
variable. If X2 , the omitted variable, has a positive effect on Y (meaning β2 > 0)
and X2 and X1 are correlated, then the coefficient on X1 in a model with only
X1 will produce a coefficient that is biased upward: the estimate will be too
big because some of the effect of unmeasured X2 will be absorbed by the
variable X1 .
REMEMBER THIS
We can use the equation for omitted variable bias to anticipate the effect of omitting a variable on the
coefficient estimate for an included variable.
14.6 Omitted Variable Bias with Multiple Variables 507
1. Suppose we are interested in knowing how much social media affect people’s income. Suppose
also that Facebook provided us data on how much time each individual spent on the site during
work hours. The model is
Incomei = β0 + β1 Facebook hoursi + i
What is the implication of not being able to measure innate productivity for our estimate
of β1 ?
2. Suppose we are interested in knowing the effect of campaign spending on election outcomes.
Vote sharei = β0 + β1 Campaign spendingi + i
We believe that the personal qualities of a candidate also matter. Some are more charming
and/or hardworking than others, which may lead to better election results for them. What is the
implication of not being able to measure “candidate quality” (which captures how charming
and hardworking candidates are) for our estimate of β1 ?
14.6 Omitted Variable Bias with Multiple Variables

Our omitted variable discussion in Section 5.2 was based on a case in which the
true model had two variables and a single variable was omitted. Now we show the
complications that arise when there are additional variables.
Suppose the true model has three independent variables such that
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + νi (14.16)
and that we estimate a model that omits variable X3 :

OmitX3 OmitX3 OmitX3
Yi = β0 + β1 X1i + β2 X2i + i (14.17)
Assuming that the error in the true model (ν) is not correlated with any of the
OmitX3
independent variables, the expected value for β̂ 1 is
OmitX
r31 − r21 r32 V3
E β̂ 1 = β1 + β3
3
(14.18)
1 − r212 V1
where r31 is the correlation of X3 and X1 , r21 is the correlation of X2 and X1 , r32
is the correlation of X3 and X2 , and V3 and V1 are the variances of X3 and X1 ,
respectively. Clearly, there are more moving parts in this case than in the case we
discussed earlier.
Equation 14.18 contains commonalities with our simpler omitted variables

bias example of Section 5.2. The effect of the omitted variable in the true model
looms large. Here β3 is the effect of the omitted variable X3 on Y, and it plays a
central role in the bias term. If β3 is zero, there is no omitted variable bias because
the crazy fraction will be multiplied by zero and thereby disappear. As with the
simpler omitted variable bias case, omitting a variable causes bias only if that
variable actually affects Y.
The bias term has more factors, however. The r31 term is the correlation of
the excluded variable (X3 ) and the first variable (X1 ). It is the first term in the
denominator of the bias term, playing a similar role to that of the correlation of
the excluded and included variables in the simpler model. The complication now
is that the correlations of the two included variables (r21 ) and correlation of the
omitted variable and the included variable (r32 ) also matter.
We can take away some simple principles. If the included independent
variables are not correlated (which would mean that r21 = 0), then the equation
simplifies to essentially what we were dealing with in the simple case. If the
excluded variable is not correlated with the other included variable (r32 = 0), we
again can go back to the intuition from the simple omitted variable bias model.
If, however, both correlations are non-zero (and, to be practical, relatively large),
then the simple-case intuition may not travel well, and we should tread carefully.
We’ll still be worried about omitted variable bias, but our ability to sign the bias
will be weakened.
REMEMBER THIS
When there are multiple variables in the true equation, the effect of omitting one of them depends in
a complicated way on the interrelations of all variables.
1. As in the simpler model, if the omitted variable does not affect Y, there is no omitted variable
bias.
2. The equation for omitted variable bias when the true equation has only two variables often
provides a reasonable approximation of the effects for cases in which there are multiple
14.7 Omitted Variable Bias due to Measurement Error

We discussed measurement error in Section 5.3. Here we derive the equation for
attenuation bias due to measurement error in an independent variable for the case
of one independent variable. We also discuss implications of measurement error
when there are multiple variables.
14.7 Omitted Variable Bias due to Measurement Error 509
Model with one independent variable

We start with a true model based on the actual value of the independent variable,
∗
which we denote with X1i :
∗
Yi = β0 + β1 X1i + i (14.19)
The independent variable we observe has some error:
∗
X1i = X1i + νi (14.20)
∗
where we assume that νi is uncorrelated with X1i . This little equation will do a lot
of work for us in helping us understand the effect of measurement error.
∗
Substituting for X1i in the true model yields
Yi = β0 + β1 (X1i − νi ) + i
= β0 + β1 X1i − β1 νi + i (14.21)
Let’s treat ν as the omitted variable and −β1 as the coefficient on the omitted
variable. (Compare these to X2 and β2 in Equation 5.7.) Doing so allows us to
write the omitted variable bias equation as
OmitX2 cov(X1 , ν)
β1 = β1 − β1 (14.22)
var(X1 )
where we use the covariance-based equation from page 60 to calculate δ1 in the

standard omitted variable equation.
Recalling that X1i = X1i∗ + νi and using the rules for covariance in Appendix D
on page 540, we can show that cov(X1 , ν) = σν2 .7 Also, because X1 = X1∗ + ν,
var(X1 ) = σX2∗ + σν2 .
We can therefore rewrite Equation 14.22 as
OmitX2 σν2
β1 = β1 − β1 (14.23)
σX2∗ + σν2
Collecting terms yields
σν2
plim β̂1 = β1 1 −
σν + σX2∗
2
7
First, note that cov(X1 , ν) = cov(X1∗ + ν, ν) = cov(X1∗ , ν) + cov(ν, ν) = cov(ν, ν) because ν is not
correlated with X1∗ . Finally, note that cov(ν, ν) = σν2 by standard rules of covariance.
σX2∗
σ2
Finally, we use the fact that 1 − σ 2 +σν 2 = 1
σν2 +σ 2∗
to produce
ν X∗ X
1
σX2∗
plim β̂1 = β1 1
σν2 + σX2∗
1
which is the equation we discussed in detail in Section 5.3.
Measurement error with multiple independent variables

We have so far dealt with a bivariate regression with a single, poorly measured
independent variable for which the error is a mean-zero random variable uncorre-
lated with anything else. If we have multiple independent variables and a single
badly measured variable, it is still the case that the coefficient on the poorly
measured independent variable will suffer from attenuation bias, as defined in
Chapter 5 on page 145. The other coefficients will also suffer, although in a
way that is hard to anticipate. This source of measurement-related bias is seldom
emphasized in real applications.
REMEMBER THIS
1. We can use omitted variable logic to derive the effect of a poorly measured independent
variable.
2. A single poorly measured independent variable can cause other coefficients to be biased.
14.8 Collider Bias with Post-Treatment Variables

In this section, we provide more detail on collider bias, a type of bias that can occur
when post-treatment variables are included in models. We initially addressed this
topic in Section 7.3. Collider bias occurs when a post-treatment variable creates a
pathway for spurious effects to appear in our estimation. Our goal in this section
is to characterize collider biases for a reasonably general model.
We’ll work with the model depicted in Figure 14.1. The model is general,
but we’ll use an example of evaluating a tutoring experiment to make things
more concrete. The independent variable of interest, X1 , measures participation
in a tutoring program in 9th grade. To keep our example clean, we’ll assume
that participation was randomized. The post-treatment variable, X2 , is 12th grade
reading scores, something potentially affected by the tutoring treatment. The
dependent variable, Y, is earnings at age 26. The unobserved confounder, U, is
intelligence, something that can affect both reading scores and earnings.
14.8 Collider Bias with Post-Treatment Variables 511
X1 U
Independent variable Unobserved
Example: 9th grade confounder variable
tutoring Example: intelligence
ρ1
α ρ2
γ1
X2 Y
Post−treatment variable γ2
Dependent variable
Example: 12th grade Example: age 26
reading score earnings
FIGURE 14.1: A More General Depiction of Models with a Post-Treatment Variable
The true relationships in Figure 14.1 are
Readingi = αTutori + ρ1 Intelligencei (14.24)

Earningsi = γ0 + γ1 Tutori + γ2 Readingi + ρ2 Intelligencei (14.25)
As discussed in Chapter 7 on page 238, if we estimate
Earningsi = β0 + β1 Tutori
then β̂1 , the estimated coefficient on X1 , will in expectation equal the true effect
of X1 , which is γ1 + αγ2 .
Our interest here is in what happens when we include a post-treatment
variable:
Earningsi = β0 + β1 Tutori + β2 Readingi (14.26)
In this case, we can work out what the expected values of the estimated coefficients
β̂1 and β̂2 . First, note that in the true model, the effect on Earnings of a one-unit
increase in Tutor is γ1 + γ2 α. The direct effect is γ1 , and the indirect effect is γ2 α
(because tutoring also affects reading by α and reading affects earnings by γ2 ).
Also note that in the true model, the effect on Earnings of a one-unit increase
in Intelligence is ρ2 +γ2 ρ1 . The direct effect of intelligence is ρ2 is and the indirect
effect of intelligence is γ2 ρ1 (because intelligence also affects reading by ρ1 and
reading affects earnings by γ2 ).
We first substitute the true equation for reading scores (Equation 14.24) into
the estimated equation for earnings (14.26), producing
Earningsi = β0 + β1 Tutori + β2 (αTutori + ρ1 Intelligence)

= β0 + (β1 + β2 α)Tutori + β2 ρ1 Intelligence (14.27)
The effect of intelligence (U) in the estimated model is β2 ρ1 , which, in

expectation, will equal the effect of intelligence (U) in the true model, which is
γ2 ρ1 + ρ2 . In other words, in expectation, β2 ρ1 = γ2 ρ1 + ρ2 , which means that
E[ β̂2 ] = γ2 + ρρ2 .
1
The effect of tutoring (X1 ) in the estimated model is β1 + β2 α, which, in
expectation, will equal the effect of tutoring in the true model, which is γ1 + γ2 α.
In other words, in expectation, β1 + β2 α = γ1 + γ2 α. Some algebra8 produces the
result that E[ β̂1 ] = γ1 − ρρ2 .
1
We can therefore characterize the expected value of our estimated coefficients
in terms of parameters of the true model:
ρ2
E[ β̂1 ] = γ1 − α (14.28)
ρ1
ρ2
E[ β̂2 ] = γ2 + (14.29)
ρ1
We can see that when we include a post-treatment variable, OLS will produce
a β̂1 estimate with an expected value of γ1 minus α ρρ2 . As long as α ρρ2 = 0 there will
1 1
be post-treatment bias. For α = 0 requires simply that X2 really is post-treatment
because α is the parameter that characterizes the effect of X1 on X2 . For the ρ
parameters to be non-zero requires simply that the unmeasured variable really is a
confounder that affects both X2 and Y, a pattern familiar from our omitted variable
bias discussion: U is something in the error term correlated with an independent
variable (by ρ1 ) and the dependent variable (by ρ2 ).
These equations help us see that the range of post-treatment bias is essentially
unlimited. Table 14.2 illustrates this by showing some examples of parameter
TABLE 14.2 Examples of Parameter Combinations for Models with

Post-Treatment Variables
Model Parameters Estimated coefficients
α ρ1 ρ2 γ1 γ2 E[ βˆ1 ] E[ βˆ2 ]
ρ ρ
Case X1 → X2 U → X2 U → Y X1 → Y X2 → Y γ 1 − α ρ2 γ 2 + ρ2
1 1
1 1 1 1 1 1 0 2
2 1 0.5 1 1 1 −1 3
3 1 0.01 1 1 1 −99 101
4 1 1 5 1 1 −4 6
8
If you must know, do the following: (1) isolate β1 on the left-hand side of the equation:
E[β1 ] = E[−β2 α + γ1 + γ2 α]; (2) substitute for E[ β̂2 ] : −α ρρ2 − αγ2 + γ1 + γ2 α = γ1 − α ρρ2 .
1 1
Conclusion 513
combinations and the expected values of the coefficients from a model with the
independent variable and post-treatment variable both included. The first line has
an extremely simple case in which the α, ρ’s and γ ’s all equal 1. The actual direct
effect of X1 is 1, but the expected value of the coefficient on X1 will be 0. Not
great. In row 2, we set the effect of U on X2 to be 0.5 and that the expected value
of the coefficient on X1 falls to −1 even though the actual direct effect is still 1.
In row 3, we set the effect of U on X2 to be 0.1, and now things get really crazy:
the expected value of the effect of X1 plummets to −99 even though the true direct
effect (γ1 ) is still just 1. This is nuts! Row 4 shows another example, still not good.
Exercise 4 in Chapter 7 provides a chance to simulate more examples.
Conclusion
OLS goes a long way with just a few assumptions about the model and the
error terms. Exogeneity gets us unbiased estimates if there are no post-treatment
variables. Homoscedasticity and non-correlated errors get us an equation for the
variance of our estimates.
How important is it to be able to know exactly how these assumptions come
together to provide all this good stuff? On a practical level, not very. We can
go about most of our statistical business without knowing how to derive these
results.
On a deeper level, though, it is useful to know how the assumptions matter.
The statistical properties of OLS are not magic. They’re not even that hard, once
we break the derivations down step by step. The assumptions we rely on play
specific roles in figuring out the properties of our estimates, as we have seen in the
derivations in this chapter. We also formalized and extended our understanding of
bias. First, we focused on omitted variable bias, deriving the omitted variable bias
conditions and exploring how omitted variable arises in various contexts. Then we
derived post-treatment collider bias for a reasonably general context.
We don’t need to be able to produce all the derivations from scratch. If we can
do the following, we will have a solid understanding of the statistical foundations
of OLS:
• Section 14.1: Explain the steps in deriving the equation for the OLS
estimate of β̂1 . What assumption is crucial for β̂1 to be an unbiased
estimator of β1 ?
• Section 14.2: What assumptions are crucial to a derivation of the standard

equation for the variance of β̂1 ?
• Section 14.3: Show how to calculate power for a given true value of β.
• Section 14.4: Show how to derive the omitted variable bias equation.
• Section 14.5: Show how to use the omitted variable bias equation to “sign
the bias.”
• Section 14.6: Explain how omitted variable bias works when the true model
contains multiple variables.
• Section 14.7: Show how to use omitted variable bias tools to characterize
the effect of measurement error.
• Section 14.8: Explain how the expected value of estimated coefficients in a

model with a post-treatment collider variable differ from the true effects.
Further Reading
See Clarke (2005) for further details on omitted variables. Greene (2003, 148)
offers a generalization that uses matrix notation.
Greene (2003, 86) also discusses the implications of measurement error when
the model contains multiple independent variables. Cragg (1994) provides an
accessible overview of problems raised by measurement error and offers strategies
for dealing with them.
Key Term
Expected value (496)
Computing Corner
Stata
1. To estimate OLS models, use the tools discussed in the Computing Corner
in Chapter 5.
2. To generate a normal random variable (useful in the simulation of a

variable measured with error), use gen Noise = rnormal(0,1), where
the first number in parentheses is the mean of the normal random variable
and the second number is the standard deviation of the normally distributed
random variable. For a uniformly distributed random variable, use gen
NoiseUniform = runiform().
Exercises 515
1. To estimate OLS models, use the tools discussed in the Computing Corner
in Chapter 5.
2. To generate a standard normal random variable (useful in the simulation

of a variable measured with error), use Noise = rnorm(N), where the
number in parentheses is the number of observations. A more general
form adds subcommands for mean and standard deviation. For example,
rnorm(500, mean = 1, sd = 2) creates a normally distributed ran-
dom variable of length 500 with mean of 1 and a standard deviation of
2. For a uniformly distributed random variable, use NoiseUniform =
runif(N), where N is the desired length of the variable.
Exercises
1. Apply the logic developed in this chapter to the model Yi = β0 + β1 Xi + i .
(There was no β0 in the simplified model we used in Section 14.1.) Derive
the OLS estimate for β̂0 and β̂1 .
2. Show that the OLS estimate β̂1 is unbiased for the model Yi = β0 +
β1 Xi + i .
3. Using the data in olympics_HW.dta on medals in the Winter Olympics

from 1980 to 2014 to answer the following questions. Table 14.3 describes
the variables.
(a) Run a model with medals as the dependent variable and population
as the independent variable, and briefly interpret the results.
(b) The model given omits GDP (among other things). Use tools
discussed in Section 14.5 to anticipate the sign of omitted variable
bias for β̂1 in the results in part (a) that are due to omission of GDP
from that model.
(c) Estimate a model explaining medals with both population and GDP.
Was your prediction about omitted variable bias correct?
(d) Note that we have also omitted a variable for whether a country is
the host for the Winter Olympics. Sign the bias of the coefficient
on population in part (a) that is due to omission of the host country
variable.
TABLE 14.3 Variables for Winter Olympics Data


year Year
time A time variable equal to 1 for first Olympics in data set (1980), 2 for second Olympics
(1984), and so forth. Useful for time series analysis.
population Population of country (in 100,000)

GDP Per capita gross domestic product, in $10,000 U.S. dollars
host Dummy variable indicating if country hosted Olympics in that year (1 = hosted, 0 =
otherwise)
temp Average high temperature (in Fahrenheit) in January (in July for countries in the
Southern Hemisphere)
elevation Highest peak elevation in the country
(e) Estimate a model explaining medals with both population and host
(do not include GDP at this point). Was your prediction about
omitted variable bias correct?
(f) Estimate a model explaining medals with population, GDP, host

country, average elevation, and average temperature. Use standard-
ized coefficients, and briefly discuss the results.
(g) Use tips in the Computing Corner to create a new GDP variable
called NoisyGDP that is equal to the actual GDP plus a standard
normally distributed random variable. Think of this as a measure of
GDP that has been corrupted by a measurement error. (Of course,
the actual GDP variable itself is almost certainly tainted by some
measurement error already.) Estimate the model from part (f), but
use NoisyGDP instead of GDP. Explain changes in the coefficient
on GDP, if any.
4. The data set MLBattend.dta contains Major League Baseball attendance

records for 32 teams from the 1970s through 2000.
(a) Estimate a regression in which home attendance rate is the dependent

variable and runs scored is the independent variable. Report your
results, and interpret all coefficients.
(b) Use the standard error from your results to calculate the statistical
power of a test of H0 : βruns_scored = 0 versus HA : βruns_scored > 0
Exercises 517
with α = 0.05 (assuming a large sample for simplicity) for three

cases:
(i) βruns_scored = 100
(ii) βruns_scored = 400
(iii) βruns_scored = 1, 000
(c) Suppose we had much less data than we actually do, such that the
standard error on the coefficient on βruns_scored were 900 (which
is much larger than what we estimated). Use the standard error
of βruns_scored = 900 to calculate the statistical power of a test of
H0 : βruns_scored = 0 versus HA : βruns_scored > 0 with α = 0.05
(assuming a large sample for simplicity) for the three cases described
in part (b).
(d) Suppose we had much more data than we actually do, such that
the standard error on the coefficient on βruns_scored were 200 (which
is much smaller than what we estimated). Use the standard error
of βruns_scored = 100 to calculate the statistical power of a test of
H0 : βruns_scored = 0 versus. HA : βruns_scored > 0 with α = 0.05
(assuming a large sample for simplicity) for the three cases described
in part (b).
(e) Discuss the differences across the power calculations for the different
standard errors.
15 Advanced Panel Data
In Chapter 8, we used fixed effects in panel data

models to control for unmeasured factors that
are fixed within units. We did so by including
dummy variables for the units or by rescaling the
data. We can also control for many time factors
by including fixed effects for time periods.
The models get more complicated when
we start thinking about more elaborate depen-
dence across time. We face a major choice of
whether we want to treat serial dependence in
terms of serially correlated errors or in terms of
dynamic models in which the value of Yt depends
directly on the value of Y in the previous period.
These two approaches lead to different modeling
choices and, in some cases, different results.
In this chapter, we discuss how these
approaches connect to the panel data analysis
we covered in Chapter 8. Section 15.1 shows
how to deal with autocorrelation in panel data
models. Section 15.2 introduces dynamic models
for panel data analysis. Section 15.3 presents
random effects models, an alternative to fixed
effects models. Random effects models treat
unit-specific error as something that complicates
standard error calculations but does not cause bias. They’re not as useful as fixed
effects models, but it can be helpful to understand how they work.
15.1 Panel Data Models with Serially Correlated Errors

In panel data, it makes sense to worry about autocorrelation for the same reasons
it makes sense to worry about autocorrelation in time series data. Remember all
518
15.1 Panel Data Models with Serially Correlated Errors 519
the stuff in the error term? Lots of that will stick around for a while. Unmeasured
factors in year 1 may linger to affect what is going on in year 2, and so on. In this
section, we explain how to deal with autocorrelation in panel models, first without
fixed effects and then with fixed effects.
Before we get into diagnosing and addressing the problem, let’s recall the
stakes. Autocorrelation does not cause bias in the standard OLS framework, but it
does cause OLS estimates of standard errors to be incorrect. In fact, it often causes
the OLS estimates of standard errors to be too small because we don’t really have
the number of independent observations that OLS thinks we do.
Autocorrelation without fixed effects

We start with a model without fixed effects. The model is
Yit = β0 + β1 X1it + β2 X2it + it

it = ρi,t−1 + νit
where νit is a mean-zero, random error term that is not correlated with the
independent variables. There are N units and T time periods in the panel data
set. We limit ourselves to first-order autocorrelation (where the error this period is
a function of the error last period). The tools we discuss generalize pretty easily
to higher orders of autocorrelation.1
Estimation is relatively simple. First, we use standard OLS to estimate the
model. We then use the residuals from the OLS model to test for evidence of
autocorrelated errors. This works because OLS β̂ estimates are unbiased even if
errors are autocorrelated, which means that the residuals (which are functions of
the data and β̂) are unbiased estimates, too.
We test for autocorrelated errors in this context using something called a
Lagrange multiplier (LM) test. The LM test is similar to our test for autocorrelation
in Chapter 13 on page 465. It involves estimating the following:
ît = ρî,t−1 + γ1 X1it + γ2 X2it + ηit
where ηit (η is the Greek letter eta) is a mean-zero, random error term. We use
the fact that N × R2 from this auxiliary regression is distributed χ12 under the null
hypothesis of no autocorrelation.
If the LM test indicates autocorrelation, we will use ρ-transformation
techniques we discussed in Section 13.3 to estimate an AR(1) model.
Autocorrelation with fixed effects

To test for autocorrelation in a panel data model that has fixed effects, we must
deal with a slight wrinkle. The fixed effects induce correlation in the de-meaned
errors even when there is no correlation in the actual errors. The error term in
the de-meaned model is (it − i· ), which means that the de-meaned error for unit
1
A second-order autocorrelated process would also have the error in period t correlated with the error
in period t − 2, and so on.
520 CHAPTER 15 Advanced Panel Data
i will include the mean of the error terms for unit i ( i· ), which in turn means
that T1 of any given error term will appear in all error terms. This means that i1
(the raw error in the first period) is in the first de-meaned error term, the second
de-meaned error term, and so on via the i· term. The result will be at least a little
autocorrelation because the de-meaned error term in the first and second periods,
for example, will move together at least a little bit because both have some of the
same terms.
Therefore, to test for AR(1) errors in a panel data model with fixed effects,
we need to use robust errors that account for autocorrelation.
REMEMBER THIS
To estimate panel models that account for autocorrelated errors, proceed as follows:
1. Estimate an initial model that does not address autocorrelation. This model can be either an
OLS model or a fixed effects model.
2. Use residuals from the initial model to test for autocorrelation, and apply the LM test based on
the R2 from the following model:
ît = ρî,t−1 + γ1 X1it + γ2 X2it + ηit
3. If we reject the null hypothesis of no autocorrelation (which will happen when the R2 in the
equation above is high), then we should remove the autocorrelation by ρ-transforming the data
as discussed in Chapter 13.
Temporal Dependence with a Lagged

15.2 Dependent Variable
We can also model temporal dependence with the dynamic models we discussed
in Section 13.4. In these models, the current value of Yit could depend directly on
Yi,t−1 , the value of Y in the previous period.
These models are sneakily complex. They seem easy because they simply
require us to include a lagged dependent variable in an OLS model. They actually
have many knotty aspects that differ from those in standard OLS models. In this
section, we discuss dynamic models for panel data, first without fixed effects and
then with fixed effects.
Lagged dependent variable without fixed effects

We begin with a panel model without fixed effects. Specifically,
Yit = γYi,t−1 + β0 + β1 X1it + β2 X2it + it (15.1)

where γ is the effect of the lagged dependent variable, the β’s are the immediate
effects of the independent variables, and it is uncorrelated with the independent
variables and homoscedastic.
We see how tricky this model is when we try to characterize the effect of
X1it on Yit . Obviously, if X1it increases by one unit, there will be a β1 increase in
Yit that period. Notice, though, that an increase in Yit in one period affects Yit in
future periods via the γYi,t−1 term in the model. Hence, increasing X1it in the first
period, for example, will affect the value of Yit in the first period, which will then
affect Y in the next period. In other words, if we change X1it , we get not only β1
more Yit but also γ × β1 more Y in the next period, and so on. That is, change in
X1it today dribbles on to affect Y forever through the lagged dependent variable in
Equation 15.1.
As a practical matter, including a lagged dependent variable is a double-edged
sword. On the one hand, it is often highly significant, which is good news in that we
have a control variable that soaks up variance that’s unexplained by other variables.
On the other hand, the lagged dependent variable can be too good—so highly
significant that it sucks the significance out of the other independent variables. In
fact, if there is serial autocorrelation and trending in the independent variable,
including a lagged dependent variable causes bias. In such a case, Princeton
political scientist Chris Achen (2000, 7) has noted that the lagged dependent
variable
does not conduct itself like a decent, well-behaved proxy. Instead it is

a kleptomaniac, picking up the effect, not only of excluded variables,
but also of the included variables if they are sufficiently trended. As
a result, the impact of the included substantive variables is reduced,
sometimes to insignificance.
This conclusion does not mean that lagged dependent variables are evil, but
rather that we should tread carefully when we are deciding whether to include
them. In particular, we should estimate models both with them and without. If
results differ substantially, we should decide to place more weight on the model
with or without the lagged dependent variable only after we’ve run all the tests
and absorbed the logic described next.
The good news is that if the errors are not autocorrelated, using OLS for
a model with lagged dependent variables works fine. Given that the lagged
dependent variable commonly soaks up any serial dependence in the data, this
approach is reasonable and widely used.2
If the errors are autocorrelated, however, OLS will produce biased estimates
of β̂ when a lagged dependent variable is included. In this case, autocor-
relation does more than render conventional OLS standard error estimates
inappropriate—autocorrelation in models with lagged dependent variables actu-
ally messes up the estimates. This bias is worth mulling over a bit. It happens
because models with lagged dependent variables are outside the conventional
2
See Beck and Katz (2011).
OLS framework. Hence, even though autocorrelation does not cause bias in OLS
models, autocorrelation can cause bias in dynamic models.
Why does autocorrelation cause bias in a model when we include a lagged
dependent variable? It’s pretty easy to see: Yi,t−1 of course contains i,t−1 . And
if i,t−1 is correlated with it (which is exactly what first-order autocorrelation
implies), then one of the independent variables in Equation 15.1, Yi,t−1 , will be
correlated with the error.
This problem is not particularly hard to deal with. Suppose there is no
autocorrelation. In that case, OLS estimates are unbiased, meaning that the
residuals from the OLS model are consistent, too. We can therefore use these
residuals in an LM test like the one we described earlier (on page 519). If we
fail to reject the null hypothesis (which is quite common since lagged dependent
variables often zap autocorrelation), then OLS it is. If we reject the null hypothesis
of no autocorrelation, we can use an AR(1) model like the one discussed in Chapter
13 to rid the data of autocorrelation and thereby get us back to unbiased and
consistent estimates.
Lagged dependent variable with fixed effects

The lagged dependent variable often captures the unit-specific variance that fixed
effects capture. Hence, it is not uncommon to see lagged dependent variables used
in place of fixed effects. If we want both in our model, we move on to consider
dynamic models with fixed effects.
Beware, however! Things get complicated when we include a lagged
dependent variable and fixed effects in the same model.
Here’s the model:
Yit = γYi,t−1 + β0 + β1 X1it + β2 X2it + αi + it
where it is uncorrelated with the independent variables.

OLS is biased in this situation. Bummer. Recall from Section 8.2 that fixed
effects models are equivalent to de-meaned estimates. That means a fixed effects
model with a lagged dependent variable will include a variable (Yi,t−1 − Y i,t−1 ).
The Y i,t−1 part of this variable is the average of the lagged dependent variable
over all periods. This average will therefore include the value of Yit , and Yit in turn
contains it . Hence, the de-meaned lagged dependent variable will be correlated
with it . The extent of this bias depends on the magnitude of this correlation,
which is proportional to T1 , where T is the length of the time series for each
observation (often the number of years of data). For a small panel with just two
or three periods, the bias can be serious. For a panel with 20 or more periods, the
problem is less serious. One piece of good news here is that the bias in a model
with a lagged dependent variable and fixed effects is worse for the coefficient
on the lagged dependent variable; simulation studies indicate that bias is modest
for coefficients on the Xit variables, which are the variables we usually care most
about.
Two ways to estimate dynamic panel data models with fixed effects
What to do? One option is to follow instrumental variable (IV) logic, covered in
Chapter 9. In this context, the IV approach relies on finding some variable that
is correlated with the independent variable in question and not correlated with
the error. Most IV approaches rely on using lagged values of the independent
variables, which are typically correlated with the independent variable in question
but not correlated with the error, which happens later. The Arellano and Bond
(1991) approach, for example, uses all available lags as instruments. These models
are quite complicated and, like many IV models, imprecise.
Another option is to use OLS, accepting some bias in exchange for better
accuracy and less complexity. While we have talked a lot about bias, we have not
yet discussed the trade-off between bias and accuracy, largely because in basic
models such as OLS, unbiased models are also the most accurate, so we don’t
have to worry about the trade-off. But in more complicated models, it is possible
to have an estimator that produces coefficients that are biased but still pretty close
to the true value. It is also possible to have an estimator that is unbiased but very
imprecise. IV estimators are in the latter category—they are, on average, going to
get us the true value, but they have higher variance.
Here’s a goofy example of the trade-off between bias and accuracy. Consider
two estimators of average height in the United States. The first is the height of
a single person, randomly sampled. This estimator is unbiased—after all, the
average of this estimator will have to be the average of the whole population.
Clearly, however, this estimator isn’t very precise because it is based on a single
person. The second estimator of average height in the United States is the average
height of 500 randomly selected people, but measured with a measuring stick that
is inaccurate by 0.25 inch (making every measurement a quarter-inch too big).3
Which estimate of average height would we rather have? The second one may
well make up what it loses in bias by being more precise. That’s the situation here
because the OLS estimate is biased but more precise than the IV estimates.
Nathaniel Beck and Jonathan Katz (2011) have run a series of simulations of
several options for estimating models with lagged dependent variables and fixed
effects. They find that OLS performs better in that it’s actually more likely to
produce estimates close to the true value than the IV approach, even though OLS
estimates are a bit biased. The performance of OLS models improves relative to
the IV approach as T increases.
H. L. Mencken said that for every problem there is a solution that is simple,
neat, and wrong. Usually that’s a devastating critique. Here it is a compliment.
OLS is simple. It is neat. And yet, it is wrong in the sense of being biased when
we have a lagged dependent variable and fixed effects. But OLS is more accurate
(meaning the variance of β̂ 1 is smaller) than the alternatives, which nets out to a
pretty good approach.
3
Yes, yes, we could subtract the quarter-inch from all the height measurements. Work with me here.
We’re trying to make a point!
REMEMBER THIS
1. Researchers often include lagged dependent variables to account for serial dependence. A
model with a lagged dependent variable is called a dynamic model.
(a) Dynamic models differ from conventional OLS models in many respects.
(b) In a dynamic model, a change in X has an immediate effect on Y, as well as an ongoing
effect on future Y’s, since any change in Y associated with a change in X will affect
future values of Y via the lagged dependent variable.
(c) If there are no fixed effects in the model and no autocorrelation, then using OLS for a
model with a lagged dependent variable will produce unbiased coefficient estimates.
(d) If there are no fixed effects in the model and there is autocorrelation, the autocorrelation
must be purged from the data before unbiased estimates can be generated.
2. OLS estimates from models with both a lagged dependent variable and fixed effects are
biased.
(a) One alternative to OLS is to use an IV approach. This approach produces unbiased
estimates, but it’s complicated and yields imprecise estimates.
(b) OLS is useful to estimate a model with a lagged dependent variable and fixed effects.
• The bias is not severe and decreases as T, the number of observations for each unit,
increases.
• OLS in this context produces relatively accurate parameter estimates.
15.3 Random Effects Models

The term fixed effects is used to distinguish from random effects. In this section,
we present an overview of random effects models and discuss when they can be
used.
random effects In a random effects model, the unit-specific error term is itself considered
model Treats a random variable. Instead of eliminating or estimating the αi , as is done in fixed
unit-specific error as a effects models, random effects models leave the αi ’s in the error term and account
random variable that is
for them during the calculation of standard errors. We won’t cover the calculations
uncorrelated with the
here other than to note that they can get tricky.
An advantage of random effects models is that they estimate coefficients on
variables that do not vary within a unit (the kind of variables that get dropped in
fixed effects models). Fixed effects models, on the other hand, cannot estimate
coefficients on variables that do not vary within a unit (as discussed on page 269).
The disadvantage of random effects models is that the random effects
estimates are unbiased only if the random effects (the αi ) are uncorrelated with
the X. The core challenge in OLS (which we have discussed at length) is that
15.3 Random Effects Models 525
the error term might be correlated with the independent variable; this problem
continues with random effects models, which address correlation of errors across
observations but not correlation of errors and independent variables. Hence,
random effects models fail to take advantage of a major attraction of panel data,
which is that we can deal with the possible correlation of the unit-specific effects
that might cause spurious inferences regarding the independent variables.
The Hausman test is a statistical test that pits random against fixed effects
models. Once we understand this test, we can see why the bang-for-buck payoff
with random effects models is generally pretty low. In a Hausman test, we use
the same data to estimate both a fixed effects model and a random effects model.
Under the null hypothesis that the αi ’s are uncorrelated with the X variables, the β̂
estimates should be similar. Under the alternative, the estimates should be different
because the random effects should be corrupted by the correlation of the αi ’s with
the X variables and the fixed effects should not.
The decision rules for a Hausman test are the following:
• If fixed effects and random effects give us pretty much the same β̂, we fail
to reject the null hypothesis and can use random effects.
• If the two approaches provide different answers, we reject the null and
should use fixed effects.
Ultimately, we believe either the fixed effects estimate (when we reject the null
hypothesis of no correlation between αi and Xi ) or pretty much the fixed effects
answer (when we fail to reject the null hypothesis of no correlation between αi
and Xi ).4
If used appropriately, random effects have some advantages. When the αi
are uncorrelated with the Xi , random effects models will generally produce
smaller standard errors on coefficients than fixed effects models. In addition, as
T gets large, the differences between fixed and random effects decline; in many
real-world data sets, however, the differences can be substantial.
REMEMBER THIS
Random effects models do not estimate fixed effects for each unit, but rather adjust standard errors
and estimates to account for unit-specific elements of the error term.
1. Random effects models produce unbiased estimates of β̂ 1 only when the αi ’s are uncorrelated
with the X variables.
2. Fixed effects models are unbiased regardless of whether the αi ’s are uncorrelated with the X
variables, making fixed effects a more generally useful approach.
4
For more details on the Hausman test, see Wooldridge (2002, 288).
Conclusion
Serial dependence in panel data models is an important and complicated
challenge. There are two major approaches to dealing with it. One is to treat
the serial dependence as autocorrelated errors. In this case, we can test for
autocorrelation and, if necessary, purge it from the data by ρ-transforming
the data.
The other approach is to estimate a dynamic model that includes a lagged
dependent variable. Dynamic models are quite different from standard OLS
models. Among other things, each independent variable has a short- and a
long-term effect on Y.
Our approach to estimating a model with a lagged dependent variable hinges
on whether there is autocorrelation and whether we include fixed effects. If there is
no autocorrelation and we do not include fixed effects, the model is easy to estimate
via OLS and produces unbiased parameter estimates. If there is autocorrelation, the
correlation of error needs to be purged via standard ρ-transformation techniques.
If we include fixed effects in a model with a lagged dependent variable,
OLS will produce biased results. However, scholars have found that the bias is
relatively small and that OLS is likely to work better than alternatives such as IV
or bias-correction approaches.
We will have a good start on understanding advanced panel data analysis when
we can answer the following questions:
• Section 15.1: How do we diagnose and correct for autocorrelation in panel

data models?
• Section 15.2: What are the consequences of including lagged dependent

variables in models with and without fixed effects? Under what conditions
is it reasonable to use lagged dependent variables and fixed effects, despite
the bias?
• Section 15.3: What are random effects models? When are they appropriate?
Further Reading
There is a large and complicated literature on accounting for time dependence
in panel data models. Beck and Katz (2011) is an excellent guide. Among other
things, these authors discuss how to conduct an LM test for AR(1) errors in a
model without fixed effects, the bias in models with autocorrelation and lagged
dependent variables, and the bias of fixed effects models with lagged dependent
variables.
There are many other excellent resources. Wooldridge (2002) is a valuable
reference for more advanced issues in analysis of panel data. An important article
by Achen (2000) pushes for caution in the use of lagged dependent variables.
Wawro (2002) provides a nice overview of Arellano and Bond (1991) methods.
Another approach to dealing with bias in dynamic models with fixed effects
is to correct for bias directly, as suggested by Kiviet (1995). This procedure works
reasonably well in simulations, but it is quite complicated.
Key Term
Random effects model (524)
Computing Corner
Stata
1. It can be useful to figure out which variables vary within unit, as this will
determine if the variable can be included in a fixed effects model. Use
tabulate unit, summarize(X1)
which will show descriptive statistics for X1 grouped by the variable called
unit. If the standard deviation of an X1 is zero for all units, there is no
within-unit variation, and for the reasons discussed in Section 8.3, this
variable cannot be included in a fixed effects model.
2. The Computing Corner in Chapter 8 discusses how to run one- and

two-way fixed effects models in Stata. We need to indicate the unit and
time variables by using the “time series set” commands (e.g., tsset unit
time).
3. To save residuals from a fixed effects model in Stata, use

xtreg Y X1 X2, fe i(unit)
predict Resid, e
The predict command here produces a variable named Resid (and we
could choose any name we wanted). The , e subcommand in the predict
command tells Stata to calculate the it portion of the error term.5
4. The Computing Corner in Chapter 13 discusses how to estimate ρ̂ in an

AR(1) model.
5
If we want to know the αi + it portion of the error term, we type
predict ResidAE, ue
Note that Stata uses the letter u to refer to the fixed effect we denote with α in our notation.
5. Stata has a command called xtregar that estimates ρ-transformed models

for panel data. There are quite a few subcommands, but the version that
most closely follows the model as we have presented is
xtregar Y X1 X2, fe rhotype(regress) twostep
6. To estimate a panel data model with a lagged dependent variable, use

xtreg Y L.Y X1 X2, fe i(unitcode)
where the L.Y independent variable is simply the lag of Y. Note that for
this command to work, we need to have invoked the tsset variable as
described earlier. This approach also works for xtregar.
7. To estimate a random effects model, use

xtreg Y X1 X2, re
where the , re tells Stata to use a random effects model.
1. It can be useful to figure out which variables vary within unit as this
will determine if the variable can be included in a fixed effects model.
Use tapply(X1, unit, sd), which will show the standard deviation of
the variable X1 grouped by the variable called unit. If the standard deviation
of an X1 is zero for all units, there is no within-unit variation for the reasons
discussed in Section 8.3, and this variable cannot be included in a fixed
effects model.
2. The Computing Corner in Chapter 8 discusses how to run one- and

two-way fixed effects models in R.
3. R automatically creates a variable with the residuals for every regression

object. For example, if we ran
TwoWay = plm(Y ~ X1 + X2, data = DTA, index = c("ID",
"time"), effect = "twoways")
the residuals would be in the variable TwoWay$residuals.
4. Estimating ρ̂ in an AR(1) model can be a bit tricky with panel data.

There are two issues. The first is that R’s residual variable will contain
only observations for non-missing observations, meaning that when data
is missing, the residual variable and the original variables are of different
lengths. The second is that when we use panel data to create lagged
residuals, we need to be careful to use lagged values only within each unit.
Suppose we have panel data on countries in which the United Kingdom
is listed above data on the United States. If we blindly lagged variables,
the lagged value of the dependent variable for the first United States
observation would be the last United Kingdom observation. That’s wrong.
Therefore, we need to be careful to manage our missing data accurately

and to have only lagged variables within unit. This can be achieved in
many ways; here is one:
(a) Make sure your data is listed by stacking units—that is, the
observations for unit 1 are first—and then ordered by time period.
Below the lines for unit 1 are observations for unit 2, ordered by
time period, and so on.6
(b) Create a variable with residuals, first by creating an empty variable

with all missing data and then by putting R’s residuals (which exist
only for non-missing data) into this variable.
Resid = rep(NA, length(ID))
Resid[as.numeric(names(TwoWay$residuals))]
= TwoWay$residuals
(c) Create lag variables, one for the unit identifier and one for residuals.
Then set all the lag residuals to missing for the first observation for
each unit.
LagID = c(NA, ID[1:(length(ID)-1)])
LagResid = c(NA, Resid[1:(length(Resid)-1)])
LagResid[LagID != ID] = NA
(d) Use the variables from part (c) to estimate the model from
Chapter 13.
RhoHat = lm(Resid ~ LagResid)
5. The Computing Corner in Chapter 13 discusses how to estimate ρ̂ in an

AR(1) model.
6. The most direct way to estimate a ρ-transformed model in R is to transform

the variables manually. There are three steps:
(a) Create lag variables for dependent and independent variables, as we

just did (for LagID, for example).
(b) Create ρ-transformed variables by first creating an empty variable

and then filling it with ρ-transformed value. For example,
XRho = rep(NA, length(X))
XRho[LagID == ID] = (X[LagID == ID]
- RhoHat$coefficients[2]*LagX[LagID == ID])
6
It is possible to stack data by year. The way we’d create lagged variables would be different,
though.
(c) Use the plm command in R (described in the Computing Corner of

Chapter 8) to run a fixed effects model via the transformed data. For
example,
TwoWayRho = plm(YRho ~ X1Rho + X2Rho,
data = Rho.Frame, index = c("ID", "TIME"),
effect = "twoways")
7. To estimate a panel data model with a lagged dependent variable, use

LDV = plm(Y ~ lag(Y) + X1 + X2, data = Data,
index = c("ID", "TIME"), effect = "twoways")
8. To estimate a random effects model, use

plm(Y ~ X1 + X2, model = "random")
Exercises
1. Use the data in olympics_HW.dta on medals in the Winter Olympics from
1980 to 2014 to answer the following questions. Table 15.1 describes the
variables.
(a) Estimate a one-way fixed effects model explaining the number of

medals with population, GDP, host country, average temperature,
and maximum elevation as independent variables. Use country as the
TABLE 15.1 Another Set of Variables for Winter Olympics Data


year Year
time A time variable equal to 1 for first Olympics in data set (1980), 2 for second Olympics
(1984) and so forth; useful for time series analysis.

population Population of country (in millions)
GDP Per capita gross domestic product (GDP) (in $10,000 U.S. dollars)
host Equals 1 if host nation and 0 otherwise

temp Average high temperature (in Fahrenheit) in January if country is in Northern
Hemisphere or July if Southern Hemisphere
elevation Highest peak elevation in the country
Exercises 531
unit for fixed effects.7 Briefly discuss the results, and explain what
is going on with the coefficients on temperature and elevation.
(b) Estimate a two-way fixed effects model with population, GDP, and
host country as independent variables. Use country and time as the
fixed effects. Explain any differences from the results in part (a).
(c) Estimate ρ̂ for the two-way fixed effects model. Is there evidence of
autocorrelation? What are the implications of your finding?
(d) Estimate a two-way fixed effects model that has population, GDP,
and host country as independent variables and accounts for autocor-
relation. Discuss any differences from results in part (b). Which is a
better statistical model? Why?
(e) Estimate a two-way fixed effects model with a lagged dependent

variable included as a control variable. Discuss differences from the
two-way fixed effects model in part (b).
(f) Is there evidence of autocorrelation in the two-way fixed effects

model that includes a lagged dependent variable? Compare your
answer to your answer in part (c). Use concepts discussed in Section
13.4 to explain the implications of autocorrelation in a model that
includes a lagged dependent variable model.
(g) Estimate a lagged dependent variable model that also con-

trols for autocorrelation. Compare the results to your answer in
parts (d) and (e).
(h) Section 15.2 discusses potential bias when a fixed effects model
includes a lagged dependent variable. What is an important deter-
minant of this bias? Assess this factor for this data set.
(i) Use the concepts presented at the end of Section 13.4 to discuss
whether it is better to approach the analysis in an autocorrelation
or a lagged dependent variable framework.
(j) Use the concept of model robustness from Section 2.2 to discuss
which results are robust and which are not.
2. Answer the following questions using the Winter Olympics data described
in Table 15.1 that can be found in olympics_HW.dta
7
For simplicity, use the de-meaned approach, implemented with the xtreg command in Stata and the
plm command in R.
(a) Investigate whether each of the independent variables varies within

unit. Discuss how whether a variable varies within unit matters for
fixed effects and random effects models.
(b) Estimate a pooled OLS model where the dependent variable is

medals and the independent variables are population, GDP, host
country, average temperature, and maximum elevation. Briefly
comment on the results.
(c) Estimate a random effects model with the same variables as in part
(b). Briefly explain the results, noting in particular what happens to
variables that have no within-unit variation.
(d) What is necessary to avoid bias for a random effects model? Do you
think this condition is satisfied in this case? Why or why not?
Conclusion: How to Be 16
an Econometric Realist
After World War II, George Orwell (1946) famously

wrote that
we are all capable of believing things

which we know to be untrue, and then,
when we are finally proved wrong, impu-
dently twisting the facts so as to show that
we were right. Intellectually, it is possible
to carry on this process for an indefinite
time: the only check on it is that sooner or
later a false belief bumps up against solid
reality, usually on a battlefield.
The goal of econometrics is to provide a less

violent empirical battlefield where theories can bump up against cold, hard data.
Unfortunately, econometric analysis is no stranger to the twisting rationaliza-
tions that allow us to distort reality to satisfy our preconceptions or interests. We
therefore sometimes end up on an emotional roller coaster. We careen from elation
after figuring out a new double-tongue-twister econometric model to depression
when multiple seemingly valid analyses support wildly disparate conclusions.
Some econometricians cope by fetishizing technical complexity. They pick
the most complicated approach possible and treat the results as the truth. If others
don’t understand the analysis, it is because their puny brains cannot keep up with
the mathematical geniuses in the computer lab. Such overconfidence is annoying
and intellectually dangerous.
Others become econometric skeptics. For them, econometrics provides no
answers. They avoid econometrics or, worse, manipulate it. This nihilism, too,
is annoying and intellectually dangerous.
What are we to do? It might seem that avoiding econometrics may limit harm.
Econometrics is a bit like a chain saw: if used recklessly, the damage can be
terrible. So it may be best to put down the laptop and back slowly away. The
533
534 CHAPTER 16 Conclusion: How to Be an Econometric Realist
problem with this approach is that there really is no alternative to statistics and
econometrics. As baseball analyst Bill James says, the alternative to statistics is
not “no statistics.” The alternative to statistics is bad statistics. Anyone who makes
any empirical argument about the world is making a statistical argument. It might
be based on vague data that is not systematically analyzed, but that’s what people
do when they judge from experience or intuition. Hence, despite the inability of
statistics and econometrics to answer all questions or be above manipulation, a
serious effort to understand the world will involve some econometric reasoning.
A better approach is realism about econometrics. After all, in the right hands,
even chain saws are awesome. If we learn how to use the tool properly, realizing
what it can and can’t do, we can make a lot of progress.
An econometric realist is committed to robust and thoughtful evaluation of
theories. Five behaviors characterize this approach.
First, an econometric realist prioritizes. A model that explains everything is
impossible. We must simplify. And if we’re going to simplify the world, let’s do
it usefully. Statistician George Box (1976, 792) made this point wonderfully:
Since all models are wrong the scientist must be alert to what is
importantly wrong. It is inappropriate to be concerned about mice
when there are tigers abroad.
The tiger abroad is almost always endogeneity. So we must prioritize fighting

this tiger by using our core econometric toolkit: experiments, OLS, fixed effects
models, instrumental variables, and regression discontinuity. There will be many
challenges in any econometric project, but we must not let them distract us from
the fight against endogeneity.
Second, an econometric realist values robustness. Serious analysts do not
believe assertions based on a single significant coefficient in a single statistical
specification. Even for well-designed studies with good data, we worry that the
results could depend on a very specific model specification. An econometric
realist will show that the results are robust by assessing a reasonable range
of specifications, perhaps with and without certain variables or with alternative
measures of important concepts.
Third, an econometric realist adheres to the replication standard. Others must
see our work and be able to recreate, modify, correct, and build off our analysis.
Results cannot be scientifically credible otherwise. Replications can be direct; that
is, others can do exactly the same procedures on the same data. Or they can be
indirect, with new data or a different context used in a research design similar to
one that has proved successful. We need replications of both types if our results
are to be truly credible.
Fourth, an econometric realist is wary of complexity. Sometimes complex
models are inevitable. But just because one model is more complicated than
another, it is not necessarily more likely to be true. It is more likely to
have mistakes, however. Sometimes complexity becomes a shield behind which
analysts hide, intentionally or not, moving their conclusions effectively beyond
the realm of reasonable replicability and therefore beyond credibility.
Conclusion: How to Be an Econometric Realist 535
Remember, econometric analysis is hard, but not because of the math.

Economics is hard because the world is a complicated place. If anything, the math
makes things easier by providing tools to simplify the world. A certain amount of
jargon among specialists in the field is inevitable and helps experts communicate
efficiently. If a result holds only underneath layers of impenetrable math, however,
be wary. Check your wallet. Count your silverware.
Investor Peter Lynch often remarked that he wouldn’t invest in any business
idea that couldn’t be illustrated with a crayon. If the story isn’t simple, it’s
probably wrong. This attitude is useful for econometric analysts as well. A
valid model will almost certainly entail background work that is not broadly
accessible, but to be most persuasive, the results should include a figure or story
that simply summarizes the basis for the finding. Perhaps we’ll have to use a
sharp one, but if we can’t explain our results with a crayon, we should keep
working.
Fifth, an econometric realist thinks holistically. We should step back from any
given result and consider the totality of the evidence. The following indicators of
causality provide a useful framework. None is necessary; none is sufficient. Taken
together, though, the more these conditions are satisfied, the more confident we
can be that a given causal claim is true.
• Strength: This is the simplest criterion. Is there a strong relationship

between the independent variable and the dependent variable?
– A strong observed relationship is less likely to be due to random

chance. Even if the null hypothesis of no relationship is true, we
know that random variation can lead to the occasional “significant”
result. But random noise producing such a result is more likely to
produce a weak connection than a strong observed relationship. A very
strong relationship is highly unlikely to simply be the result of random
noise.
– A strong observed relationship is less likely to be spurious for reasons

that aren’t obvious. A strong relationship is not immune to endogeneity,
of course, but it is more likely that a strong result due to endogeneity
alone will be attributable to a relatively clear source of endogeneity. For
a weak relationship, the endogeneity could be subtle, but sufficient to
account for what we observe.
– A strong observed relationship is more likely to be important. A weak

relationship might not be random or spurious; it might simply be
uninteresting. Life is short. Explain things that matter. Our goal is not to
intone the words “statistically significant” but rather to produce useful
knowledge.
• Consistency: Do different analysts consistently find the relationship in

different contexts?
536 CHAPTER 16 Conclusion: How to Be an Econometric Realist
– All too often, a given theoretical claim is tested with the very data that
suggested the result. That’s not much to go on; a random or spurious
relationship in one data set does not a full-blown theory make. Hence,
we should be cautious about claims until they have been observed across
multiple contexts. With that requirement met, it is less likely that the
result is due to chance or to an analyst’s having leaned on the data to get
a desired result.
– If results are not observed across multiple contexts, are there contextual
differences? Perhaps the real finding would lie in explaining why a
relationship exists in one context and not in others.
– If other results are different, can we explain why the other results are
wrong? It is emphatically not the case that we should interpret two
competing statistical results as a draw. One result could be based on a
mistake. If that’s true, explain why (nicely, of course). If we can’t explain
why one approach is better, though, and we are left with conflicting
results, we need to be cautious about believing we have identified a real
relationship.
• Specificity: Are the patterns in the data consistent with the specific claim?
Each theory should be mined for as many specific claims as possible, not
only about direct effects but also about indirect effects and mechanisms. As
important, the theory should be mined for claims about when we won’t see
the relationship. This line of thinking allows us to conduct placebo tests in
which we should see null results. In other words, the relationship should be
observable everywhere we expect it and nowhere we don’t.
• Plausibility: Given what we know about the world, does the result make
sense? Sometimes results are implausible on their face: if someone found
that eating french fries led to weight loss, we should probably ask some
probing questions before supersizing. That doesn’t mean we should treat
implausible results as wrong. After all, the idea that the earth revolves
around the sun was pretty implausible before Copernicus. Implausible
results that happen to be true just need more evidence to overcome the
implausibility.
Adherence to these criteria is not as cut and dried as looking at confidence

intervals or hypothesis tests. Strength, consistency, specificity, and plausibility are
more important because they determine not “statistical significance” but what we
conclude about empirical relationships. They should never be far from the mind of
an econometric realist who wants to use data to learn about how the world really
works.
So we have done a lot in this book. We’ve covered a vast array of econometric
tools. And we’ve just now described a productive mind-set, that of an econometric
realist. But there is one more element: creativity. Think of econometrics as the
Further Reading 537
grammar for good analysis. It is not the story. No one reads a book and says, “Great
grammar!” A terrible book might have bad grammar, but a good book needs more
than good grammar. The material we covered in this book provides the grammar
for making convincing claims about the way the world works. The rest is up to
you. Think hard, be creative, take chances. Good luck.
Further Reading
In his 80-page paean to statistical realism, Achen (1982, 78) puts it this way: “The
uninitiated are often tempted to trust every statistical study or none. It is the task
of empirical social scientists to be wiser.” Achen followed this publication in 2002
with an often-quoted article arguing for keeping models simple.
The criteria for evaluating research discussed here are strongly influenced by
the Bradford-Hill criteria from Bradford-Hill (1965). Nevin (2013) assesses the
Bradford-Hill criteria for the theory that lead in gasoline was responsible for the
1980s crime surge in the United States (and elsewhere).
APPENDICES:
MATH AND PROBABILITY
BACKGROUND
A. Summation
N
• i=1 Xi = X1 + X2 + X3 + · · · + XN
• If a variable in the summation does not have a subscript, it can be “pulled out” of
the summation. For example,

N
βXi = βX1 + βX2 + βX3 + · · · + βXN
i=1
= β(X1 + X2 + X3 + · · · + XN )

N
=β Xi
i=1
• If a variable in the summation

N has a subscript, it cannot be “pulled out” of the
summation. For example, i=1 Xi Yi = X1 Y1 + X2 Y2 + X3 Y3 + · · · + XN YN cannot
as a general matter be simplified.
• As a general matter, a non-linear function in a sum is not the same as the

N
non-linear function of the sum. For example, as a general matter, i=1 Xi2 will
N
not equal ( i=1 Xi ) except for very particular circumstances (e.g., Xi = 1 for all
2
observations).
B. Expectation
• Expectation is the value we expect a random variable to have. The expectation is
basically the average of the random variable if we could sample from the variable’s
distribution a huge (infinite, really) number of times.
538
MATH AND PROBABILITY BACKGROUND 539
• For example, the expected value of the value of a six-sided die is 3.5. If we roll a
die a huge number of times, we’d expect each side to come up an equal proportion
of times, so the expected average will equal 6 the average of 1, 2, 3, 4, 5, and 6.
More formally, the expected value will be 1 p(Xi )Xi , where X is 1, 2, 3, 4, 5, and
6 and p(Xi ) is the probability of each outcome, which in this example is 16 for each
value.
• The expectation of some number k times a function is equal to k times the

expectation of the function. That is, E[kg(X)] = kE[g(X)] for constant k, where
g(X) is some function of X. Suppose we want to know what the expectation of 10
times the number on a die is. We can say that the expectation of that is simply 10
times the expectation. Not rocket science, but useful.
C. Variance
The variance of a random variable is a measure of how spread out the distribution
is. In a large sample, the variance can be estimated as
1
N
var(X) = (Xi − X)2
N
i=1
In small samples, a degrees of freedom correction means we divide by N − 1

instead of by N. For large N, it hardly matters whether we use N or N − 1; as
a practical matter, computer programs take care of this for us.
It is useful to deconstruct the variance equation to determine exactly what it
does. The math is pretty simple:
1. Take deviation from the mean for each observation.
2. Square it to keep it positive.
3. Take the average.
Here are some useful properties of variance (the “variance facts” cited in
Chapter 14):
1. The variance of a constant plus a random variable is the variance of the

random variable. That is, letting k be a fixed number and be a random
variable with variance σ 2 , then
var(k + ) = var(k) + var()

= 0 + var()
= σ2
540 MATH AND PROBABILITY BACKGROUND
2. The variance of a random variable times a constant is the constant squared

times the variance of the random variable. That is, letting k be some
constant and be a random variable with variance σ 2 , then
var(k) = k2 var()
= k2 σ 2
3. When random variables are correlated, the variance of a sum (or differ-
ence) of random variables depends on the variances and covariance of the
variables. Letting and τ be random variables:
• var( + τ ) = var() + var(τ ) + 2 cov(, τ ), where cov(, τ ) refers

to the covariance of and τ
• var( − τ ) = var() + var(τ ) − 2 cov(, τ ), where cov(, τ ) refers

to the covariance of and τ
4. When random variables are uncorrelated, the variance of a sum (or

difference) of random variables equals the sum of the variances. This
outcome follows directly from the previous one, which we can see by
noting that if two random variables are uncorrelated, their covariance
equals 0 and the covariance term drops out of the equations.
D. Covariance
• Covariance measures how much two random variables vary together. In large
samples, the covariance of two variables is
N
i=1 (X1i − X 1 )(X2i − X 2 )
cov(X1 , X2 ) = (A.1)
N
• As with variance, several useful properties apply when we are dealing with
covariance:
1. The covariance of a random variable, , and some constant, k, is zero.

Formally, cov(, k) = 0.
2. The covariance of a random variable, , with itself is the variance of that

variable. Formally, cov(, ) = σ2 .
3. The covariance of k1 and k2 τ , where k1 and k2 are constants and and τ

are random variables, is k1 k2 cov(, τ ).
4. The covariance of a random variable with the sum of another random

variable and a constant is the covariance of the two random variables.
Formally, letting and τ be random variables, then cov(, τ + k) =
cov(, τ ).
E. Correlation
The equation for correlation is
cov(X, Y)
corr(X, Y) =
σX σY
where σX is the standard deviation of X and σY is the standard deviation of Y.
If X = Y for all observations, cov(X, Y) = cov(X, X) = var(X) and σX = σY ,
implying that the denominator will be σX2 , which is the variance of X. These
calculations therefore imply that the correlation for X = Y will be +1, which is
the upper bound for correlations.1 For perfect negative correlation, X = −Y and
correlation is −1.
The equation for covariance (Equation A.1) looks a bit like the equation for
the slope coefficient in bivariate regression on page 49 in Chapter 3. The bivariate
regression coefficient is simply a restandardized correlation:
BivariateOLS σY
β̂1 = corr(X, Y) ×
σX
F. Probability Density Functions

probability density A probability density function (PDF) is a mathematical function that describes
function A the relative probability for a continuous random variable to take on a given
mathematical function probability. Panels (c) and (d) of Figure 3.4 provide examples of two PDFs.
that describes the
While the shapes of PDFs can vary considerably, they all share certain
relative probability for a
continuous random
fundamental features. The values of a PDF are greater than or equal to 0 for all
variable to take on a possible values of the random variable. The total area under the curve defined by
given probability. the PDF equals 1.
One tricky thing about PDFs is that they are continuous functions. Thus, we
cannot say that the probability that a random variable equals 2.2 is equal to the
value of the function evaluated at 2.2 because the value of the function is pretty
much the same at 2.2000001 and 2.2000002, and pretty soon the total probability
1
We also get perfect correlation if the variables are identical once normalized. That is, X and Y are
(Xi −X) (Yi −Y)
perfectly correlated if X = 10Y or if X = 5 + 3Y, and so forth. In these cases, σX
= σY
for all
observations.
would exceed 1 because there are always more possible values very near to any
given value. Instead, we need to think in terms of probabilities that the random
variable is in some (possibly small) region of values. Hence, we need the tools
from calculus to calculate probabilities from a PDF.
Figure A.1 shows the PDF for an example of a random variable. Although
we cannot use the PDF to simply calculate the probability the random variable
equals, say, 1.5, it is possible to calculate the probability that the random
variable is between 1.5 and any other value. The figure highlights the area
under the PDF curve between 1.5 and 1.8. This area corresponds to the
probability this random variable is between 1.5 and 1.8. In Appendix G, we
show example calculations of such probabilities based on PDFs from the normal
distribution.2
Probability
density
0.75
0.6
0.45
0.3
0.15
0 1 1.5 1.8 2 3 4 5
Value of x
FIGURE A.1: An Example of a Probability Density Function (PDF)
2
More formally, we can indicate a PDF as a function, f (x), that is greater than zero for
∞ all values of x.
Because the total area under the curve defined by the PDF equals one, we know that −∞ f (x)dx = 1.

The probability that the random variable x is between a and b is ab f (x)dx = F(b) − F(a), where F()
is the integral of f ().
G. Normal Distributions
standard normal We work a lot with the standard normal distribution. (Only to us stats geeks does
distribution A normal “standard normal” not seem repetitive.) A normal distribution is a specific (and
distribution with a mean famous) type of PDF, and a standard normal distribution is a normal distribution
of zero and a variance
with mean of zero and a variance of one. The standard deviation of a standard
(and standard error) of
one.
normal distribution is also one, because the standard deviation is the square root
of the variance.
One important use of the standard normal distribution is to calculate
probabilities of observing standard normal random variables that are less than or
equal to some number. We denote the function Φ(x) = Pr(X < Z) as the probability
that a standard normal random variable X is less than Z. This is known as the
cumulative distribution function (CDF) because it indicates the probability of
seeing a random variable less than some value. It simply expresses the area under
a PDF curve to the left of some value.
Figure A.2 shows four examples of the use of the CDF for standard normal
PDFs. Panel (a) shows Φ(0), which is the probability that a standard normal
0.4 0.4
Probability density
Probability density
0.3
Φ(0)
0.3
= Pr(X < 0) Φ(−2)
= 0.500 = Pr(X < −2)
0.2 0.2 = 0.023
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4 0.4
Probability density
Probability density
0.3 Φ(1.96) 0.3

Φ(1.0)
= Pr(X < 1.96)
= Pr(X <1.0)
= 0.975
= 0.841
0.2 0.2
0.1 0.1
0.0 0.0
1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)
FIGURE A.2: Probabilities that a Standard Normal Random Variable Is Less than Some Value
random variable will be less than zero. It is the area under the PDF to the left
of the zero. We can see that it is half the total area, meaning that the area to the left
of the zero is 0.500, and the probability of observing a value of a standard normal
random variable that is less than zero is 0.500. Panel (b) shows Φ(−2), which is
the probability that a standard normal random variable will be less than –2. It is
the proportion of the total area that is left of –2, which is 0.023. Panel (c) shows
Φ(1.96), which is the probability that a standard normal random variable will be
less than 1.96. It is 0.975. Panel (d) shows Φ(1), which is the probability that a
standard normal random variable will be less than 1. It is 0.841.
We can also use our knowledge of the standard normal distribution to calculate
the probability that β̂1 is greater than some value. The trick here is to recall that
if the probability of something happening is P, then the probability of its not
happening is 1 − P. This property tells us that if there is a 15 percent chance of
rain, then there is a 85 percent probability of no rain.
To calculate the probability that a standard normal variable is greater than
some value Z, use 1 − Φ(Z). Figure A.3 shows four examples. Panel (a) shows
0.4 0.4
Probability density
Probability density
0.3 0.3
1 − Φ(0)
= Pr(X > 0) 1 − Φ(−2)
= 0.500 = Pr(X > −2)
0.2 0.2 = 0.977
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4 0.4
Probability density
Probability density
0.3 0.3
1 − Φ(1.96) 1 − Φ(1.0)
= Pr(X > 1.96) = Pr(X > 1.0)
= 0.025 = 0.159
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 1 2 4
(c) (d)
FIGURE A.3: Probabilities that a Standard Normal Random Variable Is Greater than Some Value
1 − Φ(0), which is the probability that a standard normal random variable will be
greater than zero. This probability is 0.500. Panel (b) highlights 1 − Φ(−2), which
is the probability that a standard normal random variable will be greater than –2. It
is 0.977. Panel (c) shows Φ(1.96), which is the probability that a standard normal
random variable will be greater than 1.96. It is 0.025. Panel (d) shows Φ(1), which
is the probability that a standard normal random variable will be greater than 1. It
is 0.159.
Figure A.4 shows some key information about the standard normal distri-
bution. In the left-hand column of the figure’s table are some numbers, and in
the right-hand column are the corresponding probabilities that a standard normal
random variable will be less than the respective numbers. There is, for example, a
0.010 probability that a standard normal random variable will be less than –2.32.
Probability
density
0.4
SD = number of Probability
standard deviations β 1 ≤ SD
β̂ 0.3
above or below
0.2
the mean, β 1
0.1 Prob.β1 ≤ –2.32
−3.00 0.0001
−2.58 0.005 0.0
–2.32
−3 −2 −1 0 1 2 3
−2.32 0.010 ⇒
(a)
−2.00 0.023
−1.96 0.025 Probability
density
−1.64 0.050 0.4
−1.28 0.100 0.3 Probability β1 ≤ 0

−1.00 0.160 0.2
0.00 0.500 ⇒
0.1
1.00 0.840
0.0
1.28 0.900 −3 −2 −1 0 1 2 3
1.64 0.950 (b)
1.96 0.975 ⇒
Probability
2.00 0.977
density
2.32 0.990 0.4
2.58 0.995 0.3 Probability β1 ≤ 1.96
3.00 0.999 0.2
Suppose βˆ1 is distributed as standard normal. The 0.1

values in the right-hand column are the probabilities
βˆ1 is less than the value in left-hand column. For 0.0
example, the probability βˆ1 < −2.32 = 0.010. 1.96
−3 −2 −1 0 1 2 3
(c)
FIGURE A.4: Standard Normal Distribution

We can see this graphically in panel (a). In the top bell-shaped curve, the portion
that is to the left of –2.32 is shaded. It is about one percent.
Because the standard deviation of a standard normal is 1, all the numbers in
the left-hand column can be considered as the number of standard deviations above
or below the mean. That is, the number −1 refers to a point that is a single standard
deviation below the mean, and the number +3 refers to a point that is 3 standard
deviations above the mean.
The third row of the table in Figure A.4 shows there is a probability of 0.010
that we’ll observe a value less than –2.32 standard deviations below the mean.
Going down to the shaded row SD = 0.00, we see that if β̂1 is standard normally
distributed, it has a 0.500 probability of being below zero. This probability is
intuitive: the normal distribution is symmetric, and we have the same chance of
seeing something above its mean as below it. Panel (b) shows this graphically.
In the last shaded row, where SD = 1.96, we see that there is a 0.975
probability that a standard normal random variable will be less than 1.96. Panel
(c) in Figure A.4 shows this graphically, with 97.5 percent of the standard
normal distribution shaded. We see this value a lot in statistics because twice
the probability of being greater than 1.96 is 0.05, which is a commonly used
significance level for hypothesis testing.
We can convert any normally distributed random variable to a standard
normally distributed random variable. This process, known as standardizing
values, is pretty easy. This trick is valuable because it allows us to use the intuition
and content of Figure A.4 to work with any normal distribution, whatever its mean
and standard deviation.
For example, suppose we have a normal random variable with a mean of 10
and a standard deviation of 1 and we want to know the probability of observing
a value less than 8. From common sense, we realize that in this case 8 is 2
standard deviations below the mean. Hence, we can use Figure A.4 to see that
the probability of observing a value less than 8 from a normal distribution with
mean 10 and standard deviation of 1 is 0.023; accordingly, the fourth row of the
table shows that the probability a standard normal random variable is less than –2
is 0.023.
How did we get there? First, subtract the mean from the value in question
to see how far it is from the mean. Then divide this quantity by the standard
deviation to calculate how many standard deviations away from the mean it is.
More generally, for any given number B drawn from a distribution with mean β1
and standard deviation se( β̂1 ), we can calculate the number of standard deviations
B is away from the mean via the following equation:
B − β1
Standard deviations from mean = (A.2)
se( β̂1 )
Notice in Equation A.2 that the β1 has no hat but se( β̂1 ) does. Seems odd,
doesn’t it? There is a logic to it, though. We’ll be working a lot with hypothetical
values of β1 , asking, for example, what the probability β̂1 is greater than some
TABLE A.1 Examples of Standardized Values

Hypothetical Distribution
Number β1 β 1)
se(β̂ Standardized Description
B
3−0
3 0 3 =1 3 is 1 standard deviation above
3
the mean of 0 when se( βˆ1 ) = 3
1−0
1 0 3 = 0.33 1 is 0.33 standard deviation above
3
7−4
7 4 3 =1 7 is 1 standard deviation above
3
1−4
1 4 3 = −1 1 is 1 standard deviation below
3
6−8
6 8 2 = −1 6 is 1 standard deviation below
2
1−8
1 8 2 = −3.5 1 is 3.5 standard deviations below
2
number would be if the true β1 were zero. But since we’ll want to work with the
precision implied by our actual data, we’ll use se( β̂1 ).3
To get comfortable with converting the distribution of β̂1 to the standard
normal distribution, consider the examples in Table A.1. In the first example
(the first two rows), β1 is 0 and the standard error of β̂1 is 3. Recall that the
standard error of β̂1 measures the width of the β̂1 distribution. In this case, 3
is one standard deviation above the mean, and 1 is 0.33 standard deviation above
the mean.
In the third and fourth rows of Table A.1, β1 = 4 and the standard deviation is
3. In this case, 7 is one standard deviation above the mean, and 1 is one standard
deviation below the mean. In the bottom portion of the table (the last two rows), β1
is 8 and the standard deviation of β̂1 is 2. In this case, 6 is one standard deviation
below the mean, and 1 is 3.5 standard deviations below the mean.
To calculate Φ(Z), we use a table such as the one in Figure A.4 or, more
likely, computer software as discussed in the Computing Corner at the end of the
appendices.
3
Another thing that can be hard to get used to is the mixing of standard deviation and standard error.
Standard deviation measures the variability of a distribution, and in the case of the distribution of β̂1 ,
its standard deviation is the se( β̂1 ). The distinction between standard deviation and standard error
seems larger when calculating the mean of a variable. The standard deviation of X indicates the
variability of X, while the standard error of a sample mean indicates the variability of the estimate of
the mean. The standard error of the mean depends on the sample size while the standard deviation of
X is only a measure of the variability of X. Happily, this distinction tends not to be a problem in
regression.
REMEMBER THIS
1. A standard normal distribution is a normal distribution with a mean of zero and a standard
deviation of one.
(a) Any normally distributed random variable can be converted to a variable distributed
according to a standard normal distribution.
β̂−β
(b) If β̂1 is distributed normally with mean β and standard deviation se( β̂1 ), then se( βˆ1 )
will
be distributed as a standard normal random variable.
(c) Converting random variables to standard normal random variables allows us to use
standard normal tables to discuss any normal distribution.
2. To calculate the probability β̂1 ≤ B, where B is any number of interest, do the following:
B−β1
(a) Convert B to the number of standard deviations above or below the mean using se( βˆ1 )
.
(b) Use the table in Figure A.4 or software to calculate the probability that β̂1 is less than B
in standardized terms.
3. To calculate the probability that β̂1 > B, use the fact that the probability β̂1 is greater than B is
1 minus the probability that β̂1 is less than or equal to B.
Review Questions
1. What is the probability that a standard normal random variable is less than or equal
to 1.64?
2. What is the probability that a standard normal random variable is less than or equal
to –1.28?
3. What is the probability that a standard normal random variable is greater than 1.28?
4. What is the probability that a normal random variable with a mean of zero and a standard
deviation of 2 is less than –4?
5. What is the probability that a normal random variable with a mean of zero and a variance of 9
is less than –3?
6. Approximately what is the probability that a normal random variable with a mean of 7.2 and a
variance of 4 is less than 9?
H. Other Useful Distributions

The normal distribution may be the most famous distribution, but it is not the
only workhorse distribution in statistical analysis. Three other distributions are
particularly common in econometric practice: the χ 2 , t, and F distributions. Each
is derived from the normal distribution.
The χ 2 distribution
χ 2 distribution A The χ 2 distribution (pronounced “kai-squared”) describes the distribution of
probability distribution squared normal variables. The distribution of a squared standard normal random
that characterizes the variable is a χ 2 distribution with one degree of freedom. The components of the
distribution of squared
sum of n independent squared standard normal random variables are distributed
standard normal
random variables.
according to a χ 2 distribution with n degrees of freedom.
The χ 2 distribution arises in many different statistical contexts. We’ll show
that it is a component of the all-important t distribution. The χ 2 distribution also
arises when we conduct likelihood ratio tests for maximum likelihood estimation
models.
The shape of the χ 2 distribution varies according to the degrees of freedom.
Figure A.5 shows two examples of χ 2 distributions. Panel (a) shows a χ 2
distribution with 2 degrees of freedom. We have highlighted the most extreme
5 percent of the distribution, which demonstrates that the critical value from a
χ 2 (2) distribution is roughly 6. Panel (b) shows a χ 2 distribution with 4 degrees
of freedom. The critical value from a χ 2 (4) distribution is around 9.5.
The Computing Corner in Chapter 12 (pages 446 and 448) shows how to
identify critical values from an χ 2 distribution. Software will often, but not always,
provide critical values for us automatically.
The t distribution
The t distribution characterizes the distribution of the ratio of a normal random
variable and the square root of a χ 2 random variable divided by its degrees of
freedom. While such a ratio may seem to be a pretty obscure combination of things
to worry about, we’ve seen in Section 4.2 that the t distribution is incredibly useful.
We know that our OLS coefficients (among other estimators) are normally
distributed. We also know (although we talk about this less) that the estimates of the
standard errors are distributed according to a χ 2 distribution. Since we need to stan-
dardize our OLS coefficients by dividing by our standard error estimates, we want to
know the distribution of the ratio of the coefficient divided by the standard error.
Formally, if z is a standard normal random variable and x is a χ 2 variable
with n degrees of freedom, the following represents a t distribution with n degrees
of freedom:
z
t(n) =
x/n
Probability
density
0.5
χ 2 (2) distribution
0.25
Pr(x > 5.99) = 0.05

0
0 2 4 6 8 10
Value of x
(a)
Probability
density
χ 2 (4) distribution
0.15
0.1
0.05
Pr(x > 9.49) = 0.05

0
0 2 4 6 8 9.49 10 12 14
Value of x
(b)
FIGURE A.5: Two χ 2 Distributions
Virtually every statistical software package automatically produces t statistics

for every coefficient estimated. We can also use t tests to examine hypotheses about
multiple coefficients, although in Section 5.6 we focused on F tests for this purpose
on the grounds of convenience.
The shape of the t distribution is quite similar to the normal distribution. As
shown in Figure 4.3, the t distribution is a bit wider than the normal distribution.
This means that extreme values are more likely to be from a t distribution than
from a normal distribution. However, the difference is modest for small sample
sizes and disappears as the sample size increases.
The F distribution
F distribution A
The F distribution characterizes the distribution of a ratio of two χ 2 random
probability distribution
that characterizes the variables divided by their degrees of freedom. The distribution is named in honor
distribution of a ratio of of legendary statistician R. A. Fisher.
two χ 2 random Formally, if x1 and x2 are independent χ 2 random variables with n1 and n2
variables. degrees of freedom, respectively, the following represents an F distribution with
degrees of freedom n1 and n2 :
x1 /n1
F(n1 , n2 ) =
x2 /n2
Since χ 2 variables are positive, a ratio of two of them must be positive as well,
meaning that random variables following F distributions are greater than or equal
to zero.
An interesting feature of the F distribution is that the square of a t distributed
variable with n degrees of freedom follows an F(1, n) distribution. To see this,
note that a t distributed variable is a normal random variable divided by the square
root of a χ 2 random variable. Squaring the t distributed variable gives us a squared
normal in the numerator, which is χ 2 , and a χ 2 in the denominator. In other words,
this gives us the ratio of two χ 2 random variables, which follow an F distribution.
We used this fact when noting on page 312 that in certain cases we can square a t
statistic to produce an F statistic that can be compared to a rule of thumb about F
statistics in the first stage of 2SLS analyses.
We use the F distribution when doing F tests which, among other things,
allows us to test hypotheses involving multiple parameters. We discussed F tests
in Section 5.6.
The F distribution depends on two degrees of freedom parameters. In the F
test examples, the degrees of freedom for the test statistic depend on the number
of restrictions on the parameters and the sample size. The order of the degrees of
freedom is important and is explained in our discussion of F tests.
The F distribution does not have an easily identifiable shape like the normal
and t distributions. Instead, its shape changes rather dramatically, depending on
the degrees of freedom. Figure A.6 plots four examples of F distributions, each
with different degrees of freedom. For each panel we highlight the extreme 5
percent of the distribution, providing a sense of the values necessary to reject the
null hypotheses for each case. Panel (a) shows an F distribution with degrees of
freedom equal to 3 and 2,000. This would be the distribution of an F statistic if
we were testing a null hypothesis that β1 = β2 = β3 = 0 based on a data set with
2,010 observations and 10 parameters to be estimated. The critical value is 2.61,
meaning that an F test statistic greater than 2.61 would lead us to reject the null
hypothesis. Panel (b) displays an F distribution with degrees of freedom equal to
18 and 300, and so on.
The Computing Corner in Chapter 5 on pages 170 and 172 shows how to
identify critical values from an F distribution. Often, but not always, software will
automatically provide critical values.
I. Sampling
Section 3.2 discussed two sources of variation in our estimates: sampling random-
ness and modeled randomness. Here we elaborate on sampling randomness.
density density
1
F(3, 2,000) F(18, 300)
distribution distribution
1
0.75
0.75
0.5
0.5
0.25
0.25
Pr(x > 2.61) = 0.05 Pr(x > 1.64) = 0.05

0 0
2.61 1.64 2
0 2 4 0 4
Value of x Value of x
Probability (a) Probability (b)
density density
1
F(2, 100) 1
F(9, 10)
distribution distribution
0.75 0.75
0.5 0.5
0.25 0.25
Pr(x > 3.09) = 0.05 Pr(x > 3.09) = 0.05

0 0
3.09 3.02
0 2 4 0 2 4
Value of x Value of x
(c) (d)
FIGURE A.6: Four F Distributions
Imagine that we are trying to figure out some feature of a given population.
For example, suppose we are trying to ascertain the average age of everyone in
the world at a given time. If we had (accurate) data from every single person,
we’d be done. Obviously, that’s not going to happen, so we take a random sample.
Since this random sample will not contain every single person, the average age of
people from it probably will not exactly match the population average. And if we
were to take another random sample, it’s likely that we’d get a different average
because we’d have different people in our sample. Maybe the first time our sample
contained more babies than usual, and the second time we got the world’s oldest
living person.
The genius of the sampling perspective is that we can characterize the degree
of randomness we should observe in our random sample. The variation will depend
on the sample size we observe and on the underlying variation in the population.
A useful exercise is to take some population, say the students in your
econometrics class, and gather information about every person in the population
for some variable. Then, if we draw random samples from this population, we will
see that the mean of the variable in the sampled group will bounce around for each
random sample we draw. The amazing thing about statistics is that we will be able
to say certain things about the mean of the averages we get across the random
samples and the variance of the averages. If the sample size is large, we will be
able to approximate the distribution of these averages with a normal distribution
having a variance we can calculate based on the sample size and the underlying
variance in the overall population.
This logic applies to regression coefficients as well. Hence, if we want to know
the relationship between age and wealth in the whole world, we can draw a random
sample and know that we will have variation related to the fact that we observe
only a subset of the target population. And recall from Section 6.1 that OLS easily
estimates means and difference of means, so even our average-age example works
in an OLS context.
It may be tempting to think of statistical analysis only in terms of sampling
variation, but this is not very practical. First, it is not uncommon to observe an
entire population. For example, if we want to know the relationship between
education and wages in European countries from 2000 to 2014, we could probably
come up with data for each country and year in our target population. And yet, we
would be naive to believe that there is no uncertainty in our estimates. Hence,
there is almost always another source of randomness, something we referred to as
modeled randomness in Section 3.2.
Second, the sampling paradigm requires that the samples from the underlying
target population be random. If the sampling is not random, the type of
observations that make their way into our analysis may systematically differ from
the people or units that we do not observe, thus causing us to risk introducing
endogeneity. A classic example is observing the wages of women who work, but
this subsample is unlikely to be a random sample from all women. The women who
work are likely more ambitious, more financially dependent on working, or both.
Even public opinion polling data, a presumed bastion of random sampling,
seldom provides random samples from underlying populations. Commercial
polls often have response rates of less than 20 percent, and even academic
surveys struggle to get response rates near 50 percent. It is reasonable to
believe that the people who respond differ in economic, social, and personal-
ity traits, and thus simply attributing variation to sampling variation may be
problematic.
So even though sampling variation is incredibly useful as an idealized source
of randomness in our coefficient estimates, we should not limit ourselves to
thinking of variation in coefficients solely in terms of sampling variation. Instead,

it is useful to step back and write down a model that includes an error term
representing uncertainty. If the observations are drawn from a truly random
sample of the target population (Hint: they never are), we can proceed with
thinking of uncertainty as reflecting only sampling variation. However, if there
is no random sampling, either because we don’t have data on the full population
or because the sample is not random, we can model the selection process and
assess whether the non-random sampling process induced correlation between
the independent variables and the error term. The Heckman selection model
referenced in Chapter 10 (page 356) provides a framework for considering such
issues. Such selection is very tricky to assess, however, and researchers continue
to seek the best way to address the issue.
Further Reading
Rice (2007) is an excellent guide to probability theory as used in statistical
analysis.
Key Terms
F distribution (550) Standard normal distribution
Probability density function (543)
(541) χ 2 Distribution (549)
Computing Corner
Excel
Sometimes Excel offers the quickest way to calculate quantities of interest related
to the normal distribution.
• There are several ways to find the probability a standard normal is less than
some value.
1. Use the NORM.S.DIST function, which calculates the normal distribu-

tion. To produce the cumulative probability, which is the percent of
the distribution to the left of the number indicated, use a 1 after the
comma: =NORM.S.DIST(2, 1).
2. Use the NORMDIST function and indicate the mean and the standard
deviation, which for a standard normal are 0 and 1, respectively. Use
a 1 after the last comma to produce the cumulative probability, which
is the percent of the distribution to the left of the number indicated:
=NORMDIST(2, 0, 1, 1).
• For a non-standard normal variable, use the NORMDIST function and indicate
the mean and the standard deviation. For example, if the mean is 9 and the
standard deviation is 3.2, the probability that this distribution will yield a
random variable less than 7 is =NORMDIST(7, 9, 3.2, 1).
Stata
• To calculate the probability that a standard normal is less than some value in
Stata, use the normal command. For example, display normal(2) will
return the probability that a standard normal variable is less than 2.
• To calculate probabilities related to a normally distributed random variable

with any mean and standard deviation, we can also standardize the
variable manually. For example, display normal((7-9)/3.2) returns
the probability that a normal variable with a mean of 9 and a standard
deviation of 3.2 is less than 7.
• To calculate the probability that a standard normal is less than some value
in R, use the pnorm command. For example, pnorm(2, mean= 1, sd=1)
will return the probability that a standard normal variable is less than 2.
• To calculate probabilities related to a normally distributed random variable

with any mean and standard deviation, we can also standardize the variable
manually. For example, pnorm((7-9)/3.2) returns the probability that a
normal variable with a mean of 9 and a standard deviation of 3.2 is less
than 7.
CITATIONS AND ADDITIONAL NOTES
Preface for Students

• Page xxiv On the illusion of explanatory depth, see http://scienceblogs.com/
mixingmemory/2006/11/16/the-illusion_of_explanatory_de/
Chapter 1
• Page 3 Gary Burtless (1995, 65) provides the initial motivation for this
example—he used Twinkies.
• Page 21 See Burtless (1995, 77).
Chapter 3
• Page 45 Sides and Vavreck (2013) provide a great look at how theory can help cut
through some of the overly dramatic pundit-speak on elections.
• Page 57 For a discussion of the central limit theorem and its connection to the
normality of OLS coefficient estimates, see, for example, Lumley et al. (2002).
They note that for errors that are themselves nearly normal or do not have severe
outliers, 80 or so observations are usually enough.
• Page 67 Stock and Watson (2011, 674) present examples of estimators that
highlight the differences between bias and inconsistency. The estimators are silly,
but they make the authors’ point.
– Suppose we tried to estimate the mean of a variable with the first observation
in a sample. This will be unbiased because in expectation it will be equal to
the average of the population. Recall that expectation can be thought of as the
average value we would get for an estimator if we ran an experiment over and
over again. This estimator will not be consistent, though, because no matter
how many observations we have, we’re using only the first observation,
which means that the variance of the estimator will not get smaller as the
sample size gets very large. So yes, no one in their right mind would use this
estimator even though it is nonetheless unbiased—but also inconsistent.
556
CITATIONS AND ADDITIONAL NOTES 557
– Suppose we tried to estimate the mean of a variable with the sample mean
plus N1 . This will be biased because the expectation of this estimator will be
the population average plus N1 . However, this estimator will be consistent
because the variance of a sample mean goes down as the sample size
increases, and the N1 bit will go to zero as the sample size goes to infinity.
Again, this is a nutty estimator that no one would use in practice, but it shows
how it is possible for an estimator that is biased to be consistent.
Chapter 4
• Page 91 For a report on the Pasteur example, see Manzi (2012, 73) and
http://pyramid.spd.louisville.edu/∼eri/fos/Pasteur_Pouilly-le-fort.pdf.
• Page 98 The distribution of the standard error of β̂1 follows a χ 2 distribution.

A normal random variable divided by a χ 2 random variable is distributed
according to a t distribution.
• Page 109 The medical example is from Wilson and Butler (2007, 105).
Chapter 5
• Page 138 In Chapter 14, we show on page 497
that the bias term in a simplified
X
example for a model with no constant is E Xi 2i . For the more standard case
i
(Xi −X)
that includes a constant in the model, the bias term is E (Xi −X) 2 , which is the
i
covariance of X and divided by the variance of X. See Greene (2003, 148) for
a generalization of the omitted variable bias formula for any number of included
and excluded variables.
• Page 153 Harvey’s analysis uses other variables, including a measure of how
ethnically and linguistically divided countries are and a measure of distance
from the equator (which is often used in the literature to capture a historical
pattern that countries close to equator have tended to have weaker political
institutions).
Chapter 6
• Page 181 To formally show that the OLS β̂1 and β̂0 estimates are functions of the
means of the treated and untreated groups requires a bit of a slog through some
algebra.
From page 49, we know that the bivariate OLS equation for the slope is
N
(Ti −T)(Yi −Y)
β̂1 =
i=1
N , where we use Ti to indicate that our independent variable
(Ti −T)2
i=1
558 CITATIONS AND ADDITIONAL NOTES
is a dummy variable (Ti = 1 indicates a treated observation and 0 otherwise). We

can break the sum into two parts, one part for Ti = 1 observations and the other
for Ti = 0 observations. We’ll also refer to T as p, where p indicates the percent of
observations that were treated, which is the average of the dummy independent
variable. (This is not strictly necessary, but it highlights the intuition that the
average of our independent variable is the percent who were treated.)

Ti =1 (Ti − p)(Yi − Y) Ti =0 (Ti − p)(Yi − Y)
β̂1 = N + N
i=1 (Ti − p) i=1 (Ti − p)
2 2
For the Ti = 1 observations, (Ti − p) = (1 − p) because by definition, the value

of Ti in this group is 1. For the Ti = 0 observations, (Ti − p) = (−p) because by
definition, the value of Ti in this group is 0. We can pull these terms out of the
summation because they do not vary across observations within each summation.

(1 − p) Ti =1 (Yi − Y) p Ti =0 (Yi − Y)
β̂1 = N − N
i=1 (Ti − p) i=1 (Ti − p)
2 2
We can rewrite the denominator as NT (1 − p), where NT is the number of

individuals who were treated (and therefore have T1 = 1).1 We also break the
equation into three parts, producing

(1 − p) Ti =1 Yi (1 − p) Ti =1 Y p Ti =0 (Yi − Y)
β̂1 = − −
NT (1 − p) NT (1 − p) NT (1 − p)
The (1 − p) in the numerator and denominator of the first and second terms cancel
out. Note also that the sum of Y for the observations where Ti = 1 equals NT Y,
allowing us to express the OLS estimate of β̂1 as

Ti =1 Yi p Ti =0 (Yi − Y)
β̂1 = −Y −
NT NT (1 − p)
p
We’re almost there. Now note that N (1−p) in the third term can be written as N1 ,
T C
where NC is the number of observations in the
control group (for whom Ti = 0).2
N
Y
Ti =1 i
We denote the average of the treated group NT as Y T and the average of
N 2 N N 2 N 2
1
To see this, rewrite N i=1 (Ti − p) as
2
i=1 Ti − 2p i=1 Ti − i=1 p . Note that both i=1 Ti and
N
i=1 Ti equal NT because the squared value of a dummy variable is equal to itself and because the
sum of a dummy variable is equal to the number of observations for which Ti = 1. We also use the
NT NT2 NNT2
facts that N i=1 p = Np and p = N , which allows us to write the denominator as NT − 2 N + N 2 .
2 2
Simplifying yields NT (1 − p).

2
To see this, substitute NNT for p and simplify, noting that NC = N − NT .
N
Y
Ti =0 i
the control group NC as Y C . We can rewrite our equation as

Ti =0 Yi Ti =0 Y
β̂1 = Y T − Y − +
NC NC

Using fact that Ti =0 Y = NC Y, we can cancel some terms and (finally!) get our
result:
β̂1 = Y T − Y C
To show that β̂0 is Y C , use Equation 3.5 from page 50, noting that Y = Y T NT +Y
N
C NC
.
• Page 183 Discussions of non-OLS difference of means tests sometimes get bogged
down in whether the variance is the same across the treatment and control groups.
If the variance varies across treatment and control groups, we should adjust our
analysis according to the heteroscedasticity that will be present.
• Page 194 This data is from from Persico, Postlewaite, and Silverman (2004).
Results are broadly similar even when we exclude outliers with very high salaries.
• Page 205 See Kam and Franceze (2007, 48) for the derivation of the variance
of estimated effects. The variance of β̂1 + Di β̂3 is var( β̂1 ) + D2i var( β̂3 ) +
2Di cov( β̂1 , β̂3 ), where cov is the covariance of β̂1 and β̂3 (see variance fact 3
on page 540).
– In Stata, we can display cov( β̂1 , β̂3 ) with the following commands:
regress Y X1 D X1D
matrix V = get(VCE)
disp V[3,1]
For more details, see Kam and Franceze (2007, 136–146).
– In R, generate a regression result object (e.g., OLSResults = lm(Y ~

X1 D X1D)) and use the vcov(OLSResults) subcommand to display the
variance-covariance matrix for the coefficient estimates. The covariance of
β̂1 and β̂3 is the entry in the column labeled X1 and the row labeled X1D.
Chapter 7
• Page 223 The data on life expectancy and GDP per capita are from the World
Bank’s World Development Indicators database available at http://data.worldbank
.org/indicator/.
• Page 228 Temperature data is from National Aeronautics and Space Administra-
tion (2012).
• Page 234 In log-linear models, a one-unit increase in X is associated with a β1

percent increase in Y. The underlying model is funky; it is a multiplicative model
of e’s raised to the elements of the log-linear model:
Y = eβ0 eβ1 X e
If we use the fact that log(eA eB eC ) = A + B + C and log both sides, we get the
log-linear formulation:
lnY = β0 + β1 X +
If we take the derivative of Y with respect to X in the original model, we get
dY
= eβ0 β1 eβ1 X e
dX
Dividing both sides by Y so that the change in Y is expressed as a percentage

change in Y and then canceling yields
dY/Y eβ0 β1 eβ1 X e

= β βX
dX e 0e 1 e
dY/Y
= β1
dX
Chapter 8
• Page 263 See Bailey, Strezhnev, and Voeten (2015) for United Nations voting data.
Chapter 9
• Page 295 Endogeneity is a central concern of Medicaid literature. See, for
example, Currie and Gruber (1996), Finkelstein et al. (2012), and Baicker et al.
(2013).
• Page 317 The reduced form is simply the model rewritten to be only a function
of the non-endogenous variables (which are the X and Z variables, not the Y
variables). This equation isn’t anything fancy, although it takes a bit of math to
see where it comes from. Here goes:
1. Insert Equation 9.12 into Equation 9.13:
Y2i = γ0 + γ1 (β0 + β1 Y2i + β2 X1i + β3 Z1i + 1i ) + γ2 X1i + γ3 Z2i + 2i

2. Rearrange by multiplying by the γ1 term as appropriate and combining

terms for X1 :
Y2i = γ0 + γ1 β0 + γ1 β1 Y2i + (γ1 β2 + γ2 )X1i + γ1 β3 Z1i + γ1 1i + γ3 Z2i + 2i
3. Rearrange some more by moving all Y2 terms to the left side of the equation:
Y2i − γ1 β1 Y2i = γ0 + γ1 β0 + (γ1 β2 + γ2 )X1i + γ1 β3 Z1i + γ1 1i + γ3 Z2i + 2i

Y2i (1 − γ1 β1 ) = γ0 + γ1 β0 + (γ1 β2 + γ2 )X1i + γ1 β3 Z1i + γ1 1i + γ3 Z2i + 2i
4. Divide both sides by (1 − γ1 β1 ):
γ0 + γ1 β0 + (γ1 β2 + γ2 )X1i + γ1 β3 Z1i + γ1 1i + γ3 Z2i + 2i

Y2i =
(1 − γ1 β1 )
γ0 +γ1 β0
5. Relabel (1−γ as π0 , (γ(1−γ
1 β2 +γ2 )
as π1 , γ1 β3
as π2 , and γ3
as π3 ,
1 β1 ) 1 β1 ) (1−γ1 β1 ) (1−γ1 β1 )
and combine the terms into ˜ :
Y2i = π0 + π1 X1i + π2 Z1i + π3 Z2i + ˜i
This “reduced form” equation isn’t a causal model in any way. The π coefficients
are crazy mixtures of the coefficients in Equations 9.12 and 9.13, which are the
equations that embody the story we are trying to evaluate. The reduced form
equation is simply a useful way to write down the first-stage model.
Chapter 10
• Page 358 See Newhouse (1993), Manning, Newhouse, et al., 1987, and Gerber
and Green (2012, 212–214) for more on the RAND experiment.
Chapter 12
• Page 423 A good place to start a consideration of maximum likelihood estimation
(MLE) is with the name. Maximum is, well, maximum; likelihood refers to the
probability of observing the data we observe; and estimation is, well, estimation.
For most people, the new bit is the likelihood. The concept is actually quite close to
ordinary usage. Roughly 20 percent of the U.S. population is under 15 years of age.
What is the likelihood that when we pick three people randomly, we get two people
under 15 and one over 15? The likelihood is L = 0.2 × 0.2 × 0.8 = 0.03. In other
words, if we pick three people at random in the United States, there is a 3 percent
chance (or, “likelihood”) we will observe two people under 15 and one over 15.
We can apply this concept when we do not know the underlying probability.
Suppose that we want to figure out what proportion of the population has health
insurance. Let’s call pinsured the probability that someone is insured (which is
simply the proportion of insured in the United States). Suppose we randomly select
three people, ask them if they are insured, and find out that two are insured and
one is not. The probability (or “likelihood”) of observing that combination is
L = pinsured ∗ pinsured ∗ (1 − pinsured ) = p2insured − p3insured
MLE finds an estimate of pinsured that maximizes the likelihood of observing the
data we actually observed.
We can get a feel for what values lead to high or low likelihoods by trying out a few
possibilities. If our estimate were pinsured = 0, the likelihood, L, would be 0. That’s
a silly guess. If our estimate were pinsured = 0.5, then L = 0.5 × 0.5 × (1 − 0.5) =
0.125, which is better. If we chose pinsured = 0.7, then L = 0.7 × 0.7 × 0.3 = 0.147,
which is even better. But if we chose pinsured = 0.9, then L = 0.9×0.9×0.1 = 0.081,
which is not as high as some of our other guesses.
Conceivably, we could keep plugging different values of pinsured into the likelihood
equation until we found the best value. Or, calculus gives us tools to quickly find
maxima.3 When we observe two people with insurance and one without, the value
of pinsured that maximizes the likelihood is 23 , which, by the way, is the common-
sense estimate when we know that two of three observed people are insured.
To use MLE to estimate a probit model, we extend this logic. Instead of estimating
a single probability parameter (pinsured in our previous example) we estimate the
probability Yi = 1 as a function of independent variables. In other words, we
substitute Φ(β0 + β1 Xi ) for pinsured into the likelihood equation just given. In this
case, the thing we are trying to learn about is no longer pinsured ; it’s now the β’s that
determine the probability for each individual based on their respective Xi values.
If we observe two people who are insured and one who is not, we have
L = Φ(β0 + β1 X1 ) × Φ(β0 + β1 X2 ) × (1 − Φ(β0 + β1 X3 ))
where Φ(β0 + β1 X1 ) is the probability that person 1 is insured (where X1 refers to

the value of X for the first person rather than a separate variable X1 , as we typically
use in the notation elsewhere), Φ(β0 + β1 X2 ) is the probability that person 2 is
insured, and (1 − Φ(β0 + β1 X3 )) is the probability that person 3 is not insured.
MLE finds the β̂ that maximizes the likelihood, L. The actual estimation process
is complicated; again, that’s why computers are our friends.
• Page 429 To use the average-case approach, create a single “average” person for
whom the value of each independent variable is the average of that independent
variable. We calculate a fitted probability for this person. Then we add one to the
value of X1 for this average person and calculate how much the fitted probability
goes up. The downside of the average-case approach is that in the real data, the
variables might typically cluster together, with the result that no one is average
3
Here’s the formal way to do this via calculus. First, calculate the derivative of the likelihood with
respect to p: ∂L
∂p
= 2pinsured − 3p2insured . Second, set the derivative to zero and solve for pinsured ; this
yields pinsured = 23 .
across all variables. It’s also kind of weird because dummy variables for the
“average” person will between 0 and 1 even though no single observation will
have any value other than 0 and 1. This means, for example, that the “average”
person will be 0.52 female, 0.85 right-handed, and so forth.
To interpret probit coefficients using the average-case approach, use the following
guide:
– If X1 is a continuous variable:
1. Calculate P1 as the fitted probability by using β̂ and assuming that all

variables are at their average values. This is
Φ( β̂0 + β̂1 X 1 + β̂2 X 2 + β̂3 X 3 + · · · )
2. Calculate P2 as the fitted probability by using β̂ and assuming that
X1 = X 1 + 1 and all other variables are at their average values. This is
Φ( β̂0 + β̂1 (X 1 + 1) + β̂2 X 2 + β̂3 X 3 + · · · )
Sometimes it makes more sense to increase X1 by a standard deviation
of X1 rather than simply by one. For example, if the scale of X1 is in
the millions of dollars, increasing it by one will produce the tiniest of
changes in fitted probability even when the effect of X1 is large.
3. The difference P2 − P1 is the estimated effect of an increase of one

standard deviation in X1 , all other variables being constant.
– If X1 is a dummy variable:

X1 = 0 and all other variables are at their average values. This is
Φ( β̂0 + β̂1 × 0 + β̂2 X 2 + β̂3 X 3 + · · · )
X1 = 1 and all other variables are at their average values. This is
Φ( β̂0 + β̂1 × 1 + β̂2 X 2 + β̂3 X 3 + · · · )
3. The difference P2 − P1 is the estimated effect of a one-unit increase
in X1 , all other variables being constant.
If X1 is a dummy variable, the command margins, dydx(X1) atmeans will

produce an estimate by the average-case method of the effect of a change in
the dummy variable. If X1 is a continuous variable, the command margins,
dydx(X1) atmeans will produce an average-case-method estimate of the
marginal effect of a change in the variable.
• Page 430 The marginal-effects approach uses calculus to determine the slope of
the fitted line. Obviously, the slope of the probit-fitted line varies, so we have
to determine a reasonable point to calculate this slope. In the observed-value

approach, we find the slope at the point defined by actual values of all the
independent variables. This will be ∂Pr(Y i =1)
∂X1 . We know that the Pr(Yi = 1) is
a cumulative distribution function (CDF), and one of the nice properties of a
CDF is that the derivative is simply the probability density function (PDF).
(We can see this graphically in Figure 12.5 by noting that if we increase
the number on the horizontal axis by a small amount, the CDF will increase
by the value of the PDF at that point.) Applying that property plus the chain rule,
ˆ ˆ1 X1i + βˆ2 X2i )
we get ∂Φ( β0 + β∂X = φ( β̂0 + β̂1 X1i + β̂2 X2i ) β̂1 , where φ() is the normal
1
PDF (φ is the lowercase Greek phi). Hence, the marginal effect of increasing X1
at the observed value is φ( β̂0 + β̂1 X1i + β̂2 X2i ) β̂1 .
The discrete-differences approach is an approximation to the marginal-effects
approach. If the scale of X1 is large, such that an increase of one unit is small,
the marginal-effects and discrete-differences approach will yield similar results.
If the scale of X1 is small, such that an increase of one unit is a relatively large
increase, the results from the marginal-effects and discrete-differences approaches
may differ noticeably.
We show how to calculate marginal effects in Stata on page 446 and in R on
page 448.
Chapter 13
• Page 460 Another form of correlated errors is spatial autocorrelation, which
occurs when the error for one observation is correlated with the error for another
observation that is spatially close to it. If we polled two people per household,
there may be spatial autocorrelation because those who live close to each other
(and sleep in the same bed!) may have correlated errors. This kind of situation
can arise with geography-based data, such as state- or county-level data, because
certain unmeasured similarities (meaning stuff in the error term) may be common
within regions. The consequences of spatial autocorrelation are similar to the
consequences of serial autocorrelation. Spatial autocorrelation does not cause
bias. Spatial autocorrelation does however, cause the conventional standard error
equation for OLS coefficients to be incorrect. The easiest first step for dealing
with this situation is simply to include a dummy variable for region. Often this
step will capture any regional correlations not captured by the other independent
variables. A more technically complex way of dealing with this situation is via
spatial regression statistical models. The intuition underlying these models is
similar to that for serial correlation, but the math is typically harder. See, for
example, Tam Cho and Gimpel (2012).
• Page 465 Wooldridge (2009, 416) discusses inclusion of X variables in this test.
The so-called Breusch-Godfrey test is a more general test for autocorrelation. See,
for example, Greene (2003, 269).
• Page 469 Wooldridge (2009, 424) notes that the ρ-transformed approach also
requires that t not be correlated with Xt−1 or Xt+1 . In a ρ-transformed model,
the independent variable is Xt − ρXt−1 and the error is t − ρt−1 . If the lagged
error term (t−1 ) is correlated with Xt , then the independent variable in the
ρ-transformed model will be correlated with the error term in the ρ-transformed
model.
• Page 478 R code to generate multiple simulations with unit root (or other) time
series variables:
Nsim = 200 # Number of obs. per simulation

SimCount = 100 # Number of simulations
S = rep(NA, SimCount) # Stores t stats
for(s in 1:SimCount){ # Loop thru simulations
G = 1.0 # 1 for unit root; <1 otherwise
Y = 0 # Start value for Y
X = 0 # Start value for X
for(i in 1:Nsim){ # Loop
Y = c(Y, G*Y[i-1] + rnorm(1)) # Generate Y
X = c(X, G*X[i-1] + rnorm(1)) # Generate X
S[s]=summary(lm(Y X))$coef[2,3] # Store t stats
} # End s loop
sum((abs(S)>2))/SimCount # % simulations w/t stat > 2
• Page 490 To estimate a Cochrane-Orcutt manually in R, begin with the R code for
diagnosing autocorrelation and then
# Rho is rho-hat
Rho = summary(LagErrOLS)$coefficients[2]
# Length of Temp variable
N = length(Temp)
# Lagged temperature
LagTemp = c(NA, Temp[1:(N-1)])
# Lagged year
LagYear = c(NA, Year[1:(N-1)])
# Rho-transformed temperature
TempRho = AvgTemp - Rho*LagTemp
# Rho-transformed year
YearRho = Year- Rho*LagYear
# Rho-transformed model
ClimateRho = lm(TempRho ~ YearRho)
# Display results
summary(ClimateRho)
Chapter 14
• Page 510 The attenuation bias result was introduced in Section 5.3. We can
also derive it by using the general form of endogeneity from page 60, which
is plim β̂1 = β1 + corr(X1 , ) σσ = β1 + cov(X σX
1 ,)
. Note that the error term in
X1 1
Equation 14.21 (which is analogous to in the plim equation) actually contains
−β1 νi + i . Solving for cov(X1 , −β1 νi + ) yields −β1 σν .
Chapter 16
• Page 534 Professor Andrew Gelman, of Columbia University, directed me to this
saying of Bill James.
GUIDE TO REVIEW QUESTIONS
Chapter 1
Review question on page 7:
1. Panel (a): β0 > 0 (it’s around 0.4) and β1 > 0
Panel (b): β0 > 0 (it’s around 0.8) and β1 < 0
Panel (c): β0 > 0 (it’s around 0.4) and β1 = 0
Panel (d): Note that the X-axis ranges from about −6 to +6. β0 is the value of
Y when X is zero and is therefore 2, which can be seen in Figure R.1. β0 is not
the value of Y at the left-most point in the figure, as it was for the other panels
in Figure 1.4.
Chapter 3
Review questions on page 64:
1. Note that the variance of the independent variable is much smaller in panel (b).
From the equation for the variance of β̂1 , we know that higher variance of X is
associated with lower variance of β̂1 , meaning the variance of β̂1 in panel (a)
should be lower.
2. Note that the number of observations is much larger in panel (d). From the
equation for the variance of β̂1 , we know that higher sample size is associated
with lower variance, meaning the variance of β̂1 in panel (d) should be lower.
Chapter 4
1. Based on the results in Table 4.2:
(a) The t statistic for the coefficient on change in income is 2.29

0.52 = 4.40.
(b) The degrees of freedom is sample size minus the number of parameters
estimated, so it is 17 − 2 = 15.
567
568 GUIDE TO REVIEW QUESTIONS
8
Y
7
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
Independent variable, X
FIGURE R.1: Identifying β0 from a Scatterplot
(c) The critical value for a two-sided alternative hypothesis and α = 0.01 is
2.95. We reject the null hypothesis.
(d) The critical value for a one-sided alternative hypothesis and α = 0.05 is
1.75. We reject the null hypothesis.
2. The critical value from a two-sided test is bigger because it indicates the point
at which α2 of the distribution is larger. As Table 4.4 shows, the two-sided
critical values are larger than the one-sided critical values for all values of α.
3. The critical values from a small sample are larger because the t distribution
accounts for additional uncertainty about our estimate of the standard error of
β̂1 . In other words, even when the null hypothesis is true, the data could work
out to give us an unusually small estimate of se(β̂1 ), which would push up our t
statistic. That is, the more uncertainty there is about se(β̂1 ), the more we could
expect to see higher values of the t statistic even when the null hypothesis is
GUIDE TO REVIEW QUESTIONS 569
true. As the sample size increases, uncertainty about se(β̂1 ) decreases, so even
when the null hypothesis is true, this source of large t statistics diminishes.
Chapter 5
1. Not at all. R2j will be approximately zero. In a random experiment, the treatment
is uncorrelated with anything, including the other covariates. This buys us
exogeneity, but it also buys us increased precision.
2. We’d like to have a low variance for estimates, and to get that we want the R2j to
be small. In other words, we want the independent variables to be uncorrelated
with each other.
Chapter 6
1. (a) Control group: 0. Treatment group: 2. Difference is 2.
(b) Control group: 4. Treatment group: −6. Difference is −10.
(c) Control group: 101. Treatment group: 100. Difference is −1.
2. (a) β̂0 : 0; β̂1 : 2
(b) β̂0 : 4; β̂1 : −10
(c) β̂0 : 101; β̂1 : −1
1. A model in which a three-category country variable has been converted into

multiple dummy variables with the United States as the excluded category
looks like this:
Yi = β0 + β1 X1i + β2 Canadai + β3 Mexicoi + i
The estimated constant (β̂0 ) is the average value of Yi for units in the excluded
category (in this case, U.S. citizens) after we have accounted for the effect
of X1 . The coefficient on the Canada dummy variable (β̂2 ) estimates how
much more or less Canadians feel about Y compared to Americans, the
excluded reference category. The coefficient on the Mexico dummy variable
(β̂3 ) estimates how much more or less Mexicans feel about Y compared to
Americans. Using Mexico or Canada as reference categories is equally valid
and would produce substantively identical results, although the coefficients on

the dummy variables would differ as they would refer to a different reference
category.
2. (a) 25
(b) 20
(c) 30
(d) 115
(e) 5
(f) −20
(g) 120
(h) −5
(i) −25
(j) 5
1. (a) β0 = 0, β1 > 0, β2 > 0, β3 = 0
(b) β0 > 0, β1 < 0, β2 > 0, β3 = 0
(c) β0 > 0, β1 = 0, β2 = 0, β3 > 0
(d) β0 > 0, β1 > 0, β2 = 0, β3 < 0 (actually β3 = −β1 )
(e) β0 > 0, β1 > 0, β2 < 0, β3 > 0
(f) β0 > 0, β1 > 0, β2 > 0, β3 < 0
2. β3 in panel (d) is −β1 .
3. False. The effect of X for the treatment group depends on β1 + β3 . If β1 is

sufficiently positive, the effect of X is still positive for the treatment group
even when β3 is negative.
Chapter 7
1. Panel (a) looks like a quadratic model with effect accelerating as profits rise.
Panel (b) looks like a quadratic model with effect accelerating as profits rise.
Panel (c) is a bit of a trick question as the relationship is largely linear, but
with a few unusual observations for profits around 4. A quadratic model would
estimate an upside down U-shape but it would also be worth exploring if these
are outliers or if these observations can perhaps be explained by other variables.
Panel (d) looks like a quadratic model with rising and then falling effect of
profits on investment. For all quadratic models, we would simply include a
variable with the squared value of profits and let the computer program tell us
the coefficient values that produce the appropriate curve.
2. The sketches would draw lines through the masses of data for panels (a), (b),
and (d). The sketch for panel (c) would depend on whether we stuck with a
quadratic model or treated the unusual obervations as outliers to be excluded
or modeled with other variables.
3. The effects of profits on investment in each panel are roughly:
Profits go from 0 to 1 percent Profits go from 3 to 4 percent

(a) Less than 1 Greater than 2
(b) Around 0 Less than –3
(c) Linear model: around 2 Around 2
(c) Quadratic model: around 2 Very negative
(d) Around +5 Around –5
Chapter 8
Review question on page 282—see Table R.1:
TABLE R.1 Values of β 0 , β 1 , β 2 , and β 3 in Figure 8.6

(a) (b) (c) (d)
β0 2 3 2 3
β1 −1 −1 0 0
β2 0 −2 2 −2
β3 2 2 −1 1
Chapter 9
1. The first stage is the model explaining drinks per week. The second stage is
the model explaining grades. The instrument is beer tax, as we can infer based
on its inclusion in the first stage and exclusion from the second stage.
2. A good instrument needs to satisfy inclusion and exclusion conditions. In

this case, beer tax does not satisfy the inclusion condition because it is not
statistically significant in the first stage. As a rule of thumb, we want the
instrumental variable in the first stage equation to have a t statistic greater than
3. The exclusion condition cannot be assessed empirically. It seems reasonable
that the beer tax in a state is not related to grades a student gets.
3. There is no evidence on exogeneity of the beer tax in the table because this is
not something we can assess empirically.
4. We would get perfect multicollinearity and be unable to estimate a coefficient

on it (or another independent variable). The fitted value of drinks per week
is a linear combination of the beer tax and standardized test score variables
(specifically is it 4 − 0.001 × test score − 2 × beer tax) meaning it will be
perfectly explained by an auxiliary regression of fitted value on the test score
and beer tax variables.
5. No. The first stage results do not satisfy the inclusion condition, and we
therefore cannot place any faith in the results of the second stage.
Chapter 10
1. There is a balance problem as the treatment villages have higher income, with
a t statistic of 2.5 on the treatment variable. Hence, we cannot be sure that the
differences in the treated and untreated villages are due to the treatment or to
the fact that the treated villages are wealthier. There is no difference in treated
and untreated villages with regard to population.
2. There is a possible attrition problem as treated villages are more likely to report
test scores. This is not surprising as teachers from treated villages have more
of an incentive to report test scores. The implication of this differential attrition
is not clear, however. It could be that the low-performing school districts tend
not to report among the control village while even low-performing school
districts report among the treated villages. Hence, the attrition is not necessarily
damning of the results. Rather, it calls for further analysis.
3. The first column reports that students in treated villages had substantially
higher test scores. However, we need to control for village income as well
because the treated villages also tended to have higher income. In addition, we
should be somewhat wary of the fact that 20 villages did not report test scores.
As discussed earlier, the direction of the bias is not clear, but it would be useful
to see additional analysis of the kinds of districts that did and did not report
test scores. Perhaps the data set could be trimmed and reanalyzed.
Chapter 11
Review question on page 384:
(a) β1 = 0, β2 = 0, β3 < 0
(b) β1 < 0, β2 = 0, β3 > 0
(c) β1 > 0, β2 < 0, β3 = 0
(d) β1 < 0, β2 > 0, β3 < 0
(e) β1 > 0, β2 > 0, β3 < 0 (actually β3 = −β2 )
(f) β1 < 0, β2 < 0, β3 > 0 (here, too, β3 = −β2 , which means β3 is positive because
β2 is negative)
Chapter 12
1. Solve for Yi∗ = 0.0:
Panel (a): X = 1.5
Panel (b): X = 2
3
Panel (c): X = 1.0
Panel (d): X = 1.5
2. True, false, or indeterminate, based on Table 12.2:
(a) True. The t statistic is 5, which is statistically significant for any

reasonable significance level.
(b) False. The t statistic is 1, which is not statistically significant for any
reasonable significance level.
(c) False! Probit coefficients cannot be directly interpreted.
(d) False. The fitted probability is Φ(0), which is 0.50.

(e) True. The fitted probability is Φ(3), which is approximately 1 because

virtually all of the area under a standard normal curve is to the left of 3.
3. Fitted values based on Table 12.2:
(a) The fitted probability is Φ(0 + 0.5 × 4 − 0.5 × 0) = Φ(2), which is 0.978.
(b) The fitted probability is Φ(0 + 0.5 × 0 − 0.5 × 4) = Φ(−2), which is

0.022.
(c) The fitted probability is Φ(3 + 1.0 × 0 − 3.0 × 1) = Φ(0), which

is 0.5.
1. Use the observed-variable, discrete-differences approach to interpreting the

coefficient. Calculate the fitted probability for all observations using actual
values of years of experience and the liquor license variables. Then calculate
the fitted probability for all observations using years of experience equal to
actual years of experience plus 1 and the actual value of the liquor license
variable. The average difference in these fitted probabilities is the average
estimated effect of a one-unit increase in years of experience on the probability
of bankruptcy.
2. Use the observed-variable, discrete-differences approach to interpreting the

coefficient. Calculate the fitted probability using the actual value of the years
of experience variable and setting liquor license to 0 for all observations.
Then calculate the fitted probability for all observations using the actual value
of years of experience and setting the liquor license variable to 1 for all
obesrvations. The average difference in these fitted probabilities is the average
estimated effect of having a liquor license on the probability of bankruptcy.
Chapter 14
1. The power of a test is the probability of observing a t statistic higher than

the critical value given the true value of β1 and the se(β̂1 ), α, and alternative
hypothesis posited in the question. This will be

β1True
1 − Φ Critical value − .
se(β̂1 )
The critical value will be 2.32 for α = 0.01 and a one-sided alternative
β True
hypothesis. The sketches will be normal distributions centered at se(1β̂ ) with
1
the portion of the normal distribution greater than the critical value shaded.

(a) The power when β1True = 1 is 1 − Φ 2.32 − 0.75
1
= 0.162.

(b) The power when β1True = 2 is 1 − Φ 2.32 − 0.75
2
= 0.636.
2. If the estimated se(β̂1 ) doubled, the power will go down because the center of
β True
the t statistic distribution will shift toward zero (because se(1β̂ ) gets smaller as
1
the standard error increases).
For
this higher standard error, the power when
β1True = 1 is 1 − Φ 2.32 − 1.5 1
= 0.049, and the power when β1True = 2 is

1 − Φ 2.32 − 1.5 2
= 0.161.
3. The probability of committing a Type II error is simply 1 minus the power.

Hence, when se(β̂1 ) = 2.5, the probability of committing a Type II error is
0.973 for β1True = 1 and 0.936 for β1True = 2.
Appendix
1. The table in Figure A.4 shows that the probability a standard normal random
variable is less than or equal to 1.64 is 0.950, meaning there is a 95 percent
chance that a normal random variable will be less than or equal to whatever
value is 1.64 standard deviations above its mean.
2. The table in Figure A.4 shows that the probability a standard normal random
variable is less than or equal to −1.28 is 0.100, meaning there is a 10 percent
chance that a normal random variable will be less than or equal to whatever
value is 1.28 standard deviations below its mean.
3. The table in Figure A.4 shows that the probability that a standard normal
random variable is greater than 1.28 is 0.900. Because the probability of being
above some value is 1 minus the probability of being below some value, there
is a 10 percent chance that a normal random variable will be greater than or
equal to whatever number is 1.28 standard deviations above its mean.
4. We need to convert the number −4 to something in terms of standard deviations

from the mean. The value −4 is 2 standard deviations below the mean of 0 when
the standard deviation is 2. The table in Figure A.4 shows that the probability
a normal random variable with a mean of zero is less (more negative) than 2
standard deviations below its mean is 0.023. In other words, the probability of
being less than −4−0
2 = −2 is 0.023.
5. First, convert −3 to standard deviations above or below the mean. In this case,
if the variance is 9, then the standard deviation (the square root of the variance)
is 3. Therefore, −3 is the same as one standard deviation below the mean. From
the table in Figure A.4, we see that there is a 0.16 probability a normal variable
will be more than one standard deviation below its mean. In other words, the
probability of being less than −3−0
√
9
= −1 is 0.16.
6. First, convert 9 to standard deviations above or below the mean. The standard
deviation (the square root of the variance) is 2. The value 9 is 9−7.2 2 =
1.8
2 standard deviation above the mean. The value 0.9 does not appear in
Figure A.4. However, it is close to 1, and the probability of being less than
1 is 0.84. Therefore, a reasonable approximation is in the vicinity of 0.8. The
actual value is 0.82 and can be calculated as discussed in the Computing Corner
on page 554.
BIBLIOGRAPHY
Acemoglu, Daron, Simon Johnson, and James of Los Angeles. University of Pennsylvania Law
A. Robinson. 2001. The Colonial Origins of Review 161: 699–756.
Comparative Development: An Empirical
Angrist, Joshua. 2006. Instrumental Variables
Investigation. American Economic Review 91(5):
Methods in Experimental Criminological
1369–1401.
Research: What, Why and How. Journal of
Acemoglu, Daron, Simon Johnson, James Experimental Criminology 2(1): 23–44.
A. Robinson, and Pierre Yared. 2008. Income
Angrist, Joshua, and Alan Krueger. 1991. Does
and Democracy. American Economic Review
Compulsory School Attendance Affect
98(3): 808–842.
Schooling and Earnings? Quarterly Journal of
Acharya, Avidit, Matthew Blackwell and Maya Sen. Economics. 106(4): 979–1014.
2016. Explaining Causal Findings without Bias:
Detecting and Assessing Direct Effects. Angrist, Joshua, and Jörn-Steffen Pischke. 2009.
American Political Science Review 110(3): Mostly Harmless Econometrics: An Empiricist’s
512–529. Companion. Princeton, NJ: Princeton University
Press.
Achen, Christopher H. 1982. Interpreting and Using
Regression. Newbury Park, CA: Sage Angrist, Joshua, and Jörn-Steffen Pischke. 2010.
Publications. The Credibility Revolution in Empirical
Economics: How Better Research Design is
Achen, Christopher H. 2000. Why Lagged Taking the Con out of Econometrics. National
Dependent Variables Can Suppress the Bureau of Economic Research working paper.
Explanatory Power of Other Independent http://www.nber.org/papers/w15794
Variables. Manuscript, University of Michigan.
Angrist, Joshua, Kathryn Graddy, and Guido
Achen, Christopher H. 2002. Toward a New Political Imbens. 2000. The Interpretation of Instrumental
Methodology: Microfoundations and ART. Variables Estimators in Simultaneous Equations
Annual Review of Political Science 5: 423–450. Models with an Application to the Demand for
Albertson, Bethany, and Adria Lawrence. 2009. Fish. Review of Economic Studies 67(3):
After the Credits Roll: The Long-Term Effects of 499–527.
Educational Television on Public Knowledge and Anscombe, Francis J. 1973. Graphs in Statistical
Attitudes. American Politics Research 37(2): Analysis. American Statistician 27(1): 17–21.
275–300.
Anzia, Sarah. 2012. The Election Timing Effect:
Alvarez, R. Michael, and John Brehm. 1995.
Evidence from a Policy Intervention in Texas.
American Ambivalence towards Abortion Policy:
Quarterly Journal of Political Science 7(3):
Development of a Heteroskedastic Probit Model
209–248.
of Competing Values. American Journal of
Political Science 39(4): 1055–1082. Arellano, Manuel, and Stephen Bond. 1991. Some
Tests of Specification for Panel Data. Review of
Anderson, James M., John M. Macdonald, Ricky
Economic Studies 58(2): 277–297.
Bluthenthal, and J. Scott Ashwood. 2013.
Reducing Crime by Shaping the Built Aron-Dine, Aviva, Liran Einav, and Amy
Environment with Zoning: An Empirical Study Finkelstein. 2013. The RAND Health Insurance
577
578 BIBLIOGRAPHY
Experiment, Three Decades Later. Journal of Way to Fight Global Poverty. New York: Public
Economic Perspectives 27(1): 197–222. Affairs.
Aronow, Peter M. and Cyrus Samii. 2016. Does Bartels, Larry M. 2008. Unequal Democracy: The
Regression Produce Representative Estimates of Political Economy of the New Gilded Age.
Causal Effects? American Journal of Political Princeton, NJ: Princeton University Press.
Science 60(1): 250–267.
Beck, Nathaniel. 2010. Making Regression and
Baicker, Katherine, and Amitabh Chandra. 2017. Related Output More Helpful to Users. The
Evidence-Based Health Policy. New England Political Methodologist 18(1): 4–9.
Journal of Medicine 377(25): 2413–2415.
Beck, Nathaniel, and Jonathan N. Katz. 1996.
Baicker, Katherine, Sarah Taubman, Heidi Allen, Nuisance vs. Substance: Specifying and
Mira Bernstein, Jonathan Gruber, Joseph P. Estimating Time-Series–Cross-Section Models.
Newhouse, Eric Schneider, Bill Wright, Alan Political Analysis 6: 1–36.
Zaslavsky, Amy Finkelstein, and the Oregon
Health Study Group. 2013. The Oregon Beck, Nathaniel, and Jonathan N. Katz. 2011.
Experiment—Medicaid’s Effects on Clinical Modeling Dynamics in Time-Series–Cross-
Outcomes. New England Journal of Medicine Section Political Economy Data. Annual Review
368(18): 1713–1722. of Political Science 14: 331–352.
Bailey, Michael A., and Elliott Fullmer. 2011. Berk, Richard A., Alec Campbell, Ruth Klap, and
Balancing in the States, 1978–2009. State Bruce Western. 1992. The Deterrent Effect of
Politics and Policy Quarterly 11(2): 149–167. Arrest in Incidents of Domestic Violence: A
Bayesian Analysis of Four Field Experiments.
Bailey, Michael A., Daniel J. Hopkins, and Todd
American Sociological Review 57(5): 698–708.
Rogers. 2015. Unresponsive and Unpersuaded:
The Unintended Consequences of Voter Bertrand, Marianne, and Sendhil Mullainathan.
Persuasion Efforts. Manuscript, Georgetown 2004. Are Emily and Greg More Employable
University. than Lakisha and Jamal? A Field Experiment on
Labor Market Discrimination. American
Bailey, Michael A., Jon Mummolo, and Hans Noel.
Economic Review 94(4): 991–1013.
2012. Tea Party Influence: A Story of Activists
and Elites. American Politics Research 40(5): Bertrand, Marianne, Esther Duflo, and Sendhil
769–804. Mullainathan. 2004. How Much Should We Trust
Bailey, Michael A., Jeffrey S. Rosenthal, and Albert Differences-in-Differences Estimates? Quarterly
H. Yoon. 2014. Grades and Incentives: Assessing Journal of Economics 119(1): 249–275.
Competing Grade Point Average Measures and Blinder, Alan S., and Mark W. Watson. 2013.
Postgraduate Outcomes. Studies in Higher Presidents and the Economy: A Forensic
Education. Investigation. Manuscript, Princeton University.
Bailey, Michael A., Anton Strezhnev, and Erik Bloom, Howard S. 2012. Modern Regression
Voeten. 2015. Estimating Dynamic State Discontinuity Analysis. Journal of Research on
Preferences from United Nations Voting Data. Educational Effectiveness 5(1): 43–82.
Journal of Conflict Resolution.
Bound, John, David Jaeger, and Regina Baker. 1995.
Baiocchi, Michael, Jing Cheng, and Dylan S. Small. Problems with Instrumental Variables Estimation
2014. Tutorial in Biostatistics: Instrumental When the Correlation Between the Instruments
Variable Methods for Causal Inference. Statistics and the Endogenous Explanatory Variable Is
in Medicine 33(13): 2297–2340. Weak. Journal of the American Statistical
Baltagi, Badi H. 2005. Econometric Analysis of Association 90(430): 443–450.
Panel Data, 3rd ed. Hoboken, NJ: Wiley. Box, George E. P. 1976. Science and Statistics.
Banerjee, Abhijit Vinayak, and Esther Duflo. 2011. Journal of the American Statistical Association
Poor Economics: A Radical Rethinking of the 71(356): 791–799.
BIBLIOGRAPHY 579
Box-Steffensmeier, Janet, and Agnar Freyr Campbell, James E. 2011. The Economic Records
Helgason. 2016. Introduction to Symposium on of the Presidents: Party Differences and Inherited
Time Series Error Correction Methods in Economic Conditions. Forum 9(1): 1–29.
Political Science. Political Analysis 24(1):1–2. Card, David. 1990. The Impact of the Mariel
Box-Steffensmeier, Janet M., and Bradford S. Jones. Boatlift on the Miami Labor Market. Industrial
2004. Event History Modeling: A Guide for and Labor Relations Review 43(2): 245–257.
Social Scientists. Cambridge, U.K.: Cambridge
Card, David. 1999. The Causal Effect of Education
University Press.
on Earnings. In Handbook of Labor Economics,
Bradford-Hill, Austin. 1965. The Environment and vol. 3, O. Ashenfelter and D. Card, eds.
Disease: Association or Causation? Proceedings Amsterdam: Elsevier Science.
of the Royal Society of Medicine 58(5): 295–300.
Card, David, Carlos Dobkin, and Nicole Maestas.
Brambor, Thomas, William Roberts Clark, and Matt 2009. Does Medicare Save Lives? Quarterly
Golder. 2006. Understanding Interaction Models: Journal of Economics 124(2): 597–636.
Improving Empirical Analyses. Political Analysis
Carrell, Scott E., Mark Hoekstra, and James E.
14: 63–82.
West. 2010. Does Drinking Impair College
Braumoeller, Bear F. 2004. Hypothesis Testing and Performance? Evidence from a Regression
Multiplicative Interaction Terms. International Discontinuity Approach. National Bureau of
Organization 58(4): 807–820. Economic Research Working Paper.
Brown, Peter C., Henry L. Roediger III, and Mark http://www.nber.org/papers/w16330
A. McDaniel. 2014. Making It Stick: The Science Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith
of Successful Learning. Cambridge, MA: T. Poole, and Howard Rosenthal. 2009.
Harvard University Press. Measuring Bias and Uncertainty in
Brownlee, Shannon, and Jeanne Lenzer. 2009. Does DW-NOMINATE Ideal Point Estimates via the
the Vaccine Matter? The Atlantic, November. Parametric Bootstrap. Political Analysis 17:
www.theatlantic.com/doc/200911/brownlee-h1n1/2 261–27. Updated at http://voteview.com/
dwnominate.asp
Brumm, Harold J., Dennis Epple, and Bennett T.
McCallum. 2008. Simultaneous Equation Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith
Econometrics: Some Weak-Instrument and T. Poole, and Howard Rosenthal. 2014.
Time-Series Issues. Manuscript, Carnegie DW-NOMINATE Scores with Bootstrapped
Mellon. Standard Errors. Updated February 17, 2013, at
http://voteview.com/dwnominate.asp
Buckles, Kasey, and Dan Hungerman. 2013. Season
of Birth and Later Outcomes: Old Questions, Cellini, Stephanie Riegg, Fernando Ferreira, and
New Answers. The Review of Economics and Jesse Rothstein. 2010. The Value of School
Statistics 95(3): 711–724. Facility Investments: Evidence from a Dynamic
Regression Discontinuity Design. Quarterly
Buddlemeyer, Hielke, and Emmanuel Skofias. 2003.
Journal of Economics 125(1): 215–261.
An Evaluation on the Performance of Regression
Discontinuity Design on PROGRESA. Institute Chabris, Christopher, and Daniel Simmons. Does
for Study of Labor, Discussion Paper 827. the Ad Make Me Fat? New York Times, March
10, 2013.
Burde, Dana, and Leigh L. Linden. 2013. Bringing
Education to Afghan Girls: A Randomized Chakraborty, Indraneel, Hans A. Holter, and Serhiy
Controlled Trial of Village-Based Schools. Stepanchuk. 2012. Marriage Stability, Taxation,
American Economic Journal: Applied Economics and Aggregate Labor Supply in the U.S. vs.
5(3): 27–40. Europe. Uppsala University Working Paper
Burtless, Gary, 1995. The Case for Randomized 2012: 10.
Field Trials in Economic and Policy Research. Chen, Xiao, Philip B. Ender, Michael Mitchell, and
Journal of Economic Perspectives 9(2): 63–84. Christine Wells. 2003. Regression with Stata.
580 BIBLIOGRAPHY
http://www.ats.ucla.edu/stat/stata/ DiazGranados, Carlos A., Martine Denis, and

webbooks/reg/default.htm Stanley Plotkin. 2012. Seasonal Influenza
Cheng, Cheng, and Mark Hoekstra. 2013. Does Vaccine Efficacy and Its Determinants in
Strengthening Self-Defense Law Deter Crime or Children and Non-elderly Adults: A Systematic
Escalate Violence? Evidence from Castle Review with Meta-analyses of Controlled Trials.
Doctrine. Journal of Human Resources 48(3): Vaccine 31(1): 49–57.
821–854. Duflo, Esther, Rachel Glennerster, and Michael
Kremer. 2008. Using Randomization in
Ching, Andrew, Tülin Erdem, and Michael Keane.
Development Economics Research: A Toolkit. In
2009. The Price Consideration Model of Brand
Handbook of Development Economics, vol. 4, T.
Choice. Journal of Applied Econometrics 24(3):
Schultz and John Strauss, eds. Amsterdam and
393–420.
New York: North Holland.
Clark, William Roberts, and Arel-Bundock Vincent.
Dunning, Thad. 2012. Natural Experiments in the
2013. Independent but Not Indifferent: Partisan
Social Sciences: A Design-Based Approach.
Bias in Monetary Policy at the Fed. Economics
Cambridge, U.K.: Cambridge University Press.
and Politics 25(1): 1–26.
Drum, Kevin. 2013a. America’s Real Criminal
Clarke, Kevin A. 2005. The Phantom Menace:
Element: Lead—New Research Finds Pb Is the
Omitted Variable Bias in Econometric Research.
Hidden Villain Behind Violent Crime, Lower
Conflict Management and Peace Science 22(4):
IQs, and Even the ADHD Epidemic. Mother
341–352.
Jones January/February.
Comiskey, Michael, and Lawrence C. Marsh. 2012. Drum, Kevin. 2013b. Crime Is at Its Lowest Level in
Presidents, Parties, and the Business Cycle, 50 Years. A Simple Molecule May Be the Reason
1949–2009. Presidential Studies Quarterly Why. Mother Jones. http://www.motherjones
42(1): 40–59. .com/kevin-drum/2013/01/lead-crime-connection
Cook, Thomas. 2008. Waiting for Life to Arrive: A Drum, Kevin. 2013c. Lead and Crime: A Response
History of the Regression Discontinuity Design to Jim Manzi. Mother Jones. http://www
in Psychology, Statistics, and Economics. .motherjones.com/kevin-drum/2013/01/
Journal of Econometrics 142(2): 636–654. lead-and-crime-response-jim-manzi
Currie, Janet, and Jonathan Gruber. 1996. Saving Dynarski, Susan. 2000. Hope for Whom? Financial
Babies: The Efficacy and Cost of Recent Changes Aid for the Middle Class and Its Impact on
in the Medicaid Eligibility of Pregnant Women. College Attendance. National Tax Journal
Journal of Political Economy 104(6): 1263–1296. 53(3, part 2): 629–662.
Cragg, John G. 1994. Making Good Inferences from Elwert, Felix, and Christopher Winship. 2014.
Bad Data. Canadian Journal of Economics Endogenous Selection Bias: The Problem of
27(4): 776–800. Conditioning on a Collider Variable. Annual
Das, Mitali, Whitney K. Newey, and Francis Vella. Review of Sociology 40(1): 31–53.
2003. Non-parametric Estimation of Sample Epple, Dennis, and Bennett T. McCallum. 2006.
Selection Models. The Review of Economic Simultaneous Equation Econometrics: The
Studies 70(1): 33–58. Missing Example. Economic Inquiry 44(2):
De Boef, Suzanna, and Luke Keele. 2008. Taking 374–384.
Time Seriously. American Journal of Political Erikson, Robert S., and Thomas R. Palfrey. 2000.
Science 52(1): 184–200. Equilibrium in Campaign Spending Games:
Demicheli, Vittoria, Tom Jefferson, Eliana Ferroni, Theory and Data. American Political Science
Alessandro Rivetti, and Carlo Di Pietrantonj. Review 94(3): 595–610.
2018. Vaccines for preventing influenza in Fearon, James D., and David D. Laitin. 2003.
healthy adults. Cochrane Database of Systematic Ethnicity, Insurgency, and Civil War. American
Reviews Issue 2, Article number CD001269. Political Science Review 97(1): 75–90.
BIBLIOGRAPHY 581
Finkelstein, Amy, Sarah Taubman, Bill Wright, Mira Greene, William. 2003. Econometric Analysis, 6th
Bernstein, Jonathan Gruber, Joseph P. Newhouse, ed. Upper Saddle River, NJ: Prentice Hall.
Heidi Allen, Katherine Baicker, and the Oregon Greene, William. 2008. Econometric Analysis, 7th
Health Study Group. 2012. The Oregon Health ed. Upper Saddle River, NJ: Prentice Hall.
Insurance Experiment: Evidence from the First
Year. Quarterly Journal of Economics 127(3): Grimmer, Justin, Eitan Hersh, Brian Feinstein, and
1057–1106. Daniel Carpenter. 2010. Are Close Elections
Randomly Determined? Manuscript, Stanford
Gaubatz, Kurt Taylor. 2015. A Survivor’s Guide to University.
R: An Introduction for the Uninitiated and the
Unnerved. Los Angeles: Sage. Hanmer, Michael J., and Kerem Ozan Kalkan. 2013.
Behind the Curve: Clarifying the Best Approach
Gerber, Alan S., and Donald P. Green. 2000. The to Calculating Predicted Probabilities and
Effects of Canvassing, Telephone Calls, and Marginal Effects from Limited Dependent
Direct Mail on Voter Turnout: A Field Variable Models. American Journal of Political
Experiment. American Political Science Review Science 57(1): 263–277.
94(3): 653–663.
Hanushek, Eric, and Ludger Woessmann. 2012.
Gerber, Alan S., and Donald P. Green. 2005. Do Better Schools Lead to More Growth?
Correction to Gerber and Green (2000), Cognitive Skills, Economic Outcomes, and
Replication of Disputed Findings, and Reply to Causation. Journal of Economic Growth 17(4):
Imai (2005). American Political Science Review 267–321.
99(2): 301–313.
Harvey, Anna. 2011. What’s So Great about
Gerber, Alan S., and Donald P. Green. 2012. Field Independent Courts? Rethinking Crossnational
Experiments: Design, Analysis, and Studies of Judicial Independence. Manuscript,
Interpretation. New York: Norton. New York University.
Gertler, Paul. 2004. Do Conditional Cash Transfers Hausman, Jerry A., and William E. Taylor. 1981.
Improve Child Health? Evidence from Panel Data and Unobservable Individual Effects.
PROGRESA’s Control Randomized Experiment. Econometrica 49(6): 1377–1398.
American Economic Review 94(2): 336–341.
Heckman, James J. 1979. Sample Selection Bias as
Goldberger, Arthur S. 1991. A Course in a Specification Error. Econometrica 47(1):
Econometrics. Cambridge, MA: Harvard 153–161.
University Press.
Heinz, Matthias, Sabrina Jeworrek, Vanessa
Gormley, William T., Jr., Deborah Phillips, and Ted Mertins, Heiner Schumacher, and Matthias
Gayer. 2008. Preschool Programs Can Boost Sutter. 2017. Measuring Indirect Effects of
School Readiness. Science 320(5884): Unfair Employer Behavior on Worker
1723–1724. Productivity: A Field Experiment. MPI
Grant, Taylor, and Matthew J. Lebo. 2016. Error Collective Goods Preprint, No. 2017/22.
Correction Methods with Political Time Series. Herndon, Thomas, Michael Ash, and Robert Pollin.
Political Analysis 24(1): 3–30. 2014. Does High Public Debt Consistently Stifle
Green, Donald P., Soo Yeon Kim, and David Economic Growth? A Critique of Reinhart and
H. Yoon. 2001. Dirty Pool. International Rogoff. Cambridge Journal of Economics 38(2):
Organization 55(2): 441–468. 257–279.
Green, Joshua. 2012. The Science Behind Those Howell, William G., and Paul E. Peterson. 2004.
Obama Campaign E-Mails. Business Week. The Use of Theory in Randomized Field Trials:
(November 29). Accessed from http://www Lessons from School Voucher Research on
.businessweek.com/articles/2012–11-29/ Disaggregation, Missing Data, and the
the-science-behind-those-obama-campaign- Generalization of Findings. American Behavioral
e-mails Scientist 47(5): 634–657.
582 BIBLIOGRAPHY
Imai, Kosuke. 2005. Do Get-Out-the-Vote Calls Kennedy, Peter. 2008. A Guide to Econometrics, 6th
Reduce Turnout? The Importance of Statistical ed. Malden, MA: Blackwell Publishing.
Methods for Field Experiments. American Khimm, Suzy. 2010. Who Is Alvin Greene? Mother
Political Science Review 99(2): 283–300. Jones. http://motherjones.com/mojo/2010/06/
Imai, Kosuke, Gary King, and Elizabeth A. Stuart. alvin-greene-south-carolina
2008. Misunderstandings among King, Gary. 1989. Unifying Political Methodology:
Experimentalists and Observationalists about The Likelihood Theory of Statistical Inference.
Causal Inference. Journal of the Royal Statistical Cambridge: Cambridge University Press.
Society, Series A (Statistics in Society) 171(2):
King, Gary. 1995. Replication, Replication. PS:
481–502.
Political Science and Politics 28(3): 444–452.
Imbens, Guido W. 2014. Instrumental Variables: An King, Gary, and Langche Zeng. 2001. Logistic
Econometrician’s Perspective. IZA Discussion Regression in Rare Events Data. Political
Paper 8048. Bonn: Forschungsinstitut zur Analysis 9: 137–163.
Zukunft der Arbeit (IZA).
King, Gary, Robert Keohane, and Sidney Verba.
Imbens, Guido W., and Thomas Lemieux. 2008. 1994. Designing Social Inquiry: Scientific
Regression Discontinuity Designs: A Guide to Inference in Qualitative Research. Princeton, NJ:
Practice. Journal of Econometrics 142(2): Princeton University Press.
615–635.
Kiviet, Jan F. 1995. On Bias, Inconsistency, and
Iqbal, Zaryab, and Christopher Zorn. 2008. The Efficiency of Various Estimators in Dynamic
Political Consequences of Assassination. Journal Panel Data Models. Journal of Econometrics
of Conflict Resolution 52(3): 385–400. 68(1): 53–78.
Jackman, Simon. 2009. Bayesian Analysis for the Klick, Jonathan, and Alexander Tabarrok. 2005.
Social Sciences. Hoboken, NJ: Wiley. Using Terror Alert Levels to Estimate the Effect
of Police on Crime. Journal of Law and
Jacobson, Gary C. 1978. Effects of Campaign Economics 48(1): 267–279.
Spending in Congressional Elections. American
Political Science Review 72(2): 469–491. Koppell, Jonathan G. S., and Jennifer A. Steen.
2004. The Effects of Ballot Position on Election
Kalla, Joshua L., and David E. Broockman. 2015. Outcomes. Journal of Politics 66(1): 267–281.
Congressional Officials Grant Access due to
Campaign Contributions: A Randomized Field La Porta, Rafael, F. Lopez-de-Silanes, C.
Experiment. American Journal of Political Pop-Eleches, and A. Schliefer. 2004. Judicial
Science 60(3): 545–558. Checks and Balances. Journal of Political
Economy 112(2): 445–470.
Kam, Cindy D., and Robert J. Franceze, Jr. 2007.
Lee, David S. 2008. Randomized Experiments from
Modeling and Interpreting Interactive
Non-random Selection in U.S. House Elections.
Hypotheses in Regression Analysis. Ann Arbor:
Journal of Econometrics 142(2): 675–697.
University of Michigan Press.
Lee, David S. 2009. Training, Wages, and Sample
Kastellec, Jonathan P., and Eduardo L. Leoni. 2007. Selection: Estimating Sharp Bounds on
Using Graphs Instead of Tables in Political Treatment Effects. Review of Economic Studies
Science. Perspectives on Politics 5(4): 755–771. 76(3): 1071–1102.
Keele, Luke, and Nathan J. Kelly. 2006. Dynamic Lee, David S., and Thomas Lemieux. 2010.
Models for Dynamic Theories: The Ins and Outs Regression Discontinuity Designs in Economics.
of Lagged Dependent Variables. Political Journal of Economic Literature 48(2): 281–355.
Analysis 14: 186–205.
Lenz, Gabriel, and Alexander Sahn. 2017.
Keele, Luke, and David Park. 2006. Difficult Achieving Statistical Significance with
Choices: An Evaluation of Heterogenous Choice Covariates and without Transparency.
Models. Manuscript, Ohio State University. Manuscript.
BIBLIOGRAPHY 583
Lerman, Amy E. 2009. The People Prisons Make: Manning, Willard G., Joseph P. Newhouse, Naihua
Effects of Incarceration on Criminal Psychology. Duan, Emmett B. Keeler, and Arleen Leibowitz.
In Do Prisons Make Us Safer? Steve Raphael 1987. Health Insurance and the Demand for
and Michael Stoll, eds. New York: Russell Sage Medical Care: Evidence from a Randomized
Foundation. Experiment. American Economic Review 77(3):
Levitt, Steven D. 1997. Using Electoral Cycles in 251–277.
Police Hiring to Estimate the Effect of Police on Manzi, Jim. 2012. Uncontrolled: The Surprising
Crime. American Economic Review 87(3): Payoff of Trial-and-Error for Business, Politics
270–290. and Society. New York: Basic Books.
Levitt, Steven D. 2002. Using Electoral Cycles in Marvell, Thomas B., and Carlisle E. Moody. 1996.
Police Hiring to Estimate the Effect of Police on Specification Problems, Police Levels and Crime
Crime: A Reply. American Economic Review Rates. Criminology 34(4): 609–646.
92(4): 1244–1250. McClellan, Chandler B., and Erdal Tekin. 2012.
Lochner, Lance, and Enrico Moretti. 2004. The Stand Your Ground Laws and Homicides.
Effect of Education on Crime: Evidence from National Bureau of Economic Research Working
Prison Inmates, Arrests, and Self-Reports. Paper No. 18187.
American Economic Review 94(1): 155–189. McCrary, Justin. 2002. Using Electoral Cycles in
Long, J. Scott. 1997. Regression Models for Police Hiring to Estimate the Effect of Police on
Categorical and Limited Dependent Variables. Crime: Comment. American Economic Review
London: Sage Publications. 92(4): 1236–1243.
Lorch, Scott A., Michael Baiocchi, Corinne S. McCrary, Justin. 2008. Manipulation of the Running
Ahlberg, and Dylan E. Small. 2012. The Variable in the Regression Discontinuity Design:
Differential Impact of Delivery Hospital on the A Density Test. Journal of Econometrics 142(2):
Outcomes of Premature Infants. Pediatrics 698–714.
130(2): 270–278. Miguel, Edward, and Michael Kremer. 2004.
Ludwig, Jens, and Douglass L. Miller. 2007. Does Worms: Identifying Impacts on Education and
Head Start Improve Children’s Life Chances? Health in the Presence of Treatment
Evidence from a Regression Discontinuity Externalities. Econometrica 72(1): 159–217.
Design. Quarterly Journal of Economics 122(1): Miguel, Edward, Shanker Satyanath, and Ernest
159–208. Sergenti. 2004. Economic Shocks and Civil
Lumley, Thomas, Paula Diehr, Scott Emerson, and Conflict: An Instrumental Variables Approach.
Lu Chen. 2002. The Importance of the Normality Journal of Political Economy 112(4): 725–753.
Assumption in Large Public Health Data Sets. Montgomery, Jacob M., Brendan Nyhan, and
Annual Review of Public Health 23: 151–169. Michelle Torres. 2017. How Conditioning on
Madestam, Andreas, Daniel Shoag, Stan Veuger, Post-Treatment Variables Can Ruin Your
and David Yanagizawa-Drott. 2013. Do Political Experiment and What to Do about It.
Protests Matter? Evidence from the Tea Party Manuscript, Washington University.
Movement. Quarterly Journal of Economics Morgan, Stephen L., and Christopher Winship.
128(4): 1633–1685. 2014. Counterfactuals and Causal Inference:
Makowsky, Michael, and Thomas Stratmann. 2009. Methods and Principals for Social Research,
Political Economy at Any Speed: What 2nd ed. Cambridge, U.K.: Cambridge University
Determines Traffic Citations? American Press.
Economic Review 99(1): 509–527. Murnane, Richard J., and John B. Willett. 2011.
Malkiel, Burton G. 2003. A Random Walk Down Methods Matter: Improving Causal Inference in
Wall Street: The Time-Tested Strategy for Educational and Social Science Research.
Successful Investing. New York: W.W. Norton. Oxford, U.K.: Oxford University Press.
584 BIBLIOGRAPHY
Murray, Michael P. 2006a. Avoiding Invalid Payments of 2008. American Economic Review
Instruments and Coping with Weak Instruments. 103(6): 2530–2553.
Journal of Economic Perspectives 20(4):
Persico, Nicola, Andrew Postlewaite, and Dan
111–132.
Silverman. 2004. The Effect of Adolescent
Murray, Michael P. 2006b. Econometrics: A Modern Experience on Labor Market Outcomes: The
Introduction. Boston: Pearson Addison Wesley. Case of Height. Journal of Political Economy
National Aeronautics and Space Administration. 112(5): 1019–1053.
2012. Combined Land-Surface Air and Pesaran, M. Hasehm, Yongcheol Shin, and Richard
Sea-Surface Water Temperature Anomalies J. Smith. 2001. Bounds Testing Approaches to
(Land-Ocean Temperature Index, LOTI) the Analysis of Level Relationships. Journal of
Global-Mean Monthly, Seasonal, and Annual Applied Econometrics 16(3): 289–326.
Means, 1880–Present, Updated through Most
Recent Months at https://data.giss.nasa.gov/ Philips, Andrew Q. 2018. Have Your Cake and Eat It
gistemp/ Too? Cointegration and Dynamic Inference from
Autoregressive Distributed Lag Models.
National Center for Addiction and Substance American Journal of Political Science 62(1):
Abuse at Columbia University. 2011. National 230–244.
Survey of American Attitudes on Substance
Abuse XVI: Teens and Parents (August). Pickup, Mark, and Paul M. Kellstedt. 2017.
Accessed November 10, 2011, at Equation Balance in Time Series Analysis: What
www.casacolumbia.org/download.aspx?path= It Is and How to Apply It. Manuscript, Simon
/UploadedFiles/ooc3hqnl.pdf Fraser University.
Newhouse, Joseph. 1993. Free for All? Lessons from Pierskalla, Jan H., and Florian M. Hollenbach. 2013.
the RAND Health Insurance Experiment. Technology and Collective Action: The Effect of
Cambridge, MA: Harvard University Press. Cell Phone Coverage on Political Violence in
Africa. American Political Science Review
Nevin, Rick. 2013. Lead and Crime: Why This
107(2): 207–224.
Correlation Does Mean Causation. January 26.
http://ricknevin.com/uploads/Lead_and_Crime_ Reinhart, Carmen M., and Kenneth S. Rogoff. 2010.
_Why_This_Correlation_Does_Mean_ Growth in a Time of Debt. American Economic
Causation.pdf Review: Papers & Proceedings 100(2): 573–578.
Noel, Hans. 2010. Ten Things Political Scientists Rice, John A. 2007. Mathematical Statistics and
Know that You Don’t. The Forum 8(3): article 12. Data Analysis, 3rd ed. Belmont, CA: Thomson.
Orwell, George. 1946. In Front of Your Nose. Roach, Michael A. 2013. Mean Reversion or a
Tribune. London (March 22). Breath of Fresh Air? The Effect of NFL
Osterholm, Michael T., Nicholas S. Kelley, Alfred Coaching Changes on Team Performance in the
Sommer, and Edward A. Belongia. 2012. Salary Cap Era. Applied Economics Letters
Efficacy and Effectiveness of Influenza Vaccines: 20(17): 1553–1556.
A Systematic Review and Meta-analysis. Lancet: Romer, Christina D. 2011. What Do We Know about
Infectious Diseases 12(1): 36–44. the Effects of Fiscal Policy? Separating Evidence
Palmer, Brian. 2013. I Wish I Was a Little Bit from Ideology. Talk at Hamilton College,
Shorter. Slate. July 30. http://www.slate.com/ November 7.
articles/health_and_science/science/2013/07/ Rossin-Slater, Maya, Christopher J. Ruhm, and Jane
height_and_longevity_the_research_is_clear_ Waldfogel. 2014. The Effects of California’s Paid
being_tall_is_hazardous_to_your.html Family Leave Program on Mothers’
Parker, Jonathan A., Nicholas S. Souleles, David Leave-Taking and Subsequent Labor Market
S. Johnson, and Robert McClelland. 2013. Outcomes. Journal of Policy Analysis and
Consumer Spending and the Economic Stimulus Management 32(2): 224–245.
BIBLIOGRAPHY 585
Scheve, Kenneth, and David Stasavage. 2012. Tam Cho, Wendy K., and James G. Gimpel. 2012.
Democracy, War, and Wealth: Lessons from Two Geographic Information Systems and the Spatial
Centuries of Inheritance Taxation. American Dimensions of American Politics. Annual Review
Political Science Review 106(1): 81–102. of Political Science 15: 443–460.
Schrodt, Phil. 2014. Seven Deadly Sins of Tufte, Edward R. 2001. The Visual Display of
Contemporary Quantitative Political Science. Quantitative Information, 2nd ed. Cheshire, CT:
Journal of Peace Research 51: 287–300. Graphics Press.
Schwabish, Jonathan A. 2004. An Economist’s Verzani, John. 2004. Using R for Introductory
Guide to Visualizing Data. Journal of Economic Statistics. London: Chapman and Hall.
Perspectives 28(1): 209–234.
Wawro, Greg. 2002. Estimating Dynamic Models in
Shen XiaoFeng, Yunping Li, ShiQin Xu, Nan Wang, Political Science. Political Analysis 10: 25–48.
Sheng Fan, Xiang Qin, Chunxiu Zhou and Philip
Wilson, Sven E., and Daniel M. Butler. 2007. A Lot
Hess. 2017. Epidural Analgesia During the
More to Do: The Sensitivity of Time-Series
Second Stage of Labor: A Randomized
Cross Section Analyses to Simple Alternative
Controlled Trial. Obstetrics & Gynecology
Specifications. Political Analysis 15: 101–123.
130(5): 1097–1103.
Sides, John, and Lynn Vavreck. 2013. The Gamble: Wooldridge, Jeffrey M. 2002. Econometric Analysis
Choice and Chance in the 2012 Presidential of Cross Section and Panel Data. Cambridge,
Election. Princeton, NJ: Princeton University MA: MIT Press.
Press. Wooldridge, Jeffrey M. 2009. Introductory
Snipes, Jeffrey B., and Edward R. Maguire. 1995. Econometrics, 4th ed. Mason, OH:
Country Music, Suicide, and Spuriousness. South-Western Cengage Learning.
Social Forces 74(1): 327–329. Wooldridge, Jeffrey M. 2013. Introductory
Solnick, Sara J., and David Hemenway. 2011. The Econometrics, 5th ed. Mason, OH:
“Twinkie Defense”: The Relationship between South-Western Cengage Learning.
Carbonated Non-diet Soft Drinks and Violence World Values Survey. 2008. Integrated EVS/WVS
Perpetration among Boston High School 1981–2008 Data File. http://www.world
Students. Injury Prevention. valuessurvey.org/
Sovey, Allison J., and Donald P. Green. 2011. Yau, Nathan. 2011. Visualize This: The Flowing
Instrumental Variables Estimation in Political Data Guide to Design, Visualization, and
Science: A Reader’s Guide. American Journal of Statistics. Hoboken, NJ: Wiley.
Political Science 55(1): 188–200.
Zakir Hossain, Mohammad. 2011. The Use of
Stack, Steven, and Jim Gundlach. 1992. The Effect Box-Cox Transformation Technique in Economic
of Country Music on Suicide. Social Forces and Statistical Analyses. Journal of Emerging
71(1): 211–218. Trends in Economics and Management Sciences
Staiger, Douglas, and James H. Stock. 1997. 2(1): 32–39.
Instrumental Variables Regressions with Weak Ziliak, Stephen, and Deirdre N. McCloskey. 2008.
Instruments. Econometrica 65(3): 557–586. The Cult of Statistical Significance: How the
Stock, James H, and Mark W. Watson. 2011. Standard Error Costs Us Jobs, Justice, and
Introduction to Econometrics, 3rd ed. Boston: Lives. Ann Arbor: University of Michigan
Addison-Wesley. Press.
PHOTO CREDITS
Page 1, Chapter 1 Opening Photo: (c Shutterstock/NigelSpiers); 13, Case Study 1.1: (c Shutterstock/ antonio-
diaz); 15, Case Study 1.2: (
c istockphoto/clickhere); 24, Chapter 2 Opening Photo: ( c istockphoto/ EdStock);
31, Case Study 2.1: (
c Shutterstock/Alan C. Heison); 45, Chapter 3 Opening Photo: Mark Wallheiser / Stringer);
74, Case Study 3.1: ( c Shutterstock/Gemenacom); 91, Chapter 4 Opening Photo: ( c Shutterstock/Torsten
Lorenz); 127, Chapter 5 Opening Photo: ( c Shutterstock/ Ritu Manoj Jethani); 141, Case Study 5.1:
(c Shutterstock/bibiphoto); 153, Case Study 5.2: (c Shutterstock/pcruciatti); 179, Chapter 6 Opening Photo:
(c Shutterstock/katatonia82); 187, Case Study 6.1: ( c Getty Images/Clarissa Leahy); 197, Case Study 6.2:
(c Shutterstock/Rena Schild); 220, Chapter 7 Opening Photo: ( c Shutterstock/gulserinak1955); 227, Case
Study 7.1: (c Shutterstock/lexaarts); 255, Chapter 8 Opening Photo: (c Shutterstock/ bikeriderlondon); 274,
Case Study 8.1: (c Getty Images/Anadolu Agency); 295, Chapter 9 Opening Photo: ( c Shutterstock/Monkey
Business Images); 305, Case Study 9.1: ( c istockphoto/RapidEye); 333, Chapter 10 Opening Photo:
(c Getty Images/Joe Raedle); 338, Case Study 10.1: ( c Shutterstock/De Visu); 350, Case Study 10.2:
(c istockphoto/lisafx); 357, Case Study 10.3: ( c Getty Images/Chicago Tribune); 362, Case Study 10.4:
(c istockphoto/Krakozawr); 373, Chapter 11 Opening Photo: ( c istockphoto/LauriPatterson), 389, Case Study
11.1: (c istockphoto/CEFutcher); 395, Case Study 11.2: ( c Shutterstock/Goodluz); 409, Chapter 12 Opening
Photo: (c Getty Images/BSIP); 431, Case Study 12.2: ( c Shutterstock/servickuz); 459, Chapter 13 Opening
Photo: (c Shutterstock/FloridaStock); 471, Case Study 13.1: ( c Getty Images/Chris Hondros) ; 482, Case
Study 13.2: (c Shutterstock/worradirek); 493, Chapter 14 Opening Photo: (_Creativecommonsstockphotos via
Dreamstime); 518, Chapter 15 Opening Photo: ( c Getty Images/J. R. Eyerman); 533, Chapter 16 Opening
Photo: (c Shutterstock/Everett Historical)
586
GLOSSARY
χ 2 distribution A probability distribution that char- autocorrelation Errors are autocorrelated if the error
acterizes the distribution of squared standard normal in one time period is correlated with the error in
random variables. Standard errors are distributed the previous time period. One of the assumptions
according to this distribution, which means that the necessary to use the standard equation for variance
χ 2 plays a role in the t distribution. Also relevant for of OLS estimates is that errors are not autocorrelated.
many statistical tests, including likelihood ratio tests Autocorrelation is common in time series data. 69
for maximum likelihood estimations. 549
autoregressive process A process in which the value
ABC issues Three issues that every experiment needs of a variable depends directly on the value from the
to address: attrition, balance, and compliance. 334 previous period. Autocorrelation is often modeled as
an autoregressive process such that the error term is a
adjusted R2 The R2 with a penalty for the number of
function of previous error terms. A standard dynamic
variables included in the model. Widely reported, but
models is also modeled as autoregressive process as
rarely useful. 150
the dependent variable is modeled to depend on the
alternative hypothesis An alternative hypothesis is lagged value of the dependent variable. 460
what we accept if we reject the hypothesis. It’s
auxiliary regression A regression that is not directly
not something that we are proving (given inherent
the one of interest but yields information helpful in
statistical uncertainty), but it is the idea we hang onto
analyzing the equation we really care about. 138
if we reject the null. 94
balance Treatment and control groups are balanced
AR(1) model A model in which the errors are
if the distributions of control variables are the same
assumed to depend on their value from the previous
for both groups. 336
period. 461
bias A biased coefficient estimate will systematically
assignment variable An assignment variable deter-
be higher or lower than the true value. 58
mines whether someone receives some treatment.
People with values of the assignment variable above binned graphs Used in regression discontinuity
some cutoff receive the treatment; people with values analysis. The assignment variable is divided into bins,
of the assignment variable below the cutoff do not and the average value of the dependent variable is
receive the treatment. 375 plotted for each bin. The plots allow us to visual-
ize a discontinuity at the treatment cutoff. Binned
attenuation bias A form of bias in which the esti-
graphs also are useful to help us identify possible
mated coefficient is closer to zero than it should be.
non-linearities in the relationship between the assign-
Measurement error in the independent variable causes
ment variable and the dependent variable. 386
attenuation bias. 145
blocking Picking treatment and control groups so
attrition Occurs when people drop out of an exper-
that they are equal in covariates. 335
iment altogether such that we do not observe the
dependent variable for them. 354 categorical variables Variables that have two or
more categories but do not have an intrinsic ordering.
augmented Dickey-Fuller test A test for unit root for
Also known as nominal variables. 179, 193
time series data that includes a time trend and lagged
values of the change in the variable as independent central limit theorem The mean of a sufficiently
variables. 481 large number of independent draws from any
587
588 GLOSSARY
distribution will be normally distributed. Because control variable An independent variable included in
OLS estimates are weighted averages, the central a statistical model to control for some factor that is not
limit theorem implies that β̂1 will be normally the primary factor of interest. 134, 298
distributed. 56
correlation Measures the extent to which two vari-
ceteris paribus All else being equal. A phrase used ables are linearly related to each other. A correlation
to describe multivariate regression results as a coeffi- of 1 indicates the variables move together in a straight
cient is said to account for change in the dependent line. A correlation of 0 indicates the variables are not
variable with all other independent variables held linearly related to each other. A correlation of −1
constant. 131 indicates the variables move in opposite directions. 9
codebook A file that describes sources for variables critical value In hypothesis testing, a value above
and any adjustments made. A codebook is a necessary which a β̂1 would be so unlikely that we reject the
element of a replication file. 29 null. 101
collider bias Bias that occurs when a post-treatment
cross-sectional data Data having observations for
variable creates a pathway for spurious effects to
multiple units for one time period. Each observation
appear in our estimation. 238
indicates the value of a variable for a given unit for the
compliance The condition of subjects receiving the same point in time. Cross-sectional data is typically
experimental treatment to which they were assigned. contrasted to panel and time series data. 459
A compliance problem occurs when subjects assigned
to an experimental treatment do not actually experi- cumulative distribution function Indicates how
ence the treatment, often because they opt out in some much of normal distribution is to the left of any given
way. 340 point. 418, 543
confidence interval Defines the range of true values de-meaned approach An approach to estimating
that are consistent with the observed coefficient esti- fixed effects models for panel data involving sub-
mate. Confidence intervals depend on the point esti- tracting average values within units from all vari-
mate, β̂1 , and the measure of uncertainty, se( β̂1 ). 117, ables. This approach saves us from having to include
133 dummy variables for every unit and highlights the
ability of fixed effects models to estimate param-
confidence levels Term referring to confidence inter- eters based on variation within units, not between
vals and based on 1 − α. 117 them. 263
consistency A consistent estimator is one for which degrees of freedom The sample size minus the
the distribution of the estimate gets closer and closer number of parameters. It refers to the amount of
to the true value as the sample size increases. For information we have available to use in the estimation
example, the bivariate OLS estimate β̂1 consistently process. As a practical matter, degrees of freedom
estimates β1 if X is uncorrelated with . 66 corrections produce more uncertainty for smaller
constant The parameter β0 in a regression model. sample sizes. The shape of a t distribution depends
It is the point at which a regression line crosses the on the degrees of freedom. The higher the degrees of
Y-axis. It is the expected value of the dependent freedom, the more a t distribution looks like a normal
variable when all independent variables equal 0. Also distribution. 63, 100
referred to as the intercept. 4
dependent variable The outcome of interest, usu-
continuous variable A variable that takes on any ally denoted as Y. It is called the dependent vari-
possible value over some range. Continuous variables able because its value depends on the values of the
are distinct from discrete variables, which can take on independent variables, parameters, and error term. 2,
only a limited number of possible values. 54 47
control group In an experiment, the group that does dichotomous Divided into two parts. A dummy
not receive the treatment of interest. 19 variable is an example of a dichotomous variable. 409
GLOSSARY 589
dichotomous variable A dichotomous variable takes exogenous An independent variable is exogenous

on one of two values, almost always 0 or 1, for all if changes in it are unrelated to other factors that
observations. Also known as a dummy variable. 181 influence the dependent variable. 9
Dickey-Fuller test A test for unit roots, used in expected value The average value of a large number
dynamic models. 480 of realizations of a random variable. 496
difference of means test A test that involves com- external validity A research finding is externally
paring the mean of Y for one group (e.g., the treatment valid when it applies beyond the context in which the
group) against the mean of Y for another group (e.g., analysis was conducted. 21
the control group). These tests can be conducted with F distribution A probability distribution that charac-
bivariate and multivariate OLS and other statistical terizes the distribution of a ratio of χ 2 random vari-
procedures. 180 ables. Used in tests involving multiple parameters,
among other applications. 550
difference-in-difference model A model that looks
at differences in changes in treated units compared to F statistic The test statistic used in conducting an
untreated units. These models are particularly useful F test. Used in testing hypotheses about multiple
in policy evaluation. 276 coefficients, among other applications. 159
discontinuity Occurs when the graph of a line makes F test A type of hypothesis test commonly used to
a sudden jump up or down. 373 test hypotheses involving multiple coefficients. 159
fitted value A fitted value, Ŷi , is the value of Y
distribution The range of possible values for a ran- predicted by our estimated equation. For a bivariate
dom variable and the associated relative probabilities OLS model, it is Ŷi = β̂0 + β̂1 Xi . Also called predicted
for each value. Examples of four distributions are value. 48
displayed in Figure 3.4. 54
fixed effect A parameter associated with a specific
dummy variable A dummy variable equals either unit in a panel data model. For a model Yit = β0 +
0 or 1 for all observations. Dummy variables are β1 X1it + αi + νit , the αi parameter is the fixed effect
sometimes referred to as dichotomous variable. 181 for unit i. 261
dyad An entity that consists of two elements. 274 fixed effects model A model that controls for unit-
and/or period-specific effects. These fixed effects cap-
dynamic model A time series model that includes ture differences in the dependent variable associated
a lagged dependent variable as an independent vari- with each unit and/or period. Fixed effects models are
able. Among other differences, the interpretation of used to analyze panel data and can control for both
coefficients differs in dynamic models from that in measurable and unmeasurable elements of the error
standard OLS models. Sometimes referred to as an term that are stable within unit. 261
autoregressive model. 460, 473
fuzzy RD models Regression discontinuity models in
elasticity The percent change in Y associated with which the assignment variable imperfectly predicts
a percent change in X. Elasticity is estimated with treatment. 392
log-log models. 234
generalizable A statistical result is generalizable if
endogenous An independent variable is endogenous it applies to populations beyond the sample in the
if changes in it are related to other factors that analysis. 21
influence the dependent variable. 8 generalized least squares (GLS) An approach to
error term The term associated with unmeasured estimating linear regression models that allows for
factors in a regression model, typically denoted as . 5 correlation of errors.. 467
exclusion condition For two-stage least squares, a goodness of fit How well a model fits the data. 70
condition that the instrument exert no direct effect in heteroscedastic A random variable is heteroscedas-
the second-stage equation. This condition cannot be tic if the variance differs for some observations.
tested empirically. 300 Heteroscedasticity does not cause bias in OLS models
590 GLOSSARY
but does violate one of the assumptions necessary irrelevant variable A variable in a regression model
to use the standard equation for variance of OLS that should not be in the model, meaning that its
estimates. 68 coefficient is zero. Including an irrelevant variable
heteroscedasticity-consistent standard errors does not cause bias, but it does increase the variance
Standard errors for the coefficients in OLS that are of the estimates. 150
appropriate even when errors are heteroscedastic. 68 jitter A process used in scatterplotting data. A small,
random number is added to each observation for
homoscedastic Describing a random variable having
purposes of plotting only. This procedure produces
the same variance for all observations. To use the
cloudlike images, which overlap less than the unjit-
standard equation for variance of OLS estimates. 68
tered data and therefore provide a better sense of the
hypothesis testing A process assessing whether the data. 74, 184
observed data is or is not consistent with a claim of
lagged variable A variable with the values from the
interest. The most widely used tools in hypothesis
previous period. 461
testing are t tests and F tests. 91
latent variable For a probit or logit model, an unob-
identified A statistical model is identified on the
served continuous variable reflecting the propensity
basis of assumptions that allow us to estimate the
of an individual observation of Yi to equal 1. 416
model. 318
least squares dummy variable approach An
inclusion condition For two-stage least squares, a
approach to estimating fixed effects models in the
condition that the instrument exert a meaningful
analysis of panel data. 262
effect in the first-stage equation in which the endoge-
nous variable is the dependent variable. 300 likelihood ratio (LR) test A statistical test for max-
imum likelihood models that is useful in testing
independent variable A variable that possibly influ- hypotheses involving multiple coefficients. 436
ences the value of the dependent variable. It is usually
denoted as X. It is called independent because its linear probability model Used when the dependent
value is typically treated as independent of the value variable is dichotomous. This is an OLS model in
of the dependent variable. 2, 47 which the coefficients are interpreted as the change in
probability of observing Yi = 1 for a one-unit change
instrumental variable Explains the endogenous in X. 410
independent variable of interest but does not directly
explain the dependent variable. Two-stage least linear-log model A model in which the independent
squares (2SLS) uses instrumental variables to variable is not logged but the independent variable
produce unbiased estimates. 297 is. In such a model, a one percent increase in X is
β1
associated with a 100 change in Y. 232
intention-to-treat (ITT) analysis ITT analysis add-
resses potential endogeneity that arises in experi- local average treatment effect The causal effect
ments owing to non-compliance. We compare the for those people affected by the instrument only.
means of those assigned treatment and those not Relevant if the effect of X on Y varies within the
assigned treatment, irrespective of whether the sub- population. 324
jects did or did not actually receive the treatment. 343 log likelihood The log of the probability of observ-
intercept The parameter β0 in a regression model. ing the Y outcomes we report, given the X data and
It is the point at which a regression line crosses the the β̂’s. It is a by-product of the maximum likelihood
Y-axis. It is the expected value of the dependent estimation process. 425
variable when all independent variables equal 0. Also log-linear model A model in which the dependent
referred to as the constant. 4, 47 variable is transformed by taking its natural log.
internal validity A research finding is internally A one-unit change in X in a log-linear model is
valid when it is based on a process free from system- associated with a β1 percent change in Y (on a 0-to-1
atic error. Experimental results are often considered scale). 233
internally valid, but their external validity may be log-log model A model in which the dependent vari-
debatable. 21 able and the independent variables are logged. 234
GLOSSARY 591
logit model A way to analyze data with a dichoto- Newey-West standard errors Standard errors for the
mous dependent variable. The error term in a coefficients in OLS that are appropriate even when
logit model is logistically distributed. Pronounced errors are autocorrelated. 467
“low-jit”. 418, 421
normal distribution A bell-shaped probability den-
maximum likelihood estimation The estimation sity that characterizes the probability of observing
process used to generate coefficient estimates for outcomes for normally distributed random variables.
probit and logit models, among others. 423, 549 Because of the central limit theorem, many statistical
quantities are distributed normally. 55
measurement error Measurement error occurs when
a variable is measured inaccurately. If the depen- null hypothesis A hypothesis of no effect. Statistical
dent variable has measurement error, OLS coeffi- tests will reject or fail to reject such hypotheses. The
cient estimates are unbiased but less precise. If an most common null hypothesis is β1 = 0, written as
independent variable has measurement error, OLS H0 : β1 = 0. 92
coefficient estimates suffer from attenuation bias,
with the magnitude of the attenuation depending on null result A finding in which the null hypothesis is
how large the measurement error variance is relative not rejected. 113
to the variance of the variable. 143 observational studies Use data generated in an
mediator bias Bias that occurs when a post- environment not controlled by a researcher. They
treatment variable is added and absorbs some of the are distinguished from experimental studies and are
causal effect of the treatment variable. 237 sometimes referred to as non-experimental studies. 21
model fishing Model fishing is a bad statistical prac- omitted variable bias Bias that results from leaving
tice that occurs when researchers add and subtract out a variable that affects the dependent variable and
variables until they get the answers they were looking is correlated with the independent variable. 138
for. 243 one-sided alternative hypothesis An alternative to
model specification The process of specifying the the null hypothesis that indicates whether the coeffi-
equation for our model. 220 cient (or function of coefficients) is higher or lower
than the value indicated in the null hypothesis. Typi-
modeled randomness Variation attributable to cally written as HA : β1 > 0 or HA : β1 < 0. 94
inherent variation in the data-generation process. This
source of randomness exists even when we observe one-way fixed effects model A panel data model
data for an entire population. 54 that allows for fixed effects at the unit level. 271
monotonicity A condition invoked in discussions of ordinal variables Variables that express rank but
instrumental variable models. Monotonicity requires not necessarily relative size. An ordinal variable,
that the effect of the instrument on the endogenous for example, is one indicating answers to a survey
variable go in the same direction for everyone in a question that is coded 1 = strongly disagree, 2 =
population. 324 disagree, 3 = agree, 4 = strongly agree. 193
multicollinearity Variables are multicollinear if they outliers Observations that are extremely different
are correlated. The consequence of multicollinearity from those in the rest of sample. 77
is that the variance of β̂1 will be higher than it
overidentification test A test used for two-stage
would have been in the absence of multicollinearity.
least squares models having more than one instru-
Multicollinearity does not cause bias. 148, 159
ment. The logic of the test is that the estimated coeffi-
multivariate OLS OLS with multiple independent cient on the endogenous variable in the second-stage
variables. 127 equation should be roughly the same when each
individual instrument is used alone. 309
natural experiment Occurs when a researcher iden-
tifies a situation in which the values of the indepen- p-hacking Occurs when a researcher changes the
dent variable have been determined by a random, or model until the p value on the coefficient of interest
at least exogenous, process. 334, 360 reaches a desired level. 243
592 GLOSSARY
p value The probability of observing a coefficient as continuous random variable to take on a given
extreme as we actually observed if the null hypothesis probability. 541
were true. 106 probability distribution A graph or formula that
panel data Has observations for multiple units over gives the probability across the possible values of a
time. Each observation indicates the value of a vari- random variable. 54
able for a given unit at a given point in time. Panel probability limit The value to which a distribution
data is typically contrasted to cross-sectional and time converges as the sample size gets very large. When the
series data. 255 error is uncorrelated with the independent variables,
perfect multicollinearity Occurs when an indepen- the probability limit of β̂1 is β1 for OLS models. The
dent variable is completely explained by a linear probability limit of a consistent estimator is the true
combination of the other independent variables. 149 value of the parameter. 65, 145, 311
plim A widely used abbreviation for probability probit model A way to analyze data with a dichoto-
limit, the value to which an estimator converges as mous dependent variable. The key assumption is that
the sample size gets very, very large. 66 the error term is normally distributed. 418
point estimates Point estimates describe our best quadratic model A model that includes X and X 2
guess as to what the true value is. 117 as independent variables. The fitted values will be
defined by a curve. A quadratic model is an example
polynomial model A model that includes values of of a polynomial model. 223, 227
X raised to powers greater than one. A polynomial
model is an example of a non-linear model in which quasi-instrument An instrumental variable that is
the effect of X on Y varies depending on the value not strictly exogenous. Two-stage least squares with
of X. The fitted values will be defined by a curve. a quasi-instrument may produce a better estimate
A quadratic model is an example of a polynomial than OLS if the correlation of the quasi-instrument
model. 223, 226 and the error in the main equation is small relative
to the correlation of the quasi-instrument and the
pooled model Treats all observations as indepen- endogenous variable. 311
dent observations. Pooled models contrast with
fixed effects models that control for unit-specific or random effects model Treats unit-specific error as a
time-specific fixed effects. 256 random variable that is uncorrelated with the indepen-
dent variable. 524
post-treatment variable A variable that is causally
affected by an independent variable. 236 random variable A variable that takes on values
in a range and with the probabilities defined by a
power The ability of our data to reject the null distribution. 54
hypothesis. A high-powered statistical test will reject
randomization The process of determining the
the null with a very high probability when the null is
false; a low-powered statistical test will reject the null experimental value of the key independent variable
with a low probability when the null is false. 111 based on a random process. If successful, random-
ization will produce as independent variable that
power curve Characterizes the probability of reject- is uncorrelated with all other potential independent
ing the null hypothesis for each possible value of the variables, including factors in the error term. 19
parameter. 111
randomized controlled trial An experiment in
predicted value The value of Y predicted by our which the treatment of interest is randomized. 19
estimated equation. For a bivariate OLS model, it is
reduced form equation In a reduced form equation,
Ŷi = β̂0 + β̂1 Xi . Also called fitted values. 48
Y1 is only a function of the non-endogenous variables
probability density A graph or formula that (which are the X and Z variables, not the Y variables).
describes the relative probability that a random Used in simultaneous equation models. 317
variable is near a specified value. 55 reference category When a model includes dummy
probability density function A mathematical func- variables indicating the multiple categories of a nomi-
tion that describes the relative probability for a nal variable, we need to exclude a dummy variable for
GLOSSARY 593
one of the groups, which we refer to as the reference attrition problems in experiments. The most famous
category. The coefficients on all the included dummy selection model is the Heckman selection model. 356
variables indicate how much higher or lower the significance level For each hypothesis test, we set
dependent variable is for each group relative to the a significance level that determines how unlikely a
reference category. Also referred to as the excluded result has to be under the null hypothesis for us to
category. 194 reject the null hypothesis. The significance level is
regression discontinuity (RD) analysis Techniques the probability of committing a Type I error for a
that use regression analysis to identify possible dis- hypothesis test. 95
continuities at the point at which some treatment simultaneous equation model A model in which
applies. 374 two variables simultaneously cause each other. 315
regression line The fitted line from a regression. 48 slope coefficient The coefficient on an independent
replication Research that meets a replication stan- variable. It reflects how much the dependent variable
dard can be duplicated based on the information increases when the independent variable increases by
provided at the time of publication. 28 one. In a plot of fitted values, the slope coefficient
characterizes the slope of the fitted line. 4
replication files Files that document how data is
gathered and organized. When properly compiled, spurious regression A regression that wrongly sug-
these files allow others to reproduce our results gests X has an effect on Y. Can be caused by, for
exactly. 28 example, omitted variable bias and nonstationary
data. 477
residual The difference between the fitted value and
the observed value. Graphically, it is the distance stable unit treatment value assumption The condi-
between an estimated line and an observation. Math- tion that an instrument has no spillover effect. This
ematically, a residual for a bivariate OLS model is condition rules out the possibility that the value of an
î = Yi − β̂0 − β̂1 Xi . An equivalent way to calculate a instrument going up by one unit will cause a neighbor
residual is î = Yi − Ŷi . 48 to become more likely to change X as well. 324
restricted model The model in an F test that imposes standard deviation The standard deviation des-
the restriction that the null hypothesis is true. If the cribes the spreadof the data. For large samples, it
(Xi −X)2
fit of the restricted model is much worse than the fit is calculated as . For probability distribu-
N
of the unrestricted model, we infer that that the null tions, the standard deviation refers to the width of
hypothesis is not true. 159 the distribution. For example, we often refer to the
robust Statistical results are robust if they do not standard deviation of the distribution as σ ; it is the
change when the model changes. 30, 130, 244, 534 square root of the variance (which is σ 2 ). To convert a
normally distributed random variable into a standard
rolling cross section data Repeated cross sections of normal variable, we subtract the mean and divide
data from different individuals at different points in by the standard deviation of the distribution of the
time (e.g., an annual survey of U.S. citizens in which random variable. 26
different citizens are chosen each year). 279
standard error The square root of the variance. Com-
sampling randomness Variation in estimates that is monly used to refer to the precision of a parameter
seen in a subset of an entire population. If a given estimate. The standard error of β̂1 from a bivariate
sample had a different selection of people, we would OLS model is the
observe a different estimated coefficient. 53, 551 square root of the variance of the
σ̂2
estimate. It is N×var(X)
. The difference between
scatterplot A plot of data in which each observation
standard errors and standard deviations can some-
is located at the coordinates defined by the indepen-
times be confusing. The standard error of a parameter
dent and dependent variables. 3
estimate is the standard deviation of the sampling
selection model Simultaneously accounts for distribution of the parameter estimate. For example,
whether we observe the dependent variable and what the standard deviation of the distribution of β̂1 distri-
the dependent variable is. Often used to deal with bution is estimated by the standard error of β̂1 . A good
594 GLOSSARY
rule of thumb is to associate standard errors with t statistic The test statistic used in a t test. It is equal
ˆ Null
parameter estimates and standard deviations with the to β1se(
−β
β̂1 )
. If the t statistic is greater than our critical
spread of a variable or distribution, which may or
value, we reject the null hypothesis. 104
may not be a distribution associated with a parameter
estimate. 61 t test A test for hypotheses about a normal random
variable with an estimated standard error. We com-
ˆ
standard error of the regression A measure of how pare | se(ββ1ˆ ) | to a critical value from a t distribution
well the model fits the data. It is the square root of the 1
determined by the chosen significance level (α). For
variance of the regression. 71
large sample sizes, a t test is closely approximated by
standard normal distribution A normal distribution a z test. 98
with a mean of zero and a variance (and standard time series data Consists of observations for a single
deviation) of one. 543 unit over time. Each observation indicates the value of
standardize Standardizing a variable converts it to a variable at a given point in time. The data proceed
a measure of standard deviations from its mean. in order, indicating, for example, annual, monthly, or
This is done by subtracting the mean of the variable daily data. Time series data is typically contrasted to
from each observation and dividing the result by the cross-sectional and panel data. 459
standard deviation of the variable. 156 treatment group In an experiment, the group that
receives the treatment of interest. 19
standardized coefficient The coefficient on an inde-
pendent variable that has been standardized accord- trimmed data set A set for which observations are
1 −X 1 removed in a way that offsets potential bias due to
ing to X1Standardized = Xsd(X . A one-unit change in
1) attrition. 355
a standardized variable is a one-standard-deviation
change no matter what the unit of X is (e.g., inches, two-sided alternative hypothesis An alternative to
dollars, years). Therefore, effects across variables can the null hypothesis that indicates the coefficient is not
be compared because each β̂ represents the effect of equal to 0 (or some other specified value). Typically
a one-standard-deviation change in X on Y. 157 written as HA : β1 = 0. 94
stationarity A time series term indicating that a two-stage least squares Uses exogenous variation
variable has the same distribution throughout the in X to estimate the effect of X on Y. In the first-stage,
entire time series. Statistical analysis of nonstationary we estimate a model in which the endogenous inde-
variables can yield spurious regression results. 476 pendent variable is the dependent variable and the
instrument, Z, is an independent variable. In the
statistically significant A coefficient is statistically second-stage, we estimate a model in which we
significant when we reject the null hypothesis that it is use the fitted values from the first-stage, X̂1i , as an
zero. In this case, the observed value of the coefficient independent variable. 295
is a sufficient number of standard deviations from the
two-way fixed effects model A panel data model
value posited in the null hypothesis to allow us to
reject the null. 93 that allows for fixed effects at the unit and time
levels. 271
substantive significance If a reasonable change in Type I error A hypothesis testing error that occurs
the independent variable is associated with a mean- when we reject a null hypothesis that is in fact true. 93
ingful change in the dependent variable, the effect
is substantively significant. Some statistically signif- Type II error A hypothesis testing error that occurs
icant effects are not substantively significant, espe- when we fail to reject a null hypothesis that is in fact
cially for large data sets. 116 false. 93
t distribution A distribution that looks like a normal unbiased estimator An estimator that produces esti-
distribution, but with fatter tails. The exact shape of mates that are on average equal to the true value of
the distribution depends on the degrees of freedom. the parameter of interest. 58
This distribution converges to a normal distribution unit root A variable with a unit root has a coefficient
for large sample sizes. 99, 549 equal to 1 on the lagged variable in an autoregressive
GLOSSARY 595
model. A variable with a unit root is nonstationary variables from the main equation are included as
and must be modeled differently than a stationary independent variables. 148
variable. 477 variance of the regression The variance of the
unrestricted model The model in an F test that regression measures how well the model explains
imposes no restrictions on the coefficients. If the fit variation in the dependent
variable. For large samples,
N
of the restricted model is much worse than the fit (Y −Ŷi )2
i=1 i
it is estimated as σ̂ 2 = N
. 63
of the unrestricted model, we infer that that the null
hypothesis is not true. 159 weak instrument An instrumental variable that adds
little explanatory power to the first-stage regression in
variance A measure of how much a random variable a two-stage least squares analysis. 312
varies. In graphical terms, the variance of a random
variable characterizes how wide the distribution is. 61 window The range of observations we analyze in
a regression discontinuity analysis. The smaller the
variance inflation factor A measure of how much window, the less we need to worry about non-linear
variance is inflated owing to multicollinearity. It can functional forms. 386
1
be estimated for each variable and is equal to 1−R 2,
j z test A hypothesis test involving comparison of a
where R2j is from an auxiliary regression in which Xj test statistic and a critical value based on a normal
is the dependent variable and all other independent distribution. 423
INDEX
?commandname, in R, 36–37 fixed effects and, 268n5 autocorrelation

ρ-transformed model, 467–70 Amy, Lerman, 374–75 autoregressive error and,
autocorrelation and, 468–70, Angrist, Joshua, 301, 302, 325, 460–62, 461n2
470t 330, 351 auxiliary regression for,
correcting for autocorrelation Anscombe, Francis, 41 464–66
using, 488–90 anthrax vaccine, 91 bias and, 68–70, 459, 476
estimating, 469–70 AR(1) model. See autoregressive detecting, 463–66
LM test and, 519 model 1 dynamic models and, 476
Newey-West vs., 470 Aron-Dine, Aviva, 358 examples of, 462f
χ 2 distribution, 99n1 Aronow, Peter, 244n11 fixed effects and, 519–20
d.f. of, 439–40 Ash, Michael, 24 fixing, 468–71
LR test and, 439–40 assignment variable for global warming, 471–73,
for RD, 394 2SLS and, 348 472f , 473t
2SLS. See two-stage least squares coefficient estimates and, 343 lagged error and, 466, 466t
in RD, 375–76, 384, 391–95, LM test and, 519
ABC issues. See attrition, balance, 393n1 modeling, 460–63
and compliance asterisk (*), in Stata, 35 OLS and, 459, 460, 464, 466,
Acemoglu, Daron, 331 attenuation bias, 144 466t, 519
Acharya, Avidit, 238, 246 attrition and orcutt, 490
Achen, Chris, 26n2, 168, 487, detection of, 354–55 R for, 488–90
527, 537 in education and wages, 359, scatterplot for, 464–65, 466f
adjusted R2 , 150 360t Stata for, 488–90
AER package, for R, 85–86, 326 endogeneity and, 354 in time series data, 69, 460–63
Affordable Care Act. See equation for, 355 variance and, 459
ObamaCare health insurance and, 357–59, ρ-transformed model and,
Afghanistan, education in, 370–72 358n11 468–70, 470t
Ahlberg, Corinne, 306 in randomized experiments, ρ-transformed model for
Albertson, Bethany, 328 354–59 correction of, 488–90
alcohol consumption and grades selection models for, 356 autoregressive error,
2SLS for, 308, 308t trimmed data set and, 355–56 autocorrelation and, 460–62,
discontinuity in, 373–74, 374f attrition, balance, and compliance 461n2
histograms for, 396f (ABC issues), in randomized autoregressive model 1 (AR(1)
RD for, 395–97, 396f , 397t experiments, 334, 334n2 model), 461
alliances. See trade and alliances augmented Dickey-Fuller test equation for, 463
alternative hypothesis for stationarity, 482 fixed effects and, 521
critical values and, 101–2, 102f for unit roots, 481 for global warming, 471–73,
decision rules for, 101t autocorrelated errors, visualizing, 472f , 473t
Dickey-Fuller test and, 480–81 461–62 LM test and, 519, 526
Entries with page numbers followed by t will be found in Tables, by an f in Figures and by an n in footnotes.
596
INDEX 597
panel data and, 521 unbiased estimator; plim in, 65, 65f
robustness and, 520 unbiasedness precision in, 61–64
auxiliary regression, 138, 173n14 2SLS and, 312 for presidential elections, 46f ,
for autocorrelation, 464–66 attrition and, 355 50–51, 51f , 51t, 94–95, 95t,
independent variable and, autocorrelation and, 69, 459, 96f
465–66n3 464, 476 probability density in, 55–56,
for institutions and human in bivariate OLS, 58–61 55f , 58f
rights, 154 characterization of, 60–61 randomness of, 53–57
averages collider, 238–43, 510–13 random variables in, 53–57
central limit theorem and, 56 from fixed effects, 268n6 regression coefficient and,
de-meaned approach and, 263 mediator, 237 50–51n4
of dependent variables, 261 modeled randomness and, 59 for retail sales and temperature,
of distributions, 58 in multivariate OLS, 167 130, 130t
of independent variables, 338 random effects model and, 524 sample size and, 80
of random variables, 56 sampling randomness and, 59 sampling randomness in, 53
standard deviation and, 26n2 weak instruments and, 313 standard error in, 61–63, 74–75
for treatment group, 182 binned graphs, RD and, 386–91, standard error of the regression
388f , 393n1 in, 71
Baicker, Katherine, 119 bivariate OLS, 45–90 Stata for, 81–84
Baiocchi, Michael, 305, 324 balance and, 337 t test for, 97–106
Baker, Regina, 302, 311n5 bias in, 58–61 unbiased estimator in, 58–60,
balance causality and, 50–51n4 58f
2SLS for, 366 central limit theorem for, 56–57 unbiasedness in, 57–61
bivariate OLS and, 337 coefficient estimates in, 46–50, variance in, 50–51n4, 61–63,
checking for, 336–37 48n3, 53–59, 76–77, 97 62f , 63n14, 67
for congressional members and consistency in, 66–67, 66f ,
variance of the regression in, 63
donors, 454, 455t 66n16
for violent crime, 77–80, 77f ,
in control group, 335–40 correlated errors in, 68
78t, 79f
control variables and, 337–38 d.f. in, 63, 63n13
for violent crime and ice
in education and wages, 359, for difference of means test,
cream, 60
360t 180–90
Blackwell, Matthew, 238, 246
foreign aid for poverty and, distributions of, 54–56, 55f
blocking, in randomized
338–40, 339t dummy independent variables
experiments, 335
ITT for, 365, 366 in, 180–90, 182f
Bloom, Howard, 398
multivariate OLS and, 337 equation for, 57n8
Bound, John, 302, 311n5
in randomized experiments, exogeneity in, 57–61
goodness of fit in, 70–77 Box, George, 534
335–40
R for, 366 for height and wages, 74–77, Box-Cox tests, 245
Stata for, 365–66 75f , 132, 132t, 133f Box-Steffensmeier, Janet, 444
in treatment group, 335–40 homoscedasticity in, 68, 74, Bradford-Hill, Austin, 537
Bayesian Analysis for the Social 75t, 80 Brambor, Thomas, 212
Sciences (Jackman), 120 hypothesis testing and, 92 Braumoeller, Bear, 212
Beck, Nathaniel, 81, 523, 526 normal distribution in, 55, 55f Broockman, David, 454
Berk, Richard, 354 null hypothesis and, 97 Brownlee, Shannon, 14–15, 21
Bertrand, Marianne, 368 observational data for, 78, 127, Buddlemeyer, Hielke, 398
bias. See also attenuation bias; 131, 198 Bush, George W., 449–50
omitted variable bias; outliers in, 77–80 Butler, Daniel, 283
598 INDEX
campaign contributions, for Clark, William, 212 equations for, 118–19, 119t
President Obama, 333 Clarke, Kevin., 514 in hypothesis testing, 117–19,
Campbell, Alec, 354 Cochrane-Orcutt model. See 118f
car accidents and hospitalization, ρ-transformed model for interaction variables, 205
238–40 codebooks for multivariate OLS, 133
Card, David, 245, 374 for data, 29, 29t probability density and, 117,
Carpenter, Daniel, 398 for height and wages, 29, 29t 118f
Carrell, Scott, 374 coefficient estimates sampling randomness and,
categorical variables, 194n5 assignment variables and, 343 118n9
to dummy independent attenuation bias and, 144 confint, in R, 122
variables, 193–202 bias in, 58 congressional elections, RD for,
in R, 213–14 in bivariate OLS, 46–50, 48n3, 402–4, 403t
regional wage differences and, 53–59, 76–77 congressional members and
194–96, 195t, 197t exogeneity of, 57–59 donors
in regression models, 193–94 for logit model, 426–29, 434 balance for, 454, 455t
in Stata, 213 in multivariate OLS, 128, 133, LPM for, 454–55, 455t
causality, 1–23 144, 146–47 probit model for, 454–55, 455t
bivariate OLS and, 50–51n4 in OLS, 493–98 consistency
core model for, 2–7, 7f outliers and, 79 in bivariate OLS, 66–67, 66f ,
correlation and, 2, 2f overidentification test and, 310 66n16
with country music and suicide, for probit model, 426–29, 427f , causality and, 535–36
15–17 434 constant (intercept)
data and, 1 random effects model and, 524 in bivariate OLS, 47, 53
dependent variable and, 2–3, random variables in, 53–57 fixed effects model and, 262
12f in simultaneous equation in regression model, 4, 5f
donuts and weight and, 3–9, 3t models, 318–19 continuous variables
endogeneity and, 7–18 unbiasedness of, 57–59 in bivariate OLS, 54
independent variable and, 2–3, variance of, 146–47, 313–14 dummy independent variables
12f coefficients and, 191t, 203
indicators of, 535–36 comparing, 155 for trade and alliances, 274–75
observational data and, 25n1 standardized, 155–58 control group
randomized experiments and, cointegration, 487 attrition in, 354–55
18–22 collider bias, 238–43, 510–13 balance in, 335–40
randomness and, 7–18 Columbia University National blocking for, 335–36
CDF. See cumulative distribution Center for Addiction and ITT and, 343
function Substance Abuse, 136 multivariate OLS and, 134,
central limit theorem, for bivariate commandname, in Stata, 34–35 134n1
OLS, 56–57 comment lines placebo to, 334n1
ceteris paribus, 131 in R, 37 in randomized experiments, 19
Chandra, Amitabh, 119 in Stata, 35 treatment group and, 134,
Chen, Xiao, 34 compliance. See also 134n1, 180, 334
Cheng, Jing, 324 non-compliance variables in, 337
χ(chi)2 distribution. See χ 2 in randomized experiments, control variables
distribution 340–54 for 2SLS, 300
Ching, Andrew, 431–35 in treatment group, 342, 348 balance and, 337–38
civil war. See economic growth confidence intervals multivariate OLS and, 134,
and civil war autocorrelation and, 460 134n1
INDEX 599
Cook, Thomas, 398 endogeneity for, 299 replication of, 28–32

core model exogeneity for, 296–98 rolling cross-section, 279
for causality, 2–7, 7f fixed effects models for, scatterplot of, 3
equation for, 5 256–61, 297 on violent crime, 30–32, 31t
correlated errors LSDV for, 263t data frames, 85n27
in bivariate OLS, 68 pooled model for, 256–61, 257t, in R, 286–87
panel data with, 518–20 258f , 259f data visualization, 34
correlation. See also scatterplots for, 258f , 259f degrees of freedom (d.f.)
autocorrelation two-way fixed effects model in bivariate OLS, 63, 63n13
bivariate OLS and, 50–51n4 for, 272–73, 273t χ 2 distribution and, 439–40
causality and, 2, 2f crime and terror alerts, natural critical values and, 101–2, 103t
covariance and, 50–51n4 experiments on, 360–62, 363t
t distribution and, 100–101, 103
defined, 9 critical value
de-meaned approach
in domestic violence in alternative hypothesis and,
for crime and police, 265f ,
Minneapolis, 351 101–2, 102f
265t, 266t
endogeneity and, 10 in hypothesis testing, 100–104
equations for, 263, 264n3
exogeneity and, 10 LR test and, 439–40
fixed effects models and,
with flu shots and health, 14–15 one-sided alternative hypothesis
263–66, 265f , 265t, 266t
for interaction variables, 205 and, 101–3, 102f
power and, 113n7 democracy. See economic growth
linear relationships and, 10n3
probability density and, 102f and democracy
of unbiased estimator, 61
in R, 122 dependent variables. See also
variables and, 9–10, 10f
in Stata, 121, 170 dummy dependent variables;
weak instruments and, 310, 312 lagged dependent variables
country music and suicide, 15–17 for t distribution, 101, 103t
t statistic and, 104 attrition and, 354
covariance, 60n11
two-sided alternative hypothesis autocorrelation and, 460
bivariate OLS and, 50–51n4
and, 101–3, 102f averages of, 261
correlation and, 50–51n4
cross-sectional data, 459 in bivariate OLS, 47, 65f
covariates, in RD, 395
cumulative distribution function causality and, 2–3, 12f
Cragg, John, 514
(CDF), 418–21, 420f defined, 3
crime
as dichotomous variables, 409,
bivariate OLS for, 77–80, 77f ,
410n2
78t, 79f Das, Mitali, 365
data on, 30–32, 31t for difference-in-difference
data. See also observational data;
models, 278–79
fitted lines for, 79f panel data; time series data;
ice cream and, 60 for economic growth and civil
specific randomized
war, 441
scatterplot for, 31, 32f , 77f , experiments
79f causality and, 1 endogeneity and, 10
terror alerts and, natural codebooks for, 29, 29t error term and, 12f
experiments on, 360–62, 363t cross-sectional, 459 in logit model, 443
crime and education, instrumental for donuts and weight, 26t LPM and, 414
variables for, 330–31, 331t endogeneity and, 24 measurement error in, 143–44
crime and police good practices with, 24–41 MLE and, 443
2SLS for, 296–98, 297t for heating degree days (HDD), multivariate OLS and, 143–44
de-meaned approach for, 265f , 209t omitted variable bias and, 503
265t, 266t hypothesis testing for, 91–126 in probit model, 443
difference-in-difference models for life expectancy and GDP per in RD, 395
for, 276–79 capita, 224f selection model and, 356
600 INDEX
dependent variables (continued) discontinuity, in alcohol LPM and, 410–14

for simultaneous equation consumption and grades, marginal-effects approach and,
models, 316 373–74, 374f 429–30
stationarity and, 477 display, in Stata, 121 MLE for, 423–26
substantive significance and, distributions. See also χ 2 multiple coefficients and,
115 distribution; cumulative 434–43
time series data and, 460 distribution function; normal observed-value, discrete
for trade and alliances, 274–75 distributions; t distribution differences approach and,
d.f. See degrees of freedom averages of, 58 429–30, 443
of bivariate OLS estimates, probit model for, 418–21,
dfbeta, in Stata, 79n24
54–56, 55f 423–25, 424f
dichotomous variables
for null hypothesis, 94, 96f Stata for, 213
dependent variables as, 409,
probability distribution, in dummy independent variables,
410n2
bivariate OLS, 54, 55f 179–219
independent variables as, 181 probability limits and, 65–67,
latent variables and, 416–17 in bivariate OLS, 180–90, 182f
65f
OLS for, 409 categorical variables to,
of unbiased estimator, 61 193–202
polynomial models and, 410n2 Dobkin, Carlos, 374
continuous variables and, 191t,
selection model and, 356 domestic violence in Minneapolis 203
Dickey-Fuller test instrumental variables for,
de-meaned approach and, 263
for global warming, 483–84, 350–54, 353t, 354t
for energy efficiency, 207–10,
483t non-compliance for, 350–54,
208f , 209t, 211f
significance level with, 485n12 353t, 354t
for HDD, 207–10, 208f , 209t,
for stationarity, 482 OLS for, 352–53, 353n10
211f
for unit roots, 480–81 donors. See congressional
as interaction variables, 203–5,
difference-in-difference models members and donors
204f , 205t
endogeneity in, 276–83 donuts and weight
for Manchester City soccer,
equations for, 277 causality and, 3–9, 3t
179, 180f , 190–92, 191t,
fixed effects models for, 276–83 data for, 26t
192f
logic of, 276–77 endogeneity and, 8–9
in multivariate OLS, 190–93
error term and, 9
OLS for, 277–79, 278f observational data and, 182
frequency table for, 26–27, 26t,
for panel data, 279–81, 280f R for, 213
27t
treatment group and, 285 scatterplots for, 184
randomness and, 8
difference of means test simulations and, 434
R for, 38–39, 83–84
balance and, 336 scatterplot for, 4f , 27, 28f slope coefficient for, 182
bivariate OLS for, 180–90 Stata for, 36 Stata for, 212–13
equation for, 334, 336 download, in R, 37 treatment group and, 181,
for height and gender, 187–90, Drum, Kevin, 136 182f
188f , 188t, 189f , 190t dummy dependent variables, dyads, fixed effects models and,
multiple variables and, 182 409–55 274–76, 275t
for observational data, 182 assignment variables and, 376 dynamic models
OLS and, 334, 336 for fish market, 330 for global warming, 482–85,
for President Trump, 183–85, hypothesis testing for, 434–43 483f , 485t
184t, 185f latent variable and, 409, lagged dependent variables and,
for treatment group, 180–81, 414–17 476, 524
186f logit model for, 421–22, 421n6 for time series data, 473–76
INDEX 601
economic growth and civil war in difference-in-difference for difference of means test,
instrumental variable for, models, 276–83 334, 336
327–29, 327t in domestic violence in for fixed effect model, 261
LPM for, 441–43, 442f Minneapolis, 350–51 for flu shots and health, 13
probit model for, 441–43, 441f , fixed effects models and, for F test, 166
442f 255–94 for heteroscedasticity-consistent
economic growth and democracy, flu shots and health and, 13–15, standard errors, 68n18
instrumental variables for, 14f for independent and dependent
331–32, 332t Hausman test for, 301n2 variable relationship, 4
economic growth and education, hypothesis testing and, 115 for logit model, 421, 421n6
multivariate OLS for, independent variable and, 10 for LR test, 436–37
140–43, 141t, 142f instrumental variables and, for multicollinearity, 147
economic growth and elections, 45 295–332 for omitted variable bias, 138,
economic growth and government multivariate OLS and, 129–37, 502–4
debt, 24–26, 25f , 25n1 166 for polynomial models, 224–25,
education. See also alcohol non-compliance and, 340–41 225n4
consumption and grades; observational data and, 21, 127 for power, 113n7
crime and education; omitted variable bias and, 139 for probit model, 420
economic growth and for p value, 108n5
overidentification test and, 310
education; law school for quasi-instrumental
in panel data, 255–94
admission variables, 310, 311n5
pooled model and, 256–57
in Afghanistan, 370–72, 371t for standard deviation, 26n3
vouchers for, non-compliance RD and, 373–405
for simultaneous equation
with, 341, 342n4 simultaneous equation models
model, 316
education and wages, 9, 359, 360t and, 315–23
for two-way fixed effects
2SLS for, 301–3 unmeasured factors, 198
model, 271
Einav, Liran, 358 for violent crime, 32
for variance, 313–14
elasticity, 234 energy efficiency, dummy for variance of standard error,
elections. See also presidential independent variables for, 499–501
207–10, 208f , 209t, 211f
elections for ρ-transformed model,
congressional elections, RD for, Epple, Dennis, 320, 321 468–69
402–4, 403t equations Erdem, Tülin, 431–35
economic growth and, 45 for 2SLS, 298, 299 errors. See also correlated errors;
get-out-the-vote efforts, for AR(1) model, 463 measurement error; standard
non-compliance for, 346–48, for attrition, 355 error; Type I errors; Type II
347n8, 347t, 348t, 366–67, for baseball players’ salaries, errors
367t 155 autocorrelated, 461–62
Ender, Philip, 34 for bivariate OLS, 50, 57n8 autoregressive, 460–62, 461n2
endogeneity, 11 for confidence interval, 118–19, heteroscedasticity-consistent
attrition and, 354 119t standard errors, 68–70,
causality and, 7–18 for core model, 5 68n18
correlation and, 10 for country music and suicide, lagged, 461, 466, 466t
for country music and suicide, 15 MSE, 71
16–17 for de-meaned approach, 263, random, 6, 417
for crime and police, 299 264n3 root mean squared error, in
data and, 24 for difference-in-difference Stata, 71
dependent variable and, 10 models, 277 spherical, 81
602 INDEX
errors (continued) for crime and police, 296–98 from probit model, 423–25,
standard error of the regression, independent variable and, 10 424f
71, 83 in natural experiments, 362 variance of, 314
error term observational data and, 21, 182 fixed effects, 261, 268
for 2SLS, 299 quasi-instrumental variables alternative hypothesis and,
autocorrelation and, 460–62 and, 310–12 268n5
autoregressive error and, randomized experiments for, AR(1) model and, 521
460–62 18–19, 334 autocorrelation and, 519–20
in bivariate OLS, 46, 47, 59–60, expected value, of random bias from, 268n6
198 variables, 496–97 lagged dependent variables and,
for country music and suicide, experiments. See randomized 520–23
16 experiments random effects model and,
dependent variable and, 12f external validity, of randomized 524–25
for donuts and weight, 9 experiments, 21 fixed effects models
endogenity and, 8, 198 constant and, 262
fixed effects models and, 262 Facebook, 333 for crime and police, 256–61,
for flu shots and health, 13–14 false-negative results, 501 297
homoscedasticity of, 68 Fearon, James, 440–41 for difference-in-difference
independent variable and, 10, Feinstein, Brian, 398 models, 255–83
12f , 16, 46, 59, 334, Finkelstein, Amy, 358 dyads and, 274–76, 275t
465–66n3 fish market, instrumental variables endogeneity and, 255–94
ITT and, 343 for, 329–30, 329t error term and, 262
in multivariate OLS, 137–39 fitted lines independent variable and, 268
normal distribution of, 56n6 independent variables and, 449 for instructor evaluation,
observational data and, 198, 323 latent variables and, 416–17 289–90, 290t
in OLS, 525 logit model and, 434–35, 435f , LSDV and, 262–63, 263t
omitted variable bias and, 503 449 multivariate OLS and, 262
quasi-instruments and, 310–13 for LPM, 411–13, 412f , 415f , for panel data, 255–94
random effects model and, 524 434–35, 435f for Peace Corps, 288–89, 289t
randomized experiments and, probit model and, 423–25, for presidential elections, 288,
337 424f , 434–35, 435f , 449 288t
RD and, 377–79 for RD, 385f , 387f R for, 528–30
in regression model, 5–6 for violent crime, 79f Stata for, 285, 527–28
for test scores, 260 fitted values for Texas school boards,
ρ-transformed model and, 469 for 2SLS, 299, 314, 348 291–93, 292t
EViews, 34 based on regression line, 50–51, for trade and alliances, 274–76,
Excel, 34 52f 275t
excluded category, 194 in bivariate OLS, 47, 53 two-way, 271–75
exclusion condition for difference-in-difference for Winter Olympics, 530–32,
for 2SLS, 300–301, 302f models, 278 530t
observational data and, 303 from logit model, 425 flu shots and health, 21n9
exogeneity, 9 for LPM, 412, 412n3 correlation with, 14–15
in bivariate OLS, 46, 57–61, 67 for Manchester City soccer, endogeneity and, 13–15, 14f
of coefficient estimates, 57–59 192f foreign aid for poverty, balance
consistency and, 67 observations and, 428–29 and, 338–40, 339t
correlation and, 10 for presidential elections, 50, Franceze, Robert, 212
correlation errors and, 68–70 52f Freakonomics (Levitt), 296
INDEX 603
frequency table global education, 177–78, 177t Head Start, RD for, 401–2, 404–5,
for donuts and weight, 26–27, global warming, 227–30, 228f , 404t
26t, 27t 229t health. See donuts and weight; flu
in R, 38 AR(1) model for, 471–73, shots and health
F statistic, 159n10, 165n13 472f , 473t health and Medicare, 374, 375–76
defined, 159 autocorrelation for, 471–73, health insurance, attrition and,
multiple instruments and, 312 472f , 473t 357–59, 358n11
F tests, 159–66 Dickey-Fuller test for, 483–84, heating degree-days (HDD),
and baseball salaries, 162–64 483t dummy independent
defined, 159 dynamic model for, 482–85, variables for, 207–10, 208f ,
for multiple coefficients, 162, 483f , 485t 209t, 211f
436 LPM for, 450–53, 451t, 452f Heckman, James, 356
with multiple instruments, 309 time series data for, 459 height and gender, difference of
for null hypothesis, 162, 309 GLS. See generalized least squares means test for, 187–90, 188f ,
OLS and, 436 Goldberger, Arthur, 168 188t, 189f , 190t
restricted model for, 160–62, Golder, Matt, 212 height and wages
165t gold standard, randomized bivariate OLS for, 74–77, 75f ,
in Stata, 170 experiments as, 18–22 132, 132t, 133f
t statistic and, 312n6 Goldwater, Barry, 50 codebooks for, 29, 29t
unrestricted model for, 160–62, goodness of fit
and comparing effects of height
165t for 2SLS, 314
measures, 164–66
using R2 values, 160–62 in bivariate OLS, 70–77
heteroscedasticity for, 75t
fuzzy RD models, 392 for MLE, 425
homoscedasticity for, 75t
in multivariate OLS, 149–50
hypothesis testing for, 123–24,
Galton, Francis, 45n2 scatterplots for, 71–72, 72f , 74
123t, 126
Gaubatz, Kurt Taylor, 34 standard error of the regression
and, 71 logged variables for, 234–36,
Gayer, Ted, 389, 400 235t
Gore, Al, 50
GDP per capita. See life multivariate OLS for, 131–34,
expectancy and GDP per Gormley, William, Jr., 389, 400
Gosset, William Sealy, 99n1 132t, 133f
capita null hypothesis for, 92
gender and wages governmental debt. See economic
growth and government debt p value for, 107f
assessing bias in, 242 scatterplot for, 75f
interaction variables for, 203–4, Graddy, Kathryn, 330
grades. See alcohol consumption t statistic for, 104–5, 104t
204f
and grades two-sided alternative hypothesis
generalizability
Green, Donald P., 274, 283, 325, for, 94
in randomized experiments, 21
347, 365, 366 variables for, 40, 40t
of RD, 394
Greene, William, 487, 514 Herndon, Thomas, 24
generalized least squares, 467–68
Grimmer, Justin, 398 Hersh, Eitan, 398
generalized linear model (glm), in
Gundlach, Jim, 15 heteroscedasticity
R, 447
Gerber, Alan, 347, 365, 366 bivariate OLS and, 68, 75t, 80
Gertler, Paul, 339 Hanmer, Michael, 443 for height and wages, 75t
get-out-the-vote efforts, Hanushek, Eric, 140, 141, 177 LPM and, 414n4
non-compliance for, 346–48, Harvey, Anna, 152–53 R and, 86
347n8, 347t, 348t, 366–67, Hausman test, 268n6, 301n2 weighted least squares and, 81
367t random effects model and, 525 heteroscedasticity-consistent
glm. See generalized linear model HDD. See heating degree-days standard errors, 68–70, 68n18
604 INDEX
high-security prison and inmate Stata for, 121–22 multicollinearity and, 148
aggression, 374 statistically significant in, multivariate OLS and, 127–28,
histograms 93, 120 134, 144–45
for alcohol consumption and substantive significance and, observed-value, discrete
grades, 396f 115 differences approach and,
for RD, 393, 393f , 396f t test for, 97–106 429
Hoekstra, Mark, 374 Type I errors and, 93, 93t omitted variable bias and, 503,
homicide. See stand your ground Type II errors and, 93t 508–10
laws and homicide probability limits and, 65f
homoscedasticity ice cream, violent crime and, 60 probit model and, 430
in bivariate OLS, 68, 74, 75t, 80 identification, simultaneous randomization of, 19, 334
for height and wages, 74, 75t equation model and, 318 slope coefficient on, 4
hospitalization, car accidents and, Imai, Kosuke, 365 substantive significance and,
238–40 Imbens, Guido, 325, 330, 398 115
Howell, William, 341 inclusion condition, for 2SLS, for test scores, 260
Huber-White standard errors. See 300, 302f for trade and alliances, 274
heteroscedasticity-consistent independent variables. See also ρ-transformed model and, 469
standard errors dummy independent inheritance tax, public policy and,
human rights. See institutions and variables 197–202
human rights attenuation bias and, 144 inmate aggression. See
hypothesis testing, 91–126. See auxiliary regression and, high-security prison and
also alternative hypothesis; 465–66n3 inmate aggression
null hypothesis averages of, 338 institutions and human rights,
alternative hypothesis and, 94, in bivariate OLS, 46, 47, 59, multivariate OLS for,
97, 105 65f , 66n16 152–55, 153t
bivariate OLS and, 92 causality and, 2–3, 12f instructor evaluation, fixed effects
confidence intervals in, 117–19, consistency and, 66n16 model for, 289–90, 290t
118f constant and, 4
instrumental variables
critical value in, 101–4 for country music and suicide,
Dickey-Fuller test for, 480–81 2SLS and, 295–308, 313
16
for dummy dependent variables, for chicken market, 319–23
defined, 3
434–43 for crime and education,
as dichotomous variables, 181
endogeneity and, 115 330–31, 331t
as dummy independent
for height and wages, 123–24, variables, 179–219 for economic growth and civil
123t, 126 dynamic models and, 476 war, 327–29, 327t
log likelihood for, 425, 436 endogeneity and, 8, 10 for economic growth and
LR test for, 434–40 error term and, 10, 12f , 16, 46, democracy, 331–32, 332t
MLE and, 423 59, 334, 465–66n3 endogeneity and, 295–332
for multiple coefficients, exogeneity and, 9, 10 for fish market, 329–30, 329t
158–64, 171–72, 434–43 fitted lines and, 449 for Medicaid enrollment, 295
power and, 109–11 fixed effects methods and, 268 multiple instruments for,
for presidential elections, for flu shots and health, 13 309–10
124–26 instrumental variables and, simultaneous equation models
p value and, 106–9, 107f 295–308 and, 315–23
R for, 122–23 logit model and, 430 for television and public affairs,
significance level and, 95–96, LPM and, 414 328–29, 328t
105 measurement error in, 144–45 weak instruments for, 310–13
INDEX 605
intention-to-treat models (ITT), Kennedy, Peter, 81 Lee, David, 398

340, 343–45 Keohane, Robert, 168, 494 Lemieux, Thomas, 398
for balance, 365, 366 ketchup econometrics, 437–40 Lenz, Gabriel, 244
for domestic violence in LR test for, 437–40, 438t Lenzer, Jeanne, 14–15, 21
Minneapolis, 352–53 Kim, Soo Yeon, 274, 283 Leoni, Eduardo, 168
for television and public affairs, King, Gary, 34, 168, 365, 443–44, Levitt, Steve, 296–97, 299, 301
368 494 lfit, in Strata, 83
interaction variables Kiviet, Jan, 527 life expectancy, linear-log model
dummy independent variables Klap, Ruth, 354 for, 233f
as, 203–5, 204f , 205t Klick, Jonathan, 362 life expectancy and GDP per
for gender and wages, 203–4, Kremer, Michael, 344 capita, polynomial models
204f Krueger, Alan, 301, 302 for, 222–26, 223f , 224f
in Stata, 212 life satisfaction, 220, 221f
intercept. See constant likelihood ratio test (LR test)
lagged dependent variables, 461
internal validity, of randomized equation for, 436–37
dynamic models and, 476, 524
experiments, 21 for hypothesis testing, 434–40
fixed effects and, 520–23
inverse t function, 121n9 ketchup econometrics, 438t
in OLS, 519–24
Iraq War and President Bush, logit model and, 439
panel data and, 520–24
probit model for, 449–50, log likelihood and, 436–37,
stationarity and, 477
449t 436–37n8, 439–40
unit roots and, 477
irrelevant variables, in multivariate probit model and, 437–39, 438t
lagged error, 461
OLS, 150 p value for, 446–47
autocorrelation and, 466, 466t
ITT. See intention-to-treat models restricted model for, 439–40
Lagrange multiplier test (LM test),
ivreg, in R, 326 in Stata, 446–47
519, 520, 522
ivregress, in Stata, 325 unrestricted model for, 439–40
for AR(1) model, 526
Laitin, David, 440–41 linear-log model, 232
Jackman, Simon, 120 La Porta, Rafael, 153n8 for life expectancy, 233f
Jaeger, David, 302, 311n5 LATE. See local average treatment in Stata, 246
jitter, 184 effect linear models, non-linear models
in Stata, 83n25, 173n14 latent variables, 409, 418n5 and, 410n2
job resumes and racial fitted lines and, 416–17 linear probability model (LPM)
discrimination, 368–70, 369t non-linear models and, 416–17 for congressional members and
Johnson, Lyndon B., 50 observational data and, 414–17 donors, 454–55, 455t
Johnson, Simon, 331 Lawrence, Adria, 328 dependent variable and, 414
Jones, Bradford, 444 law school admission dummy dependent variables
judicial independence, LPM for, 411–14, 411t, 412f , and, 410–14
multivariate OLS for, 413f , 415f for economic growth and civil
152–55, 153t probit model for, 415f , 427–28, war, 441–43, 442f
427f fitted lines for, 411–13, 412f ,
Kalkan, Kerem Ozan, 443 scatterplot for, 415f 415f , 434–35, 435f
Kalla, Joshua, 454 least squares dummy variable fitted value for, 412, 412n3
Kam, Cindy, 212 approach (LSDV) for global warming, 450–53,
Kastellec, Jonathan, 168 fixed effects models and, 451t, 452f
Katz, Jonathan, 523, 526 262–63, 263t heteroscedasticity and, 414n4
Keane, Michael, 431–35 R for, 286–87 independent variable and, 414
Keele, Luke, 487 Stata for, 284–85 ketchup econometrics, 431–34,
Kellstedt, Paul, 487 two-way fixed models and, 272 434t, 435f
606 INDEX
linear probability model MLE and, 425 mediator bias, 237

(continued) null hypothesis and, 436–37 Medicaid
for law school admission, log-linear model, 233–34 instrumental variables for, 295
411–14, 411t, 412f , 413f , log-log model, 234 outcome variables for, 295
415f Long, J. Scott, 444 Medicare, regression discontinuity
misspecification problem in, Lorch, Scott, 306 and, 374–76
412–13, 413f LPM. See linear probability model Mencken, H. L., 523
OLS and, 410, 414 LR test. See likelihood ratio test Mexico, Progresa experiment in,
S-curves and, 415–16 LSDV. See least squares dummy 338–40, 339t
slope and, 414 variable approach Miguel, Edward, 327, 344
linear regression. See ordinary Ludwig, Jens, 404 Miller, Douglass, 404
least squares Lynch, Peter, 535 Minneapolis. See domestic
linear relationships, correlation violence in Minneapolis
and, 10n3 Maestas, Nicole, 374 misspecification problem, in LPM,
lm, in R, 446–47 Maguire, Edward, 17 412–13, 413f
LM test. See Lagrange multiplier Major League Baseball Mitchell, Michael, 34
test attendance, hypothesis testing MLE. See maximum likelihood
load, in R, 37 for, 516–17 estimation
local average treatment effect salaries, 155–58, 156t, 158t modeled randomness
(LATE) Manchester City soccer, dummy bias and, 59
with 2SLS, 324 independent variables for, in bivariate OLS, 54
RD and, 394 179, 180f , 190–92, 191t, model fishing, 243–45
Lochner, Lance, 330–31 192f model specification, 220, 243–45
logged variables Manzi, Jim, 21 monotonicity, 324
for height and wages, 234–36, marginal-effects approach Montgomery, Jacob, 246
235t dummy dependent variables Moretti, Enrico, 330–31
in OLS, 230–36 and, 429–30 Morgan, Stephen L., 168
logit model Stata and, 451–52 moving average error process,
coefficient estimates for, margins, in Stata, 446 correlated errors and, 461n2
426–29 maximum likelihood estimation MSE. See mean squared error
dependent variables in, 443 (MLE) Mullainathan, Sendhil, 368
for dummy dependent variables, dependent variable and, 443 multicollinearity
421–22, 421n6 for dummy dependent variables, for institutions and human
equation for, 421, 421n6 423–26 rights, 154
fitted lines and, 434–35, 435f , goodness of fit for, 425 in multivariate OLS, 147–49,
449 log likelihood and, 425 154, 167
fitted values from, 425 McClellan, Bennett, 320, 321 in Stata, 169
independent variables and, 430 McClellan, Chandler, 280 multiple coefficients
ketchup econometrics, 431–34, McCloskey, Deirdre, 120 dummy dependent variables
434t, 435f McCrary, Justin, 393n1 and, 434–43
LR test and, 439–40 mean squared error (MSE), 71 F test for, 159–60, 170, 436
R for, 446–49 measurement error for height and athletics, 164–66
in Stata, 446–47 in dependent variable, 143–44 hypothesis testing for, 158–64,
log likelihood in independent variable, 144–45 170, 434–43
for hypothesis testing, 425, 436 in multivariate OLS, 143–45 OLS for, 436
LR test and, 436–37, 436–37n8, omitted variable bias from, multiple instruments
439–40 508–10 F statistic and, 312
INDEX 607
for instrumental variables, precision in, 146–50 with educational vouchers, 341,
309–10 R2 and, 149 342n4
multiple variables for retail sales and temperature, endogeneity and, 340–41
difference of means test tests 127, 128f , 129–31, 129f , for get-out-the-vote efforts,
and, 182 130t 346–48, 347n8, 347t, 348t,
in multivariate OLS, 128, 135, R for, 170–71 366–67, 367t
167 standard errors in, 133 ITT and, 343–45
omitted variable bias with, Stata for, 168 schematic representation of,
507–8 variance in, 146–47 341–43, 342f
multivariate OLS, 127–77 for wealth and universal male variables for, 348–49
attenuation bias and, 144 suffrage, 200–201, 201t, non-linear models
balance and, 337 202f latent variables and, 416–17
bias in, 167 Murnane, Richard, 300n1 linear models and, 410n2
coefficient estimates in, 128, Murray, Michael, 81, 324 OLS and, 220–21
133, 144, 146–47 normal distributions
confidence interval for, 133 in bivariate OLS, 55, 55f
control group and, 134, 134n1 _n, in Stata, 89n29 CDF and, 418–21, 420f
control variables in, 134, 134n1 National Center for Addiction and of error term, 56n6
dependent variable and, Substance Abuse (Columbia probit model and, 418, 419f
143–44 University), 136 t distribution and, 100, 100f
dummy independent variables National Longitudinal Survey of null hypothesis, 92–126
in, 190–93 Youth (NLSY), 40–41, 123 alternative hypothesis and, 94,
for economic growth and natural experiments, on crime and 97, 105
education, 140–43, 141t, terror alerts, 360–62, 363t augmented Dickey-Fuller test
142f natural logs, 230, 234n6 and, 481
endogeneity and, 129–37, 166 negative autocorrelation, 462, autocorrelation and, 460
error term in, 137–39 462f bivariate OLS coefficient
estimation process for, 134–36 negative correlation, 9–10, 10f estimates and, 97
fixed effects models and, 262 neonatal intensive care unit Dickey-Fuller test and, 480–81
goodness of fit in, 149–50 (NICU), 2SLS for, 305–8, distributions for, 94, 96f
for height and wages, 131–34, 306t, 307t F test for, 159–60, 309
132t, 133f Nevin, Rick, 537 for height and athletics, 164–66
independent variables and, Newey, Whitney, 365 log likelihood and, 436
127–28, 134, 144–45 Newey-West standard errors, 467, power and, 109–11, 336–37,
for institutions and human 470, 489–90 502
rights, 152–55, 153t NFL coaches, probit model for, for presidential elections,
irrelevant variables in, 150 452t, 453–54 94–95, 95t, 96f
for judicial independence, NICU. See neonatal intensive care p value and, 106–9, 107f
152–55, 153t unit significance level and, 95–96,
measurement error in, 143–45 NLSY. See National Longitudinal 105
multicollinearity in, 147–49, Survey of Youth statistically significant and, 93
154, 167 nominal variables, 193 t test for, 105
multiple variables in, 128, 135, non-compliance Type I errors and, 93, 95, 97
167 2SLS for, 346–56 Type II errors and, 93, 95, 97
observational data for, 166 for domestic violence in types of, 105
omitted variable bias in, Minneapolis, 350–54, 353t, null result, power and, 113
137–39, 144, 154, 167 354t Nyhan, Brendan, 246
608 INDEX
Obama, President Barack critical value and, 101–3, 102f for television and public affairs,
campaign contributions for, 333 one-way fixed effect models, 271 368
ObamaCare, 19 orcutt, 490 unbiased estimator and, 493–98
simultaneous equation models ordinal variables, 193, 194n5 variance for, 314, 499–501
for, 316 ordinary least squares (OLS). See for Winter Olympics, 515–16
observational data also bivariate OLS; Orwell, George, 533
for 2SLS, 323, 346, 349, 350 multivariate OLS outcome variables
for bivariate OLS, 78, 127, 131, 2SLS and, 298, 301n2 for Medicaid, 295
198 advanced, 493–512 RD and, 384
causality and, 25n1 autocorrelation and, 460, 464, outliers
for crime and terror alerts, 362 466, 466t, 519 in bivariate OLS, 77–80
difference of means test for, 182 autocorrelation for, 459 coefficient estimates and, 80
dummy independent variables balance and, 336 sample size and, 80
and, 182 coefficient estimates in, 493–98 scatterplots for, 80
for education and wages, 301 for crime and police, 256–61, overidentification test, 2SLS and,
endogeneity and, 21, 127 257t, 258f , 259f 309–10
error term and, 198, 323 for dichotomous variables, 409
exclusion condition and, 303 for difference-in-difference panel data
exogeneity and, 21, 182 models, 277–79, 278f advanced, 518–32
and fitted values, 428–29 difference of means test and, AR(1) model and, 521
latent variables and, 414–17 334, 336 with correlated errors, 518–20
messiness of, 24 for domestic violence in difference-in-difference models
for multivariate OLS, 166 Minneapolis, 352–53, for, 279–81, 280t
in natural experiments, 362 353n10 endogeneity in, 255–94
for NICU, 305 dynamic models and, 474–75 fixed effects models for, 255–94
RD and, 375 error term in, 525 lagged dependent variable and,
observed-value, discrete F test and, 436 520–24
differences approach Hausman test for, 301n2 OLS for, 284
dummy dependent variables lagged dependent variables in, random effects model and,
and, 429, 443 519–24 524–25
independent variable and, 429 logged variables in, 230–36 parent in jail, effect of, 242
for probit model, 430–31 LPM and, 410, 414 Park, David, 487
Stata for, 444–47 LSDV and, 262–63, 263t Pasteur, Louis, 91
OLS. See ordinary least squares MLE and, 423 Peace Corps, fixed effects model
omitted variable bias model specification and, 220 for, 288–89, 289t
anticipating sign of, 505–6, for multiple coefficients, 436 perfect multicollinearity, 149
506t omitted variable bias in, 502–14 Persico, Nicola, 40, 74, 123
for institutions and human for panel data, 284 Pesaran, Hashem, 487
rights, 154 polynomial models and, 224 Peterson, Paul E., 341
from measurement error, probit model and, 418 p−hacking, 243–45
508–10 quadratic models and, 226 Philips, Andrew, 487
with multiple variables, 507–8 quantifying relationships Phillips, Deborah, 389, 400
in multivariate OLS, 137–39, between variables with, 46 Pickup, Mark, 487
144, 154, 167 quasi-instruments and, 311 Pischke, Jörn-Steffen, 325
in OLS, 502–14 R for, 515 placebo, to control group, 334n1
one-sided alternative hypothesis, se for, 499–501 plausability, causality and, 536
94 Stata for, 170–72, 514 plim. See probability limits (plim)
INDEX 609
point estimate, 117 in bivariate OLS, 47 for economic growth and civil
police. See crime and police bivariate OLS for, 46f , 50–51, war, 441–43, 441f , 442f
Pollin, Robert, 24 51f , 51t, 94–95, 95t, 96f equation for, 420
polynomial models, 221–30 fixed effects model for, 288, fitted lines and, 423–25, 424f ,
dichotomous variables and, 288t 434–35, 435f , 449
410n2 hypothesis testing for, 124–26 fitted values from, 423–25,
equations for, 224–25, 225n4 null hypothesis for, 94–95, 95t, 424f
for life expectancy and GDP per 96f independent variables and, 430
capita, 222–26, 223f , 224f for presidential elections, 50 for Iraq War and President
OLS and, 224 variables for, 87t Bush, 449–50, 449t
for RD, 383–84, 383f , 387f presidential elections ketchup econometrics, 431–34,
pooled model 434t, 435f
bivariate OLS for, 46f , 50–51,
for crime and police, 256–61, for law school admission, 415f ,
51f , 51t, 94–95, 95t, 96f
257t, 258f , 259f 427–28, 427f
fitted values for, 50, 52f
two-way fixed effects model LR test and, 438t, 439–40
fixed effects models for, 288,
and, 272 for NFL coaches, 452t, 453–54
288t
positive autocorrelation, 462, normal distribution and, 418,
hypothesis testing for, 124–26 419f
462f
null hypothesis for, 94–95, 95t, observed-value, discrete
positive correlation, 9–10, 10f
96f differences approach for,
Postlewaite, Andrew, 40, 74, 123
predicted values for, 45, 50 430–31
post-treatment variables, 236–43
residuals for, 50 R for, 446–49
collider bias with, 510–13
scatterplots for, 45, 46f Stata for, 444–47
defined, 236
variables for, 87t Progresa experiment, in Mexico,
pound sign (#), in R, 37
prison. See high-security prison 338–40, 339t
poverty. See foreign aid for
and inmate aggression public affairs. See television and
poverty
power probability, of Type II error, 111n7 public affairs
balance and, 336–37 probability density p-value
calculating, 501–2 in bivariate OLS, 55–56, 55f , hypothesis testing and, 106–9,
58f 107f
equations for, 113n7
confidence interval and, 117, for LR test, 446–47
hypothesis testing and, 109–11
118f in Stata, 446–47
null hypothesis and, 336–37,
502 critical value and, 102f
null result and, 113 for null hypothesis, 95 quadratic models, 221–30
and standard error, 113 p value and, 107f fitted curves for, 225f
Type II errors and, 109–11, probability distribution, in for global warming, 227–30,
110f , 501–2 bivariate OLS, 54, 55f 228f , 229t
power curve, 111–13, 112f probability limits (plim), in OLS and, 226
R for, 123 bivariate OLS, 65, 65f R for, 246
Prais-Winsten model. See probit model Stata for, 246
ρ-transformed model coefficient estimates for, quarter of birth, 2SLS for, 301–3
precision 426–29, 427f quasi-instrumental variables
in 2SLS, 313–15 for congressional members and equation for, 310, 311n5
in bivariate OLS, 61–64 donors, 454–55, 455t exogeneity and, 310–12
in multivariate OLS, 146–50 dependent variables in, 443
predict, in Strata, 83 for dummy dependent variables, R (software), 33, 36–39, 39n8
predicted values 418–21, 423–25, 424f for 2SLS, 326
610 INDEX
R (software) (continued) for congressional members and assignment variable in, 375–76,
AER package for, 85–86, 326 donors, 454–55, 455t 384, 391–95, 393n1
for autocorrelation, 488–90 control group in, 19 basic model for, 375–80
for balance, 366 discontinuity in, 373–74 binned graphs and, 386–91,
data frames in, 286–87 error term and, 337 388f , 393n1
for dummy variables, 213–14 for exogeneity, 18–19, 334 χ 2 distribution for, 394
for fixed effects models, 528–30 external validity of, 21 for congressional elections,
for hypothesis testing, 122–23 for flu shots and health, 13–15, 402–4, 403t
installing packages, 86 14–15, 14f , 21n9 covariates in, 395
for logit model, 446–49 generalizability of, 21 dependent variable in, 395
for LSDV, 286–87 as gold standard, 18–22 diagnostics for, 393–97
for multivariate OLS, 170–71 internal validity of, 21 discontinuous error distribution
for Newey-West standard for job resumes and racial at threshold in, 392
errors, 489–90 discrimination, 368–70, 369t endogeneity and, 373–405
for OLS, 515 RD for, 395–97, 396f , 397t error term and, 377–79
for probit model, 446–49 for television and public affairs, fitted lines for, 385f , 387f
for quadratic models, 246 328–29, 328t, 366–68 flexible models for, 381–84
residual standard error in, 71, 85 treatment group in, 19, 334 fuzzy RD models, 392
sample limiting with, 38–39 randomness. See also modeled generalizability of, 394
for scatterplots, 400 randomness; sampling for Head Start, 401–2, 404–5,
variables in, 37–38, 38n7 randomness 404t
R2 of bivariate OLS estimates, histograms for, 393, 393f , 396f
for 2SLS, 314 53–57 LATE and, 394
adjusted, 150 causality and, 7–18 limitations of, 391–97
F tests using, 160–62 random variables Medicare and, 374–76
goodness of fit and, 71–72, 74 averages of, 56 outcome variables and, 384
multiple, 85 in bivariate OLS, 46, 53–57, 54 polynomial models for, 383–84,
multivariate OLS and, 149 central limit theorem and, 56 383f , 387f
racial discrimination. See job χ 2 distribution and, 99n1 scatterplots for, 376–77, 377f ,
resumes and racial in coefficient estimates, 53–57 378f , 400
discrimination expected value of, 496–97 slope and, 381, 381f
RAND, 358 probability density for, 55–56, treatment group and, 376
random effects model, panel data 55f for universal prekindergarten,
and, 524–25 probit model and, 418 389–90, 389f , 390t,
random error, 6 random walks. See unit roots 400–402, 401t
latent variables and, 417 RD. See regression discontinuity windows and, 386–91, 387f
randomization reduced form equation, 317 regression line
of independent variable, 19, 334 reference category, 194 in bivariate OLS, 47
in Progresa experiment, 339 reg, in Stata, 325 fitted values based on, 50–51
randomized experiments, 333–34 regional wage differences, scatterplot with, 85
2SLS for, 308, 308t categorical variables and, regression models
ABC issues in, 334, 334n2 194–96, 195t, 197t categorical variables in, 193–94
attrition in, 354–59 regression coefficient, bivariate for chicken market, 319–23
balance in, 335–40 OLS and, 50–51n4 constant in, 4, 5f
blocking in, 335 regression discontinuity (RD) error term in, 5–6
causality and, 18–22 for alcohol consumption and regression to the mean, 45n2
compliance in, 340–54 grades, 395–97, 396f , 397t Reinhart, Carmen, 24–25
INDEX 611
replication, of data, 28–32 substantive significance and, Schrodt, Phil, 220

replication files, 28 115 Schwabish, Jonathan, 34
for robustness, 30 variance and, 63n15, 65 S-curves, LPM and, 415–16
residuals sampling randomness SD. See standard deviation
autocorrelation and, 464 bias and, 59 se. See standard error
in bivariate OLS, 47, 53 in bivariate OLS, 53 selection models, for attrition, 356
for presidential elections, 50 confidence intervals and, 118n9 Sen, Maya, 238, 246
residual standard error, in R, “sandwich” package in R, 489–90 Sergenti, Ernest, 327
71, 85 Satyanath, Shanker, 327 Shin, Yongcheol, 487
restricted model scalar variables, 399 significance level
defined, 159 in Stata, 405n3 critical values and, 101–2
for LR test, 439–40 scatterplots with Dickey-Fuller test, 485n12
retail sales and temperature for autocorrelation, 464–65, hypothesis testing and, 95–96,
bivariate OLS for, 130, 130t 466f 105
multivariate OLS for, 127, for crime and police, 258f , Silverman, Dan, 40, 74, 123
128f , 129–31, 129f , 130t 259f simultaneous equation models,
ρ (rho)-transformed model. See of data, 3 317f
ρ-transformed model for donuts and weight, 4f , 27, 2SLS for, 317–18
Ripley, Brian, 34 28f coefficient estimates in, 318–19
Roach, Michael, 453 for dummy independent
equation for, 316
Robinson, James, 331 variables, 184
identification and, 318
robust, in Stata, 83, 168, 212 for economic growth and
instrumental variables and,
robustness education, 142f
315–23
AR(1) model and, 520 for goodness of fit, 71–72, 72f ,
for ObamaCare, 316
multivariate OLS and, 166 74
Skofias, Emmanuel, 398
replication files for, 30 for height and gender, 188f ,
189f slope
robust standard errors. See in bivariate OLS, 47, 53
heteroscedasticity-consistent for height and wages, 75f
jitter in, 173n14 LPM and, 414
standard errors
for law school admission, 415f RD and, 381, 381f
Rogoff, Ken, 24–25
for life expectancy, 233f slope coefficient
rolling cross-section data, 279
for life expectancy and GDP per for dummy independent
root mean squared error, in Stata, variables, 182
71 capita, 223f , 224f
for life satisfaction, 221f for independent variables, 4
for outliers, 80 for interaction variables, 205
Sahn, Alexander, 244 for presidential elections, 46f omitted variable bias and, 502
Samii, Cyrus, 244n11 for President Trump, 185f Small, Dylan, 305, 324
sample size in R, 400 Smith, Richard, 487
bivariate OLS and, 80 for RD, 376–77, 377f , 378f , Snipes, Jeffrey B., 17
blocking and, 335 400 software. See R; Stata
confidence interval and, 119 with regression line, 85 Sovey, Allison, 325
critical value and, 103 for retail sales and temperature, specification. See model
d.f. and, 63 128f , 129f specification
outliers and, 79 in Stata, 173n14, 399 specificity, causality and, 536
plim and, 66–67 for violent crime, 31, 32f , 77f , spherical errors, 81
probability limits and, 65 79f spurious regression, stationarity
standard error and, 113–14 Scheve, Kenneth, 197, 200 and, 477–80, 479f
612 INDEX
stable unit treatment value for categorical variables, 213 unit roots and, 477–81, 479f ,
assumption (SUTVA), 324 critical value in, 121, 170 480f
Stack, Steven, 15 dfbeta in, 79n24 statistically significant
Staiger, Douglas, 312n6 for dummy variables, 212–13 balance and, 336
standard deviation (SD) for fixed effects models, 285, in hypothesis testing, 93, 120
averages and, 26n2 527–28 statistical realism, 533–37
with data, 26 F test in, 170 statistical software, 32–33
equation for, 26n3 for hypothesis testing, 121–22 Stock, James, 312n6, 487
se and, 61 interaction variables in, 212 strength, causality and, 535
standard error of the regression ivreggress in, 325 Stuart, Elizabeth, 365
in bivariate OLS, 71 jitter in, 83n25, 173n14 substantive significance,
in Stata, 83 limit sample in, 176n15 hypothesis testing and, 115
standard error (se) linear-log model in, 246 suicide. See country music and
for 2SLS, 300, 313 suicide
logit model in, 446–47
autocorrelation and, 464 summarize, in Stata, 34–35
LR test in, 446
in bivariate OLS, 61, 74 supply equation, 320–22
for LSDV, 284–85
fixed effects and, 268n5 SUTVA. See stable unit treatment
marginal-effects approach and,
for height and wages, 74, 133 value assumption
451–52
heteroscedasticity-consistent Swirl, 34
multicollinearity in, 169
standard errors, 68–70, 68n18 syntax files
for multivariate OLS, 168 in R, 37
for interaction variables, 205
_n in, 89n29 in Stata, 35
multicollinearity and, 149–50
for observed-value, discrete
in multivariate OLS, 133
differences approach, 444–47
Newey-West, 467 Tabarrok, Alexander, 362
for OLS, 170, 514
for null hypothesis, 95 t distribution, 99–100, 99n1
for probit model, 444–47
for OLS, 499–501 critical value for, 101, 103t
for quadratic models, 246
and power, 113 d.f. and, 103
reg in, 325
in R, 86 inverse t function and, 121n9
and sample size, 113–14 robust in, 83, 168, 212
MLE and, 423
substantive significance and, root mean squared error in, 71
normal distribution and, 100,
115 scalar variables in, 405n3
100f
t tests and, 98 scatterplots in, 399
teacher salaries. See education and
variance of, 499–501 standard error of the regression wages
standardization, of variables, 156 in, 83 Tekin, Erdal, 280
standardized coefficients, 155–58 for standardized regression television and public affairs,
standardized regression coefficients, 169–71 367–68
coefficients test in, 446–47 instrumental variables for,
in Stata, 169–71 ttail in, 121n10 328–29, 328t
stand your ground laws and twoway in, 83 temperature. See global warming;
homicide, 276–77, 280–81, VIF in, 169 retail sales and temperature
280t stationarity, 485n12 terror alerts. See crime and terror
Stasavage, David, 197, 200 augmented Dickey-Fuller test alerts
Stata, 34–36 for, 482 test, in Stata, 446–47
for 2SLS, 325 Dickey-Fuller test for, 482 test scores. See education and
for autocorrelation, 488–90 global warming and, 482–85, wages
for balance, 365–66 483f , 485t Texas school boards, fixed effects
for bivariate OLS, 81–84 time series data and, 476–82 model for, 291–93, 292t
INDEX 613
time series data, 459–92 critical value for, 101 for television and public affairs,
autocorrelation in, 460–63 for hypothesis testing, 97–106 368
correlated errors in, 68–70 MLE and, 423 for treatment group, 329
dependent variable and, 460 for null hypothesis, 105 variables in, 348–49
dynamic models for, 473–76 se and, 98 variance of, 313–14
for global warming, 459 Tufte, Edward, 34 twoway, in Stata, 83
stationarity and, 476–82 two-sided alternative hypothesis, two-way fixed effects models,
Torres, Michelle, 246 94 271–75
trade and alliances, fixed effects critical value and, 101–3, 102f Type I errors
model for, 274–76, 275t two-stage least squares (2SLS), hypothesis testing and, 93, 93t
treatment group, 335 300n1 null hypothesis and, 95, 97
2SLS for, 329 for alcohol consumption and significance level and, 95–96
attrition in, 354–55 grades, 308, 308t Type II errors
averages for, 182 assignment variable and, 348 hypothesis testing and, 93t
balance in, 335–40 for balance, 366 null hypothesis and, 93, 95, 97
blocking for, 335–36 bias and, 312 power and, 109–11, 110f ,
compliance in, 342, 348 for crime and police, 296–98, 501–2
control group and, 134, 134n1, 297t probability of, 111n7
180, 334 for domestic violence in significance level and, 95–96
difference-in-difference models Minneapolis, 352–53
and, 285 for education and wages, 301–3 unbiased estimator
difference of means test for, exclusion condition for, in bivariate OLS, 58–60, 58f
181, 186f 300–301, 302f correlation of, 61
dummy independent variables fitted value for, 299, 314, 348 distributions of, 61
and, 181, 182f goodness of fit for, 314 ITT and, 344
ITT and, 343 Hausman test for, 301n2 OLS and, 493–98
in randomized experiments, inclusion condition for, 300, unbiasedness
19, 334 302f in bivariate OLS, 57–61
RD and, 376 instrumental variables and, of coefficient estimates, 57–59
SUTVA and, 324 295–308, 313 Uncontrolled (Manzi), 21
variables in, 337 LATE with, 324 unit roots
trimmed data set, attrition and, with multiple instruments, 309 augmented Dickey-Fuller test
355–56 for NICU, 305–8, 306t, 307t for, 481
Trump, President Donald, 1, 45, for non-compliance, 346–56 Dickey-Fuller test for, 480–81
183–85 observational data for, 323, 346, lagged dependent variable and,
TSTAT, 121–22, 121n10 349, 350 477
t statistic OLS and, 298, 301n2 stationarity and, 477–81, 479f ,
critical value and, 104 overidentification test and, 480f
for economic growth and 309–10 universal prekindergarten, RD for,
education, 143 precision of, 313–15 389–90, 389f , 390t,
F test and, 312n6 for quarter of birth, 301–3 400–402, 401t
for height and wages, 104–5, R2 for, 314 unrestricted model
104t R for, 326 defined, 159
p value and, 108 se for, 300, 313 for LR test, 439–40
ttail, in Stata, 121n10 for simultaneous equation
t tests, 99n1 model, 317–18 variables
for bivariate OLS, 97–106 Stata for, 325 in 2SLS, 348–49
614 INDEX
variables (continued) in Stata, 169 weight. See donuts and weight

in control group, 337 variance of the regression, in weighted least squares,
correlation and, 9–10, 10f bivariate OLS, 63 heteroscedasticity and, 81
for global education data, 177t Vella, Francis, 365 Wells, Christine, 34
for height and wages, 40, 40t Venables, William, 34 West, James, 374
for non-compliance, 348–49 Verba, Sidney, 168, 494 Western, Bruce, 354
post-treatment, 236–43, 510–13 Verzani, John, 34 Willett, John, 300n1
for presidential elections, 87t VIF. See variance inflation factor Wilson, Sven, 283
in R, 37–38, 38n7 violent crime windows, RD and, 386–91, 387f
standardization of, 156 bivariate OLS for, 77–79, 77f , Winship, Christopher, 168
in Stata, 35 78t, 79f Winter Olympics
stationarity and, 476–82 data on, 30–32, 31t fixed effects models for,
in treatment group, 337 fitted lines for, 79f 530–32, 530t
for Winter Olympics, 39, 39t ice cream and, 60 OLS for, 515–16
variance scatterplot for, 31, 32f , 77f , variables for, 39, 39t
of 2SLS, 313–14 79f Woessmann, Ludger, 140, 141,
autocorrelation and, 459, 460 177
in bivariate OLS, 50–51n4, wages, categorical variables and Wooldridge, Jeffrey, 311n5,
61–63, 62f , 63n14, 67 regional differences in, 469n5, 489n15, 526
of coefficient estimates, 194–96, 195t, 197t. See also World Values Survey, 220
146–47, 313–14 education and wages; gender
of fitted value, 314 and wages; height and wages
Yared, Pierre, 331
homoscedasticity and, 68 Wald test, 446–47
Yau, Nathan, 34
in multivariate OLS, 146–47 Watson, Mark, 487
Yoon, David, 274, 283
for OLS, 314, 499–501 Wawro, Greg, 527
sample size and, 63n15, 65 weak instruments
of se, 499–501 bias and, 313 Zeng, Langche, 443
variance inflation factor (VIF), for instrumental variables, Ziliak, Stephen, 120
149 310–13 z tests, MLE and, 423

Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli

Uploaded by

Copyright:

Available Formats

Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Michael A Bailey - Real Econometrics - The Right Tools To Answer Important Questions 2nd Edition OXFORD UNIVERSITY PRESS - Libgenli

Uploaded by

Copyright:

Available Formats

Real

New York Oxford

Published in the United States of America by Oxford University Press

For titles covered by Section 112 of the US Higher Education

All rights reserved. No part of this publication may be reproduced, stored in

You must not circulate this work in any other form

Library of Congress Cataloging-in-Publication Data

Names: Bailey, Michael A., 1969- author.

Printed in the United States of America

List of Figures xii

Useful Commands for R xxiv

1 The Quest for Causality 1

2 Stats in the Wild: Good Data Practices 24

I THE OLS FRAMEWORK 43

3.3 Endogeneity and Bias 57

4 Hypothesis Testing and Interval Estimation:

5 Multivariate OLS: Where the Action Is 127

6 Dummy Variables: Smarter than You Think 179

6.2 Dummy Independent Variables in Multivariate OLS 190

7 Specifying Models 220

II THE CONTEMPORARY ECONOMETRIC TOOLKIT 253

9 Instrumental Variables: Using Exogenous Variation

9.4 Quasi and Weak Instruments 310

10 Experiments: Dealing with Real-World Challenges 333

11 Regression Discontinuity: Looking for Jumps in Data 373

III LIMITED DEPENDENT VARIABLES 407

12.2 Using Latent Variables to Explain Observed Variables 414

IV ADVANCED MATERIAL 457

13 Time Series: Dealing with Stickiness over Time 459

14 Advanced OLS 493

14.8 Collider Bias with Post-Treatment Variables 510

15 Advanced Panel Data 518

16 Conclusion: How to Be an Econometric Realist 533

Citations and Additional Notes 556

Guide to Review Questions 567

Photo Credits 586

2.1 Two Versions of Debt and Growth Data 25

6.12 Heating Used and Heating Degree-Days for Homeowner who

7.1 Average Life Satisfaction by Age in the United States 221

8.1 Robberies and Police for Large Cities in California 258

9.1 Conditions for Instrumental Variables 302

10.1 Compliance and Non-compliance in Experiments 342

11.1 Drinking Age and Test Scores 374

11.3 Possible Results with Basic RD Model 378

13.1 Examples of Autocorrelation 462

14.1 A More General Depiction of Models with a Post-Treatment

A.1 An Example of a Probability Density Function (PDF) 542

1.1 Donut Consumption and Weight 3

2.1 Descriptive Statistics for Donut and Weight Data 26

3.1 Selected Observations from Election and Income Data 51

4.1 Type I and Type II Errors 93

5.3 Using Multiple Measures of Education to Study Economic Growth

6.1 Feeling Thermometer toward Donald Trump 184

7.1 Global Temperature, 1879–2012 229

7.3 Variables for Political Instability Data 247

8.1 Basic OLS Analysis of Robberies and Police Officers 257

10.1 Balancing Tests for the Progresa Experiment: Difference of Means

10.3 Second-Stage Regression in Campaign Experiment: Explaining