Instucciones Proyecto 1
Instucciones Proyecto 1
Instucciones Proyecto 1
Gregory Bruich
Spring 2019 Department of Economics, Harvard University
Empirical Project 1
Stories from the Atlas: Describing Data using Maps, Regressions, and Correlations
Posted on Thursday, February 7, 2019
Due at midnight on Thursday, February 21, 2019
The Opportunity Atlas was publicly released on October 1, 2018, and an accompanying article
appeared on the front page of the New York Times. The Opportunity Atlas is a freely available
interactive mapping tool that traces the roots of outcomes such as poverty and incarceration back
to the neighborhoods in which children grew up.
Policymakers, journalists, and the public have begun to explore the Opportunity Atlas, casting
new light on the geography of upward mobility in communities across the country. As an example,
see Jasmine Garsd’s recent analysis for the New York City neighborhood of Brownsville in
Brooklyn.
In this first empirical project, you will use the Opportunity Atlas mapping tool and the underlying
data to describe equality of opportunity in your hometown and across the United States. (If you
grew up outside the United States, you may select a community in which you have spent some
time, such as Boston, MA.)
The end product will be a 4-6 page narrative (or story) in which you describe what you have learned
from the Atlas. The next page lists specific analyses and questions that your narrative must
address. It should be double spaced with references, graphs, and maps.
This project focuses on the following methods for descriptive data analysis. (The later empirical
projects you will do in this class will be focused on causal inference and prediction).
1. Data visualization. Maps are a powerful way to present descriptive statistics for data with
a geographic component. You will use maps to display upward mobility statistics for the
Census tracts in your hometown.
2. Regression and correlation analysis. You will use linear regressions and correlation
coefficients to quantify the statistical relationship between upward mobility and potential
explanatory variables.
The Stata data file that you will use in this assignment, atlas.dta, contains an extract of the
Opportunity Atlas data. I have also merged on several other variables, which you may use for the
correlational analysis.
We will invite 5-10 students who produce the most compelling and insightful stories/analyses to
discuss them with Professor Chetty and his team members at a lunch hosted at Opportunity
Insights.
Instructions
Please submit your Empirical Project on Canvas. Your submission should include three files:
1. A 4-6 page narrative as a word or pdf document (double spaced and including references,
graphs, maps, and tables)
2. A do-file with your STATA code or an .R script file with your R code
3. A log file of your STATA or R output
1. Start by looking up the city where you grew up on the Opportunity Atlas. Zoom in to the
Census tracts around your home.
Figure 1 in your narrative should be a map of the Census tracts in your hometown from the
Opportunity Atlas. Examples for Milwaukee, WI (where Professor Chetty grew up) and
Los Angeles, CA (discussed in Lecture 1) are shown on the next page. The text of your
narrative should describe what you see, and what data are being visualized.
Examine the patterns for a number of different groups (e.g., lowest income children, high
income children) and outcomes (e.g., earnings in adulthood, incarceration rates). Only
choose one or two of these to include in your narrative.
2. (To answer this question, read the Opportunity Atlas manuscript) What period do the data
you are analyzing come from? Are you concerned that the neighborhoods you are
studying may have changed for kids now growing up there? What evidence do Chetty et
al. (2018) provide suggesting that such changes are or are not important? What type of
data could you use to test whether your neighborhood has changed in recent years?
3. Now turn to the atlas.dta data set. How does average upward mobility, pooling races and
genders, for children with parents at the 25th percentile (kfr pooled_p25) in your
home Census tract compare to mean (population-weighted, using count_pooled)
upward mobility in your state and in the U.S. overall? Do kids where you grew up have
better or worse chances of climbing the income ladder than the average child in America?
Hint: The Opportunity Atlas website will give you the tract, county, and state FIPS codes
for your home address. For example, searching for “Lynwood Road, Verona, New
Jersey” will display Tract 34013021000, Verona, NJ. The first two digits refer to the
state code, the next three digits refer to the county code, and the last 6 digits refer to the
tract code. In Stata, listing this observation can be done as follows:
list kfr_pooled_p25 if state == 34 & county == 013 & tract == 021000
2
5. Now let’s turn to downward mobility: repeat questions (3) and (4) looking at children
who start with parents at the 75th and 100th percentiles. How do the patterns differ?
6. Using a linear regression, estimate the relationship between outcomes of children at the
25th and 75th percentile for the Census tracts in your home county. Generate a scatter
plot to visualize this regression. Do areas where children from low-income families do
well generally have better outcomes for those from high-income families, too?
7. Next, examine whether the patterns you have looked at above are similar by race. If there
is not enough racial heterogeneity in the area of interest (i.e., data is missing for most
racial groups), then choose a different area to examine.
8. Using the Census tracts in your home county, can you identify any covariates which help
explain some of the patterns you have identified above? Some examples of covariates
you might examine include housing prices, income inequality, fraction of children with
single parents, job density, etc. For 2 or 3 of these, report estimated correlation
coefficients along with their 95% confidence intervals.
9. Open question: formulate a hypothesis for why you see the variation in upward mobility
for children who grew up in the Census tracts near your home and provide correlational
evidence testing that hypothesis.
For this question, many covariates have been provided to you in the atlas.dta file, which
are described under the “Characteristics of Census tracts” header in Table 1.
You are welcome to use outside data that are not included in atlas.dta, but this is not
required. Diane Sredl has created a research guide for our class that contains links to
other data sources. You may wish to read this tutorial on how to add variables to a data
set in Stata.
10. Putting together all the analyses you did above, what have you learned about the
determinants of economic opportunity where you grew up? Identify one or two key
lessons or takeaways that you might discuss with a policymaker or journalist if asked
about your hometown. Mention any important caveats to your conclusions; for example,
can we conclude that the variable you identified as a key predictor in the question above
has a causal effect (i.e., changing it would change upward mobility) based on that
analysis? Why or why not?
3
Figure 1
Household Income in Adulthood for Children Raised in Low-Income Households
in Milwaukee, WI
Notes: This figure shows household income at ages 31-37 for low income children who grew up
in Census tracts near Milwaukee, WI. The image was saved from www.opportunity-atlas.org by
first searching for “Milwaukee, WI” and then clicking on the “download as image” button.
Figure 2
Incarceration Rates for Black Men Raised in the Lowest-Income Households
in Los Angeles, CA
Notes: This figure is from the non-technical summary of the Opportunity Atlas and was discussed
in Lecture 1.
4
DATA DESCRIPTION, FILE: atlas.dta
The data consist of n = 73,278 U.S. Census tracts. For more details on the construction of the
variables included in this data set, please see Chetty, Raj, John Friedman, Nathaniel Hendren,
Maggie R. Jones, and Sonya R. Porter. 2018. “The Opportunity Atlas: Mapping the Childhood
Roots of Social Mobility.” NBER Working Paper No. 25147.
Table 1
Definitions of Variables in atlas.dta
5
singleparent_share2010 Share of Single-Headed Households with Children 72,564
2010
singleparent_share1990 Share of Single-Headed Households with Children 72,196
1990
singleparent_share2000 Share of Single-Headed Households with Children 72,285
2000
traveltime15_2010 Share of Working Adults w/ Commute Time of 15 72,939
Minutes Or Less in 2010
emp2000 Employment Rate 2000 72,344
mail_return_rate2010 Census Form Rate Return Rate 2010 72,547
ln_wage_growth_hs_grad Log wage growth for HS Grad., 2005-2014 51,635
jobs_total_5mi_2015 Number of Primary Jobs within 5 Miles in 2015 72,311
jobs_highpay_5mi_2015 Number of High-Paying (>USD40,000 annually) 72,311
Jobs within 5 Miles in 2015
nonwhite_share2010 Share of People who are not white 2010 73,111
popdensity2010 Population Density (per square mile) in 2010 73,194
ann_avg_job_growth_2004_2013 Average Annual Job Growth Rate 2004-2013 70,664
job_density_2013 Job Density (in square miles) in 2013 72,463
6
kfr_asian_p100 Household income ($) at age 31-37 for Asian 13,480
children with parents at the 100th percentile of the
national income distribution
kfr_black_p25 Household income ($) at age 31-37 for Black 34,086
children with parents at the 25th percentile of the
national income distribution
kfr_black_p75 Household income ($) at age 31-37 for Black 34,049
children with parents at the 75th percentile of the
national income distribution
kfr_black_p100 Household income ($) at age 31-37 for Black 32,536
children with parents at the 100th percentile of the
national income distribution
kfr_hisp_p25 Household income ($) at age 31-37 for Hispanic 37,611
children with parents at the 25th percentile of the
national income distribution
kfr_hisp_p75 Household income ($) at age 31-37 for Hispanic 37,579
children with parents at the 75th percentile of the
national income distribution
kfr_hisp_p100 Household income ($) at age 31-37 for Hispanic 35,987
children with parents at the 100th percentile of the
national income distribution
kfr_white_p25 Household income ($) at age 31-37 for white 67,978
children with parents at the 25th percentile of the
national income distribution
kfr_white_p75 Household income ($) at age 31-37 for white 67,968
children with parents at the 75th percentile of the
national income distribution
kfr_white_p100 Household income ($) at age 31-37 for white 67,627
children with parents at the 100th percentile of the
national income distribution
3. Counts of number of children under 18 in 2000 (to calculate weighted summary statistics)
count_pooled Count of all children 72,451
count_white Count of White children 72,451
count_black Count of Black children 72,451
count_asian Count of Asian children 72,451
count_hisp Count of Hispanic children 72,451
count_natam Count of Native American children 72,451
Note: This table describes the variables included in the atlas.dta file.
7
Table 2a
STATA Hints
STATA command Description
*clear the workspace This code shows how to clear the workspace, change the
clear working directory, and open a Stata data file.
set more off
cap log close To change directories on either a mac or windows PC, you
can use the drop down menu in Stata. Go to file -> change
*change working directory and open data set working directory -> navigate to the folder where your data
cd "C:\Users\gbruich\Ec1152\Projects\" is located. The command to change directories will appear;
use atlas.dta it can then be copied and pasted into your .do file.
*Summary stats These commands report means and standard deviations for
sum yvar [aw = count_pooled] yvar, weighted by the variable count_pooled. The first line
calculates these statistics across the full sample. The second
*Summary stats for Wisconsin line calculates these statistics for observations in Wisconsin.
sum yvar if state == 55 [aw = count_pooled ] The third line calculates these statistics for observations in
Milwaukee County.
*Summary stats for Milwaukee County
sum yvar if state == 55 & county == 079 [aw =
count_pooled ]
8
Table 2b: R Commands
R command Description
#clear the workspace This sequence of commands shows how to
rm(list=ls()) open Stata datasets in R. The first block of
code clears the work space. The second
#Install and load haven package block of code installs and loads the
install.packages("haven")
library(haven)
“haven” package. The third block of code
changes the working directory to the
#Change working directory and load stata data set location of the data and loads in atlas.dta.
setwd("C:/Users/gbruich/Ec1152/Projects")
atlas <- read_dta("atlas.dta")
#Install and load sandwich and lmtest packages This sequence of commands shows how to
install.packages("sandwich") estimate an ordinary least squares
install.packages("lmtest") regression with heteroskedasticity-robust
library(sandwich) standard errors. The first block of code
library(lmtest)
first loads the necessary packages. The
#Run regression with homoskedasticity-only standard errors second block of code estimates a
mod1 <- lm(yvar~xvar1+xvar2 + xvar3, data = milwaukee) regression of yvar against xvar1, xvar2,
summary(mod1) and xvar3, then reports the estimated
coefficients, homoskedasticity-only
#Report coefficients with heteroskedasticity robust standard errors standard errors, and regression diagnostics
coeftest(mod1, vcov = vcovHC(mod1, type="HC1")) (R2, adjusted R2, RMSE/SER which is
referred to in the output as the Residual
standard error). The last block of code
reports the coefficients with
heteroskedasticity-robust standard errors.
#Method 1 These commands show how to estimate
##Standardize variables correlation coefficients.
milwaukee$x_std <- (milwaukee$yvar -
mean(milwaukee$yvar))/sd(milwaukee$yvar)
9
summary(mod2) with heteroskedasticity robust standard
coeftest(mod2, vcov = vcovHC(mod2, type="HC1")) errors.
#Note that regression output matches the following output The second method is to use the cor
cor(milwaukee$kfr_pooled_p25, milwaukee$job_density_2013)
command, which does not report standard
errors.
# Install and load ggplot2 package These commands show how to draw a
install.packages("ggplot2") scatter plot of yvar against xvar1. The
library(ggplot2) geom_smooth part of the code adds an OLS
regression line. The last line saves the
# Draw scatter plot with linear fit line
ggplot(data = milwaukee) + geom_point(aes(x = xvar1, y = yvar)) +
graph as a .png file.
geom_smooth(aes(x = xvar, y = yvar), method = "lm", se = F)
10