Lecture set 1
Lecture set 1
Chapters 1, 2 and 3
The statistical analysis of
economic (and related)
data
Brief Overview of the Course
• Economics suggests important relationships, often with policy
implications, but virtually never suggests quantitative
magnitudes of causal effects.
– What is the quantitative effect of reducing class size on student
achievement?
– How does another year of education change earnings?
– What is the price elasticity of cigarettes?
– What is the effect on output growth of a 1 percentage point increase in
interest rates by the Fed?
This course is about using data to measure
causal effects.
• Ideally, we would like an experiment
– What would be an experiment to estimate the effect of class size on
standardized test scores?
• But almost always we only have observational
(nonexperimental) data.
– returns to education
– cigarette prices
– monetary policy
• Most of the course deals with difficulties arising from using
observational data to estimate causal effects
– confounding effects (omitted factors)
– simultaneous causality
– “correlation does not imply causation”
In this course you will:
• Learn methods for estimating causal effects using observational
data
• Learn methods for prediction – for which knowing causal effects
is not necessary – including forecasting using time series data;
• Focus on applications – theory is used only as needed to
understand the whys of the methods;
• Learn to evaluate the regression analysis of others – this means
you will be able to read/understand empirical economics papers.
• Get some hands-on experience with regression analysis in your
problem sets.
Review of Probability and Statistics
(SW Chapters 2, 3)
• Empirical problem: Class size and educational output
– Policy question: What is the effect on test scores (or some other outcome
measure) of reducing class size by one student per class? by 8
students/class? (class size measured by student-teacher ratio (STR))
– We must use data to find out (is there any way to answer this without
data?)
What comes next…
• The mechanics of estimation, hypothesis testing, and confidence
intervals should be familiar
• These concepts extend directly to regression and its variants.
• Before turning to regression, however, we will review some of
the underlying theory of estimation, hypothesis testing, and
confidence intervals.
Review of Statistical Theory
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Random variable Y
• Numerical summary of a random outcome (e.g., district average
test score, district STR)
Population distribution of Y
• The probabilities of different values of Y that occur in the
population, for ex. Pr[Y = 650] (when Y is discrete)
• or: The probabilities of sets of these values, for ex. Pr[640 ≤ Y ≤
660] (when Y is continuous).
(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(1 of 3)
skewness
3
Y
= measure of asymmetry of a distribution
• skewness = 0: distribution is symmetric
• skewness > (<) 0: distribution has long right (left) tail
E Y Y
4
kurtosis
4
Y
= measure of mass in tails
= measure of probability of large extreme values (outliers)
• kurtosis = 3: normal distribution
• skewness > 3: heavy tails (“leptokurtotic”)
(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(3 of 3)
2 random variables: joint distributions and
covariance (1 of 2)
• Random variables X and Z have a joint distribution
• The covariance between X and Z is
cov(X,Z) = E[(X – μX)(Z – μZ)] = σXZ
• The covariance is a measure of the linear association between X
and Z; its units are units of X × units of Z
• cov(X,Z) > 0 means a positive relation between X and Z
• If X and Z are independently distributed, then cov(X,Z) = 0
• The covariance of a RV with itself is its variance:
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
2 random variables: joint distributions and
covariance (2 of 2)
The covariance between Test Score and STR is negative:
(c) Conditional distributions and
conditional means (1 of 3)
Conditional distributions
• The distribution of Y, given value(s) of some other random variable, X
• Ex: the distribution of test scores, given that Student-Teacher Ratio (STR) < 20
This measures the difference in test scores associated with differences in the
class size.
• Because Y1 and Y2 are randomly drawn from the same population (i.e., come
from the same distribution) this implies that Y1, Y2 are identically distributed
– That is, under simple random sampling, Y1 and Y2 are independently and
identically distributed (i.i.d.).
Hence, we said the data set {Yi}, i = 1,…, n, are i.i.d.
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population…
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Ῡ is the natural estimator of the mean. But:
a) What are the properties of Ῡ ?
b) Why should we use Ῡ rather than some other estimator?
Y1 (the first observation)
maybe unequal weights – not simple average
median(Y1,…, Yn)
The starting point is the sampling distribution of Ῡ …
(a) The sampling distribution of Ῡ (1 of 3)
Consistency Unbiasedness
The mean and variance of the sampling
distribution of Ῡ (3 of 3)
E (Y ) Y
Y2
var(Y )
n
Implications:
1. Ῡ is an unbiased estimator of μY (that is, E(Ῡ ) = μY)
2. var(Ῡ ) is inversely proportional to n
1. the spread of the sampling distribution is proportional to 1/ n
2. Thus the sampling uncertainty associated with Y is proportional
to 1/ n (larger samples, less uncertainty, but square-root law)
The sampling distribution of Ῡ when n is
large
For small sample sizes, the distribution of Ῡ is complicated, but
if n is large, the sampling distribution is simple!
1. As n increases, the distribution of Ῡ becomes more tightly centered
around μY (the Law of Large Numbers)
2. Moreover, the distribution of Ῡ approximates μY and becomes
normally distributed (the Central Limit Theorem)
The Law of Large Numbers:
An estimator is consistent if the probability that its falls within an
interval of the true population value tends to one as the sample
size increases.
If (Y1 , , Yn ) are i.i.d. and Y2 , then Y is a consistent estimator
of Y , that is,
Pr[| Y Y | ] 1 as n
p
which can be written, Y Y
p
(“Y Y ” means “Y converges in probability to Y ”).
Y2
(the math : as n , var(Y ) 0, which implies that
n
Pr[|Y Y | ] 1.)
The Central Limit Theorem (CLT) (1 of 3)
If (Y1 , , Yn ) are i.i.d. and 0 Y2 , then when n is large, the
distribution of Y is well approximated by a normal distribution.
Y2
Y is approximately distributed N ( Y , ) (“normal distribution with
n
mean Y and variance /n”)
2
Y
Y E (Y ) Y Y
That is, “standardized” Y ZY is approximately
var (Y ) Y / n
distributed as N (0, 1)
– The larger is n, the better is the approximation.
The Central Limit Theorem (CLT) (3 of 3)
Y E (Y )
Sampling distribution of ZY = :
var(Y )
Summary: The Sampling Distribution of Ῡ
For Y1 , , Yn i.i.d. with 0 Y2 ,
• The exact (finite sample) sampling distribution of Y has mean Y
(“Y is an unbiased estimator of Y ”) and variance Y2 /n
• Other than its mean and variance, the exact distribution of Ῡ is
complicated and depends on the distribution of Y (the population
distribution)
• When n is large, the sampling distribution simplifies:
p
Y Y (Law of large numbers)
Y E (Y )
is approximately N (0,1) (CLT)
var(Y )
1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision based on the evidence at hand whether a null
hypothesis is true, or instead that some alternative hypothesis is
true. That is, test
– H0: E(Y ) = μY,0 vs. H1: E(Y ) > μY,0 (1-sided, >)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) < μY,0 (1-sided, <)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) ≠ μY,0 (2-sided)
Some terminology for testing statistical
hypotheses (1 of 2)
p-value = probability of drawing a statistic (e.g.Ῡ ) at least as
adverse to the null as the value actually computed with your data,
assuming that the null hypothesis is true.
The significance level of a test is a pre-specified probability of
incorrectly rejecting the null, when the null is true.
Note
A p-value is a measure of the probability that an observed difference (or simply
an observed value that differs from the hypothesized value) could have occurred
just by random chance.
The p-value is the area in the tails of a standard normal distribution outside the
plus/minus region of the distribution curve.
If the population std. deviation σY is unknown, then it must be
estimated using the sample variance of Y
Estimator of the variance of Y:
n
1
sY2
n 1 i 1
(Yi Y ) 2
“sample variance of Y ”
Fact:
If (Y1,…,Yn) are i.i.d. and the fourth moment E(Y4) < ∞ (kurtosis),
then p
2
sY Y
2
Ys Yl Ys Yl
t
ss2
sl2 SE (Ys Yl )
ns nl
Confidence Intervals (1 of 2)
• A 95% confidence interval for μY is an interval or range that contains
the true value of μY in 95% of repeated samples.
• Digression: What is random here? The values of Y1,...,Yn and thus any
functions of them – including the confidence interval. The confidence
interval will differ from one sample to the next. The population
parameter, μY, is not random (we just don’t know it and it needs to be
estimated)
Confidence Intervals (2 of 2)
A 95% confidence interval can always be constructed as the
set of values of μY not rejected by a hypothesis test with a 5%
significance level.
Y Y Y Y
{Y : 1.96} {Y : 1.96 1.96}
sY / n sY / n
sY s
{Y : 1.96 Y Y 1.96 Y }
n n
sY sY
{Y (Y 1.96 , Y 1.96 )}
n n