Linear Regression

Introduction to Econometrics
STAT-S-301
Linear Regression (2016/2017)

Lecturer: Yves Dominicy
Teaching Assistant: Elise Petit
Class size and educational output
Policy question: What is the effect of reducing class size by one student per
class? By x students/class?
What is the right output (performance) measure?
parent satisfaction
student personal development
future adult welfare
future adult earnings
performance on standardized tests
The California Test Score Data Set

All K-8 California school districts (n = 420)
Variables:
5th grade test scores (Stanford-9 achievement test, combined math and
reading), district average.
Student-teacher ratio (STR) = no. of students in the district divided by
no. full-time equivalent teachers.
An initial look at the California test score data:
Do districts with smaller classes (lower STR) have higher test scores?
The class size/test score policy question:

What is the effect on test scores of reducing STR by one student/class?
Object of policy interest:
Test score
.
STR
This is the slope of the line relating test score and STR.
This suggests that we want to draw a line through the Test Score v. STR
scatterplot but how?
Notation and Terminology:

The population regression line:
Test Score = 0 + 1STR
1 = slope of population regression line

=
Test score
STR
= change in test score for a unit change in STR.

Why are 0 and 1 population parameters?
We would like to know the population value of 1.
We dont know 1, so must estimate it using data.
How can we estimate 0 and 1 from data?

Recall that Y was the least squares estimator of Y: Y solves,
n
min m (Yi m) 2 .
i 1
By analogy, we will focus on the least squares (ordinary least squares or

OLS) estimator of the unknown parameters 0 and 1, which solves,
n
min b0 ,b1 [Yi (b0 b1 X i )]2

i 1
The OLS estimator

n
min b0 ,b1 [Yi (b0 b1 X i )]2

i 1
The OLS estimator minimizes the average squared difference between the
actual values of Yi and the prediction (predicted value) based on the estimated
line.
This minimization problem can be solved using calculus.
The result is the OLS estimators of 0 and 1.
Why use OLS, rather than some other estimator?

OLS is a generalization of the sample average: if the line is just an
intercept (no X), then the OLS estimator is just the sample average of
Y1,Yn (Y ).
Like Y , the OLS estimator has some desirable properties: under certain
assumptions, it is unbiased (that is, E( 1 ) = 1), and it has a tighter
sampling distribution than some other candidate estimators of 1 .
Importantly, this is what everyone uses the common language of linear
regression.
10
11
Application to the California Test Score Class Size data
Estimated slope =
1 = 2.28
Estimated intercept =
0 = 698.9
Estimated regression line: TestScore_est= 698.9 2.28 x STR

12
Interpretation of the estimated slope and intercept:

TestScore_est= 698.9 2.28xSTR
Districts with one more student per teacher on average have test scores that
are 2.28 points lower.
That is,
Test score
STR
= 2.28.
The intercept (taken literally) means that, according to this estimated line,
districts with zero students per teacher would have a (predicted) test score
of 698.9.
This interpretation of the intercept makes no sense it extrapolates the line
outside the range of the data in this application; the intercept is not itself
economically meaningful.
13
Predicted values & residuals:

One of the districts in the data set is Antelope, CA, for which STR = 19.33 and
Test Score = 657.8.
Predicted value:
Residual:
YAntelope = 698.9 2.28 x 19.33 = 654.8
u Antelope = 657.8 654.8 = 3.0
14
OLS regression: STATA output

regress testscr str, robust
Regression with robust standard errors
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
19.26
0.0000
0.0512
18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.39
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
-------------------------------------------------------------------------
TestScore_est = 698.9 2.28 x STR
15
Inference on OLS estimates:

The OLS regression line is an estimate, computed using our sample of data; a
different sample would have given a different value of
1 .
How can we:

1) quantify the sampling uncertainty associated with
2) use
1 ?
1 to test hypotheses such as 1 = 0?
3) construct a confidence interval for 1?
16
1. Probability Framework for Linear Regression

Population:
population of interest (ex: all possible school districts)
Random variables:
Y, X
Ex: (Test Score, STR)
Joint distribution of (Y,X):
The key feature is that we suppose there is a linear relation in the population that
relates X and Y; this linear relation is the population linear regression
17
The Population Linear Regression Model

Yi = 0 + 1Xi + ui, i = 1,, n
X is the independent variable or regressor.
Y is the dependent variable.
0 = intercept.
1 = slope.
ui = error term .
The error term consists of omitted factors, or possibly measurement error in
the measurement of Y. In general, these omitted factors are other factors
that influence Y, other than the variable X.
18
The population regression line and the error term
What are some of the omitted factors in this example?

19
Sampling
Data and sampling
The population objects (parameters) 0 and 1 are unknown; so to draw
inferences about these unknown parameters we must collect relevant data.
Simple random sampling:
Choose n entities at random from the population of interest, and observe
(record) X and Y for each entity
Simple random sampling implies that {(Xi, Yi)}, i = 1,, n, are independently
and identically distributed (i.i.d.). (Note: (Xi, Yi) are distributed independently
of (Xj, Yj) for different observations i and j.)
Task at hand: to characterize the sampling distribution of the OLS estimator. To
do so, we make three assumptions:
20
The Least Squares Assumptions

1.
The conditional distribution of u given X has mean zero, that is

E(u|X = x) = 0.
2.
(Xi,Yi), i =1,,n, are i.i.d.
3.
X and u have four moments, that is:

E(X4) < and E(u4) <.
21
Least squares assumption #1: E(u|X = x) = 0.

For any given value of X, the mean of u is zero
22
Assumption #1 and the class size example

Test Scorei = 0 + 1STRi + ui,
where ui = other factors.
Other factors:
parental involvement
outside learning opportunities (extra math class,)
home environment conducive to reading
family income is a useful proxy for many such factors
So E(u|X=x) = 0 means E(Family Income|STR) = constant (which implies that
family income and STR are uncorrelated). This assumption is not innocuous!
23
Least squares assumption #2: (Xi,Yi), i = 1,,n are i.i.d.

This arises automatically if the entity (individual, district) is sampled by simple
random sampling: the entity is selected then, for that entity, X and Y are
observed (recorded).
The main place we will encounter non-i.i.d. sampling is when data are recorded
over time (time series data) this will introduce some extra complications.
24
Least squares assumption #3: E(X4) < and E(u4) <.

Because Yi = 0 + 1Xi + ui, assumption #3 can equivalently be stated as:
E(X4) < and E(Y4) < .
Assumption #3 is generally plausible. A finite domain of the data implies finite
fourth moments (Standardized test scores automatically satisfy this; STR, family
income, etc. satisfy this too).
25
2.
Like Y ,
Estimation: the Sampling Distribution of 1
1 has a sampling distribution.
What is E( 1 )? (where is it centered)

What is var( 1 )? (measure of sampling uncertainty)
What is its sampling distribution in small samples?
What is its sampling distribution in large samples?
26
1 :
The sampling distribution of

Yi = 0 + 1Xi + ui
= 0 + 1 X + u
Yi Y = 1(Xi
so
X ) + (ui u ).
Thus,
n
1 =
( X
i 1
X )(Yi Y )
2
(
X
X
)
i
i 1
( X
=
i 1
X )[ 1 ( X i X ) ( ui u )]
n
2
(
X
X
)
i
i 1
27
1 =
( X
i 1
X )[ 1 ( X i X ) ( ui u )]
n
2
(
X
X
)
i
i 1
( X
i 1
X )( X i X )
2
(
X
X
)
i
i 1
( X
i 1
X )(ui u )
2
(
X
X
)
i
i 1
so
n
1 1 =
( X
i 1
X )(ui u )
2
(
X
X
)
i
i 1
28
We can simplify this formula by noting that:
( X i X )(u i u ) = ( X i X )u i ( X i X ) u
i 1
i 1
i 1
( X
i 1
X )u i .
Thus,
n
1 1 =
( X
i 1
n
X )u i
2
(
X
X
)
i
i 1
where vi = (Xi
1 n
vi
n i 1
n 1 2
sX
n
X )ui.
29
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
, where vi = (Xi
X )ui
We now can calculate the mean and variance of
1 :
n 1 2
sX
n
n 1 n vi
=
E 2
n 1 n i 1 s X
n 1 n vi
=
E 2
n 1 n i 1 s X
n
1
E( 1 1) = E
n vi
i 1
30
Now
E(vi/ s X ) = E[(Xi
X )ui/ s X2 ] = 0
because E(ui|Xi=x) = 0.
Thus,
n
vi
n
1
E( 1 1) =
E 2 = 0
n 1 n i 1 s X
so
E( 1 ) = 1.
That is,
1 is an unbiased estimator of 1.
31
Calculation of the variance of
1 :
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
This calculation is simplified by supposing that n is large (so that
s X2
can be
replaced by X ); the result is,

2
var( v )
var( 1 ) =
.
2
n X
32
The exact sampling distribution is complicated, but when the sample size is
large we get some simple (and good) approximations:
p
var(
v
)
) = 1, 1.
(1) Because var( 1 )=
and
E(
1
1
n X2
(2) When n is large, the sampling distribution of
1 is well approximated by a
normal distribution (CLT).
33
The sampling distribution of
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
When n is large:
X )ui (Xi X)ui, which is i.i.d. (why?) and has two moments,
1 n
that is, var(vi) < (why?). Thus vi is distributed N(0,var(v)/n) when
n i 1
vi = (Xi
n is large.
s X2 is approximately equal to X2 when n is large.

n 1
1
=1
1 when n is large.
n
n
34
1 :
Large-n approximation to the distribution of
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
1 n
vi
n i 1
2
X
v2
which is approximately distributed N(0,
).
2 2
n( X )
Because vi = (Xi
X )ui, we can write this as:
var[( X i x )ui ]
).
1 is approximately distributed N(1,
4
n X
35
Recall the summary of the sampling distribution of Y :

For (Y1,,Yn) i.i.d. with 0 < Y < :
2
The exact (finite sample) sampling distribution of Y has mean Y (Y is an

unbiased estimator of Y) and variance Y /n.
2
Other than its mean and variance, the exact distribution of Y is complicated
and depends on the distribution of Y.
p
Y Y
(law of large numbers).
Y E (Y )
is approximately distributed N(0,1) (CLT).
var(Y )
36
Parallel conclusions hold for the OLS estimator
1 :
Under the three Least Squares Assumptions,

The exact (finite sample) sampling distribution of
1 has mean 1 ( 1 is
an unbiased estimator of 1), and var( 1 ) is inversely proportional to n.

Other than its mean and variance, the exact distribution of
1 is
complicated and depends on the distribution of (X,u).

p
1 1 (law of large numbers).
1 E ( 1 )
var( 1 )
is approximately distributed N(0,1) (CLT).
37
38
3.
Hypothesis Testing
Suppose a skeptic suggests that reducing the number of students in a class has
no effect on learning or, specifically, test scores. The skeptic thus asserts the
hypothesis,
H0: 1 = 0
We wish to test this hypothesis using data reach a tentative conclusion whether
it is correct or incorrect.
39
One-sided vs two-sided alternatives:

Null hypothesis and two-sided alternative:
H0: 1 = 0 vs. H1: 1 0
or, more generally,
H0: 1 = 1,0 vs. H1: 1 <1,0
where 1,0 is the hypothesized value under the null.
Null hypothesis and one-sided alternative:
H0: 1 = 1,0 vs. H1: 1 < 1,0
In economics, it is almost always possible to come up with stories in which an effect
could go either way, so it is standard to focus on two-sided alternatives.
Recall hypothesis testing for population mean using Y :
t=
Y Y ,0
sY / n
then reject the null hypothesis if |t| >1.96.

40
Applied to a hypothesis about 1:

t=
1 1,0
, where 1 is the value of 1,0 hypothesized under the null (for example, if
SE ( 1 )
the null value is zero, then 1,0 = 0.

SE( 1 ) = the square root of an estimator of the variance of the sampling distribution of
1 .
41
Variance of
Recall the expression for the variance of
var( 1 ) =
where vi = (Xi
X )ui.
1 (large n):
var[( X i x )ui ] v2
=
2 2
n( X )
n X4
Estimator of the variance of
1
estimator of v2
n (estimator of X2 ) 2
1 :
1 n
2 2
(
X
X
)
ui
i
n 2 i 1
2
n
n
1
2
n ( X i X )
i 1
It is less complicated than it seems. The numerator estimates the var(v), the denominator
estimates var(X).
42
Testing hypotheses about
t=
1 1,0
SE ( 1 )
1 1,0
2
Reject at 5% significance level if |t| > 1.96.

p-value is p = Pr[|t| > |tact|] = probability in tails of normal outside |tact|.
Both the previous statements are based on large-n approximation; typically
n = 50 is large enough for the approximation to be excellent.
43
Example: Test Scores and STR, California data

Estimated regression line:
TestScore_est = 698.9 2.28STR
Regression software reports the standard errors:

SE( 0 ) = 10.4
t-statistic testing 1,0 = 0 =
1 1,0
SE ( 1 )
SE( 1 ) = 0.52
=
2.28 0
0.52
= 4.38
The 1% two-sided significance level is 2.58, so we reject the null at the 1%

significance level.
Alternatively, we can compute the p-value.
44
The p-value based on the large-n standard normal approximation to the t-statistic
is 0.00001 (104).
45
4.
Confidence intervals
In general, if the sampling distribution of an estimator is normal for large n, then

a 95% confidence interval can be constructed.
So, a 95% confidence interval for
1 is:
CI95%( 1 ) = [ 1 1.96 x SE( 1 )].
Example: Test Scores and STR, California data

Estimated regression line:
SE( 0 ) = 10.4
95% confidence interval for
SE( 1 ) = 0.52
1 :
[ 1 1.96 x SE( 1 )]= [2.28 1.96 x 0.52] = [3.30, 1.26]

Equivalent statements:
The 95% confidence interval does not include zero.
The hypothesis 1 = 0 is rejected at the 5% level.
46
A convention for reporting estimated regressions:

Put standard errors in parentheses below the estimates

(10.4) (0.52)
This expression means that:
The estimated regression line is
TestScore_est = 698.9 2.28 STR.

The standard error of
0 is 10.4.
The standard error of
1 is 0.52.
47
OLS regression: STATA output

Number of obs =
420
F( 1,
418) =
19.26
Prob > F
= 0.0000
R-squared
= 0.0512
Root MSE
= 18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.38
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
------------------------------------------------------------------------so:

(10.4) (0.52)
t (1 = 0) = 4.38, p-value = 0.000
95% conf. interval for 1 is (3.30, 1.26)
48
5. Regression when X is Binary

Sometimes a regressor is binary:
X = 1 if female, = 0 if male
X = 1 if treated (experimental drug), = 0 if not
X = 1 if small class size, = 0 if not
So far, 1 has been called a slope, but that doesnt make much sense if X is
binary.
How do we interpret regression with a binary regressor?
49
Let Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):

When Xi = 0:
Yi = 0 + ui
When Xi = 1:
Yi = 0 + 1 + ui
thus:
When Xi = 0, the mean of Yi is 0
When Xi = 1, the mean of Yi is 0 + 1
that is:
E(Yi|Xi=0) = 0
E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) E(Yi|Xi=0)
= population difference in group means
50
Example: TestScore and STR, California data

Let
1 if STRi 20
Di =
0 if STRi 20
The OLS estimate of the regression line relating TestScore to D (with standard
errors in parentheses) is:
TestScore_est = 650.0 + 7.4D

(1.3) (1.8)
Difference in means between groups = 7.4;
SE = 1.8
t = 7.4/1.8 = 4.0
51
Compare the regression results with the group means, computed directly:
Class Size
Small (STR > 20)
Large (STR 20)
Estimation:
Test =0:
Average score (Y )
657.4
650.0
Std. dev. (sY)

19.4
17.9
N
238
182
Ysmall Ylarge = 657.4 650.0 = 7.4
Ys Yl
7.4
= 4.05
t
SE (Ys Yl ) 1.83
95% confidence interval = [7.4 1.96 x 1.83]=[3.8,11.0]

This is the same as in the regression!
TestScore_est = 650.0 + 7.4D

(1.3) (1.8)
52
Summary: regression when Xi is binary (0 or 1)

Yi = 0 + 1Xi + ui
0 = mean of Y given that X = 0.
0 + 1 = mean of Y given that X = 1.
1 = difference in group means, X =1 minus X = 0.
SE( 1 ) has the usual interpretation.
t-statistics, confidence intervals constructed as usual.
This is another way to do difference-in-means analysis.
The regression formulation is especially useful when we have additional
regressors.
53
6. Other Regression Statistics

A natural question is how well the regression line fits or explains the data.
There are two regression statistics that provide complementary measures of the
quality of fit:
The regression R2 measures the fraction of the variance of Y that is
explained by X; it is unitless and ranges between zero (no fit) and one
(perfect fit).
The standard error of the regression measures the fit the typical size of a
regression residual in the units of Y.
54
The R2
Write Yi as the sum of the OLS prediction + OLS residual: Yi = Yi +
ui
The R2 is the fraction of the sample variance of Yi explained by the regression,

that is, by Yi :
R =
ESS
,
TSS
where ESS =
(
Y
Y
)
i
i 1
R2 =
and TSS =
2
.
(
Y
Y
)
i
n
ESS
2
, where ESS = (Yi Y )
TSS
i 1
i 1
and TSS =
2
(
Y
Y
)
i
i 1
The R2:
R2 = 0 means ESS = 0, so X explains none of the variation of Y.
R2 = 1 means ESS = TSS, so Y = Y so X explains all of the variation of Y.
0 R2 1.
For regression with a single regressor (the case here), R2 is the square of the
correlation coefficient between X and Y.
55
The Standard Error of the Regression (SER)

The standard error of the regression is (almost) the sample standard deviation of
the OLS residuals:
1 n
2
(
u
u
)
i i
n 2 i 1
SER =
(the second equality holds because
1 n 2
ui
n 2 i 1
1 n
ui
n i 1
= 0).
56
SER =
1 n 2
ui
n 2 i 1
The SER:
has the units of u, which are the units of Y
measures the spread of the distribution of u
measures the average size of the OLS residual (the average mistake
made by the OLS regression line)
The root mean squared error (RMSE) is closely related to the SER:
RMSE =
1 n 2
ui
n i 1
This measures the same thing as the SER the minor difference is division
by 1/n instead of 1/(n2).
57
Technical note: why divide by n2 instead of n1?

SER =
1 n 2
ui
n 2 i 1
Division by n2 is a degrees of freedom correction like division by n1 in
sY2 ; the difference is that, in the SER, two parameters have been estimated (0
2
and 1, by and ), whereas in s only one has been estimated (Y, by
0
Y ).
When n is large, it makes negligible difference whether n, n1, or n2 are
used although the conventional formula uses n2 when there is a single
regressor.
58
Example of R2 and SER
TestScore_est = 698.9 2.28STR, R2 = 0.05, SER = 18.6

(10.4) (0.52)
The slope coefficient is statistically significant and large in a policy sense, even
though STR explains only a small fraction of the variation in test scores.
59
A Practical Note: Heteroskedasticity, Homoskedasticity, and the Formula

for the Standard Errors of
0 and 1
What do these two terms mean?

If var(u|X=x) is constant that is, the variance of the conditional distribution of
u given X does not depend on X, then u is said to be homoskedastic. Otherwise,
u is said to be heteroskedastic.
60
Homoskedasticity in a picture:
E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)

The variance of u does not change with (depend on) x
61
Heteroskedasticity in a picture:
E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)

The variance of u depends on x so u is heteroskedastic.
62
A real-world example of heteroskedasticity from labor economics: average

hourly earnings vs. years of education (data source: 1999 Current Population
Survey)
Average Hourly Earnings
Fitted values
Average hourly earnings
60
40
20
0
5
10
15
20
Years of Education
Scatterplot and OLS Regression Line
63
Is heteroskedasticity present in the class size data?
Hard to saylooks nearly homoskedastic, but the spread might be tighter for
large values of STR.
64
So far we have (without saying so) assumed that u is heteroskedastic:

Recall the three least squares assumptions:
1.
2.
3.
The conditional distribution of u given X has mean zero, that is

E(u|X = x) = 0.
(Xi,Yi), i =1,,n, are i.i.d.
X and u have four finite moments.
Heteroskedasticity and homoskedasticity concern var(u|X=x). Because we have

not explicitly assumed homoskedastic errors, we have implicitly allowed for
heteroskedasticity.
65
What if the errors are in fact homoskedastic?

You can prove some theorems about OLS (in particular, the Gauss-Markov
theorem, which says that OLS is the estimator with the lowest variance
among all estimators that are linear functions of (Y1,,Yn).
The formula for the variance of
1 and the OLS standard error simplifies:
If var(ui|Xi=x) = u , then
2
var( 1 ) =
u2
var[( X i x )ui ]
==
2 2
n( X )
n X2
Note: var( 1 ) is inversely proportional to var(X): more spread in X means

more information about
1 .
66
General formula for the standard error of
1 is the
1 n
2 2
(
X
X
)
ui
i
n 2 i 1
2
n
n
1
2
n ( X i X )
i 1
of:
Special case under homoskedasticity:
1 n 2
ui
n 2 i 1
1
n
.
n 1
2
(
X
X
)
i
n i 1
67
Standard errors in statistical software

The homoskedasticity-only formula for the standard error of
1 and the
heteroskedasticity-robust formula (the formula that is valid under

heteroskedasticity) differ in general, you get different standard errors using
the different formulas.
Homoskedasticity-only standard errors are the default setting in
regression software sometimes the only setting (e.g. Excel). To
get the general heteroskedasticity-robust standard errors you
must override the default.
If you dont override the default and there is in fact heteroskedasticity, you will
get the wrong standard errors (and wrong t-statistics and confidence intervals).
68
The critical points:

If the errors are homoskedastic and you use the heteroskedastic formula for
standard errors (the one we derived), you are OK.
If the errors are heteroskedastic and you use the homoskedasticity-only
formula for standard errors, the standard errors are wrong.
The two formulas coincide (when n is large) in the special case of
homoscedasticity.
The bottom line: you should always use the heteroskedasticity-based
formulas these are conventionally called the heteroskedasticity-robust
standard errors.
69
Heteroskedasticity-robust standard errors in STATA

Number of obs =
420
F( 1,
418) =
19.26
Prob > F
= 0.0000
R-squared
= 0.0512
Root MSE
= 18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.39
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
------------------------------------------------------------------------Use the , robust option!!!
70
Digression on Causality
The original question (what is the quantitative effect of an intervention that
reduces class size?) is a question about a causal effect: the effect on Y of
applying a unit of the treatment is 1.
But what is, precisely, a causal effect?
The common-sense definition of causality is not precise enough for our
purposes.
In this course, we define a causal effect as the effect that is measured in an
ideal randomized controlled experiment.
71
Ideal Randomized Controlled Experiment

Ideal: subjects all follow the treatment protocol perfect compliance, no
errors in reporting, etc.!
Randomized: subjects from the population of interest are randomly
assigned to a treatment or control group (so there are no confounding
factors)
Controlled: having a control group permits measuring the differential effect
of the treatment
Experiment: the treatment is assigned as part of the experiment: the
subjects have no choice, which means that there is no reverse causality in
which subjects choose the treatment they think will work best.
72
Back to class size:

What is an ideal randomized controlled experiment for measuring the effect
on Test Score of reducing STR?
How does our regression analysis of observational data differ from this ideal?
o The treatment is not randomly assigned
o In the US in our observational data districts with higher family
incomes are likely to have both smaller classes and higher test scores.
o As a result it is plausible that E(ui|Xi=x) 0.
o If so, Least Squares Assumption #1 does not hold.
o If so,
1 is biased:
does an omitted factor make class size seem more
important than it really is?
73

Linear Regression

Uploaded by

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

Linear Regression (2016/2017)

Class size and educational output

The California Test Score Data Set

An initial look at the California test score data:

The class size/test score policy question:

Object of policy interest:

Notation and Terminology:

1 = slope of population regression line

= change in test score for a unit change in STR.

How can we estimate 0 and 1 from data?

By analogy, we will focus on the least squares (ordinary least squares or

min b0 ,b1 [Yi (b0 b1 X i )]2

The OLS estimator

min b0 ,b1 [Yi (b0 b1 X i )]2

Why use OLS, rather than some other estimator?

Application to the California Test Score Class Size data

Estimated regression line: TestScore_est= 698.9 2.28 x STR

Interpretation of the estimated slope and intercept:

Predicted values & residuals:

YAntelope = 698.9 2.28 x 19.33 = 654.8

u Antelope = 657.8 654.8 = 3.0

OLS regression: STATA output

TestScore_est = 698.9 2.28 x STR

Inference on OLS estimates:

How can we:

1 to test hypotheses such as 1 = 0?

3) construct a confidence interval for 1?

1. Probability Framework for Linear Regression

The Population Linear Regression Model

The population regression line and the error term

What are some of the omitted factors in this example?

The Least Squares Assumptions

The conditional distribution of u given X has mean zero, that is

(Xi,Yi), i =1,,n, are i.i.d.

X and u have four moments, that is:

Least squares assumption #1: E(u|X = x) = 0.

Assumption #1 and the class size example

where ui = other factors.

Least squares assumption #2: (Xi,Yi), i = 1,,n are i.i.d.

Least squares assumption #3: E(X4) < and E(u4) <.

Estimation: the Sampling Distribution of 1

1 has a sampling distribution.

What is E( 1 )? (where is it centered)

The sampling distribution of

We can simplify this formula by noting that:

We now can calculate the mean and variance of

Calculation of the variance of

This calculation is simplified by supposing that n is large (so that

replaced by X ); the result is,

(2) When n is large, the sampling distribution of

normal distribution (CLT).

The sampling distribution of

s X2 is approximately equal to X2 when n is large.

Large-n approximation to the distribution of

X )ui, we can write this as:

Recall the summary of the sampling distribution of Y :

The exact (finite sample) sampling distribution of Y has mean Y (Y is an

(law of large numbers).

is approximately distributed N(0,1) (CLT).

Parallel conclusions hold for the OLS estimator

Under the three Least Squares Assumptions,

an unbiased estimator of 1), and var( 1 ) is inversely proportional to n.

complicated and depends on the distribution of (X,u).

1 1 (law of large numbers).

is approximately distributed N(0,1) (CLT).

One-sided vs two-sided alternatives:

then reject the null hypothesis if |t| >1.96.

Applied to a hypothesis about 1:

the null value is zero, then 1,0 = 0.

Recall the expression for the variance of

Estimator of the variance of