Linear Regression
Linear Regression
STAT-S-301
Policy question: What is the effect of reducing class size by one student per
class? By x students/class?
What is the right output (performance) measure?
parent satisfaction
student personal development
future adult welfare
future adult earnings
performance on standardized tests
Do districts with smaller classes (lower STR) have higher test scores?
Test score
.
STR
This is the slope of the line relating test score and STR.
This suggests that we want to draw a line through the Test Score v. STR
scatterplot but how?
Test score
STR
min m (Yi m) 2 .
i 1
The OLS estimator minimizes the average squared difference between the
actual values of Yi and the prediction (predicted value) based on the estimated
line.
This minimization problem can be solved using calculus.
The result is the OLS estimators of 0 and 1.
10
11
Estimated slope =
1 = 2.28
Estimated intercept =
0 = 698.9
Test score
STR
= 2.28.
The intercept (taken literally) means that, according to this estimated line,
districts with zero students per teacher would have a (predicted) test score
of 698.9.
This interpretation of the intercept makes no sense it extrapolates the line
outside the range of the data in this application; the intercept is not itself
economically meaningful.
13
Residual:
14
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
19.26
0.0000
0.0512
18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.39
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
-------------------------------------------------------------------------
15
1 .
1 ?
16
17
18
Sampling
Data and sampling
The population objects (parameters) 0 and 1 are unknown; so to draw
inferences about these unknown parameters we must collect relevant data.
Simple random sampling:
Choose n entities at random from the population of interest, and observe
(record) X and Y for each entity
Simple random sampling implies that {(Xi, Yi)}, i = 1,, n, are independently
and identically distributed (i.i.d.). (Note: (Xi, Yi) are distributed independently
of (Xj, Yj) for different observations i and j.)
Task at hand: to characterize the sampling distribution of the OLS estimator. To
do so, we make three assumptions:
20
2.
3.
21
22
Other factors:
parental involvement
outside learning opportunities (extra math class,)
home environment conducive to reading
family income is a useful proxy for many such factors
So E(u|X=x) = 0 means E(Family Income|STR) = constant (which implies that
family income and STR are uncorrelated). This assumption is not innocuous!
23
24
25
2.
Like Y ,
26
1 :
= 0 + 1 X + u
Yi Y = 1(Xi
so
X ) + (ui u ).
Thus,
n
1 =
( X
i 1
X )(Yi Y )
2
(
X
X
)
i
i 1
( X
=
i 1
X )[ 1 ( X i X ) ( ui u )]
n
2
(
X
X
)
i
i 1
27
1 =
( X
i 1
X )[ 1 ( X i X ) ( ui u )]
n
2
(
X
X
)
i
i 1
( X
i 1
X )( X i X )
2
(
X
X
)
i
i 1
( X
i 1
X )(ui u )
2
(
X
X
)
i
i 1
so
n
1 1 =
( X
i 1
X )(ui u )
2
(
X
X
)
i
i 1
28
( X i X )(u i u ) = ( X i X )u i ( X i X ) u
i 1
i 1
i 1
( X
i 1
X )u i .
Thus,
n
1 1 =
( X
i 1
n
X )u i
2
(
X
X
)
i
i 1
where vi = (Xi
1 n
vi
n i 1
n 1 2
sX
n
X )ui.
29
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
, where vi = (Xi
X )ui
1 :
n 1 2
sX
n
n 1 n vi
=
E 2
n 1 n i 1 s X
n 1 n vi
=
E 2
n 1 n i 1 s X
n
1
E( 1 1) = E
n vi
i 1
30
Now
E(vi/ s X ) = E[(Xi
X )ui/ s X2 ] = 0
because E(ui|Xi=x) = 0.
Thus,
n
vi
n
1
E( 1 1) =
E 2 = 0
n 1 n i 1 s X
so
E( 1 ) = 1.
That is,
1 is an unbiased estimator of 1.
31
1 :
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
s X2
can be
var( v )
var( 1 ) =
.
2
n X
32
The exact sampling distribution is complicated, but when the sample size is
large we get some simple (and good) approximations:
p
var(
v
)
) = 1, 1.
(1) Because var( 1 )=
and
E(
1
1
n X2
1 is well approximated by a
33
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
When n is large:
X )ui (Xi X)ui, which is i.i.d. (why?) and has two moments,
1 n
that is, var(vi) < (why?). Thus vi is distributed N(0,var(v)/n) when
n i 1
vi = (Xi
n is large.
1 :
1 1 =
1 n
vi
n i 1
n 1 2
sX
n
1 n
vi
n i 1
2
X
v2
which is approximately distributed N(0,
).
2 2
n( X )
Because vi = (Xi
var[( X i x )ui ]
).
1 is approximately distributed N(1,
4
n X
35
Other than its mean and variance, the exact distribution of Y is complicated
and depends on the distribution of Y.
p
Y Y
Y E (Y )
var(Y )
36
1 :
1 has mean 1 ( 1 is
1 is
1 E ( 1 )
var( 1 )
37
38
3.
Hypothesis Testing
Suppose a skeptic suggests that reducing the number of students in a class has
no effect on learning or, specifically, test scores. The skeptic thus asserts the
hypothesis,
H0: 1 = 0
We wish to test this hypothesis using data reach a tentative conclusion whether
it is correct or incorrect.
39
t=
Y Y ,0
sY / n
1 1,0
, where 1 is the value of 1,0 hypothesized under the null (for example, if
SE ( 1 )
1 .
41
Variance of
var( 1 ) =
where vi = (Xi
X )ui.
1 (large n):
var[( X i x )ui ] v2
=
2 2
n( X )
n X4
1
estimator of v2
n (estimator of X2 ) 2
1 :
1 n
2 2
(
X
X
)
ui
i
n 2 i 1
2
n
n
1
2
n ( X i X )
i 1
It is less complicated than it seems. The numerator estimates the var(v), the denominator
estimates var(X).
42
t=
1 1,0
SE ( 1 )
1 1,0
2
43
1 1,0
SE ( 1 )
SE( 1 ) = 0.52
=
2.28 0
0.52
= 4.38
44
The p-value based on the large-n standard normal approximation to the t-statistic
is 0.00001 (104).
45
4.
Confidence intervals
1 is:
SE( 0 ) = 10.4
95% confidence interval for
SE( 1 ) = 0.52
1 :
0 is 10.4.
1 is 0.52.
47
Number of obs =
420
F( 1,
418) =
19.26
Prob > F
= 0.0000
R-squared
= 0.0512
Root MSE
= 18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.38
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
------------------------------------------------------------------------so:
48
49
Yi = 0 + ui
When Xi = 1:
Yi = 0 + 1 + ui
thus:
When Xi = 0, the mean of Yi is 0
When Xi = 1, the mean of Yi is 0 + 1
that is:
E(Yi|Xi=0) = 0
E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) E(Yi|Xi=0)
= population difference in group means
50
1 if STRi 20
Di =
0 if STRi 20
The OLS estimate of the regression line relating TestScore to D (with standard
errors in parentheses) is:
51
Compare the regression results with the group means, computed directly:
Class Size
Small (STR > 20)
Large (STR 20)
Estimation:
Test =0:
Average score (Y )
657.4
650.0
N
238
182
Ys Yl
7.4
= 4.05
t
SE (Ys Yl ) 1.83
52
53
54
The R2
Write Yi as the sum of the OLS prediction + OLS residual: Yi = Yi +
ui
R =
ESS
,
TSS
where ESS =
(
Y
Y
)
i
i 1
R2 =
and TSS =
2
.
(
Y
Y
)
i
n
ESS
2
, where ESS = (Yi Y )
TSS
i 1
i 1
and TSS =
2
(
Y
Y
)
i
i 1
The R2:
R2 = 0 means ESS = 0, so X explains none of the variation of Y.
R2 = 1 means ESS = TSS, so Y = Y so X explains all of the variation of Y.
0 R2 1.
For regression with a single regressor (the case here), R2 is the square of the
correlation coefficient between X and Y.
55
1 n
2
(
u
u
)
i i
n 2 i 1
SER =
1 n 2
ui
n 2 i 1
1 n
ui
n i 1
= 0).
56
SER =
1 n 2
ui
n 2 i 1
The SER:
has the units of u, which are the units of Y
measures the spread of the distribution of u
measures the average size of the OLS residual (the average mistake
made by the OLS regression line)
The root mean squared error (RMSE) is closely related to the SER:
RMSE =
1 n 2
ui
n i 1
This measures the same thing as the SER the minor difference is division
by 1/n instead of 1/(n2).
57
1 n 2
ui
n 2 i 1
sY2 ; the difference is that, in the SER, two parameters have been estimated (0
2
and 1, by and ), whereas in s only one has been estimated (Y, by
0
Y ).
When n is large, it makes negligible difference whether n, n1, or n2 are
used although the conventional formula uses n2 when there is a single
regressor.
58
0 and 1
60
Homoskedasticity in a picture:
61
Heteroskedasticity in a picture:
62
Fitted values
60
40
20
0
5
10
15
20
Years of Education
63
Hard to saylooks nearly homoskedastic, but the spread might be tighter for
large values of STR.
64
65
If var(ui|Xi=x) = u , then
2
var( 1 ) =
u2
var[( X i x )ui ]
==
2 2
n( X )
n X2
1 .
66
1 is the
1 n
2 2
(
X
X
)
ui
i
n 2 i 1
2
n
n
1
2
n ( X i X )
i 1
of:
1 n 2
ui
n 2 i 1
1
n
.
n 1
2
(
X
X
)
i
n i 1
67
1 and the
68
69
Number of obs =
420
F( 1,
418) =
19.26
Prob > F
= 0.0000
R-squared
= 0.0512
Root MSE
= 18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.39
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
------------------------------------------------------------------------Use the , robust option!!!
70
Digression on Causality
The original question (what is the quantitative effect of an intervention that
reduces class size?) is a question about a causal effect: the effect on Y of
applying a unit of the treatment is 1.
But what is, precisely, a causal effect?
The common-sense definition of causality is not precise enough for our
purposes.
In this course, we define a causal effect as the effect that is measured in an
ideal randomized controlled experiment.
71
72
1 is biased:
73