0% found this document useful (0 votes)
93 views

EC212: Introduction To Econometrics Simple Regression Model (Wooldridge, Ch. 2)

This document provides an overview of simple linear regression models. It defines the simple regression model as relating a dependent variable y to an independent or regressor variable x, while accounting for other influencing factors through an error term u. It addresses issues such as allowing for other factors, determining the functional form, and capturing a ceteris paribus relationship between y and x. The key assumptions of the simple linear regression model are that the expected value of the error term u is 0, and that the expected value of u given x is also 0 for all values of x. This model forms the basis for deriving ordinary least squares estimates of the population parameters β0 and β1 from sample data on x and y.

Uploaded by

SHUMING ZHU
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

EC212: Introduction To Econometrics Simple Regression Model (Wooldridge, Ch. 2)

This document provides an overview of simple linear regression models. It defines the simple regression model as relating a dependent variable y to an independent or regressor variable x, while accounting for other influencing factors through an error term u. It addresses issues such as allowing for other factors, determining the functional form, and capturing a ceteris paribus relationship between y and x. The key assumptions of the simple linear regression model are that the expected value of the error term u is 0, and that the expected value of u given x is also 0 for all values of x. This model forms the basis for deriving ordinary least squares estimates of the population parameters β0 and β1 from sample data on x and y.

Uploaded by

SHUMING ZHU
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

EC212: Introduction to Econometrics

Simple Regression Model


(Wooldridge, Ch. 2)

Tatiana Komarova

London School of Economics

Summer 2021

1
1. Definition of simple regression
model
(Wooldridge, Ch. 2.1)

2
Simple regression

• We begin with cross-sectional analysis and assume we can


collect a random sample from population of interest

• Once we have population in mind, consider two variables x


and y . We study how y varies with changes in x

• Examples: x is years of schooling, y is hourly wage; x is class


size, y is test score

3
Issues

• We must confront three issues:

1. How do we allow factors other than x to affect y ? Usually


there is no exact relationship between two variables

2. What is functional relationship between y and x?

3. How can we capture ceteris paribus relationship between y


and x?

4
Simple linear regression model

• To address these issues, consider the following equation


relating y to x
y = β0 + β1 x + u
which is assumed to hold in the population of interest

• This equation defines simple linear regression model (or


bivariate regression model)

• We explain y in terms of x (not explain x in terms of y )

• In this lecture, y is called dependent variable, x is called


regressor, u is called error term (among other names)

5
How does the model deal with three issues?

• Look at simple linear regression model

y = β0 + β1 x + u

• This equation explicitly allows for other factors by u to affect


y (Issue #1 is addressed)

• This equation also addresses Issue #2, functional form.


Namely, y is assumed to be linearly related to x

• We call β0 the intercept parameter and β1 the slope


parameter. Our ultimate goal is to estimate them

• See Appendix A.2 for review on linear function

6
• The equation also addresses Issue #3, ceteris paribus issue. In

y = β0 + β1 x + u

all other factors affecting y are in lumped into u

• We want to know how y changes when x changes, holding u


fixed

• Let ∆ denote “change”. Holding u fixed means ∆u = 0. So

∆y = β1 ∆x + ∆u
= β1 ∆x when ∆u = 0

• This equation also defines β1 as slope

7
Examples
• A model to explain crop yield by fertilizer use is

yield = β0 + β1 fertilizer + u

where u contains land quality, rainfall on plot of land, so on.


β1 tells us how yield changes when amount of fertilizer
changes, holding all else fixed

• Wage equation

wage = β0 + β1 educ + u

where u contains somewhat vague factors (“ability”) but also


past work experience, tenure on job, etc. In this case

∆wage = β1 ∆educ when ∆u = 0

8
Impose assumptions to estimate β0 and β1

• Consider simple linear regression model

y = β0 + β1 x + u

• We wish to estimate parameters β0 and β1 by random sample


of y and x. But we never observe u

• To estimate β0 and β1 , we restrict how u and x are related

9
Are we done with the causality issue?
• Thus, β1 measures the effect of x on y , holding all other
factors (in u) fixed
• Are we done with the causality issue? – Tjhis seems too easy!

• Unfortunately, no

• How can we really learn the ceteris paribus effect of x on y ,


holding other factors fixed, when we are ignoring all those
other factors?
• How can we hope to generally estimate the ceteris paribus
effect of x on y when we have assumed all other factors
affecting y are unobserved and lumped into u?
• The key is that the simple linear regression (SLR) model is a
population model. When it comes to estimating β1 (and β0 )
using a random sample of data, we must restrict how u and x
are related to each other 10
x and u are viewed as random variables
• x and u are viewed as random variables having distributions
in the population
• For example, if x = educ, then in principle we could figure out
its distribution in the population of adults over 30 years old,
say. See below the histogram of educ obtained from WAGE1
data.

11
x and u are viewed as random variables (cont.)

• Suppose u is cognitive ability. Assuming we can measure


cognitive ability, even conditional on a given education levels,
there will a range of individuals with different cognitive ability,
i.e. u will have a distribution in the population.
• So, what we must do is to restrict the way how u and x are
related in the population

• See App. B.1-4 for review on random variables and their


distributions

12
Notation for random variables

• For this course, we follow notation of Wooldridge and denote


random variables by lower case letters, e.g. y , x, yi , xi

• Sometimes same notation is used for particular outcomes (or


realizations) of random variables

• For example, if we write like E (y ) = µ, y should be viewed as


a random variable. On the other hand, if we write like
y = 1.6, it should be viewed as a particular outcome

• Distinction should be clear for each context

13
First assumption: E (u) = 0

• First, we assume the expected value of u is zero in population

E (u) = 0

where E (·) means the expected value

• This is innocuous normalization and can be imposed


without loss of generality

• E.g. Normalizing “land quality” or “ability” to have zero


expected value in the population should be harmless

14
• Also presence of β0 makes the assumption E (u) = 0
innocuous

• When E (u) = α0 6= 0, we can rewrite the model as

y = β0 + β1 x + u
= (β0 + α0 ) + β1 x + (u − α0 )
= β0∗ + β1 x + u ∗

where new error term u ∗ = u − α0 satisfies E (u ∗ ) = 0

• New intercept is β0∗ = β0 + α0 . Important point is: slope β1


does not change

15
Second “key” assumption: E (u|x) = E (u)

• To restrict how x and u are related, we assume

E (u|x) = E (u) for each value of x

i.e. conditional expectation of u given x does not depend on x

• E (u|x) means “the expected value of u given x”

• We say u is mean independent of x

16
Example: Ability and education

• What does this assumption mean?

• Suppose u is “ability” and x is years of education. Mean


independence assumption says

E (ability |x = 8) = E (ability |x = 12) = E (ability |x = 16)

i.e. average ability is same in different portions of the


population with 8th grade education, 12th grade education
and four-year college eduction

• Because people choose education levels partly based on ability,


this assumption is almost certainly false

17
Example: Land quality and fertilizer amount

• Suppose u is “land quality” and x is fertilizer amount. In this


case
E (u|x) = E (u)
holds true, if fertilizer amounts are chosen independently of
land quality

• This assumption is reasonable as far as fertilizer amounts are


assigned at random

18
Implications of assumptions

• Combining E (u) = 0 and E (u|x) = E (u) gives

E (u|x) = 0 for all values of x

• By Properties of conditional expectation (Property CE.1-2 in


App. B.4), E (u|x) = 0 implies

E (y |x) = E (β0 + β1 x + u|x)


= β0 + β1 x + E (u|x)
= β0 + β1 x

• E (y |x) = β0 + β1 x is called population regression function

19
Population regression and distribution of y given x
• Straight line is population regression function
E (y |x) = β0 + β1 x. Conditional distribution of y at three
different values of x are superimposed. Remember that
y = β0 + β1 x + u and u has distribution in population

20
2. Deriving ordinary least squares
estimates
(Wooldridge, Ch. 2.2)

21
Purpose here

• Given data on x and y , how can we estimate the population


parameters β0 and β1 ?

• Let {(xi , yi ) : i = 1, 2, . . . , n} be sample of size n (number of


observations) from the population. Think of this as a random
sample

• Next graph shows n = 15 families and the population


regression of saving on income

22
23
Conditions for β0 and β1 in population

• To estimate β0 and β1 , we use two conditions

E (u) = 0
E (xu) = 0

• First condition is innocuous and under this condition

Cov (x, u) = E (xu) − E (x)E (u) = E (xu)

so second condition says: x and u are uncorrelated

• Both conditions are implied from

E (u|x) = 0

(see App. B.4, Property CE.5)

24
Conditions for parameters

• By plugging u = y − β0 − β1 x into the two conditions, we get


equations for β0 and β1

E (u) = E (y − β0 − β1 x) = 0
E (xu) = E [x(y − β0 − β1 x)] = 0

• These conditions determine β0 and β1 in the population

• But we do not observe everyone in the population. So we do


not know pdf of (y , x) and cannot compute E (·)

• What we have is a random sample from the population (just a


subset). So we use their sample analogs to estimate β0 and
β1 (called method of moments)

25
Method of moments

• Estimate expected value “E (·)” by sample average


Pn
“ n1 i=1 ·” (sample analog)
Pn
• E.g. E (x) (population mean) is estimated by x̄ = n1 i=1 xi
(sample mean)

• E.g. Var (x) = E [(x − E (x))2 ] (population variance) is


by n1 ni=1 (xi − x̄)2 (sample variance) or often
P
estimated
1 P n 2
n−1 i=1 (xi − x̄)

• See App. A.1 for review on summation operator “ ni=1 ”


P

26
Method of moments for β0 and β1

• To determine the estimates β̂0 and β̂1 , use sample analogs


n
1X
(yi − β̂0 − β̂1 xi ) = 0
n
i=1
n
1X
xi (yi − β̂0 − β̂1 xi ) = 0
n
i=1

• Two linear equations for two unknowns β̂0 and β̂1


Pn
• We use standard notation ȳ = n1 i=1 yi
(called sample average)

27
• By property of summation operator, first equation is
n
1X
0 = (yi − β̂0 − β̂1 xi )
n
i=1
n n n
1X 1X 1X
= yi − β̂0 − β̂1 xi
n n n
i=1 i=1 i=1
n n
!
1X 1X
= yi − β̂0 − β̂1 xi
n n
i=1 i=1
= ȳ − β̂0 − β̂1 x̄

which means
β̂0 = ȳ − β̂1 x̄

28
• Plugging this into second equation (and multiply n both
sides), we get
n
X
xi [yi − (ȳ − β̂1 x̄) − β̂1 xi ] = 0
i=1

• Simple algebra gives equation for β̂1

n
" n #
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄)
i=1 i=1

• We could conclude
Pn
xi (yi − ȳ )
β̂1 = Pi=1
n
i=1 xi (xi − x̄)
Pn
if i=1 xi (xi − x̄) 6= 0

29
One step ahead

• Now we get solution for β̂1 . But to get more intuitive form for
β̂1 , we use these facts about summation operator
n
X
(xi − x̄) = 0
i=1
n
X n
X
xi (yi − ȳ ) = (xi − x̄)(yi − ȳ )
i=1 i=1
n
X n
X
xi (xi − x̄) = (xi − x̄)2
i=1 i=1

(see App. A.1)

30
OLS estimate

• So, equation for β̂1 is rewritten as

n
" n #
X X
2
(xi − x̄)(yi − ȳ ) = β̂1 (xi − x̄)
i=1 i=1
Pn
• If − x̄)2 > 0, then we get
i=1 (xi
Pn
(x − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
Pn i
β̂1 = i=1 2
=
i=1 (xi − x̄) Sample Variance(xi )

which is easy to memorize

• Based on β̂1 above, slope estimate β̂0 is given by

β̂0 = ȳ − β̂1 x̄

31
Remarks

• For reasons we will see, β̂0 and β̂1 are called ordinary least
squares (OLS) estimates

• Formula for β̂1 is important. It shows how to estimate slope


parameter β1 from sample

• In practice, just let Stata compute it

• β̂1 can be computed whenever sample variance of xi is not


zero, which only rules out very rare situation: all xi ’s take
same value

32
Ordinary least squares
• Where does the name “ordinary least squares” come from?

• For any candidates β̂0 and β̂1 , define residual

ûi = yi − β̂0 − β̂1 xi for i = 1, . . . , n

• Suppose we measure size of mistake for each i by squaring the


residual: ûi2 . Then we add them all up
n
X n
X
ûi2 = (yi − β̂0 − β̂1 xi )2
i=1 i=1

which is called sum of squared residuals

• If we choose β̂0 and β̂1 to minimize sum of squared residuals,


it can be shown (using calculus) that solutions are slope and
intercept estimates we obtained by method of moments
33
OLS regression line

• Once we compute β̂0 and β̂1 , we get OLS regression line as


linear function of x
ŷ = β̂0 + β̂1 x
• This line allows us to predict y for any (sensible) value of x

• Intercept β̂0 is prediction for y when x = 0 (which is usually


meaningless if x = 0 is not possible)

• Slope β̂1 allows us to predict changes in y for any


(reasonable) change in x

∆ŷ = β̂1 ∆x

• If ∆x = 1 (x increases by one unit), then ∆ŷ = β̂1

34
Example: Effects of education on hourly wage
(WAGE1.dta)
. des wage educ

storage display value


variable name type format label variable label

wage float %8.2g average hourly earnings


educ byte %8.0g years of education

• Model
wage = β0 + β1 educ + u
E (u|educ) = 0
• Estimate β0 and β1 by OLS. General form of Stata command:

reg y x

The order of y and x is critical!


35
Result for OLS
. reg wage educ

Source SS df MS Number of obs = 526


F( 1, 524) = 103.36
Model 1179.73204 1 1179.73204 Prob > F = 0.0000
Residual 5980.68225 524 11.4135158 R-squared = 0.1648
Adj R-squared = 0.1632
Total 7160.41429 525 13.6388844 Root MSE = 3.3784

wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .5413593 .053248 10.17 0.000 .4367534 .6459651


_cons -.9048516 .6849678 -1.32 0.187 -2.250472 .4407687

• In output, β̂0 = −0.90 is Coef. labeled “ cons” and β̂1 = 0.54


is Coef. labeled “educ”, and OLS regression line is

[ = −0.90 + 0.54 educ


wage
36
• wage is measured by dollars per hour. So each additional year
of schooling is estimated to be worth β̂1 =54 cents per hour

[ = −0.90. In
• Plugging in educ = 0 gives silly prediction wage
range where data are sparse, we may get strange predictions
(only 18 people with educ < 8)

• When educ = 8

[ = −0.90 + 0.54(8) = 3.42


wage

• Predicted hourly wage at eight years of education is $3.42,


which we can think of as our estimate of average wage in
population when educ = 8. But no one in sample earns
exactly $3.42: some earn more, some earn less. One worker
earns $3.5, which is close

37
Subsample with educ = 8
. list wage if educ == 8

wage

4. 6
30. 3.3
58. 10
89. 9.9
120. 3

128. 1.5
203. 10
214. 7.4
220. 5.8
221. 3.5

226. 3
266. 2.9
284. 8.5
287. 5
303. 3

331. 3
367. 4.1
406. 8.4
412. 4.4
425. 3

463. 3
487. 2.2

38
Reminder

• When we write model

wage = β0 + β1 educ + u

E (u|educ) = 0
we do not know β0 and β1

• Rather, β̂0 = −0.90 and β̂1 = 0.54 are our estimates from
particular sample of 526 workers. These estimates may or may
not be close to β0 and β1 . If we obtain another sample of 526
workers, estimates will change

• Intuitively, population is everyone in classroom, and sample is


only people in some row

39
Summary so far

• Simple regression model

y = β0 + β1 x + u, E (u|x) = 0

• Suppose we have random sample {(yi , xi ) : i = 1, ..., n} from


population. By method of moments using E (u) = 0 and
E (xu) = 0, β0 and β1 are estimated by
Pn
(x − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn i
i=1
2
=
i=1 (xi − x̄) Sample Variance(xi )
β̂0 = ȳ − β̂1 x̄

which are called OLS estimates

40
3. Properties of OLS on any sample
from population
(Algebraic properties of OLS)
(Wooldridge, Ch. 2.3)

41
OLS fitted values and residuals

• Once we have OLS estimates β̂0 and β̂0 , we get OLS fitted
values by plugging xi into the equation

ŷi = β̂0 + β̂1 xi for i = 1, . . . , n

• OLS residuals are

ûi = yi − ŷi = yi − β̂0 − β̂1 xi for i = 1, . . . , n

• Note: ûi is different from ui = yi − β0 − β1 xi

42
OLS, then save ŷi and ûi
. reg wage educ

Source SS df MS Number of obs = 526


F( 1, 524) = 103.36
Model 1179.73204 1 1179.73204 Prob > F = 0.0000
Residual 5980.68225 524 11.4135158 R-squared = 0.1648
Adj R-squared = 0.1632
Total 7160.41429 525 13.6388844 Root MSE = 3.3784

wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .5413593 .053248 10.17 0.000 .4367534 .6459651


_cons -.9048516 .6849678 -1.32 0.187 -2.250472 .4407687

. predict wagehat
(option xb assumed; fitted values)

. predict uhat, resid

43
List for i = 1, . . . , 15
. list wage educ wagehat uhat in 1/15

wage educ wagehat uhat

1. 3.1 11 5.0501 -1.9501


2. 3.2 12 5.591459 -2.35146
3. 3 11 5.0501 -2.0501
4. 6 8 3.426023 2.573977
5. 5.3 12 5.591459 -.2914593

6. 8.8 16 7.756896 .9931036


7. 11 18 8.839615 2.410385
8. 5 12 5.591459 -.5914595
9. 3.6 12 5.591459 -1.991459
10. 18 17 8.298256 9.881744

11. 6.3 16 7.756896 -1.506896


12. 8.1 13 6.132819 1.997181
13. 8.8 12 5.591459 3.178541
14. 5.5 12 5.591459 -.0914594
15. 22 12 5.591459 16.60854

44
• Some residuals are positive, others are negative. None in the
first 15 is especially close to zero

• educ may not be good predictor of wage. We need to


formalize how good it is

• Want a measure to see how ûi ’s are overall close to zero

• For this purpose, we derive some useful properties of ûi and ŷi

45
Recall conditions for OLS

• Recall that OLS estimates β̂0 and β̂1 are chosen to satisfy
n
X
(yi − β̂0 − β̂1 xi ) = 0
i=1
n
X
xi (yi − β̂0 − β̂1 xi ) = 0
i=1

• Plugging in definition ûi = yi − β̂0 − β̂1 xi directly gives us two


important properties

46
Algebraic properties

• (1) OLS residuals always add up to zero


n
X
ûi = 0
i=1

• (2) Sample covariance (and thus sample correlation) between


regressor and residual is always zero
n
X
xi ûi = 0
i=1

47
Implications (nice exercises)

• By yi = ŷi + ûi and Property (1)

ȳ = ŷ¯

i.e. sample average of actual yi is same as sample average of


fitted value ŷi

• By definition ŷi = β̂0 + β̂1 xi and Property (1)-(2)


n
X
ŷi ûi = 0
i=1

i.e. ŷi and ûi have zero sample covariance

48
Goodness-of-fit
• For each observation, write

yi = ŷi + ûi
• Define total sum of squares (SST), explained sum of squares
(SSE) (“model sum of squares” in Stata) and residual sum of
squares (SSR) as
n
X
SST = (yi − ȳ )2
i=1
n
X
SSE = (ŷi − ȳ )2
i=1
Xn
SSR = ûi2
i=1

• They become sample variance if divided by n (or n − 1) (note:


û¯ = 0 by Property (1))
49
• By writing
n
X n
X
SST = (yi − ȳ )2 = [(yi − ŷi ) + (ŷi − ȳ )]2
i=1 i=1
n
X
= [ûi + (ŷi − ȳ )]2
i=1
Pn
and using the fact i=1 ŷi ûi = 0, we get

SST = SSE + SSR

• Define fraction of total variation in yi that is explained by xi as

SSE SSR
R2 = =1−
SST SST
which is called R-squared of regression

50
Remark on R 2 (1)

• It always holds 0 ≤ R 2 ≤ 1 by construction

• R 2 = 0 means no linear relationship between yi and xi .


R 2 = 1 means a perfect linear relationship

• As R 2 increases, yi ’s get closer to OLS regression line

51
Remark on R 2 (2)

• Low R 2 does not mean your model is wrong or your estimates


are biased
• See the simulation example below – the data generating
process for outcome y is linear in x but the error variance is
large relative to te variance of x. As a result, R 2 is low but
estimates are close to the true parameter values that were
used to generate the data

52
Remark on R 2 (3)
Simulation example 1

53
Remark on R 2 (4)
Simulation example 1

54
Remark on R 2 (5)

• High R 2 does not mean your model is right or your estimates


are unbiased
• See the simulation example below – the data generating
process for outcome y is non-linear in x but the error variance
is small relative to the variance of x. As a result, R 2 is high

55
Remark on R 2 (6)
Simulation example 2

56
Remark on R 2 (7)
Simulation example 2

57
Remark on R 2 (8)
Simulation example 2

58
Example: Wage equation (WAGE1.dta)
. reg wage educ

Source SS df MS Number of obs = 526


F( 1, 524) = 103.36
Model 1179.73204 1 1179.73204 Prob > F = 0.0000
Residual 5980.68225 524 11.4135158 R-squared = 0.1648
Adj R-squared = 0.1632
Total 7160.41429 525 13.6388844 Root MSE = 3.3784

wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .5413593 .053248 10.17 0.000 .4367534 .6459651


_cons -.9048516 .6849678 -1.32 0.187 -2.250472 .4407687

• We can see R 2 = .1648 from output. So years of education


explains only about 16.5% of variation in hourly wage

59
4. Units of measurement and
functional form
(Wooldridge, Ch. 2.4)

60
Units of measurement

• It is important to know how y and x are measured in order


to interpret regression functions

• As example, use CEOSAL1.dta. Take


y = salary (CEO salary measured in thousands of dollars)
x = roe (returns on equity measured in percent)

. des salary roe

storage display value


variable name type format label variable label

salary int %9.0g 1990 salary, thousands $


roe float %9.0g return on equity, 88-90 avg

61
Regress salary on roe
. reg salary roe

Source SS df MS Number of obs = 209


F( 1, 207) = 2.77
Model 5166419.04 1 5166419.04 Prob > F = 0.0978
Residual 386566563 207 1867471.32 R-squared = 0.0132
Adj R-squared = 0.0084
Total 391732982 208 1883331.64 Root MSE = 1366.6

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

roe 18.50119 11.12325 1.66 0.098 -3.428196 40.43057


_cons 963.1913 213.2403 4.52 0.000 542.7902 1383.592

• One percentage point increase in roe increases predicted


salary by 18.501 (i.e. $18,501)

62
Change unit of x = roe

• What if we measure roe as decimal, rather than percent?


Define
roe
roedec =
100
• What will happen to intercept, slope, and R 2 if we regress
salary on roedec?

• Nothing should happen to intercept: roedec = 0 is same as


roe = 0.

• But slope will increase by 100. One percentage point change


in roe is same as ∆roedec = .01. So we get same effect as
before

• Also R 2 should not change (and indeed it does not)

63
Regress salary on roedec = roe/100

. gen roedec = roe/100

. reg salary roedec

Source SS df MS Number of obs = 209


F( 1, 207) = 2.77
Model 5166418.8 1 5166418.8 Prob > F = 0.0978
Residual 386566563 207 1867471.32 R-squared = 0.0132
Adj R-squared = 0.0084
Total 391732982 208 1883331.64 Root MSE = 1366.6

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

roedec 1850.119 1112.325 1.66 0.098 -342.8196 4043.057


_cons 963.1913 213.2403 4.52 0.000 542.7902 1383.592

64
Change unit of y = salary

• What if we measure salary in dollars, rather than thousands of


dollars?
salarydol = 1000 × salary
• Both intercept and slope should be multiplied by 1000. To get
same prediction when x = 0, intercept should be multiplied by
1000. To get same effect of one percent change in x, slope
should be multiplied by 1000 as well

65
Two ways to find answer

• Basically two ways to answer this kind of question

• Use intuition: β̂0 is prediction for ŷ when x = 0. β̂1 gives us


change ∆ŷ = β̂1 ∆x. Both should match to the ones by
original regression

• Use definition:

β̂0 = ȳ − β̂1 x̄
Pn
(x − x̄)(yi − ȳ )
β̂1 = Pn i
i=1
2
i=1 (xi − x̄)

replace y or x with cy or cx, say

66
Using natural logarithm
• Recall wage example

[ = −0.90 + 0.54 educ


wage

• Might be okay approximation but: value of another year of


schooling is constant (we expect additional years of schooling
to be worth more than previous years)

• How can we incorporate such increasing effect? One way is to


assume constant percentage effect by using natural log
(denote by “log” but “ln” is also common)

• See App. A.4 for review on log. Key property is

100∆ log(x) ≈ %∆x

i.e. log change times 100 approximates percent change


67
Log-level model

• Now consider

log(wage) = β0 + β1 educ + u

• Holding u fixed, then ∆ log(wage) = β1 ∆educ and

∆ log(wage)
β1 =
∆educ
• By property of log,

100 · ∆ log(wage) ≈ %∆wage

• Thus β1 is interpreted as

100β1 ≈ %∆wage when ∆educ = 1

68
Regress log(wage) on educ
. drop lwage

. gen lwage = log(wage)

. reg lwage educ

Source SS df MS Number of obs = 526


F( 1, 524) = 119.58
Model 27.5606288 1 27.5606288 Prob > F = 0.0000
Residual 120.769123 524 .230475425 R-squared = 0.1858
Adj R-squared = 0.1843
Total 148.329751 525 .28253286 Root MSE = .48008

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0827444 .0075667 10.94 0.000 .0678796 .0976091


_cons .5837727 .0973358 6.00 0.000 .3925563 .7749891

• Estimated return to each year of education is about 8.3%


• Warning: This R 2 is not directly comparable to R 2 when
wage is dependent variable 69
Log-log model

• We can use log on both sides of equation to get constant


elasticity models

• For example, consider CEOSAL1.dta. If

log(salary ) = β0 + β1 log(sales) + u

then
∆ log(salary ) %∆salary
β1 = ≈
∆ log(sales) %∆sales
• Elasticity is free of units of salary and sales

70
Regress log(salary ) on log(sales)
. reg lsalary lsales

Source SS df MS Number of obs = 209


F( 1, 207) = 55.30
Model 14.0661688 1 14.0661688 Prob > F = 0.0000
Residual 52.6559944 207 .254376785 R-squared = 0.2108
Adj R-squared = 0.2070
Total 66.7221632 208 .320779631 Root MSE = .50436

lsalary Coef. Std. Err. t P>|t| [95% Conf. Interval]

lsales .2566717 .0345167 7.44 0.000 .1886224 .3247209


_cons 4.821997 .2883396 16.72 0.000 4.253538 5.390455

• Estimated elasticity of CEO salary with respect to firms sales


is about .257

71
Summary: Interpretation of β1

Model Dep. Var. Regressor β1


Level-Level y x β1 = ∆y
∆x
β1 ∆y
Level-Log y log(x) 100 = %∆x
Log-Level log(y ) x 100β1 = %∆y
∆x
Log-Log log(y ) log(x) β1 = %∆y
%∆x

72
5. Expected value and variance of OLS
(Wooldridge, Ch. 2.5)

73
Statistical property of OLS
• Our analysis so far has been purely algebraic that holds for
any sample

• Now our job gets harder. We study statistical properties of


OLS estimator under random sampling

• Mathematical statistics: How do our estimators behave


across different samples? On average, do we get right
answer if we could repeatedly sample? (Usually we only have
one particular sample, just imagine)

• This leads to notion of unbiasedness

• Need to find expected value of OLS estimators (i.e. average


outcome across all possible random samples) and see if we are
right on average

• See App. C.1.-C.2 for review on concepts for point estimation


74
Assumptions SLR (Simple Linear Regression)
• SLR.1 (Linear in parameters) Population model is

y = β0 + β1 x + u

where β0 and β1 are (unknown) parameters

• SLR.2 (Random sampling) We have a random sample of


size n, {(yi , xi ) : i = 1, . . . , n}, following the population model

• SLR.3 (Sample variation in xi ) Sample outcomes on xi are


not all the same value

• SLR.4 (Zero conditional mean) In population, error term u


has zero mean given any value of regressor

E (u|x) = 0 for all x

75
Assumptions SLR.1

• Assumption SLR.1 (Linear in parameters)

Population model is

y = β0 + β1 x + u

where β0 and β1 are (unknown) parameters

• We view x and u as random variables, so y is of course


random

76
Assumptions SLR.2

• Assumption SLR.2 (Random sampling)

We have a random sample of size n, {(yi , xi ) : i = 1, ..., n},


following the population model

• We know how to use these data to estimate β0 and β1 by OLS

• Because each i is a draw from the population, we can write

yi = β0 + β1 xi + ui

for each i

• Note: ui is different from residual ûi

77
Assumptions SLR.3

• Assumption SLR.3 (Sample variation in xi )

Sample outcomes on xi are not all the same value

• Equivalent to saying that sample variance of


{xi : i = 1, . . . , n} is not zero (very mild assumption)

78
Assumptions SLR.4

• Assumption SLR.4 (Zero conditional mean)

In population, error term u has zero mean given any value of


regressor
E (u|x) = 0 for all x
• This is key for showing unbiasedness of OLS estimator

• If x is non-random, then this assumption becomes

E (u) = 0

79
Basic approach to show unbiasedness
• We focus on β̂1 . Want to show

E (β̂1 ) = β1

where expected value means averaging across random samples

• Basically two approaches to show unbiasedness. One is to


explicitly compute expected value of β̂1 conditional on sample
outcomes of regressors {x1 , . . . , xn }. But this is cumbersome

• Here (and Wooldridge) takes second shortcut approach: We


treat {x1 , . . . , xn } as non-random in the derivation

• So randomness in β̂1 comes through ui ’s (or yi ’s) only

• We use it only as simplifying device for derivation. In


Assumptions SLR.1-4, xi ’s are treated as random
80
Simplifying device “Conditional on X”

• Throughout my part of lecture, “X” means sample outcome of


regressors, i.e.
X = {x1 , . . . , xn }
• Next chapter we will have multiple regressors. In such case

X = {(xi1 , . . . , xik ) : i = 1, . . . , n}

• Intuitively, X means “sample of all x variables”

• If we say:
Conditional on X
it means sample outcome of regressors X are treated as
non-random (or fixed or constant) variables

81
Show unbiasedness: Step 1
• What we want to show is

E (β̂1 ) = β1

conditional on X

• Step 1: Write down formula for β̂1 . It is convenient to use


Pn
(xi − x̄)yi
β̂1 = Pi=1
n 2
i=1 (xi − x̄)

which is one of several equivalent forms

• Let SSTx = ni=1 (xi − x̄)2 (total variation in xi ) and write


P

Pn
(xi − x̄)yi
β̂1 = i=1
SSTx
82
Show unbiasedness: Step 2
• Step 2: Replace yi with β0 + β1 xi + ui (which uses SLR.1-2)

• Numerator of β̂1 becomes


n
X
(xi − x̄)yi
i=1
Xn
= (xi − x̄)(β0 + β1 xi + ui )
i=1
n
X n
X n
X
= β0 (xi − x̄) + β1 (xi − x̄)xi + (xi − x̄)ui
i=1 i=1 i=1
n
X
= β1 SSTx + (xi − x̄)ui
i=1
Pn Pn Pn
by i=1 (xi − x̄) = 0 and i=1 (xi − x̄)xi = i=1 (xi − x̄)2
83
• So we get
Pn Pn
β1 SSTx + i=1 (xi − x̄)ui i=1 (xi− x̄)ui
β̂1 = = β1 +
SSTx SSTx
• Now define
(xi − x̄)
wi =
SSTx
• Then we have
n
X
β̂1 = β1 + wi ui
i=1

• Estimation error (or noise) β̂1 − β1 is written by weighted sum


of ui ’s

84
Show unbiasedness: Step 3

• Step 3: Find E (β̂1 )

• Note that wi ’s are all functions of X. Thus conditional on X,


properties of E (·) yields
n
!
X
E (β̂1 ) = E β1 + wi ui
i=1
n
X
= β1 + wi E (ui )
i=1

• SLR.4 implies E (ui ) = 0. Therefore, conditional on X,

E (β̂1 ) = β1

85
Theorem (Unbiasedness of OLS)

• Under Assumptions SLR.1-4 and conditional on X,

E (β̂1 ) = β1
E (β̂0 ) = β0

• Unbiasedness of β̂0 is shown in a similar way (nice exercise!)

• Remember β1 is fixed constant in population. Estimator β̂1


varies across samples and is the random outcome. Before
we collect the data, we do not know what β̂1 will be

86
Simulation

• Now simulate data by Stata

y = 3 + 2x + u

where

x ∼ Normal(0, 9)
u ∼ Normal(0, 36)

and they are independent

87
Generate data

. range x 0 16 250

. * Generate Normal(0.9) distribution for x

. replace x = 3*invnormal(uniform())
(250 real changes made)

. * Generate Normal(0,36) distribution for u

. gen u = 6*invnormal(uniform())

88
Estimate by 1st sample

. gen y = 3 + 2*x + u

. reg y x

Source SS df MS Number of obs = 250


F( 1, 248) = 219.41
Model 7946.95533 1 7946.95533 Prob > F = 0.0000
Residual 8982.35506 248 36.2191736 R-squared = 0.4694
Adj R-squared = 0.4673
Total 16929.3104 249 67.9891983 Root MSE = 6.0182

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x 1.983109 .1338799 14.81 0.000 1.719422 2.246796


_cons 3.034248 .3806681 7.97 0.000 2.284493 3.784002

89
Estimate by 2nd sample
. * Now repeat the process, but keep x the same

. replace u = 6*invnormal(uniform())
(250 real changes made)

. replace y = 3 + 2*x + u
(250 real changes made)

. reg y x

Source SS df MS Number of obs = 250


F( 1, 248) = 198.83
Model 7936.87302 1 7936.87302 Prob > F = 0.0000
Residual 9899.70081 248 39.9181484 R-squared = 0.4450
Adj R-squared = 0.4427
Total 17836.5738 249 71.6328266 Root MSE = 6.3181

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x 1.981851 .1405502 14.10 0.000 1.705026 2.258675


_cons 2.661485 .399634 6.66 0.000 1.874376 3.448594

90
Estimate by 3rd sample
. replace u = 6*invnormal(uniform())
(250 real changes made)

. replace y = 3 + 2*x + u
(250 real changes made)

. reg y x

Source SS df MS Number of obs = 250


F( 1, 248) = 231.56
Model 8467.28171 1 8467.28171 Prob > F = 0.0000
Residual 9068.60664 248 36.5669623 R-squared = 0.4829
Adj R-squared = 0.4808
Total 17535.8884 249 70.4252544 Root MSE = 6.0471

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x 2.047002 .1345212 15.22 0.000 1.782052 2.311951


_cons 3.40232 .3824914 8.90 0.000 2.648974 4.155666

91
Estimate by 4th sample
. replace u = 6*invnormal(uniform())
(250 real changes made)

. replace y = 3 + 2*x + u
(250 real changes made)

. reg y x

Source SS df MS Number of obs = 250


F( 1, 248) = 248.84
Model 9298.98077 1 9298.98077 Prob > F = 0.0000
Residual 9267.60761 248 37.3693855 R-squared = 0.5008
Adj R-squared = 0.4988
Total 18566.5884 249 74.564612 Root MSE = 6.1131

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x 2.145181 .1359891 15.77 0.000 1.87734 2.413022


_cons 2.551488 .3866653 6.60 0.000 1.789921 3.313054

92
• 1st and 2nd generated data sets give us β̂1 ≈ 1.98, very close
to β1 = 2. 4th data set gives us β̂1 ≈ 2.15, bit far

• If we repeat the experiment again and again and then average


the outcomes of β̂1 , we would get very close to 2

• Problem is: we do not know which kind of sample we have.


We never know whether we are close to the population value

• Generally, we hope that our sample is “typical” and produces


slope estimate close to β1 , but again we never know

93
Remarks on theorem

• Unbiasedness is a property of estimator (not estimate)

• After estimation like

\ = 0.584 + 0.083 educ


log(wage)

it is tempting to say 8.3% is an “unbiased estimate” of return


to education. Technically, this statement is incorrect.
Estimator (or rule or recipe) used to get β̂1 = 0.083 is
unbiased (under SLR.1-4)

• Key is SLR.4: E (u|x) = 0. If it fails, then OLS will be biased


for β1 . For example, if some omitted factor contained in u is
correlated with x, then SLR.4 will typically fail

94
Variance of OLS estimator
• Under SLR.1-4, OLS estimators are unbiased, i.e. estimates
are equal to population values on average
• We also need measure of dispersion in sampling distribution
of the estimators. We employ variance (and standard
deviation)

95
Additional assumption

• We could characterize variance of OLS estimators under


SLR.1-4 (and will do in Ch. 8). For now, it is easiest to
introduce an assumption to simplify derivation

• Assumption SLR.5 (Homoskedasticity, or Constant


variance)
Error term has same variance given any value of regressor x

Var (u|x) = σ 2 > 0 for all x

where σ 2 is unknown

96
Remark on SLR.5

• Combining Assumptions SLR.1 (y = β0 + β1 x + u), SRL.4


(E (u|x) = 0) and SLR.5 (Var (u|x) = σ 2 )

E (y |x) = β0 + β1 x
Var (y |x) = σ 2

• So expected value of y is allowed to change with x (this is our


interest) but variance does not change with x

• Constant variance assumption may not be realistic. It must be


determined on case-by-case basis

97
Example: Saving and income

• For y = sav and x = inc, consider

E (sav |inc) = β0 + β1 inc

with β1 > 0, i.e. average family saving increases with income

• If we impose SLR.5, then

Var (sav |inc) = σ 2

which means variability in saving does not change with income

• There are reasons to think saving would be more variable as


income increases

98
Theorem (Sampling variance of OLS)

• Under Assumptions SLR.1-5 and conditional on X,

σ2 σ2
Var (β̂1 ) = Pn 2
=
i=1 (xi − x̄) SSTx
2 −1
P n 2

σ n i=1 xi
Var (β̂0 ) =
SSTx

99
Derive variance formula

• To show this result, write as before


n
X
β̂1 = β1 + wi ui
i=1

where wi = (xi − x̄)/SSTx . We treat wi as non-random in


derivation

• Because β1 is constant, it does not affect Var (β̂1 ). So

n
!
X
Var (β̂1 ) = Var wi ui
i=1

100
• Since ui ’s are independent, they are uncorrelated each other.
So, variance of sum equals sum of variances (see Property
VAR.4 in App. B.4)

• Thus, conditional on X,
n n
!
X X
Var wi ui = Var (wi ui )
i=1 i=1
Xn
= wi2 Var (ui )
i=1
Xn
= wi2 σ 2
i=1
n
X
= σ2 wi2
i=1

where third equality uses SLR.5


101
• Also note that
n n Pn 2
X X (xi − x̄)2 i=1 (xi − x̄)
wi2 = =
(SSTx )2 (SSTx )2
i=1 i=1
SSTx 1
= 2
=
(SSTx ) SSTx
• Combining these results, we get

σ2
Var (β̂1 ) =
SSTx
conditional on X

102
σ2
Remark on Var (β̂1 ) = SSTx

• This is “standard” formula for variance of β̂1 . It is not valid if


Assumption SLR.5 does not hold

• SLR.5 was not used to show unbiasedness of β̂1 , which


requires only SLR.1-4

• Based on this formula, we can easily study two factors (σ 2


and SSTx ) that affect Var (β̂1 )

• As error variance σ 2 increases, so does Var (β̂1 ). The more


“noise” in the relationship between y and x, the harder it is to
learn about β1

• By contrast, more variation in {xi } is a good thing

SSTx ↑ implies Var (β̂1 ) ↓

103
• Since SSTx /n is sample variance of x, we can say

SSTx ≈ nσx2

and thus
σ2
Var (β̂1 ) ≈
nσx2
which means Var (β̂1 ) shrinks at 1/n rate

• This is why more data is a good thing: More data shrinks


sampling variance of β̂1

• Standard deviation of β̂1 is squared root of variance:


σ
sd(β̂1 ) = √
SSTx
This turns out to be measure of variation that appears in
confidence intervals and test statistics
104
Estimating error variance σ 2
• Look at formula
σ2
Var (β̂1 ) =
SSTx
• We can compute SSTx from data {x1 , . . . , xn }. But
σ 2 = E (u 2 ) is unknown and needs to be estimated

• If we observe {u1 , . . . , un }, unbiased estimator of σ 2 would be


1 Pn 2
n i=1 ui .
Since ui is unobservable, we use OLS residual ûi
as proxy and estimate σ 2 by
n
1 X 2 SSR
σ̂ 2 = ûi =
n−2 n−2
i=1

• Indeed under SLR.1-5, E (σ̂ 2 ) = σ 2 (unbiased) [don’t worry


about proof] “−2” is crucial to achieve unbiasedness

105
Standard error of β̂1

• In regression output, usually



σ̂ = σ̂ 2

is reported (called root mean squared error in Stata)


q
• Given σ̂, we can now estimate sd(β̂1 ) = Var (β̂1 ) = √ σ .
SST
x
Its estimate is called standard error of β̂1 :
σ̂
se(β̂1 ) = √
SSTx

106
Example: Return to education (WAGE1.dta)
. reg lwage educ

Source SS df MS Number of obs = 526


F( 1, 524) = 119.58
Model 27.5606288 1 27.5606288 Prob > F = 0.0000
Residual 120.769123 524 .230475425 R-squared = 0.1858
Adj R-squared = 0.1843
Total 148.329751 525 .28253286 Root MSE = .48008

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0827444 .0075667 10.94 0.000 .0678796 .0976091


_cons .5837727 .0973358 6.00 0.000 .3925563 .7749891

• In this regression, σ̂ = .4801 (see “Root MSE” in output).


Standard errors are reported in “Std. Err.” columns

107

You might also like