Chapter - 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 59

LECTURE 2:

SIMPLE
REGRESSION I
Rozina Econometrics
Econometrics Rozina
2 Introducing Simple Regression

Econometrics Rozina
Introducing Simple Regression
3

 simple regression = regression with 2 variables

Y effected by cause X x
dependent variable independent variable
explained variable explanatory variable
response variable control variable
predicted variable predictor variable
regressand regressor

 we are actually going to derive the linear regression model in three very different
ways
 these three ways reflect three types of econometric questions we
discussed in the first lecture (descriptive, causal and forecasting)
 while the math for doing it is identical, conceptually they are very different
ideas

Econometrics Rozina
4 Descriptive Approach
Why do we need a regression model?
Estimation & interpretation.
Correlation vs. causation.

Econometrics Rozina
Introducing Simple Regression

Definition of the simple linear regression model

„Explains variable in terms of variable “

B1 give us these information


Intercept
1- direction (+ or – like the
Slope parameter
correlation )
2- Magnitude ( when x change by
1 what is the change in y)
3- Significant

Dependent variable,
explained variable, Error term,
Independent variable, disturbance,
response variable,… explanatory variable, unobservables,…
regressor,…
Introducing Simple Regression

Interpretation of the simple linear regression model

„Studies how varies with changes in :“

as long as

By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit

The simple linear regression model is rarely applicable in practice but its
discussion is useful for pedagogical reasons
Introducing Simple Regression

Example: Soybean yield and fertilizer

Rainfall,
land quality,
presence of parasites, …
Measures the effect of fertilizer on
yield, holding all other factors fixed

Example: A simple wage equation

Labor force experience,


tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education,
holding all other factors fixed
Introducing Simple Regression

When is there a causal interpretation?


Conditional mean independence assumption

Example: wage equation


The explanatory variable must not
contain information about the mean
of the unobserved factors

e.g. intelligence …

The conditional mean independence assumption is unlikely to hold because


individuals with more education will also be more intelligent on average.

This should not happen to have a coreelation between education and intelligence
Introducing Simple Regression

Population regression function (PFR)


The conditional mean independence assumption implies that

This means that the average value of the dependent variable can be expressed
as a linear function of the explanatory variable
Introducing Simple Regression
5

 goal: estimate E[y|x] (called the population regression function)


 x is discrete (A discrete random variable x has a countable number of
possible values) – wages vs. education example
 one can collect data for the individual categories (the
more categories, the more difficult)

€1,500

wage €1,000

€500

0
8 10 12 14 18 20
16
Econometrics education (years) Rozina
Introducing Simple Regression (cont’d
6 )
 x is continuous – an infinite number of possible values. Continuous random variables are
usually measurements. Examples include height, weight, the amount of sugar in an orange, the time
required to run a mile. problem: no categories
 example: CEO’s salary vs. company’s ROE
 ROE = return on equity = net income as a percentage of common equity (ROE =
10 means if I invest $100 in equity in a firm I earn $10 a year)
 does a CEO’s salary (typically huge) reflect her performance?

4000

CEO’s
salary 3000
2000
($1,000s)

1000

0
0 10 20 30 40 50
company’s
Econometrics ROE Rozina
Introducing Simple Regression (cont’d
7 )
 therefore, we need a model for E[y|x]
 in other words, we need to find a “good” mathematical expression for f
in E[y|x] = f(x) )
 the simplest model I can think of is E[y|x] = β0 + β1x

Why use simple models:


Simple models are:
▪ easier to estimate.
▪ easier to interpret (e.g., β1 = Δwage/Δeduc etc.).
▪ easier to analyze from the statistical standpoint.
▪ safe: they serve as a good approximation to the real
relationship, the functional nature of which might be
unknown and/or complicated. Things can’t go too wrong
when using a simple model.
Further reading: Angrist and Pischke (2008): Mostly Harmless
Econometrics: An Empiricist’s Companion.
Econometrics Rozina
Introducing Simple Regression

Population regression function

For individuals with , the


average value of is
Introducing Simple Regression

In order to estimate the regression model one needs data


A random sample of observations

First observation

Second observation

Third observation Value of the dependent


variable of the i-th ob-
Value of the expla-
servation
natory variable of
the i-th observation
n-th observation
Introducing Simple Regression- different
between actual and estimated value
Fit as good as possible a regression line through the data points:

Fitted regression line


For example, the i-th
data point
Linear Model for E[y|x]
8

 interepretation:
 intercept: β0 = E[y|x =
0]
slope: E[ y|x]  E[ y|x]
 1 
x

x
E[y|x] = β0 + β1x
E[y|x]

β1

1
β0

Econometrics Rozina
Estimation
9

 once we’ve decided about the functional form of the model (in our case, it’s E[y|
x] = β0 + β1x), we need to develop techniques to obtain estimates of the
parameters β0 and β1
 we base our estimates on a sample of the population → we need to make
inferences about the whole population
 sample:
 n observations (people, countries, etc.) indexed with i
 x and y values for ith person denoted as yi , xi (for i = 1,…,n)
Preliminaries
 it will be useful to define u = y – β0 – β1x
 what do we know about u?
 first, because E[y|x] = β0 + β1x, we have E[u|x ]  E[ y  0  1x|x ] 
 E[ y|x ]  0  1x 
 0  1x  0  1x  0

Econometrics Rozina
Estimation (cont’d
10 )
 this means two things:
1. E u = 0. (This should be intuitive: E[u|x] = 0 for all x → E u = 0;
2. the expected value of u does not change when we change x
 the second property has numerous implications, the most important being:

 cov(x,u) = 0 ()
 E[xu] = 0 (because cov(x,u) = E[xu] – Ex E u )
 how is this useful in estimation?
 we arrived at two important facts about expectations of u:
Eu=0
E[xu] = 0
 typically, expectations are estimated
using a sample mean
(e.g., how would you estimate mean wage
with a sample of 10 people?)
 we’ll
Econometrics use the idea of sample analogue, Rozina
Estimation (cont’d
11 )
 before we move on, we’ll revise some more statistical concepts connected
with random sample
 remember we need to make inference about the population based on our sample
and its characteristics
Population vs. sample characteristics:

Random variable- Population (size N) Sample (size n)


behave randomly

x  E x x  1  N 1 xi x  1 n 1 xi
N i n i
n
1

var x  E(x  x ) 2
1
 N
(xi  x ) 2 2
1 i
N i1
n1 i (x 
x)

cov(x, y)  E(x  x )( y  y )
1
 N ( y   )(x   ) 1
n 1 ( yi  y)(xi  x )
N i1 i y i x n1 i

 the expressions on the right are called sample mean, sample variance
and sample covariance
Econometrics Rozina
Estimation (cont’d
12 )
 for ith person (i.e., observation) in our sample, we have the population regression
model
y i = β 0 + β 1 x i + ui ,
where β0 and β1 are the unknown parameters to be estimated
 to think about estimation let’s define the sample regression model

yi  ˆ0  ˆ1xi  uˆi ,


where


 ˆ0 and ˆ1 are our estimates of β0 and β1 from the sample
uˆ i is defined as uˆ i  yi  ˆ0  ˆ1xi

note: if ˆ0 and ˆ1 are like β0 and β1 , then uˆ i should be like ui
 sample analogue:
 in order to make inferences about the population, we have to believe
that our sample looks like the population
 if it is so, then let’s force things to be true in the sample which we
know would be true in the population
Econometrics Rozina
Estimation (cont’d
13 )
 from the discussion about (the population’s) u, we know that
Eu=0
E[xu] = 0
 the sample analogue to this is:

0  1nn i
1 uˆ i
 1n ni1 ( yi  ˆ0 

0  1nn 1ix i uˆ i  1 n n i1 xˆi 1(xyii) ˆ0  ˆ1xi )


as yi and xi are the data, the above equations are in fact two linear equations in
variables ˆ0 andˆ1 ; they can be solved (fairly) easily (see
 Wooldridge, pages
note that the first 28 and
equation tells29)
us that ˆ0  y  ˆ1x , where x and y
are the sample means of xi’s and yi’s
n
1 
ˆ  i1 ( yi  y)(x i  x )
 solving the equations to get ˆ 1 2
(x  x )
ni1 i
yields:

Econometrics Rozina
Estimation (cont’d
14 )

 we can rewrite the formula for the slope as


1
n1 ni1 ( yi  y)(xi  x ) ,
1  1 2
n1 ni1
(x  x )
ˆ i

which can be viewed as the sample analogue to


cov(x, y)
var x
 both var x and its sample counterpart are always positive, therefore:
 the regression coefficient (β1), the covariance, and the correlation
coefficient must all have the same sign
 one of them is zero only if all of them are zero

Econometrics Rozina
Interpretation:
Estimation Intercept or B0= 963.191
- Slope B1: 18.5
If there is no change ROE the salary of
CEO is 963.191
- if the ROE increase by 1% the salary will
increase by 18.501
CEO Salary and return on equity - the sign is positive=> positive
relationship between salary and ROE

Salary in thousands of dollars Return on equity of the CEO‘s firm

Fitted regression

Intercept If the return on equity increases by 1 percent,


then salary is predicted to change by 18,501 $

Causal interpretation?
Estimation

Fitted regression line


(depends on sample)

Unknown population regression line


Interpretation:
Estimation Intercept or B0= -0.9
- Slope B1:0.54
If there is no change education the salary
of wage is -0.9
- if the education increase by 1% the
salary will increase by 0.54
Wage and education - the sign is positive=> positive
relationship between salary and ROE

Hourly wage in dollars Years of education

Fitted regression

Intercept
In the sample, one more year of education was
associated with an increase in hourly wage by 0.54 $
Causal interpretation?
Correlation Vs. Causation
16

 the difference between causation and correlation is a key concept in


econometrics
 you saw that in the model with conditional expectations, the estimates were based
on the correlation between x and y (remember the formulas)
 there are many ways a causal interpretation can be given that is consistent
with the (correlation-based) results (see next slide)
 as we’ll see, no econometric tool can ever “prove” or “find” a
causal
relationship on itself
 having an economic model is essential in establishing the causal
interpretation (we’ll talk about this in the causal-analytic part)
 conclusions:
 correlation ≠ causation
 statistical significance ≠ the effect of x on y is significant (only that
they are “tightly associated”)
 these issues are confused all the time by politicians and the popular press
Econometrics Rozina
Three Causation Schemes
17

 looking at the relationship between x and y from the causal standpoint, the
association (or correlation) between x and y can represent three basic situations:

y x x causes y: if a CEO’s performance is good (high ROE), he gets paid a


lot.

y x y causes x: high salaries motivate CEOs, thus making them perform


well (resulting in high ROE)

y x there is another factor z that causes both x and y, which makes x


and y be associated: if a CEO is good (clever, motivated, etc.), he
both performs well and gets paid a lot
z

 the descriptive approach makes no claims about which one is the case

Econometrics Rozina
19 Causal Approach
The need for causal analysis. Structural
model and its assumptions. Estimation.
Examples.

Econometrics Rozina
Causal Analysis
20

 in the descriptive framework, we couldn’t really say anything about


causality
 however, in decision-making situations, causal questions are typically those we
need to answer:
 if I face a decision, it means I can influence something (I have control over an
economic variable)
 in order to decide effectively, I need to know the impact of my decision
on things I’m interested in
 examples:
 central bank sets interest rates in order to keep inflation within specified
bounds (inflation targeting)
 companies charge prices so as to maximize profit / revenues /
market share
 ministry of education introduces new school fees scheme; the aim is to
collect money without discouraging students from education. Therefore, one
has to find the effect of education on future income

Econometrics Rozina
Causal Analysis (cont’d
21 )
 to say something about causality we need to make some more
assumptions
 in practice, these assumptions will have to be checked using an economic theory /
common sense + knowledge of the real problem
 mathematically, the formulation of these assumptions consists in writing
down a structural data-generating model
 this will look similar to what we have been doing, but conceptually it is very
different
 for the description-type analysis, we started with the data and then
asked which model could help summarize the conditional expectation
 for the structural (causal) case we start with the model and then use it to say
what the data will look like (even before we actually collect them)

Econometrics Rozina
Structural model
22

 a simple structural model may look like this


y = β0 + β1 x + u
 what has changed from the descriptive analysis?
 descriptive model: conditional expectation of y = a function of x
E[y|x] = f(x)
 structural model: y = a function of x and u
y = f(x,u)
 it should be clear that modeling y is much more ambitious than
modeling the expectation of y
 we have already encountered u, but this time it has a real content (see later)
 in choosing a particular structural model, we’re actually saying that we believe
that, in reality, the value of y is “created” from x and u using function f
 this is a daring claim, so we need to choose the model very carefully

Econometrics Rozina
How About u?
24

 u is probably the most important part of the structural model


 we’ll spend a lot of time talking about u and its relation to x
(note that the relation of u and y is obvious from the equation)
 what does u contain? Everything that the rest of the right-hand side of the
equation failed to capture
 in the previous example, this means everything that affects your wages
besides education:
 intelligence
 work effort
 …other suggestions?

 in some textbooks, you can read that u contains a couple more things:
 measurement errors (or poor proxy variables)
 the intrinsic randomness (in human behaviour etc.)
 model specification errors
→ these will not be all that relevant for our causal discussion

Econometrics Rozina
Crucial Assumption: E[u|x] = 0
25

 once we have specified the structural model, we need to estimate its


parameters, i.e. β0 and β1
 in order to be able to do so, we need to assume something about the relation
between u and x
 mathematically, the crucial assumption takes on the form E[u|x] = 0
 this is the same as what we did before (in the descriptive analysis), but
conceptually very different
 before u was defined simply u = y – β0 – β1x . It didn’t actually mean
anything
 now we think of u as this real thing that is actually out there and means
something – it is just that we don’t / can’t observe it
 even though we don’t observe it, we can still argue whether the
assumption is fulfilled or not
 in order to do so, we need to know the assumption tells us in the first
place

Econometrics Rozina
Crucial Assumption: E[u|x] = 0 (cont’d
26 )
 as before, the assumption that E[u|x] = 0 really means two separate things,
one of which is a big deal, the other is not:

1. E[u|x] doesn’t vary with x (i.e., it’s a constant).


 this holds true if…
 …u is assigned at random
 …u and x are independent
 …perhaps something else
 and is not true if u and x are correlated
 example: intelligence (a part of u) is correlated with education
(x) → we’re in trouble
(we’ll discuss the possible solutions as we go)

2. Eu = 0 (i.e., the “unconditioned” expectation is zero).


 the important part is 1; we assume 2 just for convenience

Econometrics Rozina
E[u|x] = 0 Vs.
28
Causation
 what does the assumption tell us about causation?
 remember the three causation schemes between x and y:
y←x , y→x ,
y←z→x
 in the causal analysis, we want to rule out all but the first one
 this is exactly what the assumption about E[u|x] effectively does
 the “arrow scheme” now contains three letters: y, x and u
 we know that u affects y (by definition), y←u
 note that E[u|x] = constant implies cov(x,u) = 0
 therefore, we’d like the arrow scheme to look like this:

y x this suggests there should


be no association
(correlation) between x and
u u

Econometrics Rozina
E[u|x] = 0 Vs. (cont’d
29
Causation )
 imagine we have the reversed causality y→
x:
y x here u affects y which in turn
affects x, so u affects x, and u
and x are correlated
u
 therefore cov(x,u) ≠ 0 and the assumption is necessarily violated
 now consider the case y ← z → x :
 if there’s any z that affects y, it is a part of u (by definition), and therefore
z (and u) affect x

y x
u contains z, so u affects x,
meaning the two are correlated
u

Econometrics Rozina
Estimation
30

 now, let’s take the assumption E[u|x] = 0 for granted


 then, nothing is really any different than in the descriptive case
 we can write down the sample regression function as

yi  ˆ0  ˆ1xi  uˆi ,


 we know that
Eui  0
E[xiui ]  0
 the sample analogue is  1n ni1 ( yi  ˆ0 
0  1nn i
1 uˆ i

n n i1 
0  1nn 1ix i uˆ i  1  xˆi 1(xyii) ˆ0  ˆ1xi )
n
i1 ( yi  y)(x i 
which gives us ˆ1  xn ) 2
,  0  y  1ˆ x exactly as before
(x  x )
i1 i ˆ
 it is only the interpretation that has changed

Econometrics Rozina
Estimation

Properties of OLS on any sample of data


Fitted values and residuals

Fitted or predicted values Deviations from regression line (= residuals)

Algebraic properties of OLS regression

Deviations from regression Correlation between deviations Sample averages of y and x


line sum up to zero and regressors is zero lie on regression line
Estimation

For example, CEO number 12‘s salary was


526,023 $ lower than predicted using the
the information on his firm‘s return on equity
Estimation

Goodness-of-Fit

„How well does the explanatory variable explain the dependent variable?“

Measures of Variation

Total sum of squares, Explained sum of squares, Residual sum of squares,


represents total variation represents variation represents variation not
in dependent variable explained by regression explained by regression
Estimation

Decomposition of total variation

Total variation Explained part Unexplained part

Goodness-of-fit measure (R-squared)

R-squared measures the fraction of the


total variation that is explained by the
regression
Estimation

CEO Salary and return on equity

The regression explains only 1.3 %


of the total variation in salaries

Voting outcomes and campaign expenditures

The regression explains 85.6 % of the


total variation in election outcomes
Caution: A high R-squared does not necessarily mean that the regression has a
causal interpretation!
Functional Form (cont’d
13 )
 note that wage = exp(β0 + β1 educ) ↔ log(wage) = β0 + β1 educ
 logarithm transform is one of the basic econometric tools
 the rule to remember: taking the log of one of the variables means we shift
from absolute changes to relative changes:

regression function interpretation of β1


y = β0 + β1x Δy = β1Δx
log y = β0 + β1x %Δy = (100 β1) Δx
y = β0 + β1log x Δy = (0.01 β1) %Δx
log y = β0 + β1log x %Δy = β1%Δx

 constant elasticity model: log y = β0 + β1 log x +


u
log y y x
 x-elasticity of y: 1  Ey,x    
%y
Econometrics log x Rozina
Estimation

Incorporating nonlinearities: Semi-logarithmic form

Regression of log wages on years of eduction

Natural logarithm of wage

This changes the interpretation of the regression coefficient:

Percentage change of wage

… if years of education
are increased by one year
Estimation

Fitted regression

The wage increases by 8.3 % for


every additional year of education
(= return to education)

For example:

Growth rate of wage is 8.3 %


per year of education
Estimation

Incorporating nonlinearities: Log-logarithmic form


CEO salary and firm sales

Natural logarithm of CEO salary Natural logarithm of his/her firm‘s sales

This changes the interpretation of the regression coefficient:

Percentage change of salary


… if sales increase by 1 %

Logarithmic changes are


always percentage changes
Estimation

CEO salary and firm sales: fitted regression

For example:
+ 1 % sales ! + 0.257 % salary

The log-log form postulates a constant elasticity model, whereas the semi-log
form assumes a semi-elasticity model
Standard assumptions for the linear regression model

Assumption SLR.1 (Linear in parameters)

In the population, the relationship


between y and x is linear

Assumption SLR.2 (Random sampling)

The data is a random sample


drawn from the population

Each data point therefore follows


the population equation
Standard assumptions for the linear regression model

Discussion of random sampling: Wage and education


The population consists, for example, of all workers of country A
In the population, a linear relationship between wages (or log wages) and years of
education holds
Draw completely randomly a worker from the population
The wage and the years of education of the worker drawn are random because
one does not know beforehand which worker is drawn
Throw back worker into population and repeat random draw times
The wages and years of education of the sampled workers are used to estimate
the linear relationship between wages and education
Standard assumptions for the linear regression model

The values drawn


for the i-th worker

The implied deviation


from the population
relationship for
the i-th worker:
Standard assumptions for the linear regression model

Assumption SLR.3 (Sample variation in explanatory variable)


The values of the explanatory variables are not all
the same (otherwise it would be impossible to stu-
dy how different values of the explanatory variable
lead to different values of the dependent variable)

Assumption SLR.4 (Zero conditional mean)

The value of the explanatory variable must


contain no information about the mean of the
unobserved factors
Eviews Output: An Overview
14

1
y  1n  yi 0ˆ , 1 sd( y)  n1
SST

ˆ
Model 1: OLS, using observations 1-209
Dependent variable: salary

coefficient std. error t-ratio p-value

const 963.191 213.240 4.517 1.05e-05 ***


roe 18.5012 11.1233 1.663 0.0978 *

Mean dependent var 1281.120 S.D. dependent var 1372.345


Sum squared resid 3.87e+08 S.E. of regression 1366.555
R-squared 0.013189 Adjusted R-squared 0.008421
F(1, 207) 2.766532 P-value(F) 0.097768
Log-likelihood -1804.543 Akaike criterion 3613.087
Schwarz criterion 3619.771 Hannan-Quinn 3615.789

R2 
SSE  1  SSR SSR
ˆ  n
k1
1
SSR
SST SST
Econometrics Rozina
Standard assumptions for the linear regression model

Theorem 2.1 (Unbiasedness of OLS)

Interpretation of unbiasedness
The estimated coefficients may be smaller or larger, depending on the sample
that is the result of a random draw
However, on average, they will be equal to the values that charac-terize the true
relationship between y and x in the population
„On average“ means if sampling was repeated, i.e. if drawing the random sample
und doing the estimation was repeated many times
In a given sample, estimates may differ considerably from true values
Standard assumptions for the linear regression model

Variances of the OLS estimators


Depending on the sample, the estimates will be nearer or farther away from the
true population values
How far can we expect our estimates to be away from the true population values
on average (= sampling variability)?
Sampling variability is measured by the estimator‘s variances

Assumption SLR.5 (Homoskedasticity)

The value of the explanatory variable must


contain no information about the variability of
the unobserved factors
Standard assumptions for the linear regression model

Graphical illustration of homoskedasticity

The variability of the unobserved


influences does not dependent on the
value of the explanatory variable
The Simple
Regression Model
An example for heteroskedasticity: Wage and education

The variance of the unobserved


determinants of wages increases
with the level of education
The Simple
Regression Model
Theorem 2.2 (Variances of OLS estimators)

Under assumptions SLR.1 – SLR.5:

Conclusion:
The sampling variability of the estimated regression coefficients will be the higher
the larger the variability of the unobserved factors, and the lower, the higher the
variation in the explanatory variable
The Simple
Regression Model
Estimating the error variance

The variance of u does not depend on x, i.e. is


equal to the unconditional variance

One could estimate the variance of the


errors by calculating the variance of the
residuals in the sample; unfortunately this
estimate would be biased

An unbiased estimate of the error variance can be obtained by


substracting the number of estimated regression coefficients
from the number of observations
Three Approaches: A Comparison
38

 descriptive approach
 goal: find the association between x and y expressed in terms of
conditional expectation E[y|x]
 assumptions: approximate functional form of E[y|x]
 causal approach
 goal: find the causal effect of x on y
 assumptions:
 structural model for y (“how y is created”)
 assumptions about u: E[u|x] = 0 (and thus, E[xu] = 0)
 forecasting approach
 goal: predict future values of y based on the knowledge of future x
 assumptions: approximate functional form of the relation “y vs. x”
→ different goals, different assumptions, same formulas for estimates
→ the econometric software has only one procedure for all three cases,
you have to know what you’re doing, check the assumptions etc.

Econometrics Rozina

You might also like