05 Diagnostic Test of CLRM 2

CLRM Assumptions & Diagnostics II
Multicollinearity
Model Specifictions Error
Topics
Multicollinearity
Under/ Overfitting a model
Functional Form
Errors on Measurement
07/04/15
Violation of the Assumptions of the CLRM
Recall that we assumed of the CLRM disturbance terms:

1. E(ut) = 0
2. Var(ut) = 2 <
3. Cov (ui,uj) = 0
4. The X matrix is non-stochastic or fixed in repeated samples
5. ut N(0,2)
MULTICOLLINEARITY
Nature of Multicollinearity
Multicollinearity
Multicollinearity occurs when the explanatory variables are very highly

correlated with each other.
Perfect multicollinearity
Cannot estimate all the coefficients

E.g. suppose x3 = 2x2
( X ' X ) 1 X ' y
and the model is yt = 1 + 2x2t + 3x3t + 4x4t + ut
Sources of Multicollinearity
The data collection method employed

Constraints on the model or in the population being
sampled.
Model specification
Overestimated model
Micronumerocity
Essentially a sample (regression) phenomenon,
Estimation in the Presence of High Multicollinearity
Although BLUE, the OLS estimator have large variance and

covariances, making precise estimation difficult.
R2 will be high but the individual coefficients will have high
standard errors.
The regression becomes very sensitive to small changes in
the specification.
Thus confidence intervals for the parameters will be very
wide, and significance tests might therefore give
inappropriate conclusions.
Impact of Multicollinearity to variance of OLS estimator
Large variances and covariances of OLS

estimators
Variance Inflating factor (VIF)

Speed of increase of variance of OLS estimators
VIF
Variances expressed with VIF
The Effect of Increasing r23 on
COV ( 2 , 2 ) and VAR ( 2 )
Detection of Multicollinearity
Scatterplot
Where C(dependent variable):comsumption expenditure, Yd: real disposable income, W; real

wealth, I:real interest rate.
Detection of Multicollinearity (cont)
High R2 but few significant t ratios

High pare-wise correlations among regressors (higher than 0.8). But it
is not necessary that they be high to have collinearity in any special
case.(e.g. x2t+x3t=x4t, three or more variables linear)
Eigenvalue and condition index.

CI
CI is between 10 and 30, there is moderate to strong multicollinearity

CI exeeds 30 there is severe multicollinearity
Maximum eigenvalue
Minimum eigenvalue
Tolerance and variance inflation factor
If the VIF of a variable higher than 10, that variable said be highly collinear
Solutions to the Problem of Multicollinearity
Traditional approaches, such as ridge regression or principal

components. But these usually bring more problems than they solve.
Some econometricians argue that if the model is otherwise OK, just

ignore it
- Maybe Multicollinearity is not bad, if the objective is prediction Only.
The easiest ways to cure the problems are

- drop one of the collinear variables
- transform the highly correlated variables into a ratio
- go out and collect more data e.g.
- a longer run of data
- switch to a higher frequency
Model specification error

Under/ Overfitting a model
Functional Form
Errors on Measurement
Economists are people searching in a dark room for

a non-existent black cat; econometricians are
regularly accused of finding one. (Kennedy, Peter, A
Guide to Econometrics, 1992, p.82)
Questions in Model Specification Error

1.
How does one go about finding the correct model?
2.
What types of model specification errors is one likely to

encounter in practice?
3.
What are the consequences of specification error?
4.
How does on detect the specification errors?
5.
Having detected specification errors, what remedies can

one adopt and with what benefits?
6.
How does one evaluate the performance of competing

model/
Omission (Underfitting) of an Important Variable
Suggested empirical model
yt 1 2 X 2t 3 X 3t u1t
Used empirical model
yt 1 2 X 2 t u 2 t
Error term from equ. (2)
u 2t 3 X 3t u1t
(1)
(2)
(3)
Omission (Underfitting) of an Important Variable

Consequences & Detection
Consequences
If the omitted variable is correlated with the included variable in regression (the correlation
coefficient between the two variables is nonzero), then parameter (intercept and coefficient)
are biased as well as inconsistent.
2
var( 2 )
x22t
Detection
where b32
x x
x
3t 3 t
2
2t
If the omitted variable is not correlated with the included variable in regression (the
correlation coefficient between the two variables is zero), intercept still biased.
Variance of the regression is incorrectly estimated.
Variance of the intercept and other coefficients are biased.
E ( 2 ) 2 3b32
Using t and F-test
var(2 )
x22t (1 r2,3 )
Inclusion (Overfitting) of an Irrelevant Variable
Suggested empirical model
Used empirical model
Consequences
yt 1 2 X 2t u1t
yt 1 2 X 2 t 3 X 3 t u 2 t
(1)
(2)
Coefficient estimates will still be consistent and unbiased,

Variance of the model is corrected estimated.
The estimated variance will be inefficient (generally larger than those of the true
model)
2
var(2 )
x22t
2
var( 2 )
x22t (1 r2,3 )
then,
var( 2 )
1
var(2 ) 1 r23
ln
(yt)
xt
u
1
2
t
Adopting the Different Functional Form

Log-linear, Linear-log, Double log
Log-Linear:
causes a 100x2% increase in yt.
a 1-unit increase in x 2t
(x2t)ut
t
1
2ln

Log-linear, Linear-log, Double log (cont)
Linear-Log:
a 0.01x2 increase in yt.
a 1% increase in x 2t causes
ln(y
(x2t)ut
t)
1
2ln

Log-linear, Linear-log, Double log (cont)
Double Log:
causes a 2 % increase in yt.
a 1% increase in x 2t
Adopting the Wrong Functional Form & Underfitting

Detection
Examination of Residuals. (e.g. total cost function)

The Durbin-Watson d statistic.

Detection using Ramseys RESET test
Ramseys RESET test is a general test

for mis-specification of functional form.
Essentially the method works by adding

higher order
terms of the fitted values
2
3
(e.g. yt , yt etc.) into an auxiliary
regression:

Detection using Ramseys RESET test
Process of Ramseys RESET test

1.
2.
2
Regress and obtain the estimated yt and y t values
Obtain R2 from this regression. The test statistic is given by TR2 and is distributed as a
. y X y
2 y 3 u
t
2t
3 t
3 t
So if the value of the test statistic is greater than a ( p 1) then reject the null
hypothesis that the functional form was correct.
yt 1 2 X 2t 3 y t2 3 y t3 ut
2 ( p 1)
But what do we do if this is the case?
The RESET test gives us no guide as to what a better specification

might be.
One possible cause of rejection of the test is if the true model is

yt 1 2 x2t 3 x22t 4 x23t ut
In this case the remedy is obvious.
Another possibility is to transform the data into logarithms. This will

linearise many previously multiplicative models into additive ones:
yt Axt e ut ln yt ln xt ut
Parameter Stability Tests
So far, we have estimated regressions such as yt = 1 + 2x2t + 3x3t + ut
We have implicitly assumed that the parameters ( 1, 2 and 3) are constant

for the entire sample period.
We can test this implicit assumption using parameter stability tests. The
idea is essentially to split the data into sub-periods and then to estimate up
to three models, for each of the sub-parts and for all the data and then to
compare the RSS of the models.
There are two types of test we can look at:

- Chow test (analysis of variance test)
- Predictive failure tests
The Predictive Failure Test
Problem with the Chow test is that we need to have enough data to do the
regression on both sub-samples, i.e. T1>>k, T2>>k.
An alternative formulation is the predictive failure test.
What we do with the predictive failure test is estimate the regression over a long
sub-period (i.e. most of the data) and then we predict values for the other period
and compare the two.
To calculate the test:

- Run the regression for the whole period (the restricted regression) and obtain the
RSS
- Run the regression for the large sub-period and obtain the RSS (called RSS1). Note
we call the number of observations T1 (even though it may come second).
RSS RSS1 T1 k
Test Statistic
RSS1
T2
where T2 = number of observations we are attempting to predict. The test statistic
will follow an F(T2, T1-k).
Predictive Failure Tests An Example
We have the following models estimated:

For the CAPM on BCA.
1980M1-1991M12
0.39 + 1.37RMt
T = 144
RSS = 0.0434
1980M1-1989M12
0.32 + 1.31RMt
T1 = 120
RSS1 = 0.0420
Can this regression adequately forecast the values for the last two years?
0.0434 0.0420 120 2

Test Statistic
= 0.164
0.0420
24
Compare with F(24,118) = 1.66.

So we do not reject the null hypothesis that the model can adequately
predict the last few observations.
How do we decide the sub-parts to use?

As a rule of thumb, we could use all or
some of the following:
600
400
200
Sample Period
443
417
391
365
339
313
287
261
235
209
183
157
131
79
105
0
53
- Use all but the last few observations and

do a predictive failure test on those.
800
27
- Split the data according to any known

important historical events (e.g. stock
market crash, new government elected)
1000
split the data accordingly to any obvious

structural changes in the series.
1200
t)
- Plot the dependent variable over time and
1400
Value of Series (y
Testing the Normality Assumption
Why did we need to assume normality for hypothesis testing?
The Bera Jarque normality test
A normal distribution is not skewed and is defined to have a coefficient of

kurtosis of 3.
The kurtosis of the normal distribution is 3 so its excess kurtosis (b2-3) is

zero.
Skewness and kurtosis are the (standardised) third and fourth moments of a
distribution.
Testing for Normality using Jarque Bera
Bera and Jarque formalise this by testing the residuals for normality by
testing whether the coefficient of skewness and the coefficient of excess
kurtosis are jointly zero.
It can be proved that the coefficients

of skewness and kurtosis can be
3
expressed respectively as: b E [u ]
E[u4 ]
1
b2
2 3/ 2
2
and
The Bera Jarque test statistic is given by

b12 b2 3 2
2
W T
~ 2
24
6
We estimate b1 and b2 using the residuals from the OLS regression, u.
What do we do if we find evidence of Non-Normality?
It is not obvious what we should do!
Using central limit theorem
If Y1, ..., YN are independent and identically distributed random

variables with mean and
, then
Y variance
Yi N 2, and
Y
ZN
N
has a probability distribution that converges to the standard
normal
N(0, 1) as N
This theorem says that the sample average of N independent

random variables from any probability distribution will have
an
approximate
standard
normal
distribution
after
standardizing, if thea sample is sufficiently large.
Y ~,
N
Example of CLT
Let the continuous

random variable Y have a
triangular distribution,
with pdf: 2 y 0 y 1
f y
otherwise
The expected value of Y is

= E(Y) = 2/3 and its
variance is 2 = var(Y) =
1/18 Z N Y 2 3
1 18
Then
N
approaches the standard

normal distribution as N
A Strategy for Building Econometric Models

Our Objective:
To build a statistically adequate empirical model which

- satisfies the assumptions of the CLRM
- is parsimonious
- has the appropriate theoretical interpretation
- has the right shape - i.e.
- all signs on coefficients are correct
- all sizes of coefficients are correct
- is capable of explaining the results of all competing models
2 Approaches to Building Econometric Models
There are 2 popular philosophies of building econometric models: the

specific-to-general and general-to-specific approaches.
Specific-to-general was used almost universally until the mid 1980s, and
involved starting with the simplest model and gradually adding to it.
Little, if any, diagnostic testing was undertaken. But this meant that all
inferences were potentially invalid.
An alternative and more modern approach to model building is the LSE
or Hendry general-to-specific methodology.
The advantages of this approach are that it is statistically sensible and also
the theory on which the models are based usually has nothing to say about
the lag structure of a model.
The General-to-Specific Approach
First step is to form a large model with lots of variables on the right hand
side
This is known as a GUM (generalised unrestricted model)
At this stage, we want to make sure that the model satisfies all of the
assumptions of the CLRM
If the assumptions are violated, we need to take appropriate actions to
remedy this, e.g.
- taking logs
- adding lags
- dummy variables
We need to do this before testing hypotheses
Once we have a model which satisfies the assumptions, it could be very big
with lots of lags & independent variables
The General-to-Specific Approach:

Reparameterising the Model
The next stage is to reparameterise the model by

- knocking out very insignificant regressors
- some coefficients may be insignificantly different from each other,
so we can combine them.
At each stage, we need to check the assumptions are still OK.
Hopefully at this stage, we have a statistically adequate empirical model

which we can use for
- testing underlying financial theories
- forecasting future values of the dependent variable
- formulating policies, etc.
Questions?

05 Diagnostic Test of CLRM 2

Uploaded by

Copyright:

Available Formats

05 Diagnostic Test of CLRM 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Diagnostic Test of CLRM 2

Uploaded by

Copyright:

Available Formats

CLRM Assumptions & Diagnostics II

Violation of the Assumptions of the CLRM

Recall that we assumed of the CLRM disturbance terms:

Multicollinearity occurs when the explanatory variables are very highly

Cannot estimate all the coefficients

and the model is yt = 1 + 2x2t + 3x3t + 4x4t + ut

The data collection method employed

Essentially a sample (regression) phenomenon,

Estimation in the Presence of High Multicollinearity

Although BLUE, the OLS estimator have large variance and

Impact of Multicollinearity to variance of OLS estimator

Large variances and covariances of OLS

Variance Inflating factor (VIF)

Variances expressed with VIF

The Effect of Increasing r23 on

COV ( 2 , 2 ) and VAR ( 2 )

Where C(dependent variable):comsumption expenditure, Yd: real disposable income, W; real

Detection of Multicollinearity (cont)

High R2 but few significant t ratios

Eigenvalue and condition index.

CI is between 10 and 30, there is moderate to strong multicollinearity

Tolerance and variance inflation factor

Solutions to the Problem of Multicollinearity

Traditional approaches, such as ridge regression or principal

Some econometricians argue that if the model is otherwise OK, just

The easiest ways to cure the problems are

Model specification error

Economists are people searching in a dark room for

Questions in Model Specification Error

How does one go about finding the correct model?

What types of model specification errors is one likely to

What are the consequences of specification error?

How does on detect the specification errors?

Having detected specification errors, what remedies can

How does one evaluate the performance of competing

Omission (Underfitting) of an Important Variable

Suggested empirical model

Used empirical model

Error term from equ. (2)

Omission (Underfitting) of an Important Variable

Using t and F-test

Inclusion (Overfitting) of an Irrelevant Variable

Suggested empirical model

Used empirical model

Coefficient estimates will still be consistent and unbiased,

Adopting the Different Functional Form

Adopting the Different Functional Form

Adopting the Different Functional Form

Adopting the Wrong Functional Form & Underfitting

Examination of Residuals. (e.g. total cost function)

Adopting the Wrong Functional Form & Underfitting

Ramseys RESET test is a general test

Essentially the method works by adding

Adopting the Wrong Functional Form & Underfitting

Process of Ramseys RESET test

hypothesis that the functional form was correct.

But what do we do if this is the case?

The RESET test gives us no guide as to what a better specification

One possible cause of rejection of the test is if the true model is