Applied Eco No Metrics

Applied Econometrics
15. Juli 2010

Inhaltsverzeichnis
1 Introduction and Motivation 3
1.1 3 motivations for CLRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The Classical Linear Regression Model (CLRM): Parameter Estimation by OLS 7
3 Assumptions of the CLRM 10
4 Finite sample properties of the OLS estimator 13
5 Hypothesis Testing under Normality 16
6 Condence intervals 22
6.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 Goodness-of-t measures 24
8 Introduction to large sample theory 25
8.1 A. Modes of stochastic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 B. Law of Largen Numbers (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.3 C. Central Limit Theorems (CLTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.4 D. Useful lemmas of large sample theory . . . . . . . . . . . . . . . . . . . . . . . . 30
8.5 Large sample properties for the OLS estimator . . . . . . . . . . . . . . . . . . . . . 31
9 Time Series Basics (Stationarity and Ergodicity) 34
10 Generalized Least Squares 37
11 Multicollinearity 39
12 Endogeneity 40
Applied Econometrics Inhaltsverzeichnis
13 IV estimation 42
14 Questions for Review 45
2
Applied Econometrics 1 Introduction and Motivation
1 Introduction and Motivation
What is econometrics?
Econometrics=economic statistics/data analysis economic theory mathematics
Conceptional:
Data perceived as realizations of random variables
Parameters are real numbers, not random variables
Joint distributions of random variables depend on parameters
General regression equations
Generalization: y
i
=
1
x
i1
+
2
x
i2
+... +
K
x
iK
+
i
(linear model) mit:
y
i
= dependent variable (observable)
x
ik
=explanatory variable (observable)
i
=unobservable component
Index of observations i=1,2,...,n (for individuals, time etc.) and regressors k=1,2...,K
y
i
(1x1)=
(1xK)
x
i
(Kx1) +
i
(1x1) (1)
=
1
...
andx
i
=
x
i1
...
x
iK
(2)
Structural parameters=suggested by economic theory
The key problem of econometrics: We deal with non-experimental data
Unobservable variables, interdependence, endogeneity, (reversed) causality
1.1 3 motivations for CLRM
Theory: Glosten/Harris
Notation:
Transaction price: P
t
3
Indicator of transaction type:
Q
t
=
1 buyer initiated trade

1 seller initiated trade
(3)
Trade volume: v
t
Drift parameter:
Earnings/costs of the market maker: c (operational costs, asymmetric information costs)
Private information: Q
t
z
t
Public information:
t
Ecient/fair price:m
t
= +m
t1
+
t
. .. .
Random walk with drift
+Q
t
z
t
mit z
t
= z
0
+z
1
v
t
Market maker sets:
Sell price (ask): P
a
t
= +m
t1
+
t
+z
t
+c
Buy price
. .. .
(for oneself)
(bid): P
b
t
= +m
t1
+
t
z
t
c
Spread: market maker anticipates price impact
Transactions occur at ask or bid price:
P
t
P
t1
= +m
t1
+
t
+Q
t
z
t
+cQ
t
[m
t1
+cQ
t1
] (4)
P
t
= +z
0
Q
t
+z
1
v
t
Q
t
+cQ
t
+
t
(5)
Observable: P
t
, Q
t
, v
t
x
i
= [1, Q
t
, v
t
Q
t
, Q
t
]
and y
i
= P
t
Unobservable:
t
Estimation of unknown structural parameters = (, z
0
, z
1
, c)
Theory: Asset pricing

Finance theory: Investors compensated for holding risk; demand expected return beyond risk-free
rate
x
j
t+1
: payo of risky asset j f
1,t+1
...f
R,t+1
(f:risk factor)
Linear (risk) factor models:
E(R
ej
t+1
)
. .. .
expected asset return
=
j
....
exposure of payo asset j to factor k risk

....
price of factor k risk
mit
j
= (
j
1
, ...,
j
k
)
und
4
= (
1
, ...,
k
)
R
ej
t+1
= R
j
t+1
. .. .
gross return
R
f
t+1
. .. .
risk-free rate
(6)
mit gross return=
x
j
t+1
P
j
t
=
P
j
t+1
+d
j
t+1
P
j
t
Single risk factor:
j
=
Cov(R
ej
,f)
V ar(f)
Some AP models:
CAPM: f = R
em
= (R
m
R
f
)
. .. .
marketrisk
Fama French (FF): f = (R
em
, HML, SMB)
f
1
, ..., f
K
excess returns themselves
Then = [E(f
1
), ..., E(f
K
)]
CAPM: E(R
ej
t+1
) =
j
E(R
em
t+1
) und FF: E(R
ej
t+1
) =
j
1
E(R
em
t+1
)+
j
2
E(HML
t+1
)+
j
3
E(SMB
t+1
)
To estimate risk loadings , we formulate sampling (regression, compatible) model:
R
ej
t+1
=
1
R
em
t+1
+
2
HML
t+1
+
3
SMB
t+1
+
j
t+1
(7)
Assume E(
j
t+1
|R
em
t+1
, HML
t+1
, SMB
t+1
) = 0 (implies E(
j
t+1
) = 0).
This model is compatible with the theoretical model which gets clear when you use expected values
on both sides.
This sampling model does not contain a constant:
j
0
= 0 (from theory) = testable restriction
Theory: Mincer equation
ln(WAGE
i
) =
1
+
2
S
i
+
3
TENURE
i
+
4
EXPR
i
+
i
(8)
Notation:
Logarithm of the wage rate: ln(WAGE
i
)
Years of schooling: S
i
Experience in the current job: TENURE
i
Experience in the labor market: EXPR
i
5
Estimation of the parameters
k
, where
2
: return to schooling
Statistical specication
E(y
i
|x
i
) = x
i
....
(x
i1
,...,x
ik
)

k
....
(
1
,...,
k
)
(linear function of the xs)

Marginal eect:
E(y
i
|x
i
)
x
K
=
K
(9)
Compatible regression model: y
i
= x
i
+
i
with E(y
i
|x
i
) = x
i
LTE (law of total expectation)
implies:
E(
i
|x
i
) = 0
Cov(
i
, x
ik
) = 0k
(10)
The specication E(y
i
|x
i
) = x
i
is ad hoc Alternative: non-parametric regression (leaves the
functional form open).
Justifyable by normality assumption:
Y
X
BV N
2
y

xy
xy

2
x
(11)
Y
X
1
...
X
k
....
(k+1x1)
MV N
x
....
(k+1x1)

2
y

xy
xy

2
x
(12)
mit
x
= E(x) = [E(x
1
)...E(x
k
)] und x = V ar(x) = kxk
E(y|x) =
y
+
1
x

yx
(x
x
) = +

x
. .. .
linear conditional mean
=
y
x
und

=
1
x

yx
V ar(y|x) =
2
y
yx
yx
. .. .
does not depend on x, homoscedasticity
Rubins causal model = Regression analysis of experiments
STAR experiment = small class experiment
Treatment: binary variable D
i
= 0, 1 (class size)
Outcome: y
i
(SAT scores)
Does D
i
y
i
? (causal eects)
Potential outcome:
One of the outcomes is hypothetical =
y
1i
ifD
i
= 1
y
0i
ifD
i
= 0
(13)
6
Applied Econometrics 2 The Classical Linear Regression Model (CLRM): Parameter
Estimation by OLS
Actual outcome: observed y
i
= y
0i
+ (y
1i
y
0i
)
. .. .
causaleffect
D
i
(the eect may be identical or dierent across
individuals i)
Uses causality explicitly in contrast to the other models.
Y
i
=
....
E(y
oi
)
+
....
y
1i
y
oi
D
i
+
....
y
oi
E(y
oi
)
+ z
....
STAR(gender, race, free lunch)
(14)
is constant across i. Goal: estimate
STAR: experimentrandom assignment of i to treatment: E(y
oi
|D
i
= 1) = E(y
oi
|D
i
= 0)
In a non-experimentselection bias: E(y
oi
|D
i
= 1) E(y
oi
|D
i
= 0) = 0
E(y
i
|D
i
= 1) = + E(
i
|D
i
= 1)
. .. .
E(y
oi
|D
i
=1)
(15)
sowie
E(y
i
|D
i
= 0) = +E(
i
|D
i
= 0)
. .. .
E(y
oi
|D
i
=0)
(16)
E(y
i
|D
i
= 1) E(y
i
|D
i
= 0) = + selectionbias
. .. .
E(y
io
|D
i
=1)E(y
oi
|D
i
=0)
Another way, apart from experiments, to avoid selection bias are natural (quasi) experiments.
2 The Classical Linear Regression Model (CLRM): Parameter
Estimation by OLS
Classical linear regression model:
y
i
=
1
x
i1
+
2
x
i2
+... +
K
x
iK
+
i
= x
i
....
(1xK)

....
(Kx1)
+
i
(17)
y
i
=(y
1
, ..., y
n
): Dependent variable, observed
x
i
=(x
i1
, ..., x
iK
): Explanatory variable, observed
=(
1
, ...,
K
): Unknown parameters
i
=Disturbance component, unobserved
b =(b
1
, ..., b
K
): Estimate of
e
i
=y
i
x
i
b: Estimated residual
7
Estimation by OLS
Based on i.i.d. variables, which refers to the independence of the is, not the xs.
Preferred technique: Least-squares (instead of maximum-likelihood and moments).
Two-sidedness (ambiquity): xs as random variables or realisations
For convenience we introduce matrix notation:
y
....
(nx1)
= X
....
(nxK)

....
(Kx1)
+
....
(nx1)
(18)
Constant: (x
11
, ..., x
n1
)=(1,...,1)
Writing extensively: A system of linear equations
y
1
=
1
+
2
x
12
+... +
K
x
1K
+
1
(19)
y
2
=
1
+
2
x
22
+... +
K
x
2K
+
2
(20)
... (21)
y
n
=
1
+
2
x
n2
+... +
K
x
nK
+
n
(22)
(23)
OLS estimation in detail:
Estimate by choosing b:
argmin(b)
. .. .
minimizesumofsquaredresiduals
e
2
i
=

(y
i
x
i
b)
2
=

(y
i
x
i1
... x
iK
)
2
= S(b)
Dierentiate with respect to b
1
, ..., b
K
FOC
S(b)
b
1
= 2[
y
i
x
i
b] =
1
n
e
i
= 0 (24)
S(b)
b
2
= 2[
y
i
x
i
b]x
i2
=
1
n
e
i
x
i2
= 0 (25)
... (26)
S(b)
b
K
= 2[
y
i
x
i
b]x
iK
=
1
n
e
i
x
iK
= 0 (27)
(28)
8
Estimation by OLS
System of K equations and K unknowns: solve for b (OLS estimator)
Characteristics:
e = 0
Cov(e
i
, x
i2
) = 0
Cov(e
i
, x
iK
) = 0
With 1 regressor:
y
i
=
1
+
2
x
i2
+
i
b
2
=
y
i
x
i2
y
i
x
i2
x
2
i2
[
x
i2
]
2
=
sample cov
sample var
Solution more complicated for K2 The system of K equations is solved by matrix algebra:
e=y-Xb FOC rewritten:
e
i
= 0 (29)
x
i2
e
i
= 0 (30)
... (31)
x
iK
e
i
= 0 (32)
X
e = 0 (33)
Extensively:
1...1
x
12
...x
n2
...
x
1K
...x
nK
e
1
e
2
...
e
n
(34)
X
e = X
(y Xb) = X
y
....
Kx1
X
X
. .. .
KxK
b
....
Kx1
= 0
XX has a rank of K (full rank, then [X
X]
1
exists and premultiplying by [X
X]
1
results in:
[X
X]
1
Xy-[X
X]
1
XXb=0
[X
X]
1
Xy-Ib=0
b=[X
X]
1
Xy
Alternative notation:
b = (
1
n
X
X)
1
1
n
X
y = (
1
n
x
i
x
i
)
1
. .. .
matrixofsamplemeans
1
n
x
i
y
i
. .. .
vectorofsamplemeans
(35)
Questions:
Properties? Unbiased, ecient, consistent?
9
Applied Econometrics 3 Assumptions of the CLRM
3 Assumptions of the CLRM
The four core assumptions of CLRM
1.1 Linearity in parameters: y
i
= x
i
+
i
This is not too restrictive, because reformulation is possible, e.g. using ln, quadratics
1.2 Strict exogeneity: E(
i
|X) = E(
i
|x
11
, ..., x
1k
, ..., x
i1
, ..., x
ik
, ..., x
n1
, ..., x
nk
) = 0
Implications:
a) E(
i
|x) = 0 E(
i
|x
ik
) = 0
. .. .
by LTE
E(
i
) = 0
. .. .
by LTE
b) E(
i
x
jk
)
a)
....
= Cov(
i
, x
jk
) = 0
. .. .
by LTE
i,j,k (use LTE, LIE)
unconditional moment restrictions (compare to OLS FOC)
Example where this may be violated:
ln(wages
i
) =
1
+
2
S
i
....
(+)
+... +
i
Ability
crime
i
. .. .
in district i
=
1
+
2
Police
i
. .. .
()
+... +
i
Social factors
unempl
i
. .. .
in country i
=
1
+
2
Lib
i
....
()
+... +
i
Macro shock
When E(
i
|x) = 0 Endogeneity (prevents us from estimating
s consistently).
Discussion: Endogeneity and sample selection bias
Rubins causal model: y
i
= +D
i
+
i
; E(y
i
|D
i
= 1)E(y
i
|D
i
= 0) = +E(y
oi
|D
i
= 1) E(y
oi
|D
i
= 0)
. .. .
selection bias
D
i
and {y
oi
, y
1i
}
. .. .
partly unobservable
assumed independent {y
oi
, y
1i
}D
i
With independence : E(y
oi
|D
i
= 1) = E(y
oi
|D
i
= 0) = E(y
oi
) =E(
i
....
i
| D
i
....
x
ik
) = E(
i
) = 0
10
Independence is normally not the case because of endogeneity and sample selection bias.
Conditional independence assumption (CIA): {y
oi
, y
1i
}D
i
|x
i
selection bias vanishes conditioning
on x
i
How do we instrumentalize CIA: by adding control variables to the right-hand side of the equation
Example: Mincer-type regression
ln(wage
i
) =
1
+
2
Highschool
i
+
3
Ten
i
+
4
Exp
i
+
5
Ability
i
+
6
Family
i
. .. .
Control variables
+
i
I assume
i
Highschool|Ability, Family,... Justies E(
i
|Highschool, ...) = 0
CIA justies inclusion control variable and E(
i
|x) = 0
Matching= sorting individuals into groups and then comparing the outcomes
1.3 No exact multicollinearity, P(rank(X) = K)
. .. .
Bernoulli distributed (is a random variable)
= 1
No linear dependencies in the data matrix, otherwise (X
X)
1
does not exist
Does not refer to a high correlation between the Xs
1.4 Spherical disturbances: V ar(
i
|x) = E(
2
i
|x) = V ar(
i
) =
2
i Homoscedasticity (relates to
the MVN)
Cov(
i
,
j
|x) = E(
i

j
|x) = E(
i

j
) = 0 No serial correlation
E(
1
, ...,
n
)
E[
|x] =
E(
2
1
|x) ... ...
E(
2

1
|x) ... ...
... ... E(
2
n
|x)
2
... 0
... ... ...
0 ...
2
= Cov(|x) (36)
By LTE E(
) = V ar() =
2
In
V ar(
i
) = E(
2
i
) =
2
Cov(
i
,
j
) = E(
i
, ...,
j
) = 0i = j
Interpreting the parameters of dierent types of linear equations
Linear model: y
i
=
1
+
2
x
i2
+ ... +
K
x
iK
+
i
. A one unit increase in the independent variable
x
iK
increases the dependent variable by
K
units
Semi-log form: ln(y
i
) =
1
+
2
x
i2
+... +
K
x
iK
+
i
. A one unit increase in the independent variable
increases the dependent variable approximately by 100*
k
percent
Log-linear model: ln(y
i
) =
1
ln(x
i1
) +
2
ln(x
i2
) +... +
K
ln(x
iK
) +
i
. A one percent increase in
x
iK
increases the dependent variable y
i
approximately by
k
percent.
e.g. y
i
= A x
i1
x
i2
i
(Cobb-Douglas)lny
i
= lnA+lnx
i1
+lnx
i2
+ln
i
11
Before the OLS proofs a useful tool: Law of total expectations (LTE)
E(y|X = x) =
y f
y|x
(y, x)dy =
y
f
xy
(x, y)
f
x
(x)
dy (37)
Using random variable x:
E(y|x) =
y
f
xy
(x, y)
f
x
(x)
dy
. .. .
g(x)
(38)
E
x
(g(x)) =
g(x)f
x
(x)dx =
y
f
xy
(x, y)
f
x
(x)
dy]f
x
(x)dx =
f
xy
dxdy =
yf
y
(y)dy = E(y)
(39)
E
x
[ E
y|x
[y|x]
. .. .
measurable function of x
] = E
y
(y) E[E(y|x)] = E(y) (40)
Notes:
works when X a vector
forecasting interpretation
LTE extension: Law of iterated expectation (LIE)
E
z|x
[[ E
y|x,z
(y|x, z)
. .. .
measurable function of x,z
]|x] = E(y|x) (41)
Other important laws
Double Expectation Theorem (DET):
E
x
[E
y|x
(g(y)|x)] = E
y
(g(y)) (42)
Generalized DET:
E
x
[E
y|x
(g(x, y))|x] = E
x,y
(g(x, y)) (43)
Linearity of Conditional Expectations:
E
y|x
[g(x)y|x] = g(x)E
y|x
[y|x] (44)
12
Applied Econometrics 4 Finite sample properties of the OLS estimator
4 Finite sample properties of the OLS estimator
Finite Sample Properties of b = (X
X)
1
X
y
1. With 1.1-1.3 and holding for any sample size: E[b|X] = and by LTE: E[b] =
unbiasedness
2. With 1.1-1.4: V ar[b|X] =
2
[X
X]
1
(importance for testing, depends on the data)
conditional variance
3. With 1.1-1.4: OLS is ecient: V ar[b|X] V ar[
|X]
Gauss-Markov theorem
4. OLS is BLUE
Starting point: sampling error
b =
b
1
1
...
b
K

K
= [X
X]
1
X
y = [X
X]
1
X
[X +] = [X
X]
1
X
(45)
Dening A = [X
X]
1
X
, so that:
A =
a
11
... a
1n
... ... ...
a
k1
... a
kn
1
...
(46)
whereas A is treated as a constant
Derive unbiasedness
Step 1:
E(b |X) = E(A |X) = A E(|X)
. .. .
=0(assumption 1.2)
= 0 (47)
Step 2: b conditionally unbiased
E(b |X) = E(b|X) E(|X) = E(b|X)
Step1
....
= 0 E(b|X) = (48)
Step 3: OLS unbiased
E
x
[E(b|X)]
LTE
....
= E(b)
Step2
....
= (49)
13
Derive conditional variance
V ar(b|X) = V ar(b|X)
sampling error
....
= V ar(A|X) = AV ar(|x)A
= AE(
|x)A
1.4
....
= A
2
InA
=
2
AA
(50)
Using [BA]=AB and inserting for A:
2
[X
X]
1
X
X[X
X]
1
=
2
[X
X]
1
= V ar(b|X) (51)
Und es gilt:
V ar(b|X) =
V ar(b
1
|x) Cov(b
2
, b
1
|x) ...
Cov(b
1
, b
2
|x) ... ...
... ... V ar(b
k
|x)
(52)
Derive Gauss-Markov-Theorem
V ar(
|x) V ar(b|x)
This refers to the fact that we are claiming that the elementwise dierence V ar(
|x) V ar(b|x) is
positive semi-denite.
Insertion:
A and B: separate matrizes, same size
A B if A-B is positive semi-denite
C
....
kxk
is psd if X
....
1xk
C
....
kxk
X
....
kx1
0 for all x= 0
a
[V ar(
|x) V ar(b|x)]a 0a = 0 (53)

If this is true for all a, it is true in particular for a=[1,0,...,0] which leads to: V ar(
1
|x) V ar(b
1
|x)
This works also for b
2
, ..., b
k
with the respective a. This means that any conditional variance of is
bigger than the repsective conditional variance.
The proof:
Note that b=A*y
is linear in y and unbiased and

=C*y with C being a function of X
D=C-A C=D+A
Step 1:
= (D +A)y = Dy +Ay = D(X +) +b = DX +D +b (54)

Step 2:
E(
|x) = DX +E(D|x) +E(b|X) = DX +D E(|x)

. .. .
=0
+ = DX + (55)
14
As we are working with an unbiased estimator: DX=0
Step 3: Going back to Step 1
= D +b (56)

. .. .
s.e.of
= D + b
. .. .
s.e.ofb=A
= [D +A] (57)
Step 4:
V ar(
|x) = V ar(
|x) = V ar[(D +A)|x] = [D +A]V ar(|x)[D +A]
=
2
[D +A][D +A]
(58)
=
2
[DD
+AD
+DA
+AA
](59)
Es gilt:
AA
= [X
X]
1
X
X[X
X]
1
= [X
X]
1
AD
= [X
X]
1
X
= [X
X]
1
[DX]
= 0 (as DX=0)
DA
= D[[X
X]
1
X
= DX[X
X]
1
= 0 (as DX=0)
V ar(
|x) =
2
[DD
+ [X
X]
1
] (60)
Step 5: Es muss also gelten
a
[
2
[DD
+ [X
X]
1
] [X
X]
1
2
]a 0 (61)
a
[
2
DD
]a 0 (62)
(63)
We have to show that aDDa 0 a is sdp
As zz=
z
2
i
0 a the above is true.
The OLS estimator is BLUE
OLS is linear: Holds under assumption 1.1
OLS is unbiased: Holds under assumption 1.1-1.3
OLS is the best estimator: Holds under the Gauss Markov theorem
15
Applied Econometrics 5 Hypothesis Testing under Normality
OLS anatomy
y
i
x
i
b; y = Xb; e = y y
p = X[X
X]
1
X
projection matrix; use: y = py = xb (lies in space spanned by columnes

of X); P is symmetric and idempotent (P*P=P)
M=In-p residual maker; use: e=M*y; M is symmetric and idempotent
y=Py+My [Xe=0 from FOC] (e orthogonal to columns of X; space is spanned by columns of
X) y
e = 0
e = M e
e =

e
2
i
=
M
with a constant

e
i
= 0 [FOC]
y = x
b (for x and y vectors of sample means)

y =
1
n
y
i
=
1
n
y
i
= y
5 Hypothesis Testing under Normality
Economic theory provides hypotheses about parameter.
Eg. asset pricing example: R
e
t
= +R
em
t
+
t
Hypothesis implied by APT: =0
If theory os rightTestable implications.
In order to test we need the distribution of b, so that hypotheses cant be tested without distributional
assumptions about .
In addition to 1.1-1.4 we assume that
1
, ...,
n
|X are normally distributed:
Distributional assumption (Assumption 1.5): Normality assumption about the conditional distribution
of |X MV N(E(|x) = 0, V ar(|x) =
2
I
n
) (Can be dispersed of later).
i
|x N(0,
2
)
Useful tools/results:
Fact 1: If x
i
N(0, 1) with i=1,...,m and x
i
independent y =

m
i=1
x
2
i

2
(m)
If x N(0,1) and y
2
(m) and x,y independent: t =
x
y/m
t(m)
Fact 2 omitted
Fact 3: If x and y are independent, so are f(x) and g(y)
Fact 4: Let x MV N(, ) and nonsingular: (x )
1
(x )
2
(random variable)
Fact 5: W
2
(a), g
2
(b) and W,g independent:
(W/a)
(g/b)
F(a, b)
16
Fact 6: X MV N(, ) and y=c+Ax mit c,A non-random vector/matrix y MV N(c +
A, AA
)
We use b
. .. .
s.e.
= (X
X)
1
X
:
Assuming |X MV N(0,
2
I
n
) from 1.5 and using Fact 6:
b |X MV N((X
X)
1
X
E(|X), (X
X)
1
X
2
I
n
X(X
X)
1
) (64)
b |X MV N(0,
2
(X
X)
1
) (65)
b |X MV N(E(b |X), V ar(b |X)) (66)
Note that V ar(b
k
|X) =
2
((X
X)
1
)
By the way: e|X MV N(0,
2
M) show using e=M* and M is symmetric and idempotent
Testing hypothesis about individual parameters (t-Test)
Null hypothesis: H
0
:
k
=
k
(a hypothesized value, a real number).
k
is often assumed to be 0.
It is suggested by theory.
Alternative hypothesis H
A
:
k
=
k
We control the type 1 error (=rejection when right) by xing a signicance level. We then aim for a
high power (=rejection when false) (low type 2 error (=no rejection when false)).
Construction of the test statistic:
By Fact 6: b
k

k
N(0,
2
((X
X)
1
)
kk
) [((X
X)
1
)
kk
is the k-th row k-th column element of
(X
X)
1
]
OLS-estimator conditionally normal distributed if |X is multivariate normal
b
k
|X N(
k
,
2
((X
X)
1
)
kk
)
Under H
0
:
k
=
k
:
b
k
|X N(
k
,
2
((X
X)
1
)
kk
)
If H
0
is true E(b
k
) =
k
:
z
k
=
b
k
2
((X
X)
1
)
kk
N(0, 1) (67)
Standard normally distributed under the null hypothesis
We dont say anything about the distribution under the alternative hypothesis
Distribution under H
0
does not depend on X
Value of the test statistic depends on X and on
2
whereas
2
is unknown
17
Call s
2
an unbiased estimate of
2
:
t
k
=
b
k
s
2
((X
X)
1
)
kk
t(n k)or
a
....
N(0, 1)for(n k) > 30 (68)
Nuisance parameter
2
can be estimated
2
= E(
2
i
|X) = V ar(
i
|X) = E(
2
i
) = V ar(
i
) (69)
We dont know
i
, but we use the estimator e
i
= y
i
x
i
b

2
= V ar(e
i
) =
1
n
(e
i
1
n
e
i
)
2
=
1
n
e
2
i
=
1
n
e
e (70)
sigma
2
is a biased estimator:
E(
2
|x) =
n K
n

2
(71)
Proof: in Hayashi p.30-31. Use:

e
2
i
= e
e =
M
trace(A)=
a
ii
(sum of the diagonal of the matrix)
Note that for n
2
is asymptotically unbiased as
lim
n
n K
n
= 1 (72)
Having an estimator where the bias and the variance vanishes when n , is what we call a
consistent estimator.
An unbiased estimator of
2
(see Hayashi p.30-31)
For s
2
=
1
nK
e
2
i
=
1
nK
e
e we get an unbiased estimator:

E(s
2
|X) =
1
n K
E(e
e|X) =
2
(73)
E(E(s
2
|X)) = E(s
2
) =
2
(74)
Using this provides an unbiased estimator of V ar(b|X) =
2
(X
X)
1
:
V ar(b|X) = s
2
(X
X)
1
(75)
t-statistic under H
0
:
t
k
=
b
k
(

V ar(b|X))
kk
=
b
k
k
SE(b
k
)
=
b
k
(

V ar(b
k
|X))
t(n K) (76)
18
Hayashi p.36-37 shows that with the unbiased estimator of
2
we have to replace the distribution of
the t-stat by the t-distribution. Sketch of the proof:
t
k
=
b
k
2
[(X
X)
1
]
kk
2
s
2
=
z
k
e e
(n k)/
2
=
z
k
q/n k
(77)
We need to shoq that q/n k
2
(n k) and q,z
k
independent.
Decision rule for the t-test
1. State your null hypothesis: H
0
:
k
=
k
is often
k
= 0; H
A
:
k
=
k
2. Given
k
, OLS-estimate b
k
and s
2
, we compute t
k
=
b
k
k
SE(b
k
)
3. Fix signicance level of two-sided test
4. Fix non-rejection and rejection regions
5. decision
Remark:
2
[(X
X)
1
]
kk
: standard deviation b
k
|X
s
2
[(X
X)
1
]
kk
: standard error b
k
|X
We want a test that keeps its size and at the same time has a high power.
Find critical value such that: Prob[ t
/2
(n k)
. .. .
real n
; lower crit value

< t(n k)
. .. .
r.v.;t-stat
< t
/2
(n k)
. .. .
real n
; upper crit value

] = 1
Example: n-k=30, =0.05
t
/2
(n k) = t
.975
(30) = 2.042
The probability that the t-stat takes on a value of 2.042 or smaller is equal to 97.5%.
t
/2
(n k) = t
.025
(30) = 2.042
The probability that the t-stat takes on a value of -2.042 or smaller is equal to 2.5%
Conventional s: 0.01; 0.05; 0.1
Use stars
Use p-values: compute the quantile taking your t-stat as the x (Interpretation!)
Use t-stat
Report standard error
Report condence intervals (Interpretation!)
19
F-test/Wald test
Model more complex hypotheses (linear hypotheses) = testing joint hypotheses.
Example 1: G/H-Modell
P
i
= +cQ
i
+z
0
Q
i
+z
1
v
i
Q
i
+
i
Hypothesis: no infromational content of traders
H
0
: z
0
= 0 and z
1
= 0; H
A
: z
0
= 0 or z
1
= 0
Example 2: Mincer eq.
ln(wage
i
) =
1
+
2
S
i
+
3
tenure
i
+
4
exper
i
+
i
Hypothesis: 1y more schooling shows the same eect as 1y more tenure AND experience has no eect
H
0
:
2
=
3
and
4
= 0; H
A
:
2
=
3
or
4
= 0
For the F-test write the null hypothesis as a system of linear equations:
R
....
#rxK

....
Kx1
= r
....
#rx1
R
11
... R
1K
... ... ...
R
#1
... R
#K
1
...
r
1
...
r
#
(78)
mit:
#=number of restrictions
R=matrix of real N
s
r=vector of real N
s
Example 1:
0 0 1 0
0 0 0 1
y
c
z
0
z
1
0
0
(79)
Example 2:
0 1 1 0
0 0 0 1
0
0
(80)
Use Fact 6: Y=A*Z (Y MV N(A, AA
))
In our context:
Replacing the = (
1
, ...,
k
) by estimator b = (b
1
, ..., b
K
)
.
R b = r (b is conditionally normally distributed under 1.1-1.5)
For the Wald-test: you only need the unconstrained parameter estimates.
For the other two, you need both the constrained and the unconstrained parameter estimate.
20
The Wald-test keeps its size and gives the highest power.
The restrictions are xed by the H
0
Denition of the F-test statistic
E( r|x) = R E(b|x) = R (under 1.1-1.3)
V ar( r|x) = R V ar(b|x)R
= R
2
(X
X)
1
R
(under 1.1-1.4)
r|x = R b|x MV N(R, R
2
(X
X)
1
R)
Using Fact 4 to construct the test for:
H
0
: R = r
H
A
: R = r
Under the hypothesis:
E(Rb|x)=r (=hypothesized value)
Var(Rb|x)=R
2
(X
X)
1
R (unaected by hypothesis)
Rb|x=r MV N(r, R
2
(X
X)
1
R
) (this is where the hypothesis goes in)

Problem: Rb is a random vector and the distribution depends on x.
Distribution of a random variabl under H
0
whose distribution does not depend on X.
With Fact 4: (Rb E(Rb|x))
(V ar(Rb|x))
1
(Rb E(Rb|x))
This leads to the Wald-statistic:
(Rb r)
[R
2
(X
X)
1
R]
1
(Rb r)
2
(#r) (81)
The smallest number for this is zero (which is the case when the restrictions are perfectly met and
the parenthesis are zero). Thus we always have a one-sided test here.
Replace
2
by its unbiased estimate s
2
=
1
nk
e
2
i
=
1
nk
e
e and dividing by #r (to nd the distri-

bution using Fact 5):
F = (Rb r)
[Rs
2
(X
X)
1
R
](Rb r)/#r F(#r, n k) (82)

F =
(Rb r)
[R(X
X)
1
R
]
1
(Rb r)/#r
(e
e)/(n K)
(83)
= (Rb r)
[R

V ar(b|X)R
]
1
(Rb r)/#r F(#r, n K) (84)
F-Test is one-sided.
For applied work:
N: s
2
(and

2
) get close to
2
, so we can use the approximation.
Decision rule for the Wald-test
1. Specify H
0
in the form R = r and H
A
: R = r
2. Calculate F-statistic
21
Applied Econometrics 6 Condence intervals
3. Look up entry in the table of the F-distribution for #r and n-K at given signicance level
4. Null is not rejected on the signicance level for F less than F
(#r, n K)
Alternative representation of the Wald/F-statistic
Minimization of the unrestricted sum of squared residuals min
(y
i
x
i
b)
2
SSR
u
Minimization of the restricted sum of squared residuals (constraints imposed) min
(y
i
x
b)
2
SSR
r
SSR
u
SSR
r
(always!)
F ratio : F =
(SSR
r
SSR
u
)/#r
SSR
u
/(n k)
(85)
This F-ratio is equivalent to the Wald-ratio. Problem: For the F-ratio we need two estimates b.
This is also known as the likelihood-ratio principle in contrast to the Wald-principles.
(Third principle: only using unrestricted estimates Langrange Multiplier principle)
6 Condence intervals
Duality of t-test and condence interval
Probability for non-rejection for t-test:
P(t
/2
(n K) t
k
t
/2
(n K)
. .. .
do not reject if this even occurs
) = 1
. .. .
if H0 is true
(86)
(t
k
is a random variable)
In 1-% of our samples t
k
lies inside the critical values.
Rewrite (86):
P(b
k
SE(b
k
)t
/2
(n K)
k
b
k
+SE(b
k
)t
/2
(n K)) = 1 (87)
(b
k
, SE(b
k
) is a random variable)
The condence interval
Condence interval for
k
(=if H
0
is true):
P(b
k
SE(b
k
)t
/2
(n K)
k
b
k
+SE(b
k
)t
/2
(n K)) = 1 (88)
The values inside the bounds of the condence interval are the value for which you cannot reject the
null-hypothesis.
In 1-% of our samples
k
would lie inside our bounds (the bounds would enclose
k
).
22
Applied Econometrics 6 Condence intervals
The condence bounds are random variables!
b
k
SE(b
k
)t
/2
(n K): lower bound of the 1- condence interval.
b
k
+SE(b
k
)t
/2
(n K): upper bound of the 1- condence interval.
Wrong interpretation: True parameter
k
lies with probability 1- within the bounds of the condence
interval. Problem: Condence bounds are not xed; they are random!
H
0
is rejected at signicance level if the hypothesized value does not lie within the condence
bounds of the 1- interval.
<Grak>
Count the number of times that
k
is inside the condence interval. After n sample, this gives us the
probability for the condence interval 1-. (
k
not inside the condence interval is equivalent to the
event that we reject the H
0
using the t-statistic).
This works the same way if
k
=
k
.
On the correct interpretation of condence intervals: given:
b
s
2
and se(b
k
)
t
/2
(n k)
(b
k
k
)/se(b
k
) (reject if outside the bounds of the critical values)
Rephrasing: for which values
k
do we (not) reject?
b
k
se(b
k
) t
/2
(n k) <
k
< b
k
+se(b
k
) t
/2
(n k) = not reject
b
k
se(b
k
) t
/2
(n k) >
k
> b
k
+se(b
k
) t
/2
(n k) = reject
We want small condence intervals, because the allow us to reject hypothesis (narrow range=high
power). We achieve this by increasing n (decreasing the standard error) or by increasing .
6.1 Simulation
Frequentists work with the idea that we can conduct experiments over and over again.
To avoid type 2 error we need smaller standard errors (narrower distribution) and thus we have to
increase the sample size.
We are running a simulation with multiple draws. To proof unbiasedness, we compute the mean for
all the draws and compare this to the true parameters. To get closer, we have to increase the number
of draws.
Increasing n leads to a smaller standard error.
Increasing draws leads to more precise estimates, but the standard error doesnt decrease.
23
Applied Econometrics 7 Goodness-of-t measures
You can use a condence interval to determine the power of a test. Determine in how many % of
the cases a wrong parameter estimate lies inside the condence interval (for the true parameter this
should be 1-). This is the power of the test. To increase this, increase the sample size.
The closer the hypothesized value and the tue value are, the less the power of the test.
7 Goodness-of-t measures
Is needed with conicting equations (theories)
Uncentered R
2
: R
2
uc
Variability of y
i
:

y
2
i
= y
y
Decomposition of yy:
=(Xb+e)
(Xb+e) = ( y+e)
( y+e) = y
y+2e
y+e
e = y
y
....
explainedvariation
+ e
e
....
unexplainedvariation(SSR)
(e
y = ( y
e)
= (e
y)
= y
e = (Xb)
e = b
e = 0 (as Xe=0))
<Grak>
R
2
uc
=
y
y
y
y
100% (% of explained variation)
1
e
e
y
y
100% = 1
e
2
i
y
2
i
100% (% of unexplained variation)
A good model explains much and therefore the residual variation is very small compared to the
explained variation.
Centered R
2
: R
2
c
Use centered R
2
if there is a constant in the model (x
i1
= 1)
Variance of y
i
:
1
n
(y
i
y)
2
Decomposition of
1
n
(y
i
y)
2
=
( y
i
y)
2
. .. .
variance of predicted values
+
e
2
i
. .. .
SSR
Proof:
1
n
(y
i
y)
2
=
1
n
( y
i
+e
i
y)
2
=

( y y)
2
+
e
2
i
+
2( y y)e
i
=
( y y)
2
+
e
2
i
+
2b Xe
i
....
0
2ye
i
=

( y y)
2
+
e
2
i
2y
e
i
. .. .
0
R
2
c
=
1
n
( y
i
y)
2
1
n
(y
i
y)
2
= 1
1
n
e
2
i
1
n
(y
i
y)
2
Both uncentered and centered R
2
lie in the interval [0,1] but describe dierent models. They are not
comparable.
The centered R
2
of a model with only a constant whereas the uncentered R
2
is not 0 as long as the
24
Applied Econometrics 8 Introduction to large sample theory
constant is =0 (but we explain only the level of y and not the variation). Without a constant the
cnetered R
2
canT be compared, because

e
i
= 0 only holds with a constant.
R
2
: coecients of determination; =[corr(y, y)]
2
Explanatory power of regression beyond constant
Test that
2
= ... =
k
= 0 is the same as the test that R
2
=0 (R
2
is a random varia-
ble/estimate)
Overall F-test:
SSR
R
: y
i
=
0
+
i
=

(y
i
y)
2
SSR
UR
: y
i
=
0
+
1
x
1
+... +
k
x
k
=

e
2
i
(SSR
R
SSR
UR
)/k
SSR
UR
/(nk1)
Trick: increase R
2
, add more regressors (k).
Modell comparison:
Conduct an F-test between the parsimonious and the extended model
Use model selection criteria, which give a penalty for parameterization
Selection criterion
R
2
adj
= 1
SSR/(n k)
SST/(n 1)
= 1
n 1
n k
. .. .
>1fork>1
SSR
SST
(89)
(can become negative)
Alternative model selection criteria
Implement Occams razor (the simpler explanation (parsimonious model) tends to be the right one)
Akaike criterion (from information theory): log[SSR/n]+2k/n (minimize, can be neg.)
Schwarz criterion (from Bayesian theory): log[SSR/n]+k*log(n)/n (minimize, can be neg.)
Schwarz tends to penalize more heavily for larger n (=favors parsimonious models).
not one is favored (all three are generally accepted), but one has to argue why one is used.
8 Introduction to large sample theory
Basic concepts of large sample theory
Using large sample theory we can dispense with basic assumptions from nite sample theory:
1.2 E(
i
|X) = 0: strict exogeneity
1.4 V ar(|X) =
2
I
n
: homoscedasticity
1.5 |X N(0,
2
I
n
): normality of the error term
25
Approximate/assymptotic distribution of b, and t- and the F-stat can be obtained.
Contents:
A. Stochastic convergence
B. Law of large numbers
C. Central limit theorem
D. Useful lemmas
A-D: pillars and building blocks of modern applied econometrics
CAN(consistent asymptotiv normal) - property of OLS (and other estimates)
8.1 A. Modes of stochastic convergence
First: non-stochastic convergence
{c
n
}=(c
1
, c
2
, ...)=sequence of real numbers
Q: Can you nd n() such that |c
n
-c|< for all nn() for all ?
A: Yes? Then {c
n
} converges to c.
Other notations: lim
n
c
n
= c; c
n
c; c=lim c
n
Examples:
c
n
= 1
1
n
so that c
n
1
c
n
= exp(n) so that c
n
0
c
n
= (1 +
a
n
)
n
so that c
n
exp(a)
Second: Stochastic convergence
{z
n
}=(Z
1
, Z
2
, ...) with Z
1
, Z
2
,... being random variables.
Illustration:
All Z
n
are iid e.g. N(0,1).
They can be described using their distribution function.
Example 1:

a.s.:
{u
n
} sequence of random variables (u
i
is iid with E(u
i
) = < , V ar(u
i
) < )
z
n
=
1
n
u
i
mit {z
n
}={u
1
,
1
2
(u
1
+u
2
)
. .. .
z
2
, ...,
1
100
(u
1
+... +u
100
)
. .. .
z
100
}
<Grak>
For every possible realization (a sequence of real numbers) of {z
n
}, the limit is
Mathematical: lim
n
z
n
= almost sure convergence; also written as: z
n
a.s.
a.s.: strongest mode of stochastic convergence
More formally:
Q: Can you nd an n(, w) such that |z
n
| < for all n>n(,w) for all <0 and for all w.
A: Yes? Then z
n
a.s.
26
Note: sample mean sequence {z
n
} is a special case of

a.s.. Generally: P(lim
n
Z
n
= ) = 1;
z
n
a.s. ; z
n
converges almost surely to .
Example 2:

p
:
<Grak> n=5: 1/3 realizations within limits
n=10: 2/3 realizations within limits
n=20: 3/3 realizations within limits
When # of realizations , then relative frequency of # of realizations in limits over # of
replications p.
Formalization:
Q: P[|z
n
| < ] = 1
A: Yes? Then z
n
; plimz
n
= ; z
n
convergence in probability to
If you dene P[|z
n
| < ] as a series of probabilities p
n
, you can express this as: p
n
1
A dierent illustration (reaching towards consistency):
<Grak>
Example 3:

m.s. (is similar to

p
):
lim
n
E[(Z
n
)
2
] = 0
m.s. implies

p
(m.s. is the stronger concept)
Implies that lim
n
E(Z
n
) = 0 and lim
n
V ar(Z
n
) = 0
Add remarks on the modes of convergence
(limit) can also be a random variable (not so important for us), so that Z
n
p
Z, Z being a
r.v. (also for a.s., m.s.)
Extends to vector/matrix sequence of r.v.: Z
n1
p

1
, Z
n2
p

2
, ...
Notation: Z
n
p
(element-wise convergence). This is used later to write: b
n
p

Convergence in distribution

d
Illustration:
<Grak>
f
z
: limiting distribution of Z
n
Denition: Denote F
n
cdf: P(Z
n
a) = F
n
(a), which means: {Z
n
} converges in distribution to a
r.v. Z, if cdf F
n
of Z
n
converges to cdf of Z at each point (of continuity) of F
z
<Grak>
Z
n
d
Z; F
z
: limiting distribution if Z
n
If distribution of Z known, we write e.g. Z
n
d
N(0, 1)
27
Extension to a sequence of random vectors:
Z
n
....
(kx1)
d
Z
....
(kx1)
; Z
n
=
Z
n1
...
Z
nk
andZ =
Z
1
...
Z
k
(90)
F
n
of Z
n
converges at each point to cdf of Z.
cdf of Z = F
Z
(a) = P[Z
1
a
1
, ..., Z
k
a
k
]
cdf of Z
n
= F
Zn
(a) = P[Z
n1
a
1
, ..., Z
nk
a
k
]
8.2 B. Law of Largen Numbers (LLN)
Weak law of large numbers (Khinchin)
{Z
i
} sequence of iid random variables
E(Z
i
) = < compute
1
n
Z
i
= Z
n
. Then:
lim
n
P[|Z
n
| > ] = 0( > 0) or plimZ
n
; Z
n
p

Extensions: The WLLN holds for
1. Multivariate Extension (seq. of random vectors):
{Z
i
}
z
11
...
z
1k
. .. .
i=1
,
z
21
...
z
2k
. .. .
i=2
, ... (91)
E(Z
i
) = < , being a vector of expectations of the rows [E(z
1
), ..., E(z
k
)]
Compute sample means for each element of seq. over the rows:
1
n
n
i=1
Z
i
= [
1
n
Z
1i
, ...,
1
n
Z
ki
]
(92)
Element-wise convergence:
1
n
Z
i
p

2. Relax independence:
Relax iid, allow for dependence in {Z
i
} through cov(Z
i
, Z
ij
) = 0, j = 0 (especially important in
time series; draws still from the same distribution)
ergodic theorem.
3. Functions of Z
i
h(z
i
)
. .. .
measurable function
e.g.{ ln(Z
i
)
. .. .
new sequence
} (93)
28
If {Z
i
} allows application of LLN and E(h(Z
i
)) < , we have:
1
n
h(z
i
)

p
E(h(z
i
)) (LLN also can
be used)
Application
{Z
i
}iid, E(z
i
) <
h(Z
i
) = (z
i
)
2
E(h(z
i
)) = V ar(z
i
) =
2
1
n
(z
i
)
2

p
V ar(z
i
)
4. Vector-valued function f(z
i
)
Vector-valued functions: one or many arguments and one or many returns.
{Z
i
}{f(Z
i
)} (mit f(Z
i
) = f(z
i1
, ..., z
ik
))
If {Z
i
} allow application of LLN and E(f(Z
i
))<:
1
n
f(Z
i
)

p
E(f(Z
i
)) (94)
Example:
{Z
i
}={Z
i1
, Z
i2
} vector sequence
f{Z
i
}=f(z
i1
, z
i2
)=(z
i1
z
i2
, z
i1
, z
i2
)
Apply the above result (WLLN)
8.3 C. Central Limit Theorems (CLTs)
Univariate CLT
If {Z
i
} iid, E(Z
i
) = < , Var(Z
i
) =
2
< and WLLN works:
Then:

n(Z
n
)

d
N(0,
2
) (without

n the variance and mean would go to zero as n goes into
innity; also used with: square-root n consistent estimators)
CLT implies LLN.
Other kind of writing this:
n(Z
n
)
a
N(0,
2
)
If y
n
=
n(Z
n
) then z
n
=
y
n
n
+ and thus z
n

a
N(,

2
n
)
Multivariate CLT
{Z
i
} is iid and can be drawn from ANY distribution, but the n for convergence may be dierent.
=
z
11
z
21
... z
n1
... ... ... ...
z
1k
z
2k
... z
nk
=

Z
1
, Z
2
, ..., Z
n
(95)
29
E(Z
1
) = E(Z
2
) = ... = = [E(Z
1
), ..., E(Z
k
)]
and
V ar(Z
i
) =
V ar(z
i1
)... ...
... .... ...
Cov(Z
i1
, z
in
) ... V ar(z
ik
)
= < (96)
and WLLN works.
Then:

n(z
n
)

d
MV N(0, )
Other writings:
n(z
n
)
a
MV N(0, ) or:
Z
n

a
MV N(, /n)
iid can be relaxed, but not as far with the WLLN
8.4 D. Useful lemmas of large sample theory
Lemmas 1&2 continuous mapping theorem (probabilitylimit
p
and

apasses through continuous
function)
Lemma 1:
Z
n
can be a scalar, vector or matrix sequence.
Z
n
p
with a as a continuous function which does not depend on n then:
a(Z
n
)

p
a() or plim
n
a(z
n
) = a(plim
n
(z
n
))
Examples:
x
n
p
ln(x
n
)

p
ln()
x
n
p
and y
n
p
x
n
+y
n
p
+
Lemma 2:
z
n
d
z then:
a(z
n
)

d
a(z) (for z: dierent degrees of diculty to nd it)
Examples:
z
n
d
z N(0, 1) z
2
n
d

2
(1) (as [N(0, 1)]
2
=
2
(1))
Lemma 3:
x
n
....
(mx1)
d
x and y
n
p
(y
n
treated as non-stochastic, a vector of real numbers), then:
x
n
+y
n
d
x +
Examples:
x
n
d
N(0, 1), y
n
p
x
n
+y
n
d
N(, 1)
30
Lemmas 4& 5 (& 6) Slotzkys theorem
Lemma 4:
x
n
d
x and y
n
p
0 (we say that y
n
is 0P), then:
x
n
y
n
p
0
Lemma 5:
x
n
....
(mx1)
d
x and A
n
....
(kxm)
p
A, then:
A
n
x
n
d
A x (A treated as non-random, x as random vector)
Examples:
x
n
d
MV N(0, ), A
n
p
A A
n
x
n
d
MV N(0, Aa
)
Lemma 6:
x
n
d
x and A
n
p
A then:
x
n
A
1
n
x
n
d
x
A
1
x (x is random and A is non-random)
Examples:
x
n
d
x MV N(0, A) and A
n
p
A x
n
A
1
x
n
d
x
A
1
x
2
(rows(x))
8.5 Large sample properties for the OLS estimator
(2.1) maintained: Linearity: y
i
= x
i
+
i
, i = 1, 2, ..., n
(2.2) Assumptions regarding dependence of {y
i
, x
i
} (get rid of iid)
(2.3) Replacing strict exogeneity: Orthogonality/predetermined regressors: E(x
ik

i
) = 0
If x
ik
= 1 E(
i
) = 0 Cov(x
ik
,
i
) = 0i, k. If violated: endogeneity
(2.4) Rank condition: E(x
i
x
i
)
. .. .
KxK
=
xx
is non-singular.
Remarks on the properties
{y
i
, x
i
} allows application of WLLN (so DGP is not necessarily iid)
We assume {y
i
, x
i
} are (jointly) stationary & ergodic (identical but not independent)
{x
i

i
}=
x
i1

i
...
x
ik

i
= {g
i
} (97)
allows applicatio of central limit theorem
{g
i
} a martingal dierence sequence (m.d.s.): E(g
i
|g
i1
, ...) = 0 (g
i
is uncorrelated with its past =
zero autocorrelation)
Assuming iid {y
i
, x
i
,
i
} covers this (iid as a special case), but its not necessary (and not useful in
time series)
31
Overview
We get for b=(X
X)
1
Xy:
b
n
= [
1
n
x
i
x
i
]
1
1
n
x
i
y
i
(98)
WLLN and continuous mapping & Slotzky theorem:
b
n
p
(consistency = asymptotically unbiased)
n(b
n
)

d
MV N(0, AV ar(b)) or b
a
MV N(,
AV ar(b)
n
) (approximate distribution)
b
n
is consistent, asymptotically normal (CAN) (We loose unbiasedness & eciency here)
We show that b
n
= (X
X)
1
X
y is consistent
b
n
= [
1
n
x
i
x
i
]
1 1
n
x
i
y
i
b
n
. .. .
sampling error
= [
1
n
x
i
x
i
]
1 1
n
x
i
i
We show: b
n
p

When sequence {y
i
, x
i
} allows application of WLLN:
1
n
x
i
x
p
E(x
i
x
i
)
=
E(x
2
i1
) ... E(x
i1
, x
ik
)
... ... ...
E(x
i1
, x
ik
) ... E(x
2
ik
)
(99)
So by Lemma 1:
[
1
n
x
i
x
i
]
1

p
[E(x
i
x
i
)]
1
1
n
x
i
p
E(x
i
i
) = 0 (we assume predeterminant regressors)
Lemma 1 implies:
b
n
= [
1
n
x
i
x
i
]
1 1
n
x
i
p
E(x
i
x
i
)
1
E(x
i
i
)

p
E(x
i
x
i
)
1
0 = 0
b
n
= (X
X)
1
X
y is consistent.
If the assumption E(x
i
i
) = 0 does not hold, we have inconsistent estimators (=death of estima-
tors)
We show that b
n
= (X
X)
1
X
y is asymptotically normal
Sequence {g
i
}={x
i
i
} allows applying CLT for
1
n
x
i
i
= g
n(g E(g
i
))

d
MV N(0, E(g
i
g
i
))
n(b
n
) = [
1
n
x
i
x
i
]
1
. .. .
A
n
ng
....
x
n
(starting point: sampling error expression *
n)
Applying Lemma 5:
A
n
= [
1
n
x
i
x
i
]
1

p
A =
1
xx
x
n
=
ng
d
x MV N(0, E(g
i
g
i
))
32
n(b
n
)

d
MV N(0,
1
xx
E(g
i
g
i
)
1
xx
. .. .
AV ar(b)
)
Practical use: b
n

a
MV N(,
AV ar(b)
n
)
b
n
is CAN
In detail: CLT on {x
i

i
}
n
1
n
x
i
i
=
ng
CLT:

n(g E(g
i
))

d
MV N(0, E[(g
i
E(g
i
))(g
i
E(g
i
))
]
. .. .
V ar(g
i
)
)
With predetermined x
i
: E(x
i
i
) = 0 E(g
i
) = 0 V ar(g
i
) = E(g
i
g
i
) = E(
2
i
x
i
x
i
)
How to estimate Avar(b)
Avar(b)=
1
xx
E(g
i
g
i
)
1
xx
1.
xx
= E(x
i
x
i
) S
xx
=
1
n
x
i
x
i
p
E(x
i
x
i
) =
xx
(also for inverse using Lemma 1)
2. E(g
i
g
i
)

S =?

p
E(g
i
g
i
) = E(
2
i
x
i
x
i
)
Consistent estimate for Avar(b):

Avar(b) = S
1
xx

S S
1
xx
p

1
xx
E(g
i
g
i
)
1
xx
So how to estimate

S = E(
2
i
x
i
x
i
):
Recall: Z
i
(vector) allows application of LLN
1
n
f(Z
i
)

p
E(f(Z
i
))
Our Z
i
= [
i
, x
i
]
1
n
2
i
x
i
x
p
E(
2
i
x
i
x
i
)
Problem:
i
is unknownwe use e
i
= y
i
x
i
b
s =
1
n
e
2
i
x
i
x
p
E(
2
i
x
i
x
i
) = E(g
i
g
i
)
Hayashi: p.123 shows that s is a consistent estimator
Result:
Avar(b) = [
1
n
x
i
x
i
]
1 1
n
e
2
i
x
i
x
i
[
1
n
x
i
x
i
]
1

p
E(x
i
x
i
)
1
E(g
i
g
i
)E(x
i
x
i
)
1
= Avar(b)
is estimated without the assumption 1.4 (homoscedasticity)
Developing a test statistic under the assumption of conditional homoscedasticity
Assumption: E(
2
i
|x
i
) =
2
(<0 and <) (We dont have to condition on all xs, but only x
i
)
E[E(
2
i
|x
i
)] = 2 = E(
2
i
) = V ar(
i
)
Avar(b) = [
1
n
x
i
x
i
]
1

2 1
n
x
i
x
i
[
1
n
x
i
x
i
]
1
= [
1
n
x
i
x
i
]
1

2
with

S =
1
n
e
2
i
1
n
x
i
x
i
and
2
=
1
n
e
2
i
;

S is still a consistent estimate if E(
2
i
|x
i
) =
2
This is the asymptotic variance covariance matrix

Avar(b). The variance of b

V ar(b) =

Avar(b)
n
=

2
[
x
i
x
i
]
1
=
2
(X
X)
1
is the same as before.
When using the above assumption, we get to the same results as before asymptotic theory, but we
also have the general result for

Avar(b) without the above assumption:
1. If you assume conditional homoscedasticity (cf. 1.4)
Avar(b) =
2
[
1
n
x
i
x
i
]
1
33
Applied Econometrics 9 Time Series Basics (Stationarity and Ergodicity)
We estimate Var(b):
Avar(b)
n
=

2
[
1
n
x
i
x
i
]
1
n
=
2
(X
X)
1

2
=
1
n
e
2
i
(consistent but biased)
We can use instead of
2
: s
2
=
1
nk
e
2
i
p

2
(consistent and unbiased)
2. If you dont assume homoscedasticity:
Avar(b) = [
1
n
x
i
x
i
]
1
[
1
n
e
2
i
x
i
x
i
][
1
n
x
i
x
i
]
1
In EViews: Heteroscedasticity-consistent variance-covariance matrix is used to compute t-stats, s.e.,
f-tests which are robust (=we dont need the homoscedasticity assumption)
3. Assuming wrongly conditional homoscedasticity: t-stats, s.e., f-tests are wrong (more precise: in-
consistent)
White standard errors
Adjusting the test statistics to make them robust against violations of conditional homoscedasticity.
t-ratio
t
k
=
b
k
[
[
1
n
x
i
x
i
]
1
1
n
e
2
i
x
i
x
i
[
1
n
x
i
x
i
]
1
n
]
kk
a
N(0, 1) (100)
Holds under H
0
:
k
=
k
F-ratio
W = (Rb r)
[R

Avar(b)
n
R
]
1
(Rb r)
2
(#r)
Holds under H
0
: R r = 0; allows for linear restrictions on
9 Time Series Basics (Stationarity and Ergodicity)
Dependence in the data:
Certain degree of dependence in the data in time series analyis; only one realization of the data
generating process is given.
CLT and WLLN rely on iid data, but dependence in real world data.
Examples:
Ination rate, stock market returns
Stochastic process: sequence of random variables/vectors indexed by time {z
1
, z
2
, ...} or {z
i
} with
i=1,2,...
A realization/sample path: one possible outcome of the process (p.97)
Distinguish between:
Realization (sample path) = sequence of real numbers
34
Process {Z
i
} = sequence of random variables (not a real number)
Hayashi: p.99f
Annual ination rate = sequence of real numbers
<Grak>
vs. Three realizations (r) of stochastic process of US ination rates = sequence of random variables
<Grak>
Ensemble mean:
1
R
infl
1995,r
p
E[infl
1995
] (by LLN, here: R=3)
What can we infer from one realiztation about the process?
We can learn something when the DGP has certain properties/restrictions. Condition: stationarity of
the process.
Stationarity
We draw from the same distribution over and over again, but maybe the draws are dependent.
Additionally, the dependence doesnt change over time.
Strict stationarity:
The joint distribution of z
i
, z
i1
, ..., z
ir
depends only on the relative position i
1
i, ..., i
1
i (distance)
but not on i itself.
In other words: the joint distribution of (z
i
, z
ir
) is the same as the joint distribution of (z
j
, z
jr
) if
i i
r
= j j
r
.
Weak stationarity:
E(z
i
) = (nite/exists, but does not depend on i)
Cov(z
i
, z
ij
) depends on j (distance), but not on i (absolute position)implies V ar(z
i
) =
2
(does
not depend on i)
(We only need the rst moments and not the whole distribution)
If all moments exist: strict stationarity implies weak stationarity
Illustration:
<Grak>
It seems that E(z
i
) does depend on i
<Grak>
IT seems that E(z
i
) = mu does not depend on i; the realization oszilliates around the expected value
(some process drags them back, although there are some dependencies over or below the mean).
V ar(z
i
) =
2
does not depend on i
This second graph seems to fulll the rst two requirements.
We now turn to the assumption that the Cov(z
i
, z
ij
) only depends on the distance.
j=0V ar(z
i
) <Grak>
The ensemble means are still constant, but V ar(z
i
) increases with i.
Example: Random walk: Z
i
= z
i1
+
i
[
i
N(0,
2
) iid]
Stationarity is testable
35
Ergodicity
If I space two random variables apart in the process, he dependence has to decrease with the distance
(restricted memory of the process; in the limit the past is forgotten)
So a stationary process is also called ergodic if: independence
lim
n
E[f(z
i
, z
i+1
, ..., z
i+k
)g(z
i+n
, z
i+n+1
, ..., z
i+n+l
)] = E[f(z
i
, z
i+1
, ..., z
i+k
)]E[g(z
i+n
, z
i+n+1
, ..., z
i+n+l
)]
(101)
Ergodic processes are stationary processes which allow the application of a LLN: Ergodic theorem:
Sequence {z
i
} is stationary and ergodic with E(z
i
) = , then
z
n
=
1
n
z
i
p
(102)
So for an ergodic process we are back to the fact that one realization is enough to infer from.
Martingale dierence sequence
Stationarity and Ergodicity are not enough for applying the CLT. To derive the CAN property of the
OLS-estimator we assume:
{g
i
}={x
i
i
}
{g
i
} is a stationary and ergodic martingale dierence sequence (m.d.s.):
E(g
i
| g
i1
, g
i2
, ...
. .. .
history of the process
) = 0E(g
i
) = 0 by LTE
Derive asymptotic distribution of b
n
(X
X)
1
X
y. We assume {x
i
,
i
}
. .. .
g
i
is a stationary & ergodic m.d.s.:
g
i
=
i
1
i
x
i2
...
i
x
ik
(103)
Why? Because there exists a CLT for s&e m.d.s
E(g
i
|g
i1
, ...) = 0 with x
i1
= 1 E(
i
|g
i1
, ...) = 0 by LTE: E(
i
) = 0 and also by LIE:
E(
i

ij
) = 0i = 1, 2, ... so that Cov(
i
,
ij
) = 0.
Summarize
We assume for large sample OLS:
{y
i
, x
i
} stationary& ergodic
LLN (ergodic theorem):
1
n
x
i
x
p
E(x
i
x
i
) and
1
n
x
i
y
i
p
E(x
i
y
i
)
Consistency of b=(XX)Xy
g
i
=
i
x
i
is a stationary&ergodic m.d.s
E(g
i
|g
i1
, ...) = 0 E(g
i
) = [E(
i
), ..., E(
i
x
ik
)] = 0
CLT for s&e m.d.s applied to {g
i
}:
1
n
i
x
i
= g
n
:
36
Applied Econometrics 10 Generalized Least Squares
n(g
n
E(g
i
)
. .. .
=0
)

d
MV N(0, E(g
i
g
i
)
. .. .
V ar(g
i
)
)
Distribution of b
b = [
1
n
x
i
x
i
]
1 1
n
x
i
y
i
is CAN
Notes:
{
i
x
i
} is a s&e m.d.s rules out Cov(
i
,
ij
) = 0.
This autocorrelation is ok for cross-sectional data, but not so ok for time eries. Can be relaxed (but
more complex): HAC properties (=heteroscedasticity autocorrelation consistent).
Robust inference
Robust s.e./t-stats etc. (HAC) and those using the restrictive assumptions (CAN) should not
dier too much (Often happens when one tries to account for autocorrelation).
10 Generalized Least Squares
Assumptions of GLS
Linearity y
i
= x
i
+
i
Full rank: rank(X)=K (with Prob=1)
Strict exogeneity E(
i
|X) = 0 E(
i
) = 0 and Cov(
i
, x
ik
) = E(
i
x
ik
) = 0
Not assumed: V ar(|X) =
2
I
n
.
Instead:
V ar(|X) = E(
|X) =
V ar(
1
|X) Cov(
1
,
2
|X) ... Cov(
1
,
n
|X)
Cov(
1
,
2
|X) V ar(
2
|X) ... ...
... ... ... ...
Cov(
1
,
n
|X) ... ... V ar(
n
|X)
(104)
V ar(|X) = E(
|X) =
2
V (X) (a positive real number extracted from V ar(|X))
Consequences:
b
n
= (X
X)
1
X
y no longer BLUE, but unbiased

V ar(b
n
|X) =
2
(X
X)
1
Deriving the GLS estimator
Derived under the assumption that V(X) is knowm, symmetric and positive denite. Then:
V=P P (with P=characteristic vectors/eigenvectors of V(nxn), =diagonal matrix with eigenvalues
on the diagonal) = spectral decompositionC=P*
1/2
(C is (nxn) and nonsingular)
V (X)
1
=CC
Transformation: y = Cy,

X = CX
y = X +
37
Applied Econometrics 10 Generalized Least Squares
Cy = CX +C
y =

X +
Least squares estimation of

using transformed data
GLS
= (

X)
1

X
y = (X
CX)
1
X
Cy = (X
1
2
V
1
X)
1
X
1
2
V
1
y
= [X
[V ar(|X)]
1
]X
[V ar(|X)]
1
y
GLS estimator is BLUE:
linearity
full rank X
strict exogeneity
V ar(|X) =
2
I
n
V ar(
|X) =
2
[X
V
1
X]
1
Assumptions 1.1-1.4 are fullled for this model
Problems:
Dicult to work out the asymptotic properties of

GLS
In real world applications V ar(|X) not known (and this time we have a lot more unknowna)
If V ar( < X) is estimated the BLUE-property of

GLS
is lost
Special case of GLS - weighted least squares
E(
) = V ar(|X) =
2
V
1
(X) 0 ... 0
0 V
2
(X) ... 0
... ... ... ...
0 0 ... V
N
(X)
=
2
V (x
i
) (105)
As V (X)
1
=CC C=
V
1
(X)
0 ... 0
0
1
V
2
(X)
... 0
... ... ... ...
0 0 ...
1
V
n
(X)
1
s
1
0 ... 0
0
1
s
2
... 0
... ... ... ...
0 0 ...
1
s
n
(106)
argmin
(
y
1
s
i
1
s
1
i

2
x
i2
s
i
...

k
x
iK
s
i
)
2
Observations are weighted by standard deviation (higher penalty for wrong estimates)
For this we would need the conditional variances V ar(|x
i
). We assume that we can use a variance
conditioned on x
i
.
38
Applied Econometrics 11 Multicollinearity
Notes on GLS
If regressors X are not strictly exogeneous, GLS may be inconsistent (OLS does not have this problem:
robustness property of OLS).
Cross-section data: E(
i
|X) = 0 better justied and E(
i

j
) = 0, but:
Conditional heteroscedasticity: V ar(
i
|) =
2
V (x
i
) WLS using
2
V (x
i
)
Functional form of V (x
i
)? Typically not known, but has to be assumed. E.g.:
V (x
i
) =

Z
i
....
observables
....
parameters to be estimated
Feasible GLS (FGLS) ( ,

estimated in one step)
Finite sample properties of GLS are lost
For large sample properties:
FGLS may be more ecient (asymptotic) than the OLS, but: the functional form of V (x
i
) has to be
correctly specied. If this is not the case: FGLS may be even less ecient than OLS.
So we face a trade-o between consistency and eciency.
One can work out the distribution of th GLS also asusming s&e m.d.s and you also get a normal
distribution.
11 Multicollinearity
Exact multicollinearity
Expressing a regressor as linear combination of (an)other regressor(s)
rank(X)= K: No full rank Assumption 1.3 or 2.4 is violated and (X
X)
1
does not exist.
Example: seasonal dummies/sex dummies and a constant
Often economic variables are correlated to some degree.
Example: tenure and work experience
BLUE result is not afected
Large sample results are not aected
Var(b|X) is aected
Eects of Multicollinearity and solution to the problem
Eects:
Coecients may have high standard errors and low signicance levels
Estimates may have the wrong sign (compared to a theoretical sign)
Small changes in the data produces wide swings in the parameter
39
Applied Econometrics 12 Endogeneity
Recall: V ar(b
k
|x) =
2
[(X
X)
1
]
kk
, X contains a constant & 2 regressors x
1
, x
2
Then V ar(b
k
|X) =

2
(1r
2
x
1
,x
2
)
(x
ik
x
k
)
2
(with r
2
= empirical correlation of regressors and second
part as sample variance of the k-th regressor)
General: [(X
X)
1
]
kk
=
1
(1R
2
k.
)
(x
ik
x
k
)
2
(with R
2
k.
= explained variance of the regression of x
k
on
the rest (data matrix excluding x
k
))
The stronger the correlation between the regressors, the smaller the denominator of the variance
and the higher the variance of b
k
: less precision and we can hardly reject the null hypothesis (small
t-stat)
The smaller
2
, the smaller V ar(b
k
).
The higher n, the higher the variance in the regressor x
k
and thus a high n can compensate for
high correlation: the larger the sample variance, the smaller V ar(b
k
).
Solutions:
Increasing precision by implementing more data (= higher nhigher variance of x
k
) (costly!)
Building a better tting model that leaves less unexplained (= smaller
2
)
Excluding some regressors (Omitted variables bias!) (= smaller correlation)
12 Endogeneity
Omitted variable bias
Correctly specied model: y = X
1
1
+X
2
2
+
Regression of y on X
1
X
2
gets into the error term: y = X
1
1
+
....
X
2
2
+
Omitted variable bias:
b
1
= (X
1
X
1
)
1
X
1
y
= (X
1
X
1
)
1
X
1
(X
1
1
+X
2
2
+)
. .. .
y
=
1
+ (X
1
X
1
)
1
X
1
X
2
. .. .
RegressionofX
2
onX
1
2
+ (X
1
X
1
)
1
X
. .. .
=0ifE(|X)=0
OLS estimator is biased:
If
2
= 0 (which would mean that X
2
is not part of the model in the rst place)(X
1
X
1
)
1
X
1
X
2
2
=
0
If (X
1
X
1
)
1
X
1
X
2
= 0 (the regression of X
2
on X
1
gives you non-zero coecients)(X
1
X
1
)
1
X
1
X
2
2
=
0
Recall:
1. E(
i
|x) = 0 = strict exogeneity E(
i
x
i
) = 0
2. E(
i
x
ik
) = 0k = large sample OLS assumption; predetermined/orthogonal regressors
E(
i
x
i
) = 0 or E(
i
x
ik
) = 0 for some k
40
Applied Econometrics 12 Endogeneity
Endogeneity bias: Working example
Simultaneous equations model of market equilibrium (structural form):
q
d
i
=
0
+
1
p
i
+u
i
(u
i
=demand shifter (e.g. preferences))
q
s
i
=
0
+
1
p
i
+v
i
(v
i
=supply shifter (e.g. temperature)
Assumptions: E(u
i
) = 0, E(v
i
) = 0, E(u
i
v
i
) = 0(Cov)
Clear markets: q
d
i
= q
s
i
Our data show clear point of the market. It is not possible to estimate
0
,
1
,
0
,
1
(levels and slopes)
as we do not know whether changes in the market equilibrium are due to supply or demand shocks.
We observe many possible equilibria, however we can not explain the slope of the demand and the
supply curve from the data.
Here: Simultaneous equation bias
From structural form to reduced form
Solving for q
i
and p
i
(q
d
i
= q
s
i
= q) yields reduced form (only exogeneous things on the right-hand
side):
p
i
=

0
1
+
v
i
u
i
1
q
i
=

1
1
+

1
v
i
1
u
i
1
Both price and quantity are a function of the tow error terms (no regression can be used at this point
as we dont observe the error terms): v
i
=supply shifter and u
i
:demand shifter
Endogeneity: Correlation between errors and regressors, regressors are not predetermined as the price
is a function of these error terms
Calculating the covariance of p
i
and the demand shifter u
i
:
Cov(p
i
, u
i
) =
V ar(u
i
)
1
(using reduced form)
If endogeneity is present the OLS-estimator is not consistent
FOCs in simple regression context yield:

1
=
1
n
(q
i
q)(p
i
p)
1
n
(p
i
p)
2
p
Cov(p
i
,q
i
)
V ar(p
i
)
(formula for estimation)
But here: Cov(p
i
, q
i
) =
1
V ar(p
i
) +Cov(p
i
, u
i
) (from structured form)
Cov(p
i
,q
i
)
V ar(p
i
)
=
1
+
Cov(p
i
,u
i
)
V ar
i
=
1
OLS is not consistent
Same holds for
1
Instruments for the market model
Properties of the instruments: uncorrelated with the errors, instruments are predetermined and corre-
lated with the endogenous regressors.
New model:
q
d
i
=
0
+
1
p
i
+u
i
q
s
i
=
0
+
1
p
i
+
2
x
i
+
i
q
d
i
= q
s
i
with:
E(
i
) = 0
41
Applied Econometrics 13 IV estimation
Cov(x
i
,
i
) = 0
Cov(x
i
, u
i
) = 0
Cov(x
i
, p
i
) =

2
1
V ar(x
i
) = 0
x
i
an instrument for p
i
yields new reduced form:
p
i
=

0
1
+

2
1
x
i
+

i
u
i
1
q
i
=

1
1
+

1
1
x
i
+

1
1
u
i
1
Cov(x
i
, q
i
) =

1
1
V ar(x
i
) =
1
Cov(x
i
, p
i
) (from reduced form)
Using Method of Moments Theory (not regression):
1
=
Cov(x
i
,q
i
)
Cov(x
i
,p
i
)
by WLLN:
1
p

1
(see
estimator:
1
n
x
i
y
i
1
n
x
i
z
i
)
A simple macroeconomic model: Haavelmo
Aggregated consumption function: C
i
=
0
+
1
Y
i
+u
i
GDP identity: Y
i
= C
i
+I
i
Y
i
aects C
i
but at the same time C
i
inuences Y
i
Reduced form:Y
i
=

0
1
1
+
1
1
1
I
i
+
u
i
1
1
C
i
cannot be regressed on Y
i
as the regressor is correlated with the residual:
Cov(Y
i
, u
i
) =
V ar(u
i
)
1
1
> 0 (using reduced form)
OLS-estimator is inconsistent: upwards biased (as Cov(C
i
, Y
i
) =
1
V ar(y
i
) + Cov(Y
i
, u
i
) from
structured form):
Cov(C
i
,Y
i
)
V ar(Y
i
)
=
1
+
Cov(Y
i
,u
i
)
V ar(Y
i
)
=
1
Valid instrument for income Y
i
: investment I
i
Cov(I
i
, u
i
) = 0
Cov(I
i
, Y
i
) =
V ar(I
i
)
1
1
= 0 (from reduced form)
Errors in variables
Explanatory variable is measured with error (e.g. reporting errors)
Classical example: Friedmans permanent income hypothesis
Permanent consumption is proportional to permanent income C
i
= kY
i
Observed variables:
Y
i
= Y
i
+y
i
C
i
= C
i
+c
i
c
i
= ky
i
+u
i
Endogeneity due to measurement errors
Solution: IV-estimators; here: x
i
= 1
13 IV estimation
General solution to endogeneity problem: IV
Note: change in notation: z
i
=regressors, x
i
=instruments
42
Linear regressions: y
i
= z
i
+
i
But the assumption of predetermined regressors does not hold: E(z
i

i
) = 0
To get consistent estimators instrumental variables are needed: x
i
= [x
i1
, ..., x
iK
]
x
i
is correlated with the endogeneous regressor, but uncorrelated with the error term.
Assumptions for IV-estimators
3.1 Linearity y
i
= z
i
+
i
3.2 Ergodic stationarity: K instruments and L regressors. Data sequence w
i
= {y
i
, z
i
, x
i
} is stationary
and ergodic [LLN can be applied as with OLS]
We focus on K=L IV (other possibility K>LGMM)
3.3 Orthogonality conditions: E(x
i

i
) = 0
E(x
i1
(y
i
z
i
)) = 0
...
E(x
iK
(y
i
z
i
)) = 0
E(x
i
(y
i
z
i
)) = 0 E(x
i

i
) = E(g
i
) = 0
Later we also assume that a CLT can be applied to g
i
as for OLS
3.4 Rank condition for identication: rank(
xy
)=L with
E(x
i
z
i
) =
E(x
i1
) ... E(x
i1
z
iL
)
... ... ...
E(x
iK
z
i1
) ... E(x
iK
z
iL
)
=
xz
....
(KxL)
(107)
(=full column rank)
1
xz
exists of K=L
Core question: Do moment conditions provide enough information to determine uniquely?
Some notes on IV-estimation
Start from:
g
i
= x
i
(y
i
z
i
) = g
i
(w
i
, ) (with w
i
=the data)
Assumption 3.3 E(g
i
) = 0 or E(g(w
i
, )) = 0 = orthogonality condition: E(x
i
(y
i
z
i
)) = 0 =
System of K equations relating L unknown parameters to K unconditional moments
Core question: Are population moment conditions sucient to determine uniquely?
as a solution to system of equation. But: only solution?
hypothetical value of that solves E(g(w

i
,
)) = 0
only solution:
E(x
i
y
i
)
. .. .
xy
(Kx1)
E(x
i
z
i
)
. .. .
xz
(KxL)
= 0
xz
=
xy
(System of linear equations Ax=b)
Necessary& sucient conditions that

only solution
xz
full column rank(L)
Reason for assumption 3.4
43
Another necessary condition (but not sucient): Order condition: K(#instruments)L(#regressors)
Deriving the IV-estimator (K=L)
E(x
i
i
) = E(x
i
(y
i
z
i
) = 0
E(x
i
y
i
) E(x
i
z
i
) = 0
= [E(x
i
z
i
)]
1
E(x
i
y
i
)
IV
= [
1
n
x
i
z
i
]
1
[
1
n
x
i
y
i
]

p

To show this proceed as for OLS, start from sampling error:

IV

Alternative notation:
IV
= (X
Z)
1
X
y
If K=L the rank condition implies that
1
xz
exists and the system is exactly identied.
Applying WLLN, CLT and the lemmas it can be shown that IV-estimator

IV
is CAN.
To show this, proceed as for OLS, start from

n(
IV
). Assume that a CLT can be applied to
{g
i
}:

n(
1
n
g
i
E(g
i
))

d
MV N(0, E(g
i
g
i
))
n(
IV
)

d
MV N(0, Avar(
IV
)) (Find expression for the variance and the estimator of the
variance)
So:
Show consistency and asymptotic normality of

IV
.

IV
is CAN:

a
MV N(,
Avar(
IV )
n
)
Relate OLS- to IV-estimators:
If regressors predetermined E(
i
z
i
) = 0 x
i
= z
i
(use regressors as instruments)
IV
= [
1
n
z
i
z
i
]
1 1
n
z
i
y
i
= OLS estimate
OLS a special case of IV
44
Applied Econometrics 14 Questions for Review
14 Questions for Review
First set
1. Ragnar Frischs statement in the rst Econometrica issue claims that Econometrics is not (only)
economic statistics and not (only) mathematics applied to economics. What is it. then?
Econometrics=
Economic statistics (Basis)
Economic theory (Basis and application)
Mathematics (Methodology)
2. How does the fundamental asset value evolve over time in the Glosten/Harris model presented in
the lecture? Which parts of the nal equation are associated with public and which parts with private
information that inuences the fundamental asset value?
m
t
....
fundamentalassetvalue
= +m
t1
+
t
. .. .
randomwalkw.drift
+Q
t
z
0
+Q
t
z
1
v
t
. .. .
privateinfo
(108)
P
t
=
....
publicinfo
+z
0
Q
t
+z
1
v
t
Q
t
+cQ
t
. .. .
privateinfo
+
t
....
publicinfo
(109)
3. Explain the components of the equations that give the market maker buy (bid) and the market
maker sell (ask) price in the Glosten/Harris model.
P
a
t
= mu
....
driftparameter
+ m
t1
. .. .
lastfairprice
+
t
....
publicinfo
z
t
....
volume
c
....
earnings/costs
(110)
4. Explain how in the Glosten/Harris model the market maker anticipates the impact that a trade
event exerts on the fundamental asset price when setting buy and sell prices.
The market maker sells with the ask price, thus z
t
and c enter positively in the price. This means
the market maker adds a volume parameters and his costs to the price to be rewarded for his work
appropriately.
5. Why should it be interesting for a) an investor and b) for a stock exchange to estimate the
parameters of the Glosten/Harris model (z
0
, z
1
, c, )?
a) To nd out the fundamental value thus to prevent the investor from paying too much
b) To trade with ecient prices (avoid bubbles, avoid exploitation through market maker).
45
6. What does the (bid-ask) spread mean? What are explanations for the existence of a positive spread
and what could be the reason that the spread diers when the same asset is traded on dierent market
environments?
P
a
t
P
b
t
= 2z
t
+ 2c (111)
Positive spread: earnings of the market maker to earn a living and avoid costs (operational costs,
protection/storage costs)
Market environment: inuences costs and trade volume eg. in a crisis
7. Which objects (variables and parameters) in the Glosten/Harris model are a) observable to the
market maker, but not to the econometrician and b) observable to the econometrician?
a) costs/earnings c
b) P
t
, Q
t
, v
t
8. The nal equation of the Glosten/Harris model contains observable objects, unknown parameters
and an unobservable component. What are these? What is the meaning of the unobservable variable
in the nal equation?
Observable: P
t
, Q
t
, v
t
Unobservable:
t
Parameter: , z
0
, z
1
, c
9. Why did we transform the two equations for the market makers buy and sell price that the market
maker posts at point in time t into the equation P
t
?
So that we can estimate the structural parameters
10. In the "Mincer equation"derived from human capital theory what are the observable variables and
what would a sensible interpretation of of the unobservable component be?
Observable: Wage, S, Tenure, Exp
Unobservable:
i
(eg. ability, luck, motivation, gender)
11. Why should the government (ministry of labour or education) be interested in estimating the
parameters of the Mincer equation from human capital theory?
To determine how many years people should go to school (costs!)
12. Explain why we can conceive ln(WAGEi), Si, TENUREi, and EXPRi in the Mincer equation as
random variables.
The individuals are randomly drawn. The variables are the results of a random experiment and thus
random variables
13. An empirical researcher using a linear regression model is confronted with the critique this is
measurement without theory". What is meant by that and why is this a problem.
46
A correlation is found, but it may be too small or without meaning for theory. Always start with a theo-
retical assumption. Without a theoretical assumption behind the statistics, we cant test hypothesis.
Causality is a problem, because we cant refer to a model/reasons for the form of our regression
14. The researcher counters that argument by stating that he is only interested in prediction and
causality issues do not matter for him. Exploiting a correlation would be just ne for his purposes.
Provide a critical assessment of this statement.
To act upon such a test, causality is needed. Correlation could be because of self-selection, unobserved
variables, reversed causality, interdependence, endogeneity
15. A possibility to motivate the linear regression model is to assume that E(Y|X)=x. What does
this imply for the disturbance term ?
uncorrelated with the Xs and its expectation value is zero.
16. Some researchers argue that nonparametric analysis provides a better way to model the conditional
mean of the dependent variable. What can be said in favor of it and what is the drawback?
E(y
i
|x
i
) = x
i
= assuming linearity (=parametric) vs. E(y
i
|x
i
) = f(x
i
, ) = no functional form
assumed (nonparametric)
Favor: no linearity assumed (robust)
Drawback: data intensive (with a lot of variety in X), curse of dimensionality
17. The parameters of asset pricing models relating expected excess returns to risk factors can be
estimated by regression analysis. Describe the idea. Which implications for the "compatible regres-
sion model"follow from theory with respect to the regression parameters and the distribution of the
disturbance term? How can we test the parameter restriction?
E(R
ej
t+1
) =
j
E(R
em
t+1
) (112)
Compatible model assumes that (un)conditional mean of
j
is 0:
E(R
ej
t+1
) =
j
E(R
em
t+1
) +
j
t+1
(113)
Parameter restriction???
18. Despite the fact that the Fama-French model is in line with the fundamental equation from
nance theory which states that E(R
ej
) =
, it is nevertheless prone to the "measurement without

theory"critique. Why is that so? What do and stand for in the rst place?
=exposure of payo asset to factor k risk
=price of factor k risk
Causality problem: if risk increases, does that cause the expected return to increase or is that because
of other factors (unobserved variables, reversed causality,...)
47
Zusatz. If one has stable correlation, we can predict without causality, eg. animals leaving the shore
and tsunamis.
Second set
1. Explain the key dierence of experimental data and the "process-producedconomic data that is
typically available.
In process-produced data: self selection (individuals not randomly assigned)
2. Why is the Tennessee STAR study a proper experiment while the SQ study of the sociology student
(probably) not?
STAR participants randomly assigned to the two groups (small or big classes). SQ study with self-
selection bias and without control group and pre-test
3. Explain the term election bias". Write down the expression and explain its terms to your fellow
student.
Selection-bias:
Self-selection: individuals decide themselves which group they want to be part of
Bias: The parameters are biased because of self-selection; errors in choosing the individuals/groups
that participate
4. Comparing the average observable outcome y
i
(e.g. entry income) of a random sample for individuals
who received a treatment d
i
(e.g. attended an SQ course) with the average outcome of a random
sample who did not receive a treatment (did not attend an SQ course) yields a consistent estimate
of E(Y
i
|d
i
= 1) E(Y
i
|d
i
= 0) (by a law of large numbers). Under which conditions does this yield
a consistent estimate of a treatment eet assumed to be identical for all i. What causes a bias of
the treatment eect? Write down an explicit expression of this bias. How can this bias be avoided?
The individuals have to be randomly assigned to either treatment or non-treatment.
E(b ) = 0 bzw. E(y
i
|d
i
= 1) E(y
i
|d
i
= 0) = + selection bias
. .. .
E(y
i0
|d
i
=1)E(y
i0
|d
i
=0)
5. Consider the police-crime example. Put it in a Rubin causal framework. Why would be suspect a
selection bias, i.e. E(y0ijdi = 1) E(y0ijdi = 0) 6= 0?
Y
i
= y
0i
+ (y
1i
y
0i
)D
i
Y
i
: Crime
D
i
: 0=police doesnt change; 1=police increases
y
1i
=crime if D
i
= 1 und y
0i
=crime if D
i
= 0
Social factors, which means police and
i
are correlated. Police ocers are not randomly assigned to
the districts.
6. What is a quasi-experiment?
48
=a natural experiment:
Without assigning individuals to groups, those groups form in a way they do in experiments (random
assignment)
7. In the lecture I promoted the xperimental ideal"for the measurement of causal eects. For instance,
the US senate passed (in 2004) a law that educational studies have to be based on experimental or
quasi-experimental designs. Discuss the pros and cons of setting up an experiment in a socio-economic
context. Give argument against conducting such experiments which do a good thing as they help to
estimate the true causal eect of treatments.
Pro: measures causal eects; self-selection bias avoided
Contra: unethical, costly
8. Take the Mincer equation as an example. Why are the years of schooling ndogenous", i.e. the
result of an economic decision process that is not made explicit in the equation. What is the role of
the disturbance term in this context? Place this endogeneity in the context of the US senates law
mentioned above.
Years of schooling depend on the ability which is part of the (S
i
could be written as a function of
IQ). No decision based on the Mincer equation.
9. Estimating the parameters of the Mincer equation yields an estimate of the coecients that multiply
with the years of schooling of 0.12. Interpret that number economically. When would the US senate
rely on this result when it has to decide on budget measures that involve endowing the US education
system with tax money? What are the concerns? What should be done to convince the senate that
the result is trustworthy?
An additional year of schooling increases the wage by 12%. The US senate would require an expe-
riment which would ask some students to leave school earlier or stay longer than planned (random
assignment). As this is unethical the selection bias cant be avoided. One might want to collect as
many pieces of information as possible to avoid endogeneity.
10. Provide another economic example (like the crime example or the liberalization example, but one
that is not given in the lecture) where endogeneity is present. Cast your example in a CLRM. What
is the economic interpretation of in your model?
???
11. There is a special name for model parameters which have an economic meaning. What do we call
them?
Structural parameters
12. Based on a sample of rms, a researcher estimates by OLS the coecients of a Cobb-Douglas
production function. The CD function is nonlinear, but the regression coecients are computed
despite the rst CLRM assumption of linearity. How is this possible? Estimates of the production
function parameters yield numbers equal to 0.62 and 0.41, respectively. How do you interpret these
estimates?
49
Log-linear model
A 1% increase in capital leads to a .62% increase in production
13. Prove your algebraic knowledge and write the CLRM in matrix and vector notation. Indicate the
dimensions of scalars, vectors and matrices explicitly!
y
1
...
y
n
. .. .
(nx1)
=
1 x
12
... x
1k
...
1 x
n2
... x
nk
. .. .
(nxk)
1
...
. .. .
(kx1)
+
1
...
. .. .
(nx1)
(114)
14. Which objects in the CLRM are observable and which are not?
Observable: y,X
Unobservable: ,
15. In the lecture we had a possible choice of b = (X
X)
1
X
y. Write out that expression in detail

using the denition of the matrix X and the vector y. Can you write the right hand side expression
using x
i
= x
i1
, x
i2
, ..., x
iK
and y
i
instead of the matrix X and the vector y?
b =
1 ... 1
x
12
... x
n2
...
x
1k
... x
nk
1 x
12
... x
1k
...
1 x
n2
... x
nk
1 ... 1
x
12
... x
n2
...
x
1k
... x
nk
y
1
...
y
n
(115)
b = (
1
n
x
i
x
i
)
1
1
n
x
i
y
i
(116)
16. Using observed data on explanatory variables and the dependent variable you can compute the
OLS estimator as b = (X
X)
1
X
y. This yields a K dimensional column vector of real numbers.

Thats an easy task using a modern computer. Explain why we conceive the OLS estimator also as
a vector of random variables (a random vector), and that the concrete computation of b yields a
particular realization of this vector of random variables. Explain the source of randomness of b.
Random vector because b is computed as a function of X andy; and X and y are random variables
because we assume they are drawn randomly. When individuals are drawn, those are the realizations
and we can compute the realized b
17. Linearity (y
i
= x
i
+
i
) seems a restrictive assumption. Provide an example (not from lecture)
where an initially nonlinear relation of dependent variable and regressors can be reformulated such
that an equation linear in parameters result. Give another example (not in lecture) where this is not
feasible.
50
y = x
lny = x +z
y =
x
oder y = x
+z
as impossible variations.
18. When are two random variables orthogonal?
For r.v. E(xy)=0
For vectors xy=0
19. Strict exogeneity is a restrictive assumption. In another assignment you will be asked to prove the
result (given in the lecture) that strict exogeneity implies that the regressors and the disturbance term
are uncorrelated (or, equivalently, have zero covariance or, also equivalently, are orthogonal). Explain,
using one of the economic examples given in the lecture (crime example or liberalization example), that
assuming that error term and regressor(s) are uncorrelated is doubtful from an economic perspective
(You have to give the an economic meaning like in the wage example where measured unobserved
ability of an individual.)
Social factors as part of
i
incluence crime and police.
20. Why is Z = rank(X) is a random variable? Which values can Z take (X is a nxK matrix where
nK)? What do we assume w.r.t. the probability function of Z? Why do we make this assumption in
the rst place and why is it important for the computation of the OLS estimator?
rank(x) is a function of X
Z=rank(X) can take on 1...K values and is Bernoulli distributed. Probability distribution: 1 for full
rank, 0 for all other outcomes degenerates
P(Z=rank(X))=1 OLS is otherwise not computable
21. How many random variables are contained in the equations
a) Y = X+: Y(n rv), X(kxn rv)
b) Y = Xb + e: Y(n rv), X(kxn rv), b(k rv), e(k rv)
Third set
1. Show the following results explicitly by writing out in detail the expectations for the case of conti-
nuous and discrete random variables.
Useful:
E(g(x, y)) =

g(x, y)f
xy
dxdy (117)
E(g(x, y)|x) =
g(x, y)
f
xy
f
x
dy (118)
a) E
X
[E
Y |X
(Y |X)] = E
Y
(Y ) (Law of Total Expectations LTE)
E
Y |X
(Y |X) =

y
f
xy
(x,y)
f
x
(x)
dy
E
X
[E
Y |X
(Y |X)] =

[
y
f
xy
(x,y)
f
x
(x)
dy]f
x
(x)dx =

y
f
xy
dxdy =

yf
y
(y)dy = E
y
(y)
51
b) E
X
[E
Y |X
(g(Y )|X)] = E
Y
(g(Y )) (Double Expectation Theorem DET)
E
Y |X
(g(Y )|X) =

g(y)
f
xy
f
x
(x)
dy
E
X
[E
Y |X
(g(Y )|X)] =

[
g(y)
f
xy
f
x
(x)
dy]f
x
(x)dx =

g(y)
f
xy
dxdy =

g(y)f
y
(y)dy = E
y
(g(y))
c) E
XZ
[E
Y |X,Z
(Y |X, Z)] = E
Y
(Y ) (LTE)
E
Y |X,Z
(Y |X, Z) =

y
f
xyz
(x,y,z)
f
xz
(x,z)
dy
E
XZ
[E
Y |X,Z
(Y |X, Z)] =

[
y
f
xyz
(x,y,z)
f
xy
(x,z)
dy]f
xz
(x, z)dxdz =

g(y)
f
xy
dxdy =

g(y)f
y
(y)dy =
E
y
(g(y))
d) E
Z|X
[E
Y |X,Z
(Y |X, Z)|X] = E
Y |X
(Y |X) (Law of Iterated Expectations LIE)
E
Y |X,Z
(Y |X, Z)|X =

y
f
xyz
(x,y,z)
f
xz
x,z
dy
E
Z|X
[E
Y |X,Z
(Y |X, Z)|X] =

[
y
f
xyz
(x,y,z)
f
xz
(x,z)dy
]
f
xz
(x,z)
f
x
(x)
dz =

y
1
f
x
(x)
f
xyz
(x, y, z)dzdy =
y
f
xy
(x,y)
f
x
(x)
dy = E
Y |X
(Y |X)
e) E
X
[E
Y |X
(g(X, Y )|X)] = E
XY
(g(X, Y )) (Generalized DET)
E
Y |X
(g(X, Y )|X) =

g(X, Y )
f
xy
(x,y)
f
x
(x)
dy
E
X
[E
Y |X
(g(X, Y )|X)] =

[
g(x, y)
f
xy
(x,y)
f
x
(x)
dy]f
x
(x)dx =

g(x, y)f
xy
(x, y)dxdy = E
xy
(g(x, y))
f) E
Y |X
[g(X)Y |X] = g(X)E
Y |X
(Y |X) (Linearity of Conditional Expectations) E
Y |X
[g(X)Y |X] =
g(x)y
f
xy
(x,y)
f
x
(x)
dy = g(x)
y
f
xy
(x,y)
f
x
(x)
dy = g(x)E
y|x
(y|x)
2. Show that for K = 2, E(
i
|x
i1
, x
i2
) = 0 implies
a) E(
i
) = 0
E(
i
|x
i1
, x
i2
) = 0 E(
i
|x
i1
) = 0 by LIE
E(
i
|x
i1
) = 0 E(
i
) = 0 by LTE
b) E(
i
x
i1
) = 0
Show: E(
i
x
i1
) = 0
E
x
i1
[E
i
|x
i1
(
i
x
i1
|x
i1
)] by DET
E
x
i1
[x
i1
E(
i
|x
i1
)
. .. .
0
] = E(0) = 0 by Linearity of Cond. Exp.
Show: E(
i
x
i1
) = Cov(
i
, x
i1
) = 0
Cov(
i
, x
i1
) = E(x
i1

i
)
. .. .
0
E(x
i1
) E(
i
)
. .. .
0
= 0
c) Show that E(
i
|X) = 0 implies that Cov(
i
, x
ik
) = 0 i,k.
E
x
[E(
i
|x)] = E(
i
) = 0 by LTE
E(
i
x
ik
) = E
x
[E(
i
x
ik
|x)] by DET
E
x
[x
ik
E(
i
|x)] = E(0) = 0 by Linearity of Cond. Exp.
Cov(x
ik
,
i
) = E(x
ik
i
) E(x
ik
)E(
i
) = 0
52
3. Explain how the positive semi-deniteness of the dierence of the two variance-covariance matrices
of two alternative estimators is related to the relative eciency of one estimator vs. another.
Positive semi-deniteness: a
[V ar(
|x) V ar(b|x)]a 0
When positive semi-deniteness is true we can choose a=[1,0,...,0],=[0,1,0,...,0] usw., so that:
V ar(
1
|x) V ar(b
1
|x), V ar(
2
|x) V ar(b
2
|x) usw. fr alle b
k
.
Thus we can assume eciency when a matrix is positive semi-denite.
4. Show by writing in detail for K=2, = (
1
,
2
)
that a
[var(
|X)var(b|X)]a 0a = 0 implies
that var(
1
|X) var(b
1
|X) for a=(1, 0).
a
[var(
|X) var(b|X)]a 0 (119)

(1, 0)
V ar(
1
|x) Cov(
2
,

1
|x)
Cov(
1
,

2
|x) V ar(
2
|x)
V ar(b
1
|x) Cov(b
2
, b
1
|x)
Cov(b
1
, b
2
|x) V ar(b
2
|x)
1
0
(120)
(1, 0)
V ar(
1
|x) V ar(b
1
|x) Cov(
2
,

1
|x) Cov(b
2
, b
1
|x)
Cov(
1
,

2
|x) Cov(b
1
, b
2
|x) V ar(
2
|x) V ar(b
2
|x)
1
0
(121)
(1, 0)
a b
c d
1
0
= [a, b]
1
0
= a = V ar(
1
|x) V ar(b
1
|x) 0 (122)
V ar(
1
|x) V ar(b
1
|x)
5. We require a variance covariance matrix to be symmetric and positive denite. Why is the latter
a sensible restriction? Denition of positive (semi-)denite matrix: A is a symmetric matrix. x is a
nonzero vector.
1. if xAx>0 for all nonzero x, then A is positive denite
2. if xAx0 for all nonzero x, then A is positive semi denite
They are symmetric by construction
X (, )
a
= [a
1
, ..., a
n
] Z=aX with V ar(Z) = a
V ar(X)a
. .. .
has to be >0
6. How does the conditional independence (CIA) assumption justify the inclusion of additional regres-
sors (control variables) in addition to the key regressor, i.e. the explanatory variable of most interest
(e.g. small class, schooling, foreign indebtedness). In which way can one, by invoking the CIA get
closer to the experimental ideal?
CIA in general: p(a|b,c)=p(a|c) or a b|c: a is conditionally independent of b given c
i
x
1
|x
2
, x
3
, ...
The more control variables, the closer we come to the ideal that
i
x
1
53
7. The CIA motivates the idea of a matching estimator briey discussed in the lecture. What is the
simple basic idea of it?
To measure treatment eects without random assignment:
Matching means sorting individuals in similar groups, where one received treatment and then compare
the out comes.
E(y
i
|x
i
, d
i
= 1) E(y
i
|x
i
, d
i
= 0) = local average treatment eect
Sort groups according to xs. Replace expectations by the sample means for the individuals and the
dierence tells the treatment eect of a certain group.
E
x
i
[E(y
i
|x
i
, d
i
= 1) E(y
i
|x
i
, d
i
= 0)] = average treatment eect
Averaging over all xs, so we get the treatment eect for all groups.
Fourth set
0. Where does the null hypothesis enter into the t-statistic?
In the numerator, as we replace
k
with
k
1. Explain what unbiasedness, eciency, and consistency mean. Give an explanation of how we can
use these concepts to assess the quality of the OLS estimator b = (X
X)
1
X
y. (We will be more

specic about consistency in later lectures, this question is just to make you aware of the issue.)
Unbiasedness E(b)=: If an estimator is unbiased, its probability distribution has an expected value
equal to the parameter it is supposed to estimate.
Eciency V ar(
|x) V ar(b|X): We would like the distribution of the sample estimates to be as

tightly distributed around the true as possible.
Consistency P(b > ) 0 as n : b is asymtotically unbiased
b is BLUE = best
....
efficiency
linear unbiased estimator
2. Refresh your basic Mathematics and Statistics:
a) v=a*Z. Z is a (nx1) vector of random variables. a is (1xn) vector of real numbers.
b) V=A*Z. Z is a (nx1) vector of random variables. A is (mxn) matrix of real numbers.
Compute for v mean and variance as well as mean vector and variance covariance matrix of V .
a) E(v) = E(a
z) = a
....
(1xn)
E(z)
. .. .
(nx1)
V ar(v) = V ar(a
z) = a
....
(1xn)
V ar(z)
. .. .
(nxn)
a
....
(nx1)
b) E(V ) = E(AZ) = A
....
(mxn)
E(Z)
. .. .
(nx1)
V ar(V ) = V ar(A Z) = A
....
(mxn)
V ar(Z)
. .. .
(nxn)
A
....
(nxm)
54
3. Under which assumptions can the OLS estimator b = (X
X)
1
X
y be called BLUE?
Best estimator: Gauss-Markov (1.1-1.5)
Linear: 1.1
Unbiased: 1.1-1.3
4. Which additional assumption is used to establish that the OLS estimator is conditionally normally
distributed? Which key results from mathematical statistics have we used to employed to show the
normality of the OLS estimator b = (X
X)
1
X
y which results from this assumption?

After assumption 1.5: |X MV N(0,
2
In)
As we know b = (X
X)X
e and using Fact 6:

b |X MV N(E(b |X), V ar(b |X))
b |X MV N(0,
2
(X
X)
1
)
b|X MV N(,
2
(X
X)
1
)
5. Why are the elements of b random variables in the rst place?
They are randomly drawn and are a result of a random experiment.
6. For which purpose is it important to know the distribution of the parameter estimate in the rst
place?
Hypothesis testing (including condence intervals, critical values, p-values)
7. What does the expression a random variable is distributed under the null hypothesis as .... mean?
Random variable has a distribtuiton, as the realizations are only one way that the estimate can look
like. Under the null hypothesis means that we suppose that b=0 meaning that the null hypothesis is
true.
8. Have we said something about the unconditional distribution of the OLS estimate b? Have we said
something about the unconditional means? The unconditional variances and covariances?
Unconditional distribution: nein
Unconditional means: ja (LTE)
Unconditional variances: nein
9. What can we say about the conditional (on X) and unconditional distribution of the t and z statistic
(under the null hypothesis)? What is their respective mean and variance?
The type of distribution stays the same when conditioning.
z
k
: conditional: N(0,1); mean:
k
; variance:
2
[(X
X)
1
]
kk
t
k
: unconditional: t(n-k); mean:
k
; variance: s
2
[(X
X)
1
]
kk
10. Explain
what is meant by a type 1 error and a type 2 error?
what is meant by signicance level = size.
what is meant by the power of a test?
55
Type 1 error : rejected although correct; Type 2 error : accepted although wrong
Probability of making Type 1 error
1-Type 2 error: Rejecting when false
11. What are the two key properties that a good statistical test has to provide?
Tests should have a high power (=reject a false null) and should keep its size
12. There are two main schools of statistics: Frequentist and Bayesian. Describe their main die-
rences!
Frequentists: x the probability (signicance level) H
0
can be rejected rejection approach
(favored)
Bayesian: the probability changes with the evidence we get from the data H
0
cannot be rejected,
but we get closer to it conrmation approach
13. What is the null and the alternative hypothesis of the t-test presented in the lecture? How is the
test statistic constructed?
H
0
:
k
=
k
, H
A
:
k
=
k
b
k
|X N(
k
,
2
[(X
X)
1
]
kk
)
H
0
: b
k
|X N(
k
,
2
[(X
X)
1
]
kk
)
z
k
=
b
k
2
[(X
X)
1
]
kk
N(0, 1)
14. On which grounds could you criticize a researcher for choosing a signicance level of 0.00000001
(or 0.95, respectively)? Give examples for more conventional signicance levels.
0.00000001: hypothesis almost never rejected (type 1 error too low)
0.95: hypothesis is too often rejected (type 1 error too high)
15. Assume the value of a t-test statistic equals 3. The number of observations n in the study is 105,
the number of explanatory variables is 5. Give a quick interpretation of this result.
t-stat=3; n=105 and k=5 leads to df=100
We can use standard normal distribution as df>30.
For 5% signicance level the critical values are 2 (rule of thumb), so we would reject the hypothe-
sis.
16. Assume the value of a t-test statistic equals 0.4 and n-K=100. Give a quick interpretation of this
result.
We cant reject the H
0
on a 5% signicance level (t-stat approximately normal and thus we can use
the rule of thumb for the critical values)
17. In a regression analysis with K=4 you obtain parameter estimates b
2
= 0.2 and b
3
= 0.04. You
are interested in testing (separately) the hypotheses
2
= 0.1 against = 0.1 and
3
= 0 against
56
3
= 0.
You obtain
s
2
X
X
1
=
0.00125 0.023 0.0017 0.0005

0.023 0.0016 0.015 0.0097
0.0017 0.015 0.0064 0.0006
0.0005 0.0097 0.00066 0.001
(123)
where s
2
is an unbiased estimated of
2
. Compute the standard errors of the two estimates, the two
t-statistics and the associated p-values. Note that the t-test (as we have dened it) is two-sided. I
assume that you are familiar with p-values, students of Quantitative Methods denitely should be.
Otherwise, refresh you knowledge!
b
2
= 0.2,
2
= 0.1 standard error for b
2
=
0.0016 = 0.04
b
3
= 0.04,
3
= 0 standard error for b
3
=
0.0064 = 0.08
t-stat 2:
0.20.1
0.04
=2.5
t-stat 3:
0.04
0.08
=0.5
p-value 2: 1-0.9938=0.0062
p-value 3: 1-0.6915=0.3085
Computing the p-value is computing the implied .
18. List the assumptions that are necessary for the result that the z-statistic z
k
=
b
k
2
[(X
X)
1
]
kk
is
standard normally distributed under the null hypothesis, i.e. z
k
N(0, 1).
Bentigt: 1.1-1.5
Zustzlich: n-k>30 bzw. Var(b|X) known.
19. Is the t-statistic nuisance parameter-free?
Yes, as the nuisance parameter 2 is replaced by s
2
.
Denition: any parameter which is not of immediate interest but must be accounted for in the analysis
of those parameters which are of interest.
20. I argue that using the quantile table of the standard normal distribution instead of the quantile
table of the t-distribution doesnt make much of a dierence in many applications: The respective
quantiles are very similar anyway. Do you agree? Discuss.
This is correct for n-k>30. For lower values of n-k: t-dist is more spread out in the tails.
21. I argue that with respect to the distribution of the t-statistic under the null substituting for
2
the unbiased estimator
2
=
1
nk
e
e of
2
does not make much of a dierence. When would you
subscribe to that argument? Discuss.
For n the bias
nK
n
for
2
vanishes. Thus the statement is ok for large n. For small n the
unbiased estimator should be used.
22.
a. Show that P = X(X
X)
1
X
and M=In-P are symmetric (i.e. A=A) and idempotent, (i.e.

57
A=A*A). A useful result is: (BA)=AB.
b. Show that y=Xb=Py and e=y-Xb=My are orthogonal vectors, i.e. y
e = 0
c. Show that e=M and

e
2
i
= e
e =
M
d. Show that if x
i1
= 1 we have

e
i
= 0 (use OLS FOC) and that y = x
b (i.e. the regression

hyperplane passes through the sample means).
a.
P symmetric:
X(X
X)
1
X
=[X][(X
X)
1
][X]=X[(X
X)
]
1
X=X[X
[X
]
1
X=X(X
X)
1
X
M symmetric:
-X(X
X)
1
X
=In-[X(X
X)
1
X]=In-X(X
X)
1
X
P idempotent:
X(X
X)
1
X
[X(X
X)
1
X]=X(X
X)
1
(XX)(X
X)
1
X=X(X
X)
1
X
M idempotent:
-X(X
X)
1
X
[In-X(X
X)
1
X]= In[In-X(X
X)
1
X]-X(X
X)
1
X[In-X(X
X)
1
X]= In-X(X
X)
1
X-X(X
X)
1
X+X(X
X)
1
X=In-
X(X
X)
1
X
b.
Py*My=0
X(X
X)
1
Xy
*[(In-X(X
X)
1
X)y]= y[X(X
X)
1
X]*[y-X(X
X)
1
Xy]= yX(X
X)
1
Xy-yX(X
X)
1
XX(X
X)
1
Xy=0
c.
e=M*
-X(X
X)
1
X
=-X(X
X)
1
X=(y-x)-X(X
X)
1
X(y-X)= y-X-X(X
X)
1
Xy+X(X
X)
1
XX= y-X-X(X
X)
1
Xy+X=
y-X(X
X)
1
Xy=y-Xb=e
ee=
M
ee=(M)
M =
M =
M
d.
Starting point: X(y-Xb)=Xy-XXb=0
1 ... 1
x
12
... x
n2
... ... ...
x
1k
... x
nk
y
1
...
y
n
1 ... 1
x
12
... x
n2
... ... ...
x
1k
... x
nk
1 x
12
... x
1k
... ... ... ...
1 x
n2
... x
nk
b
1
...
b
k
= 0 (124)
58
Only rst row:
y
i
[n
x
i2
, ...,
x
ik
] b = 0 y = b
1
+b
2
x
2
+.... +b
k
x
k
(125)
y
i
= x
i
b +e
i
= y
i
+e
i
1
n
y
i
=
1
n
y
i
+
1
n
e
i
. .. .
0, if x=1
y = y
Fifth set
1. What is the dierence between the standard deviation of a parameter estimate b
k
and the standard
error of the estimate?
Standard deviation is the square root of the true variance and the standard error is the square root
of the estimated variance (and thus is dependent on the sample size).
2. Discuss the pros and conts of alternative ways to present the results for a t-test:
a) parameter estimate and
*** for signicant parameter estimate (at =1%)
** for signicant parameter estimate (at =5%)
* for signicant parameter estimate (at =10%)
b) parameter estimate and p-value
c) parameter estimate and t-statistic
d) parameter estimate and parameter standard error
e) your preferred choice
a. Pro: Choice for all conventional signicance levels and easy to recognize; Con: not as much infor-
mation as with p-value (densed information)
b. Pro: p-value with information for any signicance level; Con: not as fast/recognizable as stars; for
a specic null hypothesis
c. Pro: t-stat is basis for p-values; Con: no clear choice possible based on the information given (table);
for a specic null hypothesis
d. Pro: Is the basis for p-values and shows variance and thus if any signicance level is useful, can
be used to compute dierent null hypothesis; Con: no clear choice possible based on the information
given (table, t-stat)
e. I would take case b. because it gives the best information very fast.
4. Consider Fama-Frenchs asset pricing model and its compatible regression representation (see
your lecture notes of the rst week). Suppose you want to test the restriction that none of these three
risk factors play a role in explaining the expected excess return of the asset (that is the parameters
in the regression equation are all zero) State the null- and alternative hypothesis in proper statistical
terms and construct the Wald statistic for that hypothesis, i.e. dene R and r.
59
R
ej
t+1
=
1
R
em
t+1
+
2
HML
t+1
+
3
SMB
t+1
+
j
t+1
Hypothesis:
H
0
:
1
=
2
=
3
= 0; H
A
: at least one =0 bzw.
0
: R = r; H
A
: R = r
R =
1 0 0
0 1 0
0 0 1
; =
; r =
0
0
0
; b =
b
1
b
2
b
3
(126)
5. How is the result from multivariate statistics that (z )
1
(z )
2
(rows(z))withz
MV N(, ) used to construct the Wald statistic to test linear hypotheses about the parameters?
We compute E(r|x) = R E(b|x) = R; V ar(r|x) = R V ar(b|X)R
= R
2
(X
X)
1
R
; r|x =
Rb|x MV N(R, R
2
(X
X)
1
R)
Under the null hypothesis:
E(Rb|x)=r (=hypothesized value)
Var(Rb|x)=R
2
(X
X)
1
R
(=unaected by hypothesis)
Rb|x MV N(r,
2
(X
X)
1
R
) (=this is where the null hypothesis goes in)

Using Fact 4:
[Rb E(Rb|x)]
[V ar(Rb|x)]
1
[Rb E(Rb|x)]
[Rb r]
[R
2
(X
X)
1
R]
1
[Rb r]
2
(#r)
Sixth set
Suppose you have estimated a parameter vector b=(0.55,0.37,1.46,0.01) with an estimated variance-
covariance matrix.
V ar(b|X) =
0.01 0.023 0.0017 0.0005

0.023 0.0025 0.015 0.0097
0.0017 0.015 0.64 0.0006
0.0005 0.0097 0.0006 0.001
(127)
a) Compute the 95% condence interval each parameter b
k
.
b) What does the specic condence interval computed in a) tell you?
c) Why are the bounds of a condence interval for
k
random variables?
d) Another estimation yields an estimated b
k
with the corresponding standard error se(b
k
). You
conclude from computing the t-statistic t
k
=

k
k
se(b
k
)
that you can reject the null hypothesis H
0
: b
k
=
k
on the % signicance level. Now, you compute the (1-)% condence interval. Will
k
lie inside
or outside the condence interval?
a. Condence intervals for:
b
1
: 0.551.96*0.1=[0.354;0.746] (bounds as realizations of r.v.)
b
2
: 0.371.96*0.05=[0.272;0.468]
b
3
: 1.461.96*0.8=[-0.108;3.028]
b
4
: 0.011.96*0.032=[-0.0527;0.0727]
60
b. The rst two reject the H
0
m the other two dont reject on a 5% signicance level for a two-sided
test.
c. Bounds are a function of X (=r.v.): se(b
k
) = s
2
[X
X]
1
d. Lies outside as we reject the t-stat.
2. Suppose, computing the lower bound of the 95% condence interval yields b
k
t
/2
(n K)
se(b
k
) = 0.01. The upper bound is b
k
+t
/2
(nK)se(b
k
)=0.01. Which of the following statements
are correct?
With probability of 5% the true parameter
k
lies in the interval -0.01 and 0.01.
The null hypothesis H
0
:
k
=
k
cannot be rejected for values (0.01 beta
k
0.01) on
the 5% signicance level.
The null hypothesis H
0
:
k
= 1 can be rejected on the 5% signicance level.
The true parameter
k
is with probability 1 = 0.95 greater than -0.01 and smaller than
0.01.
The stochastic bounds of the 1 condence interval overlap the true parameter with proba-
bility 1 .
If the hypothesized parameter value
k
falls within the range of the 1 condence interval
computed from the estimates b
k
and se(b
k
) then we do not reject H
0
:
k
=
k
at the
signicance level of 5%.
false
correct
correct
false
correct (if we suppose that the null hypothesis is true)
correct
Seventh Set
1.
a) Show that if the regression includes a constant:
y
i
=
1
+
2
x
i2
+... +
K
xi
K
+
i
then the variance of the dependent variable can be written as:
1
N
N
i=1
(y
i
y)
2
=
1
N
N
i=1
( y
i
y)
2
+
1
N
N
i=1
2
i
(128)
61
Hint: y = y
b) Take your result from a) and formulate an expression for the coecient of determination R
2
.
c) Suppose, you estimated a regression with an R2 = 0.63. Interpret this value.
d) Suppose, you estimate the same model as in c) without a constant. You know that you cannot
compute a meaningful centered R
2
. Therefore, you compute the uncentered R
2
uc
: R
2
uc
=
y
y
y
y
= 0.84
Compare the two goodness of t measures in c) and d). Would you conclude that the constant can
be excluded because R
2
uc
> R
2
?
a)
1
N
(y
i
y)
2
. .. .
SST
(129)
=
1
N
( y
i
+e
i
y)
2
=
1
N
( y
i
y +e
i
)
2
=
1
N
( y
i
y)
2
+
1
N
(e
i
)
2
(130)
1
N
2( y
i
y)e
i
=
1
N
( y
i
y)
2
+
1
N
(e
i
)
2
1
N
2 y
i
e
i
+
1
N
2y
e
i
(131)
=
1
N
( y
i
y)
2
+
1
N
(e
i
)
2
1
N
2b
Xe
i
. .. .
0by FOC
+
1
N
2y
e
i
. .. .
0with constant
(132)
=
1
N
( y
i
y)
2
. .. .
SSE
+
1
N
(e
i
)
2
. .. .
SSR
(133)
(Is our desired result as y = y if x
i1
= 1)
b)
R
2
=
SSE
SST
= 1
SSR
SST
(134)
c) 63% of the variance of the dependent variable can be explained with the estimated variance.
d) R
2
and R
2
uc
can never be compared as they are based on dierent models.
2. In a hedonic price model the price of an asset is explained with its characteristics. In the following
we assume that housing pricing can be explained by its size sqrft (measured as square feet), the
number of bedrooms bdrms and the size of the lot lotsize (also measured as square feet. Therefore,
we estimate the following equation with OLS:
log(price) =
1
+
2
log(sqrft) +
3
bdrms +
4
log(lotsize) +
Results of the estimation can be found in the following table: (a) Interpret the estimates b
2
und b
3
.
(b) Compute the missing values for Std. Error and t-Statistic in the table and comment on the
statistical signicance of the estimated coecients (H
0
:
j
= 0 vs. H
1
:
j
= 0, j=0,1,2,3).
(c) Test the null hypothesis H
0
:
1
= 1vs.H
1
:
1
= 1.
(d) Compute the p-value for the estimate b
2
and interpret the result.
62
(e) What is the null hypothesis of that specic F- statistic missing in the table? How does it relate
to the R
2
? Compute the missing value of the statistic and interpret the result.
(f) Interpret the value of R-squared.
(g) An alternative specication of the model that excludes the lot size as an explanatory variable
provides you with values for the Akaike information criterion of -0.313 and a Schwartz criterion of
-0.229. Which specication would you prefer?
a) Everything else constant, an increase of 1% in size leads to a .7% increase in price.
Everything else constant, a one unit increase in bedrooms leads to a 3.696% increase in price.
b) Starting point: Coecient/Std.Err=T-stat
Se
2
=0.0929
T stat
3
=1.3425
The constant, sqrft and lotsize are signicant on a 5% signicant level; bdrms is not signicant
c)
1.297041
0.65128
= 3.52696 As the critical value is 1.98 we reject on a 5% signicance level.
d) bdrms=0.70023; t-stat: 7.54031 => almost 0 in table for N(0,1)
e) H
0
:
2
=
3
=
4
= 0 bzw.
0 1 0 0
0 0 1 0
0 0 0 1
0
0
0
(135)
F =
R
2
/k
(1 R
2
)/(n K)
=
0.64297/4
(1 0.64297)/84
=
0.1607
0.0042003
= 38.25 (136)
So we reject for any conventional signicance level
Alternative: get SSR
R
=

(y
i
y)
2
=n*sample variance of y
i
f) 64% of the variance of the dependent variable can be explained with the estimated variance.
g) -0.313 vs. -0.4968 for Akaike and -0.229 vs. -0.3842 for Schwartz. We prefer the rst model (with
lot size) as both AIC and SIC are smaller.
63
3. What can you do to narrow the parameter condence bounds? Discuss the possibilities: - Increase
- Increase n
Can you explain how the eect of an increase of the sample size on the condence bounds works?
- Increase in : not good, as type 1 error increases
- Increases n: standard error descreasesbounds decrease
4. When would you use the uncentered R
2
uc
and when would you use the centered R
2
. Why is the
uncentered R
2
uc
higher than a centered R
2
? What is the range of the R
2
uc
and R
2
?
R
2
uc
: without a constanthigher (as the level is explained too)
R
2
c
: with constantlower (explaines only the variance around the level)
Range for both is from 0 to 1.
5. How do you interpret a R
2
of 0.38?
38% of the variance of the dependent variable can be explained with the estimated variance.
6. Why would you use an adjusted R
2
adj
? What is the idea behind the adjustment of the R
2
adj
? Which
values can the R
2
adj
take?
R
2
adj
: any value between and 1. Has a penalty term for the use of degrees of freedom (heavy
parameterization) (Occams Razor)
7. What is the intuition behind the computation of AIC and SBC?
Find the smallest SSR, but adds a penalty term for the use of degrees of freedom (=heavy parame-
terization)
Eighth set
1. A researcher conducts an OLS regression with K = 4 with a computer software that is unfortunately
not able to report p-values. Besides the four coecients and their standard errors the program only
reports the t-statistics that test the null hypothesis H
0
:
k
= 0 against H
A
: k = 0 for k=1,2,3,4.
Interpret the t-statistics below and compute the associated p-values. (Interpret the p-values for a
reader who works with a signicance level =5%.)
a)t1=-1.99
b)t2=0.99
c)t3=-3.22
d)t4=2.3
T-stat shows where we are on the x-axis of the cdf.
Assumption: n-k>30. Then we reject for
1
,
3
,
4
using the critical value 1.96.
P-values:
a) T-stat in standard normal distribution is 0.9767. 2*(1-0.9767)=0.0466 reject
b) T-stat in standard normal distribution is 0.8398. 2*(1-0.8398)=0.3224 not reject
c) T-stat in standard normal distribution is close to one, so p-value close to zero reject
d) T-stat in standard normal distribution is 0.9893. 2*(1-0.9893)=0.0214 reject
64
2. Explain your own in words: What is convergence in probability? What is convergence almost surely?
Which concept is stronger?
Convergence in probability: The probability that the n-th element of a row of n numbers lies within
the bounds around a certain number (e.g. mean) is equal to 1 as n goes into innity.
Almost sure convergence: The probability that the n-th element of a row of n numbers is equal to a
certain number (e.g. mean) is equal to 1 as n goes into innity.
a.s is stronger as it requires Z

n
to be equal to whereas

p
only requires Z
n
to lie in bounds
around .
3. Illustrate graphically the concept of convergence in probability. Illustrate graphically a random
sequence that does not converge in probability.
See Vorlesungsunterlagen
4. Review the notion of non-stochastic convergence. In which way do we refer to the notion of non-
stochastic convergence when we consider almost sure convergence, convergence in probability and
convergence in mean square? What are the non-stochastic sequences in each of the three modes of
stochastic convergence?
In non-stochastic convergence we review what happens to a series when n goes into innity (there is
no randomness then in the elements of the series).
In

a.s
p
, und

m.s we have n going into innity as the non-stochastic component. As the stochastic
component we have to consider the dierent worlds of innite draws.
Examples for non-stochastic series:
A series of probabilities
p
1
Realizations of X

a.s.0
5. Explain in your own words: What does convergence in mean square mean? Does convergence in
mean square imply convergence in probability? Or does convergence in probability imply convergence
in mean square?
Convergence in mean square: The expectation value of the mean error is equal to 0 as n grows into
innity (also: the variance). Convergence in mean square implies convergence in probability.
6. Illustrate graphically the concept of convergence in distribution. What does convergence in distribu-
tion mean? Think of an example and provide a graphical illustration where the c.d.f. of the sequence
of random variables does not converge in distribution.
Convergence in distribution: The series Z
n
converges to Z, if the distribution function F
n
of Z
n
converges to the cdf of Z at each point of F
Z
.
With this mode of convergence, we increasingly expect to see the next outcome in a sequence of
random experiments becoming better and better be modeled by a given probability distribution.
Example where convergence in distribution doesnt work:
1. We draw random variables with a probability of 0.5 from a normal distribution with either mean 0
65
or with mean 1.
2. The distribution of b
n
, as it doesnt have a distribution (only a point)
3. We draw from a distribution where the variance doesnt exist. Computing the means doesnt have
a limit distribution.
7. I argue that extending convergence almost surely/in probability/mean square to vector sequences (or
matrices) does not increase complexeity as the basic concept remains the same. However, extending
convergence in distribution of a random sequence to a random vector sequence entails an increase of
complexity. Why?
For

a.s,
p
and

m.s extending the concepts to vectors only require elementwise-convergence.
For

d
all the k elements of the vectors have to be lower or equal to a constant at the same time for
every of the n elements of the series Z
n
(joint distribution).
8. Convergence in distribution implies the existence of a limit distribution. What do we mean by
that?
{Z
n
} has a distribution F
n
. The limit distribution F of Z
n
is the distribution for which we assume
that F
n
converges to at every point (F and F
n
belong to the same class of distributions and we just
adjust the parameters)
9. Suppose that the random sequence zn converges in distribution to z, where z is a
2
(2) random
variable. Write this formally using two alternative notations.
Z
n
d
Z
2
(2)
Z
n
d

2
(2)
10. What assumptions have to be fulllled that you can apply Khinchins WLLN?
{Z
i
} sequence of iid random variables
E(Z
i
) = <
11. I argue that extending the WLLN to a vector random sequences does not add complexeity to the
case of a scalar sequence, it just saves space because of notational convencience. Why?
for a vector we have to use elementwise convergence (of the means towards the expectation values).
We use the same argument as for convergence in probability (we just have to dene the means and
expectations for the rows for elementwise convergence).
Ninth set
1. What does the Lindeberg-Levy (LL) central limit theorem state? What assumptions have to be
fulllled that you can apply the LL CLT?
Formula:

n(z
n
)

d
N(0,
2
).
The dierence between the mean of a series and its expected value follows a normal distribution and
thus the mean of the series follows also a normal distribution (z
n

a
N(,

2
n
)).
Assumptions:
66
{Z
i
} iid
E(Z
i
) = <
V ar(Z
i
) =
2
<
WLLN can be applied
2. Name the concept that is associated to the following short hand notations and explain their meaning:
a. z
n
d
N(0, 1): Convergence in distribution
b. plim
n
z
n
= : Convergence in probability
c. z
n
m.s. : Convergence in mean square

d. z
n
d
z: Convergence in distribution
e.

n(z
n
)

d
MV N(0, ): Lindenberg-Levy CLT (Multivariate CLT)
f. y
n
p
: Convergence in probability
g. z
n

a
N(,

2
n
): CLZ (follows from the univariate CLT)
h. z
n
a.s. : Convergence almost surely/almost sure convergence

i. z
2
n
d

2
(1): Convergence in distribution/Lemma 2
3. Apply the seful lemmasf the lecture. State the name of the respective lemma(s) that you use.
Whenever matrix multiplications or summations are involved, assume that the respective operations
are possible. The underscore means that we deal with vector or matrix sequences. and A indicate
vectors or matrices of real numbers. I
m
is the identity matrix of dimension m:
a. z
n
p
, then: plimexp(z
n
) =?
b. z
n
d
z N(0, 1), then: z
2
n
d
?
c. z
n
d
z MV N(, ), A
n
p
A, then: A
n
(z
n
)
a
?
d. x
n
d
MV N(, ), y
n
p
, A
n
p
A, then: A
n
x
n
+y
n
d
?
e. x
n
d
x, y
n
is Op, then: x
n
+y
n
d
?
f. x
n
d
x MV N(, ), y
n
p
0, then: plimx
n
y
n
=?
g. x
n
d
MV N(0, I
m
), A
n
p
A, then: z
n
= A
n
x
n
d
?
and then: z
n
(A
n
A
n
)
1
z
n

a
?
a. Using Lemma 1: plimexp(z
n
) = exp() (if exp does not depend on n)
b. Using Lemma 2: z
2
n
d
[N(0, 1)]
2
=
2
(1)
c. Using Lemma 2: z
n
d
MV N(0, ) and then using Lemma 5: A
n
(z
n
)
a
MV N(0, AA
)
d. Using Lemma 5: A
n
x
n
d
A x = MV N(, AA
) and then using Lemma 3: A

n
x
n
+y
n
d
A x + = MV N( +, AA
)
e. Using Lemma 3: x
n
+y
n
d
x + 0
f. Using Lemma 4: plimx
n
y
n
p
0
67
g. Using Lemma 5: z
n
d
A x = MV N(0, AA
) and then using Lemma 6: z
n
(A
n
A
n
)
1
z
n

a
(Ax)
. .. .
z
(AA
)
1
. .. .
V ar(z)
(Ax)
. .. .
z
Fact 4

2
(rows(z))
4. When large sample theory is used to derive the properties of the OLS estimator the set of ass-
umptions for the nite sample properties of the OLS estimator are altered. Which assumptions are
retained? Which are replaced?
Two are retained:
(1.1)(2.1) Linearity
(1.3)(2.4) Rank condition
Four are replaced:
iid(2.2) Dependencies allowed
(1.2)(2.3) Replacing strict exogeneity by orthogonality: E(x
ik

i
) = 0
(1.4)deleted
(1.5)deleted
5. What are the properties of b
n
= (X
X)
1
X
y under the "newssumptions?

b
n
p
(consistency)
b
a
MV N(,
Avar(b)
n
) (approximate normal distribution)
Unbiasedness & eciency lost
6. What does CAN mean?
CAN = Consistent Asymptotically Normal
7. Does consistency of b
n
need an iid sample of dependent variable and regressors?
iid is a special case of a martingale dierence sequence. Necessary is only a stationary & ergodic
m.d.s., which means an identical but not an independent sample.
8. Where does a WLLN and where does a CLT come into play when deriving the properties of
b = (X
X)
1
X
y?
Deriving consitency: We use Lemma 1 twice. But to apply Lemma 1 on [
1
n
x
i
x
i
], we have to assure
that
1
n
x
i
x
p
E(x
i
x
i
) which we do by applying WLLN.
Deriving distribution: We dene
1
n
x
i
i
= g and apply CLT to get
n(gE(g
i
))

d
MV N(0, E(g
i
g
i
)).
This is then used together with Lemma 5.
Tenth Set
1. At which stage of the derivation of the consistency property of the OLS estimator do we have to
invoke a WLLN?
2. What does it mean when an estimator has the CAN property?
3. At which stage of the derivation of the asymptotic normality of the OLS estimator do we have to
invoke a WLLN and when a CLT?
1-3: see Ninth set
68
4. Which of the useful lemmas 1-6 is used at which stage of a) consistency proof and b) asymptotic
normality proof?
a) We use Lemma 1 to show that together with WLLN: [
1
n
x
i
x
i
]
1

p
[E(x
i
x
i
)]
1
b) We know that

n(b ) = [
1
n
x
i
x
i
]
1
. .. .
A
n
ng
....
x
n
. For:
A
n
p
A =
1
xx
and
x
n
d
x MV N(0, E(g
i
g
i
))
So we can apply Lemma 5:

n(b ) MV N(0,
1
xx
E(g
i
g
i
)
1
xx
)
5. Explain the dierence of the assumptions regarding the variances of the disturbances in the nite
sample context and using asymptotic reasoning.
Finite sample variance: E(
2
i
|X) =
2
(assumption 1.4)
Large sample variance: E(
2
i
|x
i
) =
2
We only condition on x
i
and not on all xs at the same time
6. There is a special case when the nite sample variances of the OLS estimator based on nite
sample assumptions and based on large sample theory assumptions (almost) coincide. When does this
happen?
When we have only one observation i.
7. What would you counter an argument of someone who says that working with the variance cova-
riance estimate s
2
(X
X)
1
is quite OK as it is mainly consistency of the parameter estimates that
counts?
Using s
2
(X
X)
1
we would have to assume conditional homoscedasticity and using (X
X)
1
S(X
X)
1
would allow us to get rid of that assumption.
8. Consider the following assumptions:
(a) linearity
(b) rank condition: KxK matrix E(x
i
x
i
) =
xx
is nonsingular
(c) predetermined regressors: E(g
i
) = 0 where g
i
= x
i
i
(d) g
i
is a martingale dierence sequence with nite second moments
i) Show that under those assumptions, the OLS estimator is approximately normally distributed:
n(b )

d
N(0,
1
xx
E(
2
i
x
i
x
i
)
1
xx
) (137)
Starting point: sampling error
We have

n(b ) = [
1
n
x
i
x
i
]
1
. .. .
A
n
ng
....
x
n
. For:
A
n
p
A =
1
xx
(only possible if
xx
is nonsingular)
x
n
d
x MV N(0, E(g
i
g
i
)) (as

n(g E(g
i
))

d
MV N(0, E(g
i
g
i
)) by CLT)
69
So we can apply Lemma 5:

n(b ) MV N(0,
1
xx
E(g
i
g
i
)
1
xx
. .. .
Avar(b)
)
b
n

a
MV N(,
Avar(b)
n
)
ii) Further, show that assumption 4 implies that the
i
are serially uncorrelated or E(
i

ij
) = 0.
E(g
i
|g
i1
, ..., g
1
). As x
i1
= 1 we focus on the rst row:
E(
i
|
i1
, ...,
1
,
i1
x
i1,2
, ...,
1
x
1K
) = 0
by LIE: E
z|x
[E(y|x, z)|x] = E(y|x) with x =
i1
, ...,
1
; y =
i
and z =
i1
x
i1,2
, ...,
1
x
1K
:
E(
i
|
i1
, ...,
1
) = 0
by LTE: E(
i
) = 0
Second step in exercise 19
9. Show that the test statistic
t
k
=
(b
k
k
)

Avar(b
k
)
n
(138)
converges in distribution to a standard normal distribution. Note, that bk is the k-th element of b
and Avar(b
k
) is the (k,k) element of the KxK matrix Avar(b). Use the facts, that

n(b
k

k
)

d
N(0, Avar(b
k
)) and

Avar(b)

p
Avar(b). Why is the latter true? Hint: use continuous mapping and
the Slutzky theorem (the seful lemmas")!
b
k

Avar(b
k
)
n
=
n(b
k
k
)

Avar(b
k
)
a
N(0, 1)
We know:
1.

n(b
k
k
)

d
N(0, Avar(b
k
)) by CLT and from joint distribution of the estimators.
2.
1

Avar(b
k
)
p
1
Avar(b
k
)
by Lemma 1 and from the derivation of

V ar(b
k
)
Lemma 5:
t
k
=
n(b
k
k
)

Avar(b
k
)
d
N(E(a x), V ar(a x)) = N(0, a
2
V ar(x)) = N(0,
Avar(b
k
)
Avar(b
k
)
2
) = N(0, 1)
10. Show that the Wald statistic:
W = (Rb r)
[R
Avar(b)
n
R
]
1
(Rb r) (139)
converges in distribution to a Chi-square with degrees of freedom equal to the number of restrictions.
As a hint, rewrite the equation above as W = c
n
Q
1
n
c
n
. Use Hayashis lemma 2.4(d) and the footnote
on page 41.
W = [Rb r]
[R

Avar(b)
n
R
]
1
[Rb r]
Under the null: Rb-r=Rb-R=R(b-)
c
n
= R
n(b
n
)

d
c MV N(0, R Avar(b) R
) (Lemma 2)
70
Q
n
= R

Avar(b) R

p
Q = R Avar(b) R
= V ar(c) (derivation of

V ar(b) and Lemma 1)
W
n
= c
n
Q
1
n
c
n
d
c
Q
1
....
V ar(c)
c
2
( #r
....
rows(c)
) (Lemma 6 + Fact 4)
11. What is an ensemble mean?
An ensemble is consisting of a large number of mental copies.
The ensemble mean is obtained by averaging all ensemble forecasts. This has the eect of ltering
out features of the forecast that are less predictable.
12. Why do we need the stationarity and ergodicity in the rst place?
Especially for time series we want to get rid of the iid assumption, as we observe dependencies. Using
s&e allows dependencies, but we still need to draw from the same distribution and the dependencies
should be less between draws that are far apart from each other.
But: Mean and variance do not change over timeWe need this for correct estimates, tests.
13. Explain the concepts of weak stationarity and strong stationarity.
Strong stationarity: the joint distribution of (z
i
, z
ir
) is the same as the joint distribution of (z
j
, z
jr
)
if i-ir=j-jr relative position
Weak stationarity: only the rst two moments are constant (not the whole distribution) and the
cov(z
i
, z
ij
) only depends on j.
15. When is a stationary process ergodic?
If the dependence decreases with distance so that:
lim
n
E(f(z
i
, z
i+1
, ..., z
i+k
) g(z
i+n
, z
i+n+1
, ..., z
i+n+l
)) =
E(f(z
i
, z
i+1
, ..., z
i+k
)) E(g(z
i+n
, z
i+n+1
, ..., z
i+n+l
))
16. Which assumptions have to be fullled to apply Kinchins WLLN? Which of these assumptions
are weakened by the ergodic theorem? Which assumption is used instead?
Assumptions:
{Z
i
} iidinstead: stationarity&ergodicity
E(Z
i
) = <
17. Which assumptions have to be ful
lled to apply the Lindeberg-Levy CLT? Is stationarity and ergodicity sucient to apply a CLT? What
property of the sequence {g
i
}0{
i
x
i
} do we assume to apply a CLT?
Assumptions:
{Z
i
} iidinstead: stationary&ergodic m.d.s.
E(Z
i
) = <
V ar(Z
i
) =
2
<
WLLN
18. How do you call a stochastic process for which E(g
i
|g
i1
, g
i2
, ...) = 0?
71
M.D.S
19. Show that if a constant is included in the model it follows from E(g
i
|g
i1
, g
i2
, ...) = 0, that
cov(
i
,
ij
) = 0j = 0 Hint: Use the law of iterated expectations.
From exercise 8: E(
i
|
i1
, ...) = E(
i
) = 0
Cov(
i
,
ij
) = E(
i

ij
)
. .. .
?
E(
i
)
. .. .
=0
E(
ij
)
E(
i

ij
) = 0
E[E(
i

ij
)|
i1
, ...,
ij
, ...,
1
] by LTE backwards
E[
ij
E(
i
|
i1
, ...,
ij
, ...,
1
)
. .. .
0(s.o.)
] = E(0) = 0 by Linearity of Conditional Expectations
Cov(
i
,
ij
) = 0
20. When using the heteroskedasticity-consistent covariance matrix,
Avar(b) =
1
xx
E(
2
i
x
i
x
i
)
1
xx
(140)
which assumption regarding the covariances of the disturbances
i
and
ij
do we make?
Cov(
i
,
ij
) = 0
Eleventh Set
1. When and why would you use GLS? Describe the limiting nature of the GLS approach.
GLS is used to estimate unknown parameters in a linear regression model. It is used when heterosce-
dasticity is present and when observations are correlated in the error term (cf. WLS).
Assumptions:
Linearity y
i
= x
i
+
i
Full rank
Strict exogeneity
V ar(|X) = sigma
2
V (X). V(X) has to be known, symmmetric and positive denite.
Limits:
V(X) normally not known, and thus we even have to estimate more. If V ar(|x) is estimated the
BLUE property is lost and

GLS
might even be worse than OLS.
If X not strictly exogeneous, GLS might be inconsistent
Large sample properties are more dicult to obtain.
2. A special case of the GLS approach is weighted least square (WLS). What diculties could arise
in a WLS estimation? How are the weights constructed?
Observations are wighted by standard deviations (higher penalty for wrong estimates). Used when all
the o-diagonal entries of (Var.-Cov.-Matrix) are zero.
Here: V ar(|X) =
2
V (x
i
) V (x
i
) = Z
i
(V (x
i
) typically unknown)
72
So again, the variance has to be estimated and thus our estimates might be inconsistent and even
less ecient than OLS.
3. When does exact multicollinearity occur? What happens to the OLS estimator in this case?
Occurs when one regressor is a linear combination of other regressors: rank(X)=K
Then: Assumption 1.3/2.4 violated and (X
X)
1
and thus the OLS estimator cannot be computed.
We can still estimate the linear combination of the parameters, though.
4. How is the OLS estimator and its standard error aected by (not exact) multicollinearity?
BLUE result is not aected
Var(b|X) is aected: Coecients have high standard errors (eect: wide condence intervals
and low t-stats and thus diculties with rejection)
Estimates may have the wrong sign
Small changes in the data produce wide swings in the parameters
5. Which steps can be taken to overcome the problem of multicollinearity?
Increase n (and thus the variance of the xs)
Get smaller
2
(=better tting model)
Get smaller correlation (=exclude some regressors, but: omitted variable bias)
Other ways:
Dont do anything (OLS is still BLUE)
Joint hypothesis testing
Use empirical data
Combine the multicollinear variables
Twelth Set
1. When does an omitted regressor not cause biased estimates?
With an omitted regressor we get: b
1
=
1
+ (X
1
X
1
)
1
X
1
X
2
. .. .
0?
2
+ (X
1
X
1
)
1
X
. .. .
0with strict exogeneity
2
= 0 if X
2
is not part of the model in the rst place
(X
1
X
1
)
1
X
1
X
2
= 0 if regression of X
2
on X
1
gives you zero coecients
2. Explain the term endogeneity bias. Give a practical economic example when an endogeniety bias
occurs.
73
Endogeneity = no strict exogeneity bzw. no predetermined regressots: Xs correlated with the error
term estimator is biased.
Example: Haavelmo: C
i
=
0
+
1
Y
i
+u
i
(I
i
part of u
i
and correlated with Y
i
)
3. What is solution to the endogeneity problem in a linear regression framework?
IV-variables: include variables from u
i
that are correlated with x
i
4. Use the same tools used to derive the CAN property of the OLS property to derive the CAN
property of the IV-estimator. Start with the sampling error and make your assumptions on the way.
(applicability of WLLN, CLT...)
Compute the sampling error:
(X
Z)
1
X
y
(X
Z)
1
X
(Z +)
(X
Z)
1
X
Z + (X
Z)
1
X

+ (X
Z)
1
X
= (X
Z)
1
X
= [
1
n
x
i
z
i
]
1 1
n
x
i
i
Consistency:

= [
1
n
x
i
z
i
]
1 1
n
x
i
y
i
p

1.
1
n
x
i
z
p
E(x
i
z
i
) and by Lemma 1:
[
1
n
x
i
z
i
]
1

p
[E(x
i
z
i
)]
1
2.
1
n
x
i
p
E(x
i
i
) = 0
= [
1
n
x
i
z
i
]
1 1
n
x
i
p
E(x
i
z
i
)
1
E(x
i

i
)
. .. .
0
= 0
Asymptotic normality:

n(
)

d
MV N(0, Avar(
))
n*sampling error:

n(
) = [
1
n
x
i
z
i
]
1
. .. .
A
n
n
1
n
x
i
i
. .. .
x
n
1. A
n
= [
1
n
x
i
z
i
]
1

p
A = E(x
i
z
i
)
1
2. x
n
=
ng
d
x MV N(0, E(g
i
g
i
)
. .. .
S.33
) (apply CLT as E(g
i
) = 0)
Lemma 5:

n(
)

d
MV N(0, E(x
i
z
i
)
1
E(g
i
g
i
)E(x
i
z
i
)
1
. .. .
Avar(
)
)
5. When would you use an IV estimator instead of an OLS estimator? (Hint: Which assumption of
the OLS estimation is violated and what is the consequence.)
When x
i
and
i
correlated (violated: Assumption 2.3)
6. Describe the basic idea of instrumental variables estimation. How are the unknown parameters
related to the data generating process?
IV can be used to obtain consistent estimators in presence of omitted variables.
x
i
and z
i
correlated; x
i
and u
i
not correlated x
i
= instrument
74
7. Which assumptions are necessary to derive the instrumental variables estimator and the CAN
property of the IV estimator?
Assumptions:
3.1 Linearity
3.2 Ergodic stationarity
3.3 Orthogonality conditions
3.4 Rank condition for identication (full rank)
8. Where do the assumptions enter the derivation of the instrumental variables estimator and the
CAN property of the IV estimator?
Derivation:
E(x
i

i
) = E(x
i
(y
i
Z
i
)
. .. .
3.1
) =
....
3.3
0
E(x
i
y
i
) E(x
i
z
i
) = 0
= [E(x
i
z
i
)]
1
. .. .
3.4
E(x
i
y
i
)
p
....
3.2
9. Show that the OLS estimator can be conceived as a special case of the IV estimator.
Relation:
E(
i
z
i
) = 0 and x
i
= z
i
. Then:
= [
1
n
z
i
z
i
]
1 1
n
z
i
y
i
= OLS estimator
10. What are possible sources endogeneity?
see additional page
75

Applied Eco No Metrics

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Applied Eco No Metrics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Eco No Metrics

Uploaded by

Copyright:

Available Formats

Applied Econometrics

15. Juli 2010

1 buyer initiated trade

Theory: Asset pricing

(linear function of the xs)

|x) V ar(b|x)]a 0a = 0 (53)

is linear in y and unbiased and

= (D +A)y = Dy +Ay = D(X +) +b = DX +D +b (54)

|x) = DX +E(D|x) +E(b|X) = DX +D E(|x)

|x) = V ar[(D +A)|x] = [D +A]V ar(|x)[D +A]

projection matrix; use: y = py = xb (lies in space spanned by columnes

b (for x and y vectors of sample means)

e we get an unbiased estimator:

; lower crit value

; upper crit value

) (this is where the hypothesis goes in)

e and dividing by #r (to nd the distri-

](Rb r)/#r F(#r, n k) (82)

y no longer BLUE, but unbiased

hypothetical value of that solves E(g(w

, it is nevertheless prone to the "measurement without

y. Write out that expression in detail

y. This yields a K dimensional column vector of real numbers.

|X) var(b|X)]a 0 (119)

y. (We will be more

|x) V ar(b|X): We would like the distribution of the sample estimates to be as

y which results from this assumption?

e and using Fact 6:

0.00125 0.023 0.0017 0.0005

and M=In-P are symmetric (i.e. A=A) and idempotent, (i.e.

b (i.e. the regression

) (=this is where the null hypothesis goes in)

0.01 0.023 0.0017 0.0005

a.s is stronger as it requires Z

m.s. : Convergence in mean square

a.s. : Convergence almost surely/almost sure convergence

) and then using Lemma 3: A

) and then using Lemma 6: z

y under the "newssumptions?

You might also like