Lecture Notes Fall Term 2013
Lecture Notes Fall Term 2013
Lecture Notes Fall Term 2013
Advanced Econometrics I
1 Slightly modied
Literature
[1] Ash, R. B. and Dolans-Dade, C. (1999). Probability & Measure Theory. Academic Press.
[2] Bauer, H. (1990). Measure and Integration Theory. de Gruyter.
[3] Bauer, H. (1991). Wahrscheinlichkeitstheorie. de Gruyter.
[4] Billingsley, P. (1994). Probability and Measure. Wiley.
[5] Breiman, L. (2007). Probability. Siam.
[6] Davidson, R. and MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford University
Press.
[7] Dehling, H. and Haupt, B. (2004). Einfhrung in die Wahrscheinlichkeitstheorie und Statistik.
Springer.
[8] Georgii, H.-O. (2007). Stochastik: Einfhrung in die Wahrscheinlichkeitstheorie und Statistik. de
Gruyter.
2
Contents
2 Asymptotic theory 17
2.1 Convergence of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Relation between dierent modes of convergence . . . . . . . . . . . . . . . . . . . 19
2.2.3 Discussion of convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Discussion of stochastic boundedness . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Strong law of large numbers and central limit theorem . . . . . . . . . . . . . . . . . . . . 24
4 Linear regression 32
4.1 The classical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Parameter estimation - the OLS approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Estimation of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Estimation of 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Hypothesis tests in the classical linear regression model . . . . . . . . . . . . . . . . . . . 36
4.3.1 Introduction to statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Wald tests to test linear restrictions on the regression coecients . . . . . . . . . . 37
4.3.3 Hypothesis tests in the classical linear regression model under normality . . . . . . 39
3
Introduction
Motivation:
study relationship between variables, e.g.
consumption and income How does raising income eect consumption behaviour?
...
Y = 0 + 1 X +
e.g. Y consumption, X wage, error term
(2) application of these tools to the classical multiple linear regression model
application of these results to economic problems in Advanced Econometrics II/III and follow-up
elective courses
4
1 Elementary probability theory
Setup:
set of the possible outcomes of an experiment sample space
e.g. = N = {1, 2, }
Unless otherwise stated, is non-empty.
A event
e.g. A = {2, 4, 6, 8, }
outcome :
intend to dene P (A) probability of event A
= {1 , 2 , 3 , }
(e.g. = N, = Z).
Denition 1.1. A probability measure P on a countable set is a set function that maps subsets of
to [0, 1] and has the following properties:
(i) P () = 0 (Here denotes the empty set.),
(ii) P () = 1,
(iii) P ( for Ai , i N, pairwise disjoint (i.e Ai Aj = for i 6= j ).
S P
i=1 Ai ) = i=1 P (Ai )
Remark:
interpretation of (i): probability that nothing happens = 0.
Sk Pk
interpretation of (iii): P ( i=1 Ai ) = i=1 P (Ai ) is reasonable for k < , but questionable for
k = . Anyway, one cannot proceed without making this assumption.
Lemma 1.2. For a countable sample space = {i }iI (with countable I ) a probability measure P is
specied by
pi = P ({i }) for i I.
For every A , it holds X
P (A) = pi .
i: i A
An event {} ( ) that only contains one element is also called an elementary event.
5
1.1.2 Arbitrary sample spaces
Typical examples: =R (or Rk or [0, ))
Problem: Mathematics (general measure theory) It is often impossible to dene P appropriately
for all subsets A such that Denition 1.1(i)-(iii) hold; see e.g. Billingsley [4][p.45f ].
Motivation: If we naively carried over the axioms of Denition 1.1(i)-(iii) to the present setting, then
we would obtain:
Regarding(S i): P () = 0.
Regarding (S ii): If I know the value of P (A), I will also know the value of P (AC ) = 1 P (A).
Regarding (S iii):
S If I know the values of P (A1 ), P (A2 ), ..., then I will know the value of
P ( i=1 Ai ). This is clear for pairwise disjoint A1 , A2 , . . . . For A1 , A2 , . . . not pairwise disjoint, a
slightly more complicated argument is needed (see Theorem 1.7(vi) below).
Denition 1.3. A class A of subsets of with (S i)-(S iii) is called a -eld (also: algebra).
Denition 1.5. Suppose that A is a -eld on a set . Then the tuple (, A) is called a mea-
surable space.
A set function P : A [0, 1] is a probability measure on (, A) if
(a) P () = 1 and
(b) for A1 , A2 , . . . pairwise disjoint A it holds that
!
additivity
[ X
P Ai = P (Ai )
i=1 i=1
Example 1.6 (Dirac measure) . Let (, A) be a measurable space and 0 . The Dirac measure 0
is then dened by
0 (A) := 1A (0 ) A A.
The Dirac measure is indeed a probability measure.
Theorem 1.7 (Properties of probability measures) . Suppose that (, A, P ) is probability space. Then
the following hold true:
(i) P () = 0
6
(ii) Finite additivity: A1 , . . . , An A disjoint imply P (ni=1 Ai ) =
Pn
i=1 P (Ai )
(iii) P (AC ) = 1 P (A).
(iv) Monotonicity: A, B A, A B implies P (A) P (B)
(v) Subtractivity: A, B A, A B implies P (B\A) = P (B) P (A)
(vi) Poincar-Sylvester: P (A B) = P (A) + P (B) P (A B)
(vii) Continuity from below: A1 , A2 , A, An An+1 , n N, implies P (An ) n
P (
k=1 Ak )
Proof. Exercise.
Suppose that A is a eld and dene A as the smallest -eld with A A (notion: A = (A )). Now
choose P : A [0, 1] with
7
(a) P () = 0
(b) P () = 1
S
(c) For A1 , A2 , . . . pairwise disjoint A with i=1 Ai A it holds that
!
[ X
P Ai = P (Ai ).
i=1 i=1
This method can be applied to characterize probability measures on R endowed with a -algebra B .
To this end, we rst introduce another function.
Denition 1.12. For a probability measure P on (R, B) the function F : R [0, 1] given by
F (b) = P ((, b]) b R
is called a (cumulative) distribution function (CDF).
Theorem 1.13 (Properties of the CDF) . Suppose that F is the distribution function of a probability
measure P on (R, B). Then
(i) P ((a, b]) = F (b) F (a), b > a,
(ii) F is non-decreasing (i.e. F (b0 ) F (b) for b0 b),
(iii) F is continuous from the right (i.e. F (bn ) F (b) for bn b, bn b (or for bn b)),
(iv) 1. limx F (x) = 0,
2. limx+ F (x) = 1.
(ii) Exercise.
(iii) Suppose that (bn )n is an arbitrary monotonously decreasing sequence in R with bn b. Then we
have to show that
F (bn ) F (b).
n
By continuity of probability measures from above
!
\
F (bn ) = P ((, bn ]) P (, bk ] = F (b).
n
k=1
(iv) 1. Exercise.
!
[
lim F (bn ) = lim P ((, bn ]) = P (, bk ] = P ((, +)) = 1.
n n
k=1
8
Now the idea is to choose F with the properties (ii) to (iv) in the previous Theorem and to dene P on
A4 :
P ((, b]) = F (b)
and on A1 :
P ((a, b]) = F (b) F (a), b>a
Can this function be uniquely extended to a set function on B?
Theorem 1.14. Consider a function F : R R satisfying (ii) to (iv) of Theorem 1.13. Then F is
a distribution function (i.e. then there exists a unique probability measure P on (R, B) with F (b) =
P ((, b]) for all b R).
Now we extend this function as follows: P : A [0, 1], where A consists of the empty set and all
nite unions of sets of A1 and their complements, and for disjoint intervals
n
! n
[ X
P (ai , bi ] = P ((ai , bi ]) with notation (c, ] = (c, ).
i=1 i=1
Denition 1.15. A probability measure P on the measurable space (R, B) is discrete if there is an at
most countable set A = {ai R | ( < ... < a1 < a0 <)a1 < a2 < . . . } such that P (A) = 1.
P ({ai }) = pi > 0,
then F has jumps at ( < ... < a1 < a0 <)a1 < a2 < . . . with jump heights (..., p1 , p0 ,) p1 , p2 , . . . .
Parameter: 0 1, n 1.
2. Geometric distribution
P ({i}) = (1 )i1 for i = 1, . . .
Parameter: 0 1.
3. Poisson distribution
i
P ({i}) = e , i = 0, 1, . . .
i!
Parameter: > 0
9
Absolutely continuous probability measures
If there is a (Riemann) integrable function f : R [0, ) such that
Z x
F (x) = f (t)dt,
then f is called Riemann density (probability density) and the corresponding probability measure (and
the CDF) is called absolutely continuous.
There is a more general denition of absolute continuity. However, this needs deeper mathematics.
Therefore we will stick to the one above which suces for many applications. Moreover, note that if F
is continuously dierentiable at x and is absolutely continuous, then f (x) = F 0 (x).
Lemma 1.17. Suppose thatR f : R [0, ) is a bounded function with at most nitely many points of
discontinuity and such that
f (x)dx = 1. Then there exists a unique probability measure on (R, B)
such that Z b
P ((a, b]) = f (x)dx.
a
Proof.
R
It suces to show that
f (x)dx denes a function satisfying (ii) to (iv) of Theorem 1.13 and
to use Theorem 1.14. This in turn is straightforward.
Parameter: R, > 0
2. Uniform distribution
1
f (x) = 1[a,b] (x)
ba
Parameters: < a < b <
3. Exponential distribution
f (x) = ex 1[0,) (x)
Parameter: >0
1.2.3 Extensions to Rk
The Borel -eld B k is the -eld generated by the open intervals (a1 , b1 ) (ak , bk ). As in the real-
valued case, probability measures on (Rk , B k ) are uniquely dened via the multivariate distribution
function:
F (b1 , . . . , bk ) = P ({(x1 , . . . , xk ) : x1 b1 , . . . , xk bk }).
F is called absolutely continuous if:
Z b1 Z bk
F (b1 , . . . , bk ) = f (x1 , . . . , xk )dxk dx1
kF
(x1 , . . . , xk ) = f (x1 , . . . , xk ).
x1 xk
For a more detailed discussion, we refer the reader to Billingsley [4].
10
1.3 Random Variables
(iii) Notation: Capital letters X, Y, Z, . . . are used for random variables (exception: Greek letters,
e.g. ), lower case letters x, y, z, . . . are used for elements of Rk and denote realisations (possible
outcomes of X, Y, Z, . . . ). (Typical convention in statistics). However, one often uses x, y, z, . . . for
random variables and their realisations, at the same time, in econometrics.
(iv) Suppose that (, A) = (R, B). Then (real-valued) indicator functions, monotone functions,
continuous functions and functions with only nitely many discontinuities are (Borel) measurable
(see Jacod and Protter [10, Theorems 8.3, 8.4].
For the indicator function it works as follows: Let AA and X = 1A . Then, we get
(v) Suppose that (, A) = (R, B) and that (Xn )n is a sequence of random variables on this space, then
X1 + X2 , X1 X2 , supn Xn , inf n Xn , and limn Xn (provided its existence) are random variables (see
Jacod and Protter [10, Corollary 8.1, Theorem 8.4]).
For two random variables X1 : Rk1 , X2 : Rk2 dened on the same probability space, one
(X10 ,X20 )0
denes the joint distribution P by
11
(ii) Suppose that (Xt )tT with some nonempty index set T is a family of Rk -valued random variables
on (, A, P ). These random variables are independent if for any nite, nonempty I0 T and any
Qt B k , t I0 , !
\ Y
P Xt1 (Qt ) = P (Xt1 (Qt )).
tI0 tI0
Independence is one of the most important tools to construct complex probabilistic models.
Lemma 1.22. If the distribution of an Rk -valued random variable X = (X1 , . . . , Xk )0 has a bounded,
piecewise continuous density fX , then
Z Z
fX1 (x1 ) = fX (x1 , . . . , xk )dx2 dxk
Note that the expectation of a random variable only depends on its distribution.
E 1A = P (A).
12
Denition 1.25 (General denition of expectation) . (i) For a real-valued random variable X 0
dene
k k k+1
Xn () = if X() < for a k N0 .
n n n
Dene
EX = lim EXn .
n
Remark:
1X[k/n, (k+1)/n)
P k
P k
Xn = k=0 n and hence EXn = k=0 n P (X [k/n, (k + 1)/n))
kXn Xm k max{n1 , m1 }
limn EXn always exists but might be innite.
The denition of the expectation above can also be interpreted as an integral, the so-called Lebesgue
integral which is based on a vertical grid.
Alternative notions:
Z Z Z Z Z Z
EX = xdP X (x) = xP X (dx) = xdF X (x) = xF X (dx) = X()P (d) = X()dP ()
Theorem 1.26. (Properties of expectations). For real-valued random variables X1 , X2 with nite ex-
pectations and 1 , 2 R it holds that
(i) |E[X1 ]| sup |X1 ()|,
(ii) E[1 X1 + 2 X2 ] = 1 E[X1 ] + 2 E[X2 ] (linearity),
(iii) E[X1 ] E[X2 ] if X1 X2 (monotonicity),
(iv) E[X1 X2 ] = E[X1 ]E[X2 ] if X1 , X2 are independent.
Remark: One can show that additivity of the expectation of positive random variables also holds if the
corresponding expectations are innite; in particular E|X| = EX + + EX .
13
Proof. The proofs of the rst three items is deferred to the exercises.
(iv):
(1) Discrete Case: Suppose that X1 , X2 0 and discrete. Say, X1 = {a1 , a2 , . . . } and X2 =
{b1 , b2 , . . . }. For Y := X1 X2 0 it follows
E(X1 X2 ) = EY
X
= yi P (Y = yi )
yi Y
X
= aj bk P (X1 = aj , X2 = bk )
aj X ,
1
bk X
2
X
= aj bk P (X1 = aj )P (X2 = bk ) (by independence)
aj X ,
1
bk X
2
X X
= aj P (X1 = aj ) bk P (X2 = bk )
aj X1 bk X2
= EX1 EX2
This implies the result because for n the lower and the upper bound converges to E[X1 ]E[X2 ].
(3) General Case:
14
Theorem 1.27. If the distribution function FX of X , given by FX (x) = P (X x), x R, has a
bounded and piecewise continuous density fX and if E[X] is nite, then
Z
E[X] = xfX (x)dx. (2)
Proof. See exercises for X 0. An extension to the general case is then straightforward.
Extensions of Theorem 1.27: Assume as above that fX is bounded and piecewise continuous and
moreover, g is a piecewise continuous function that is either bounded or non-negative, then
Z
Eg(X) = g(x)fX (x)dx. (3)
EX < . Hence,
and similarly
Z Z Z
1 2 2 1 2 2 1 2 2
EX = xe(x) /(2 ) dx = y ey /(2 ) dy + ey /(2 ) dy = .
2 2 2 2 2 2
In particular, we see that there are random variables with the same expectation but dierent distribu-
tions.
n
! n n
!!2
X X X
V ar Xi =E Xi E Xi
i=1 i=1 i=1
n n
!2
X X
=E Xi EXi (by linearity)
i=1 i=1
n
!2
X
=E (Xi EXi ) (by linearity)
i=1
n
! n
X X
=E (Xi EXi ) (Xj EXj )
i=1 j=1
n
X
= E(Xi EXi )(Xj EXj ) (by linearity)
i=1,j=1
Xn n
X
= E(Xi EXi )(Xj EXj ) + E(Xi EXi )2
| {z }
i=1,j=1 i=1
i6=j V ar(Xi )
15
By the fact that functions of independent random variables are also independent it follows that
n
! n n
X X X
V ar Xi = E(Xi EXi )E(Xj EXj ) + V ar(Xi )
i=1 i=1,j=1 i=1
i6=j
For an Rk -valued random variable X = (X1 , . . . , Xk )0 such that EXj exists for all j, we dene the
expectation (vector) as
E[X1 ]
.
= E[X] = . .
.
E[Xk ]
If E|Xj |2 < for all j, the covariance matrix is dened by
=E[(X )(X )0 ]
(X1 1 )2
... (X1 1 )(Xk k )
. .. .
=E . .
. . .
(Xk k )(X1 1 ) ... (Xk k )2
E[(X1 1 )2 ]
... E[(X1 1 )(Xk k )]
. .. .
= . . .
. . .
2
E[(Xk k )(X1 1 )] ... E[(Xk k ) ]
16
2 Asymptotic theory
In applications, we are typically interested in behaviour of Xn0 for xed n0 . Via asymptotic theory we
get an approximation by embedding Xn0 into a sequence (Xn )n .
Suppose that for a sequence of random variables (Xn )n and a random variable X on a probability space
(, A, P )
Xn () X() for all . (4)
n
In general this is not sucient for
EXn EX
n
(see exercises). Our aim is now to set up additional conditions that assure convergence of the corre-
sponding expectations. There are two main tools, the monotone convergence theorem and Lebesgue's
dominated convergence theorem.
Then
EXn EX.
n
If the random variables do not converge, one can obtain a weaker result.
Lemma 2.2 (Fatou's Lemma). Consider a sequence of nonnegative random variables (Xn )n (dened on
the same probability space (, A, P )) and dene
X() = lim inf Xn ().
n
Then
EX lim inf EXn .
n
Proof. lim inf n Xn () = limn (inf kn Xk ()).
Recall that First note that the sequence (Yn )n ,
given by Yn = inf kn Xk , satises 0 Yn () Yn+1 (), n, , and Yn () X(). By monotone
n
convergence theorem and the fact that Yn Xk k n,
Using the latter result, Lebesgue's majorized convergence theorem can be proven.
Remark: Both theorems also hold if (4), (5), (6) only hold a.s. (almost surely), i.e.
17
2.2 Modes of convergence
2.2.1 Denitions
Pk qP
k
Let kk denote a norm on Rk , e.g. kxk = kxk1 = j=1 |xj | or kxk = kxk2 = j=1 |xj |2 .
Denition 2.4. Suppose that (Xn )n and X are random variables on a probability space (, A, P ) and
with values in (Rk , Bk ).
(i) (Convergence in probability)
The sequence (Xn )n converges in probability to X if
P ({ | kXn () X()k > }) = P (kXn Xk > ) 0 > 0.
n
Notation:
P
Xn X, p-limn Xn = X.
Notation:
a.s.
Xn X P a.s., Xn X a.s., Xn X.
n n
Notation:
Lp
Xn X, Lp lim Xn = X.
n
Now we skip the assumption that all random variables live on the same probability space.
Denition 2.5. Suppose that (Xn )n and X are random variables with values in (Rk , Bk ).
(i) (Convergence in distribution)
The sequence (Xn )n converges to X in distribution if
Ef (Xn ) Ef (X)
n
Notation: Xn = OP (1).
Remark:
(i) Rough interpretation of convergence in probability: Xn is approximately equal to X
18
(ii) Rough interpretation of convergence in distribution: Xn has approximately the same distribution
as X
Theorem 2.6. Suppose that (Xn )n and X are real-valued with distribution functions (Fn )n and F ,
respectively. Then
d
Xn X Fn (x) F (x)
n
at all continuity points of F .
Lp Lp1 1 L
Xn X Xn X Xn X &
P d
Xn X Xn X
a.s.
Xn X %
Lp -convergence (p 1)
Lp Lp1 L
1
Xn X Xn X Xn X
This follows from Jensen's inequality :
(EkY k)p E(kY kp ) for p 1,
see Billingsley [4].
Theorem 2.7 (Markov's inequality). For a random variable X and an (strictly) monotonously increasing
function g : [0, ) [0, ) it holds for > 0 that
Eg(kXk)
P (kXk )
g()
Proof. It holds that 1(x ) g(x)/g(). We apply this inequality with x equal to kX()k. This gives
g(kXk) E[g(kXk)]
P (kXk ) = E[1(kXk )] E .
g() g()
Corollary 2.8. Suppose that (Xn )n and X are random variables on some probability space (, A, P ).
Then
Xn X in p-th mean = Xn
P
X.
Before we discuss further relations between modes of convergence, two other important applications
of Markov's inequality are presented.
19
Application 1. For a real-valued random variable X with nite second moment and >0 it holds
that
var(X)
P (|X EX| ) (Chebychev's inequality).
2
Proof. The application of Markov's inequality with g(u) = u2 and Y = X EX yields the assertion.
Application 2.
Theorem 2.9 (Weak law of large numbers 1). Suppose that X1 , X2 , . . . are uncorrelated random vari-
ables (i.e. cov(Xi , Xj ) = 0, i 6= j ) with EX1 = EX2 = = R and V arX1 , V arX2 , 2 .
Then
1 P
Xn = (X1 + + Xn ) .
n
Proof. As
1 2
E[(Xn )2 ] = V ar(Xn ) = [V ar(X1 ) + + V ar(Xn )] ,
n2 n
we obtain from Chebychev's inequality that
1 2
P (|Xn | ) 0.
2 n n
Theorem 2.10 (Weak law of large numbers 2) . For a sequence of i.i.d. random variables X1 , X2 , ...
with nite mean = E[Xj ] it holds that Xn . P
Theorem 2.10 can be proved similarly as Theorem 2.9 but in a more complex manner. Here,
Chebychev's inequality is applied to an average of truncated versions of X1 , X2 , ... instead of Xn .
Remark: In general, convergence in probability does not imply convergence in Lp , see exercises.
Thus,
Convergence in probability does not imply a.s. convergence. To see this, a counterexample is provided.
Example 2.12. Suppose that (, A, P ) = ([0, 1], B, U nif orm[0, 1]) and dene
20
Convergence in probability and convergence in distribution
Theorem 2.13. For Rk -valued random variables (Xn )n and X it holds:
P d
Xn X Xn X.
Proof. Choose >0 and a bounded continuous function f : Rk R. We will show that there exists an
n0 > 0 such that
(*) |E[f (Xn )] E[f (X)]|
for n n0 . W.l.o.g. we assume that |f (x)| 1, x.
First, we choose C>0 such that P (kXk > C) /6. Such a C exists because P (kXk > m) 0 for
m by monotone convergence (or dominated convergence).
0 < 1 such that |f (x) f (z)| /3 for all x, z with kxk C + 1, kzk C + 1 and
Second, choose
kx zk . exists because f is uniformly continuous on the compact set {x : kxk C + 1}.
Such a
Third, choose n0 , such that P (kXn Xk ) /6 for n n0 .
Now dene the events An,1 = {kXk C}, An,2 = {kXk < C, kXn Xk < }, and An,3 = {kXk <
C, kXn Xk }. Denote the indicator functions of these events by 1An,1 , 1An,2 and 1An,3 . Dene also
An,3 = {kXn Xk }. Note that An,3 An,3 and An,1 An,2 An,3 = . We have now all parts of
our argument prepared to show (*):
Theorem 2.14. For Rk -valued random variables (Xn )n on a probability space (, A, P ) and determin-
istic a Rk it holds:
P d
Xn a Xn a
Ef (Xn ) E 1kXn ak = P (kXn ak ).
1+ 1+
Theorem 2.15 (Continuous mapping theorem (CMT)). Suppose that (Xn )n , X are random variables
Proof. Exercise.
21
Example 2.16 (Plug-in principle) . Suppose that we observe realizations of the Rk -valued random vari-
p
ables X1 , . . . , Xn and we want to estimate an unknown parameter R . Then
bn = T (X1 , . . . , Xn )
with some measurable function T : Rkn is called parameter estimator. A sequence of estimators
P
(bn )n for an unknown parameter is consistent if bn . If (bn )n is a consistent sequence of estimators
P
bn )
for a parameter and if g is a continuous function, then g( g(). E.g. suppose that X1 , . . . , Xn
X
are i.i.d. (independent and identically distributed) with P 1 = Exp(), > 0 and values in (0, ).
1
Pn P 1
We search for a consistent estimator of . Since by the WLLN Xn = n k=1 Xk EX1 = , the
1
CMT gives that the denition of n = Xn yields a consistent sequence of estimators for .
b
Even though our denition of distributional convergence is technically suitable, it is sometimes not
very convenient to apply it to check distributional convergence. Next, we state a useful tool to deduce
multivariate distributional convergence from the univariate.
Theorem 2.17 (Cramr-Wold device) . Suppose that (Xn )n and X are Rk -valued random variables.
Then
d d
Xn X a0 Xn a0 X a Rk
Proof. For the proof, we refer the reader to Theorem 29.4 in Billingsley [4].
Remark. In particular, the theorem implies that the distribution of a random vector X is uniquely
determined by the distributions of a0 X for all a Rk . This is used in computer tomography to recover
images. It is also useful to deduce a multivariate CLT from a univariate one, see below.
Theorem 2.18. Suppose that (Xn )n , X are Rk -valued random variables. Choose l N. The following
statements are equivalent
(i) Xn X.
d
(ii) Ef (Xn ) Ef (X) for all functions f : Rk R that are continuous and have bounded support.
n
(iii) Ef (Xn ) Ef (X) for all functions f : Rk R that are l-times dierentiable and bounded.
n
(iv) E[exp(ia0 Xn )] E[exp(ia0 X)] for all a Rk . Here i = 1. The function a 7 E[exp(ia0 X)]
n
is also called the characteristic function or Fourier transform of X .
Proof. Clearly, (i) = (ii). The converse can be proved using the ideas of the proof of Theorem 2.11.
(i) (iii) can be found in Pollard [11, Theorem III.3.12]. The proof of (i) (iv) can be found in
Section 29 in Billingsley [4].
Lemma 2.19 (Slutsky's Lemma) . For Rk -valued random variables (Xn )n , (Zn )n and X it holds:
P d d
kXn Zn k 0, Zn X Xn X.
Proof. Note that the functions considered in Theorem 2.18(ii) are uniformly continuous. Using this
particular characterization of convergence in distribution, we get
|Ef (Xn ) Ef (X)| |Ef (Zn ) f (X)| + E|f (Zn ) f (Xn )|1kZn Xn k + kf k P (kZn Xn k ).
The right-hand side is less than any prescribed >0 whenever >0 is chosen suciently small and
n n0 (), where n0 has to be chosen suciently large.
22
2.2.4 Discussion of stochastic boundedness
Recall: (Xn )n is stochastically bounded if
Notation: Xn = Op (1)
Why do we not use
P (kXn k C) = 1 C ?
In this sense, also constant sequences (Xn )n are not bounded, i.e. sequences with Xn X (the whole
sequence Xn is identical to a xed random variable X ). Note that in general it does not hold that
P (kXk C) = 1 C.
Proof. We show the statement for real-valued Xn and X with continuous CDF FX . The general proof
follows from Theorem 29.1 in Billingsley [4].
For >0 choose x , y with FX (x ) < 2, FX (y ) > 1 2 . Then:
This implies that P (x < Xn y ) > 1 for n n0 if n0 is chosen large enough.
P
We write Xn = oP (1) for Rk -valued random variables (Xn )n if Xn 0k . This relation is often
successfully applied to make use of the following relations.
Theorem 2.21.
(i) Xn = X + op (1) = Xn = Op (1) (Xn , X scalar or vector or matrix).
(ii) For Xn = op (1), Yn = op (1), Un = Op (1), Wn = Op (1) it holds
(a) Xn + Yn = op (1),
(b) Un + Wn = Op (1),
(c) Un Wn = Op (1),
(d) Xn Un = op (1).
(iii) g : Rk Rl continuous at x0 . Then
Xn = x0 + op (1) = g(Xn ) = g(x0 ) + op (1)
Proof. (i) Apply Theorems 2.13 and 2.20: We have that Xn X = oP (1) and therefore
P
Xn X ,
d
which in turn implies Xn X and hence Xn = OP (1).
(ii) (a) See exercises.
(b) Can be shown analogously to (a).
(c) We only state the real-valued case. The general case can be treated similarly invoking sub-
multiplicativity of certain matrix norms and equivalence of matrix norms.
The assertion follows from choosing KW such that P (|Wn | KW ) /2 and then choosing K
such that P (|Un | K /KW ) /2).
(d) Can be veried analogously to (c).
23
(iii) This is a corollary of the CMT in the case where g is continuous everywhere: Similar to [(i)], we
d d
get that Xn x0 and therefore also g(Xn ) g(x0 ). Finally, apply Theorem 2.14 to deduce
the assertion.
Xn = Op (cn ).
If c1
n Xn = op (1), one writes also
Xn = op (cn ).
For example,
1
n(n ) = Op (1) n = + Op .
n
The following theorems provide very powerful tools for asymptotic statistics. Still their proofs are lengthy.
Therefore, we skip them and only give the corresponding references.
Theorem 2.22 (Strong law of large numbers (SLLN)). For a sequence of i.i.d. random variables
on some probability space (, A, P ) with nite mean = E[Xj ] it holds that Xn
X1 , X2 , ...
almost surely.
Theorem 2.23 (Central limit theorem for i.i.d. sequences) . Let X1 , X2 , X3 , . . . be i.i.d. real-valued
random variables with EXi = , V ar(Xi ) = 2 (0, ). Then
X1 + + Xn n d
Z N (0, 1).
n
Theorem 2.24 (Lyapounov CLT) . Let X1 , X2 , . . . be independent real-valued random variables with
t = EXt , t2 = V ar(Xt ) and m3,t = E|Xt t |3 < . Assume
Pn 1/3
[ t=1 m3,t ]
Pn 0.
1/2 n
[ t=1 t2 ]
Then
X1 + + Xn 1 n d
p Z N (0, 1).
12 + + n2
1
exp 0.5(x )0 1 (x ) , x Rk ,
(x) = p
k
(2) det
has a multivariate normal distribution with mean Rk and covariance matrix Rk Rk , which it
0 0 0 k
assumed to be positive denite. One can show that a X N (a , a a) for any a R \{0k }
24
Theorem 2.25 (Multivariate CLT). Suppose that X1 , X2 , . . . are i.i.d. Rk -valued random variables with
mean vector and nite, positive denite covariance matrix . Then
1 d
(X1 + + Xn n) Ze N (0k , ).
n
Proof. This proof is an application of the Cramr-Wold device and an one-dimensional CLT.
Choose a Rk : a0 X1 , , a0 Xn are one dimensional i.i.d. random variables with mean a0 and variance
0
a a. Thus
(a0 X1 + + a0 Xn na0 ) d
Z N (0, 1)
n(a0 a)1/2
1 d
a0 (X1 + Xn n) Z N (0, a0 a)
n
1 d
(X1 + + Xn n) Ze N (0, ).
n
25
3 Conditional expectations, probabilities and variances
E[{Y g(X)}2 ]. ()
Denition 3.1. Each (measurable) function g that minimizes () is called conditional expectation
of Y given X .
Notation:
E[Y |X] = g(X), E[Y |X = x] = g(x).
Remark:
One can show that E[Y |X] exists, i.e. there is a measurable function g that minimizes (); see
Chapters 22 and 23 in Jacod and Protter [10].
Note that E[Y |X] = g(X) is a random variable, while E[Y |X = x] is a real number.
Conditional expectations can also be dened if only E|Y | < or Y 0 but this denition is less
intuitive. Therefore, we stick to this version and discuss the general variant very briey later on.
Recall the relation between expectation and probability E 1A = P (A). Now, we dene conditional
distributions via conditional expectations.
Denition 3.2. Suppose that X and Z are random variables with values in (Rk , Bk ) and (Rl , Bl ),
respectively. Then a conditional distribution of Z given X is dened as
P Z|X (B) = P (Z B | X) = E(1ZB | X), B B l
Moreover, P Z|X=x (B) = E(1ZB | X = x) is called conditional distribution of Z given X = x.
Remark: Note that again, P Z|X is random. One can show that the corresponding minimizers can be
chosen such that P Z|X
and P Z|X=x are probability measures (a.s.) (these are called regular conditional
distributions). This also holds for random variables with other state spaces (under certain assumptions).
ydP Y |X=x ;
R
In this case E(Y | X = x) = see Theorem 34.5 in Billingsley [4].
X X P (Y = y, X = x)
[y g(x)]2 P (Y = y, X = x) or equivalently [y g(x)]2 .
y y
P (X = x)
Note that the latter quotient gives a probability measure on Y , the discrete state space of Y . Since we
know that the mean of a square-integrable random variable Z minimizes E(Z a)2 , we obtain
X P (Y = y, X = x)
E(Y | X = x) = g(x) = y .
y
P (X = x)
More generally speaking, we re-obtain the denition of conditional distribution for discrete random
variables as
P X,Y ({x} A)
P Y |X=x (A) = if P X ({x}) > 0. (+)
P X ({x})
For other values of x dene P Y |X=x (A) as you like (The set of such x is a PX null set).
26
Example 3.3. Suppose that X and Y describe two independent dice experiments and dene Z = X +Y .
Then, for x = 1, . . . , 6
12 12 x+6
X X X z
E(Z | X = x) = 6zP (Z = z, X = x) = 6zP (Y = z x, X = x) = = x + 3, 5
z=2 z=2 z=x+1
6
Example 3.4. Suppose that X1 , . . . , Xn are i.i.d. random variables with X1 Bin(1, ), where
(0, 1) is an unknown parameter. Then
Pn Pn
P (X1 = x1 , . . . , Xn = xn ) = i=1 xi (1 )n i=1 xi .
0
Pn
We consider the statistic T (X := (X1 , . . . , Xn ) ) = i=1 Xi , i.e. T Bin(n, ). Then
(
1/ nk , if T (x) = k
0
P (X = (x1 , . . . , xn ) | T (X) = k) =
0 else
is independent of , i.e. T contains already the whole information of the observations X1 , . . . , Xn regard-
ing . A statistic with this property is called sucient. These kind of statistics can be used to construct
so-called UMVU estimators for the unknown parameter ( niformly u minimal variance unbiased).
ZZ
E[{Y g(X)}2 ] = {y g(x)}2 fX,Y (x, y)dxdy = min!
Z
{y g(x)}2 fX,Y (x, y)dy = min!
R R
yfX,Y (x, y)dy yfX,Y (x, y)dy
g(x) = R = a.s.
fX,Y (x, y)dy fX (x)
R R
yfX,Y (X, y)dy yfX,Y (x, y)dy
= E[Y |X] = a.s. and E[Y |X = x] = a.s.
fX (X) fX (x)
More generally, for arbitrary fX and forx xed, E[Y |X = x] is the mean of the distribution with density
( f (x,y)
X,Y
if fX (x) > 0
fX (x)
fY |x (y) = ,
any density else
see e.g. Example 33.5 in Billingsley [4]. This distribution is the conditional distribution of Y given
X = x. The (random) distribution with density fY |X is the conditional distribution of Y given X.
Theorem 3.5. For an Rk -valued random variable X and a real-valued random variable Y assume that
EY 2 < . Then the following are equivalent (TFAE):
(i) g(X) = E[Y |X] a.s.
(ii) E [{Y g(X)}h(X)] = 0 for all measurable functions h with E[h2 (X)] < .
27
(iii) E [{Y g(X)}h(X)] = 0 for all measurable functions h : Rk {0, 1}.
Remark:
(iii) can be rewritten as E [Y 1(X B)] = E [g(X)1(X B)] for all (Borel-) sets B Rk .
Property (iii) is often used as denition of a conditional expectation. Note that is does not require
that E[Y 2 ] < . It suces to assume that E|Y | < or Y 0. Conditional expectations are
(more generally) typically dened under these conditions.
There exists an even more general notion of conditional expectations. Suppose that Y is a random
variable dened on a probability space (, A, P ) with E|Y | < . Suppose that A0 A is a
sub- -eld of A. Then the random variable Y0 = E[Y |A0 ] is dened as an A0 -measurable random
variable that fullls:
E [Y0 1B )] = E [Y 1B ] for all sets B A0 .
2
Under the additional assumption of E[Y ] < this is equivalent to:
see Satz 15.8 in Bauer [3]. This notion of conditional expectations generalizes conditional expecta-
tions of the form E[Y |X] because of E[Y |X] = E[Y |A0 ] if A0 is equal to the -eld generated by
X, i.e.A0 = {X 1 (C) : C measurable}. An example for such conditional expectations are cases
of time series where A0 denotes the -eld of events of the past.
G = g minimizes E {Y G(X)}2
= E[{Y g(X) ah(X)}2 ]|a=0 = 0 for all measurable functions h with E[h2 (X)] <
a
E[{Y g(X)}2 ] 2aE[{Y g(X)}h(X)] + a2 Eh2 (X) |a=0 = 0
=
a
for all measurable functions h with E[h2 (X)] <
= E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) < .
(i) (ii):
Assume that
E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) <
This implies
28
Proof. Suppose that there exists an > 0 such that P (g1 (X)g2 (X) ) > 0. From Theorem 3.5(iii)
we get with h(x) = 1g1 (x)g2 (x)
Theorem 3.7 (Iterated expectations). Suppose that Y is a square-integrable, real-valued random vari-
able, X is an Rk -valued random variable and Z is an Rl -valued random variable on a probability space
(, A, P ). Then
(i) E[E[Y |X]] = EY ,
(ii) E[E[Y |X, Z]|Z] = E[Y |Z] a.s.,
(iii) E[E[Y |X]|X, Z] = E[Y |X] a.s.,
(iv) E[Y f (X)|X] = f (X)E[Y |X] a.s., where f is an R-valued function such that Ef 2 (X) +
E[Y f (X)]2 < ,
(v) E[E[Y |X]|f (X)] = E[Y |f (X)] a.s.,
(vi) E[Y |X, Z, X 2 , XZ] = E[Y |X, Z] a.s., where k=l=1.
Remark: (i)-(iii) are also called Law of Iterated Expectations (LIE) or tower property.
We want to show
f (Z) = E[Y |Z] a.s.
This is equivalent to
E[f (Z)h(Z)] = E[Y h(Z)] h.
The latter follows from
(iii) Exercise.
(iv) Suppose that g(X) is a version of E[Y |X], then E(h(X)[Y g(X)]) = 0 for any square-integrable
function h. In particular E(h(X)f (X)[Y g(X)]) = 0 for any function h that takes values 0 or 1
only. The application of Theorem 3.5(iii) nally yields the assertion.
This follows directly by the application of the denitions of conditional expectations. Now put
W = (X, Z)0 and g(W ) = (X 2 , XZ)0 .
(v) Due to (10) we obtain
29
Example 3.8 (Application of (vi)) .
E[W age | education, experience] = 0 + 1 educ + 2 exper + 3 educ exper + 4 educ2
= E[W age | educ, exper, educ2 , educ exper] a.s.
Theorem 3.9 (Properties of conditional expectation). Suppose that Y1 , Y2 are square-integrable real-
valued random variables, X is an Rk -valued random variable on a probability space (, A, P ) and a1 , a2
are scalars. Then
(i) E[a1 Y1 + a2 Y2 |X] = a1 E[Y1 |X] + a2 E[Y2 |X] a.s.,
(ii) Y1 Y2 E[Y1 |X] E[Y2 |X] a.s.,
(iii) (E[XY |Z])2 E[X 2 |Z]E[Y 2 |Z] a.s. for real-valued, square-integrable X and another random
variable Z on the same space (Cauchy-Schwarz inequality),
(iv) : R R convex, E[(Y )]2 < +
(E[Y |X]) E[(Y )|X] a.s.
(Jensen inequality),
(v) 0 Yn Y = E[Yn |X] E[Y |X] a.s. (monotone convergence),
(vi) P (|Y | |X) 2 E[Y 2 |X] a.s. for any > 0.
Proof. (i) Apply Theorem 3.5(iii) and linearity of the ordinary expectation.
The next theorem describes the relation between independence and conditional distributions:
Proof. Obviously the second part follows from the rst one in conjunction with the remark below De-
nition 3.2. Moreover,
30
3.2 Conditional Variances
Denition 3.11. For a real-valued random variable Y (with EY 4 < ) and an Rk -valued random
variable X on a probability space (, A, P ) a conditional variance of Y given X is dened as
V ar[Y |X] = E[(Y E[Y |X])2 |X].
V ar[a(X)Y + b(X)|X] = E(a2 (X)(Y E(Y | X))2 | X) = a2 (X)V ar[Y |X] a.s.
(ii)
E[V ar(Y |X)] + V ar(E[Y |X]) = E(E(Y 2 |X) [E(Y | X)]2 ) + E[E(Y | X)]2 (EY )2 = var(Y )
(iii) Exercise.
31
4 Linear regression
Remark:
Y: dependent variable, regressand
strict exogeneity implies Ei = 0 (tower property of conditional expectation), which is not restric-
tive (existence of 0 )
homoskedasticity: E(2i | X) = 2 a.s., and var(2i | X) = 2 a.s.
4.2.1 Estimation of
We aim to establish an estimator for the unknown parameter
Pn . The ordinary least square estimator
0 2
(OLS estimator) is the minimizer of i=1 (Yi Xi ) . In matrix notation we get the following denition.
Denition 4.2. In the classical linear regression model an OLS estimator is given by
bOLS = arg min (Y X)0 (Y X).
RK+1
Theorem 4.3. In the classical linear regression model the OLS estimator is unique a.s. and
bOLS = (X 0 X)1 X 0 Y a.s..
Proof. We only work on a set 0 with P (0 ) = 1 and such that for each 0 X() has full rank.
Set Q() = (Y X)0 (Y X). Then
Q()
= 2X 0 Y + 2X 0 X
32
Q()
and, therefore,
= 0K+1 if = (X 0 X)1 X 0 Y . We found the only candidate for a minimizer. It
remains to show that bOLS indeed minimizes (not maximizes) the function Q. Therefore, suppose that
e 6= bOLS and obtain (with probability 1)
Q()
e
> Q(bOLS ).
Remark:
Note that if a matrix A has full column rank, than A0 A is positive denite. See Exercise.
n
1X
Xi ei = 0K+1 .
n i=1
1
bOLS = SXX sXY a.s.
with
n n
1X 1X
SXX = Xi Xi0 (sample mean of Xi Xi0 ) and sXY = Xi Yi (sample mean of Xi Yi ).
n i=1 n i=1
The following result summarizes the nite-sample properties of the OLS estimator.
Theorem 4.4 (Gauss-Markov-Theorem). In the classical linear regression model the OLS is BLUE (best
linear unbiased estimator), that is, for any (conditionally) unbiased estimator e of that is linear in Y ,
COV (e | X) COV (bOLS | X) a.s..
Remark: COV (e | X) COV (bOLS | X) x0 (COV (e | X) COV (bOLS | X))x 0 for all
means
K+1
xR . ek | X) var(bOLS | X).
Taking x as the k th unit vector, this implies in particular that var(
Proof. Linearity and unbiasedness of the OLS estimator follow immediately from Theorem 4.3 and it
remains to prove optimality. To this end, suppose that e = AY is another linear, conditionally unbiased
estimator of , i.e.
33
How much of the variation of the dependent variable can be explained by the variability of the
regressors? We could use kY Yn 1n k22 kY X bOLS k22 . However, this would lead to a scale-dependent
measure. Instead let us consider
kY X bOLS k22
R2 = 1
kY Yn 1n k22
which is referred to as coecient of determination in the literature. This is indeed the fraction of
variability of Y that can be explained by X since
n
X n
X n
X
(Yi Yn )2 = (Ybi Yn )2 + e2i +Sn .
i=1 i=1 i=1
| {z } | {z } | {z }
total variability of Y variability of regression variability of residuals
Here, Ybi = Xi0 bOLS and ei = Yi Ybi denotes the ith residual and we use X 0 e = 0K+1 (normal equations)
to see that !
n
X n
X
0
Sn = 2 (Ybi Yn )ei = 2 bOLS X 0 e Yn ei = 0 + 0 = 0.
i=1 i=1
For the rest of this paragraph, we assume the observations (Yi , Xi,1 , . . . , Xi,K ), i = 1, . . . , n, to be i.i.d.
and study the asymptotics of the OLS estimator for . To this end, we have to deal with asymptotics
for matrices. We can apply our Denition 2.4, just using a matrix norm, e.g.
qP 2-norm of a pq matrix,
p Pq
kAk = i=1 j=1 |aij |2 .
Theorem 4.5 (Consistency of the OLS estimator) . In the classical linear regression model with i.i.d.
(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . we assume that EX1 X10 is nite and invertible. Then
P
bOLS,n .
1
bOLS = SXX sXY a.s.
P
and note that SXX EX1 X10 by the WLLN. Using that the inversion of a matrix is a continuous
transformation and the CMT for convergence in probability (see Theorem 1.9.5 in van der Vaart and
Wellner [13]), we get
1 P
SXX (EX1 X10 )1 .
Again by the WLLN, sXY converges in probability to EX1 Y1 = (EX1 X10 ) + EX1 1 = (EX1 X10 ) .
Therefore,
1
bOLS = (SXX (EX1 X10 )1 )sXY + (EX1 X10 )1 sXY
= oP (1)OP (1) + (EX1 X10 )1 (EX1 X10 ) + oP (1)
= + oP (1).
Theorem 4.6 (Asymptotic normality of bOLS ). In the classical linear regression model with
i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we assume that EX1 X10 is nite and invertible. Then
d
n (bOLS ) Z N (0K+1 , 2 (E(X1 X10 ))1 ).
Proof. First note that bOLS = (X 0 X)1 X 0 almost surely. Moreover, the multivariate CLT gives
n
1 1 X d
X 0 = Xi i Ze N (0K+1 , 2 (E(X1 X10 ))
n n i=1
34
To sum up, we have
1 1
n(bOLS ) = (n1 X 0 X)1 X 0 = oP (1) OP (1) + E(X1 X10 )1 X 0 = oP (1) + Zn
n n
d
where Zn = E(X1 X10 )1 1n X 0 Z by CMT. Finally, Slutsky's Lemma yields the result.
rh i rh i
1 1
In(k) = bk,OLS z1/2 (X 0 X) , bk,OLS + z1/2 (X 0 X)
k,k k,k
where zq denotes the q -quantile of the standard normal distribution. Hence, P (k In (k)) 1 ,
n
(k)
i.e. In is an asymptotic (1-)-condence interval for k . It is desirable to provide condence intervals
2
in the more realistic setting that is unknown.
4.2.2 Estimation of 2
The aim of this paragraph is to establish an estimator for the variance of the error terms based on the
OLS estimator for the regression coecients.
Denition 4.7. If n > K + 1, the OLS estimate of the variance 2 > 0 is given by
2 e0 e
bOLS =
nK 1
Due to the spherical error variance assumption the latter term reduces to
n n
!
X X
0
E(e e | X) = (In X(X 0 X)1 X 0 )i,i = 2
2
n (X(X X)0 1 0
X )i,i .
i=1 i=1
and it remains to show that the sum on the r.h.s. is equal to K +1 almost surely. This in turn follows
from
K+1
X K+1
X
(X 0 X(X 0 X)1 )i,i = (IK+1 )i,i = K + 1
i=1 i=1
Pn
0 1 0
PK+1
if we can show that i=1 (X(X X) X )i,i = i=1 (X 0 X(X 0 X)1 )i,i . To see this, let P
us consider a
p q matrix A and P
a q p matrix B . The trace of a square matrix C is dened as tr(C)= i Ci,i . Then
p Pq Pq Pp
it holds tr(AB) = i=1 j=1 A i,j Bj,i = j=1 i=1 B j,i A i,j = tr(BA). Finally, put A = X(X 0 X)1
0
and B = X .
Theorem 4.9. In the classical linear regression model with i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we
assume that EX1 X10 is nite and invertible. Then bOLS
P
2 as n .
35
Proof. P
1
Pn 2
It suces to show that
n i=1 ei 2 . By the WLLN, consistency of bOLS and since SX,X =
OP (1) we have
n n
1X 2 1X
ei = (i Xi0 (bOLS ))2
n i=1 n i=1
n n n
1X 2 1X 1X
= i 2(bOLS )0 Xi i + (bOLS )0 Xi Xi0 (bOLS )
n i=1 n i=1 n i=1
= 2 + oP (1).
Typically, X is modelling the vector of observations and is an unknown parameter lying in the
parameter space . Based on our data X we aim to decide a test problem of the following form
H 0 : 0 vs. H1 : 1 = \0 . (11)
36
Denition 4.13. Suppose that (X, , A, {P | }) is a statistical experiment and is a test to
decide the problem (11) .
(i) A type I error (error of rst kind) occurs when H0 is true but rejected.
(ii) A type II error (error of second kind) occurs when H1 is true but rejected.
Example 4.14 (Example 4.10 cont'd) . Typically, both errors occur, see sketch in the lecture.
Hence, the idea is to minimize the type II error under the condition that the type I error is less than
a prescribed level .
Denition 4.15. In the set-up of Denition 4.13
(i) a test is an -test if E (X) = P ((X) = 1) for all 0 ,
(ii) a sequence of tests (n )n is consistent if P (n (X) = 1) n
1 for all 1 .
H0 : R =
for some prescribed (r (K + 1)) matrix R of rank r and a prescribed r dimensional vector .
Example 4.16. This general hypothesis covers several interesting special cases.
1. H 0 : k = 0
with R = (0, . . . , 0, 1 , 0, . . . , 0)
|{z} and =0
(k+1)th
2. H 0 : 0 = 1
with R = (1, 1, 0, . . . , 0) and =0
3. H 0 : 0 + 1 + 2 = 1
with R = (1, 1, 1, 0, . . . , 0) and =1
4. H 0 : 0 = 1 = 2 = 0
with
1 0 0 0 ... 0 0
R = 0 1 0 0 ... 0 and = 0 .
0 0 1 0 ... 0 0
Under the conditions of Theorem 4.6, we know that n(bOLS ) is asymptotically normal. This
property carries over to n(RbOLS R) by CMT and the fact that linear transformations of normal
random variables are normally distributed again. This relation is now invoked to construct a so-called
Wald test. For its denition we require knowledge of a certain distribution.
where denotes the so-called Gamma function, i.e. (a) = 0 xa1 ex dx, a > 0.
R
It can be shown that the sum of the squares of l independent standard normal variables is 2l distributed;
see Jacod and Protter [10, Example 6, Chapter 15].
37
Denition 4.18. A Wald test of level (0, 1) for H0 : R = based on n observations Zi =
(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , n, is given by
0 if nb
(
2
OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) [2r,/2 , 2r,1/2 ]
n ((Z1 , . . . , Zn )) = ,
1 else
Theorem 4.19. Suppose that n > K + 1 and that the conditions of Theorem 4.6 hold. Then a sequence
of Wald tests (n )n is an asymptotic test and consistent.
2 d
nb
OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) 2r
2 d
nb
OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) .
Since (EX1 X10 )1 is symmetric and positive denite and since R has full rank it can be shown
similarly to Exercise 35(i) that R(EX1 X10 )1 R0 is symmetric and positive denite and, hence,
the same properties hold for its inverse. By Seber [12, 10.8] and reference therein we can
decompose R(EX1 X10 )1 R0 = [R(EX1 X10 )1 R0 ]1/2 [R(EX1 X10 )1 R0 ]1/2 and [R(EX1 X10 )1 R0 ]1 =
([R(EX1 X1 ) R ] ) ([R(EX1 X10 )1 R0 ]1 )1/2 such that
0 1 0 1 1/2
Hence
1 d
n
bOLS [(R(EX1 X10 )1 R0 )1 ]1/2 (RbOLS ) e N (0r , Ir )
and the CMT gives
2 d
nb
OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) e0 e = .
P (n (Z) = 1) 1 n n0 .
38
By step 1 of this proof the second term can be bounded from above by /4 for n n0 if we choose K
2
suciently large. Due to consistency of
bOLS and SX,X we have for large n
P (n (Z) = 1) P k n [(R(SX,X )1 R0 )1 ]1/2 (R )k2 > 2( r,1/2 + K)
2
0 1 0 1 1/2
P nk[(R(EX1 X1 ) R ) ] (R )k2 > 4( r,1/2 + K)
1 .
Note that the theorem above states asymptotic results only. They might be very unreliable in small
samples. Can we do better?
4.3.3 Hypothesis tests in the classical linear regression model under normality
From now on, we assume that P |X = N (0n , 2 In ). This implies that the conditional distribution of
does not depend on X and therefore X and are independent by Theorem 3.10. Hence, the marginal
2
distribution of the errors is also given by P = N (0n , In ). Moreover, this assumption allows us to
construct exact tests.
We consider the test problem H0 : k = for a xed k {0, . . . , K}. First note that under H0
b
p OLS,k N (0, 1)
[(X X)1 ]k,k
0
However this quantity cannot be used as a test statistic if 2 is unknown. Therefore we substitute the
unknown term by its estimate
2
bOLS and consider the t statistic
bOLS,k
Tn = p
bOLS [(X 0 X)1 ]k,k
Theorem 4.21. Suppose that n > K + 1 and that P |X = N (0n , 2 In ). Then Tn tnK1 under the
null hypothesis.
Proof. (Sketch) We use the representation of the t distribution below Denition 4.20.
1. Modied nominator.
b
p OLS,k N (0, 1).
[(X X)1 ]k,k
0
2. Modied denominator. Noting that the OLS residuals are distributed as (In X(X 0 X)1 X 0 )Z ,
where Z N (0n , In ) is independent of X, we have
2
e0 e d 0
= Z (In X(X 0 X)1 X 0 )Z.
bOLS
(n K 1) =
2 2
It can be shown that the latter is indeed 2nK1 distributed. First note that by Davidson,
MacKinnon [6, Theorem 4.1(b)]: if Z N (0n , In ) and P is an n dimensional projection matrix
39
(i.e. symmetric and P2 = P) of rank p then Z 0 P Z 2p . Denoting A(X1 , . . . , Xn ) = (In
0 1 0
X(X X) X) we get by Tonelli-Fubini theorem (Jacod and Protter [10, Theorem 10.3]
Z Z
1z0 A(x1 ,...,xn )zy P Z (dz)P (X1 ,...,Xn ) (dx1 , . . . , dxn )
0 0 0
0 0 1 0
P (Z (In X(X X) X )Z y) =
Z
0 0 0
= f2nK1 (y)P (X1 ,...,Xn ) (dx1 , . . . , dxn )
= f2nK1 (y)
since A(x1 , . . . , xn ) is a projection matrix of rank nK1 by Seber [12, 4.11] for almost all
(x1 , . . . , xn ).
3. Independence. A sketch of the proof can be found in Hayashi [9, Proof of Proposition 1.3].
It follows immediately from the previous theorem that this test is an test.
Lemma 4.22. Under the conditions of the previous Theorem and if (Y1 , X1,1 , . . . X1,K )0 , . . . ,
(Yn , Xn,1 , . . . Xn,K )0 , i = 1, . . . , n, are i.i.d. such that EX1 X10 is nite and invertible, the t test is
consistent.
bOLS,k k
Tn = p + pk
0 1
bOLS [(X X) ]k,k bOLS [(X 0 X)1 ]k,k
= OP (1) + pk .
bOLS [(X 0 X)1 ]k,k
Hence, for any xed >0 it holds that with Rn determining the OP (1) in the previous line
!
|k |
P (|Tn | tnK1,1/2 ) P p tnK1,1/2 + K P (|Rn | > K)
bOLS [(X 0 X)1 ]k,k
!
|k |
P p 2(tnK1,1/2 + K)
[(X 0 X)1 ]k,k
q
P n|k | 4 E[(X 0 X)1 ]k,k (tnK1,1/2 + K) 2
for suciently large K. We can deduce the assertion of the lemma if (tnK1,1/2 )n is uni-
formly bounded in
q P n. To this end, recall that any tn distributed random variable can be written as
1 n d
Z0 / n k=1 Zk2 Z N (0, 1) for i.i.d. standard normal Z0 , Z1 , . . . .
Remark:
More generally it can be shown that under the conditions above
c0 (bOLS )
Tn(c) = p tn(K+1) c RK+1 \{0}.
c0 (X 0 X)1 c bOLS
Hence, we can establish a t test for the Problems 2 and 3 described in Example 4.16.
Under the normality assumption we can also provide a test for the general hypothesis of para-
graph 4.3.2. We can show that the nite sample distribution of the corresponding test statistic
divided by r has a so-called F distribution with r and n K 1 degrees of freedom which is dened
as the distribution of (Z1 /r)/(Z2 /(n K 1)) with independent Z1 2r and Z2 2nK1 .
Hence we can again construct an test.
40