Lecture Note 2019 PDF
Lecture Note 2019 PDF
Lecture Note 2019 PDF
Yixiao Sun
Department of Economics,
Spring 2019
Contents
Preface ix
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CONTENTS vi
3.8.3 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Extremum Estimators 77
5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 ML Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.4 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.5 MD Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
CONTENTS vii
Logit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
CONTENTS viii
1
This chapter is less polished than other chapters so use it at your own risk
CONTENTS ix
Preface
The primary goal of Econ 220C is to introduce tools necessary to understand and implement
empirical studies in economics focusing on issues other than time-series analysis. This course
contains two parts. The first part deals with panel data models: (1) static panel data models
(2) dynamic panel data models.The second part of the course deals with limited-dependent-
variable models: (1) discrete choice models; (2) censored and truncated regression models,
(3) sample selection models; and (4) evaluation of treatment e§ects. While the second part
focuses mainly on cross sectional data, it also covers panel Probit/Logit, panel Tobit and
panel attrition models. From an econometric theory perspective, the unifying framework for
the second part is the asymptotic theory of extremum estimators, which includes GMM as a
special case.
We will study di§erent issues in the specification, estimation and testing of these models
with cross-sectional data and with panel data. The emphasis of the course is on both econo-
metric ideas and econometric techniques. For some of the problem sets you will have to deal
with actual data or perform simulation experiments. You should become familiar as soon as
possible with some general features of the econometric package that you choose. MATLAB
is widely used by econometricians. STATA has gained increasing popularity in recent years
among applied micro economists. R has been widely used in statistics but not as much in
ix
Chapter 1
Causal/Structural Inference
We compare predictive modeling with causal modeling. Let us first define what a (linear)
Given two scalar random variables (X, Y ) , suppose we want to predict Y based on X. The
cov(X, Y )
β∗ = and α∗ = EY − (EX) β ∗ .
var(X)
Note that these are purely statistical objects. They may not contain any physical, chemical,
e = Y − (α∗ + Xβ ∗ ) .
I want to emphasize that this is just a mathematical definition. The mathematical equation
can be rewritten as
Y = (α∗ + Xβ ∗ ) + e.
We add whatever it should be to bring α∗ + Xβ ∗ to Y. The added amount may not represent
Ee = EY − (α∗ + EXβ ∗ ) = 0.
Y = (α∗ + Xβ ∗ ) + e
where e satisfies
Ee = 0 and cov(X, e) = 0.
This is our linear predictive model. We choose the coe¢cients α∗ and β ∗ such that the
prediction residual e has mean zero and is uncorrelated with X. That is, the prediction residual
cov (X, Y )
γ∗ = and δ ∗ = EX − (EY ) γ ∗
var(Y )
ẽ = X − δ ∗ − Y γ ∗ .
Then we have
X = δ ∗ + Y γ ∗ + ẽ
by β ∗ units. When we observe such a change in X, other things, including both observables
and unobservables, may have changed too. So ‘all else being equal condition’ may not be met.
The expected change of β ∗ units could be due to the change of X and/or other variables that
change with X.
observe that Xi is higher than Xj by one unit, then we expect Yi to be higher than Yj by β ∗
units. Here we do not know whether other things are the same across the two individuals. In
fact, we let other things to change freely with X. That is, individuals make their own choice
are iid. What is the (linear) statistical prediction model between Y and X? That is, what are
Y = α∗ + Xβ ∗ + e
var(X) var(X)
α∗ = EY − (EX)β ∗ = 0,
so
e = Y − (α∗ + Xβ ∗ ) = Y1 − Y2 .
1 1
Y = 0 + X + (Y1 − Y2 ) ,
2 2
so α∗ = 0, β ∗ = 1/2, and e = 1
(Y1 − Y2 ) . Again this is a purely statistic decomposition,
A statistical relationship does not have to be linear. How should be interpret (α∗ + Xβ ∗ )?
h i
a,b
(α∗ , β ∗ ) are the population regression coe¢cients. Compare them with the sample regres-
# ∗
$ 1 Xh
n i
α̂∗OLS , β̂ OLS = arg min fˆn (a, b), where fˆn (a, b) = (Yi − a − Xi b)2
a,b n
i=1
or equivalently
∗ c (X, Y )
cov ∗
Define
m(x) = E(Y |X = x)
which is the conditional mean function. This is a function of x : for each given x, we can
compute E(Y |X = x) and assign this value to m (x) . When we do not look at a particular
f (a, b)
n o
2 2
= E [Y − m(X)] + [m(X) − a − Xb] + 2 [Y − m(X))] [m(X) − a − Xb]
n o n o
and then h i * + * +
E {e} = 0 for e = e1 e2 .
By the Law of Iterated Expectation (LIE), which is also called “law of total expectation”, we
have
= m(X) − m(X) = 0.
Other functions of X
m›Xfi
XK D
n o n o
n o
The first term E [Y − m(X)]2 does not depend on (a, b) . Its presence does not change the
n o
To summarize,
# $ n o
a,b
tation function (CEF) m (x). α∗ + xβ ∗ is the closest linear function to m (x) according the
criterion n o Z
Remark 2 We also say that α∗ + Xβ ∗ is the best linear prediction of Y given X in the MSE
sense.
identification conditions)
identity, but we know that E (e1 |X) = 0. So any random variable can be decomposed into
two pieces: A piece that statistically explained by X, that is, the CEF, and a piece left over
that is orthogonal to (i.e., uncorrelated with) any function of X. This decomposition is totally
statistical.
E (v) = 0. What is β ∗ ?
is β ∗ ?
Y α + Xβ + u
Interpretation of β : If we intervene and set X to change by 1 unit while keeping all else
constant, then Y will change by β units. The di§erence between β ∗ and β lies in whether all
Xi by one unit while keeping all else constant, then we expect Yi to change by β units. We
compute the causality e§ect by looking at the same individual under two di§erent scenarios:
one is observed and the other is counterfactual. That is, we observe one scenario. Then we
ask: what would have happened if Xi was increased by one unit while keeping all else being
equal?
while holding all else being equal. This is only possible in ideal and controlled experiments.
In an observational study, we do not control the causal factors so X, the causal factor of
interest, may be covary with unobserved causal factors collected in u. For example, X is years
of schooling, Y is hourly wage rate and one causal factor in u is ability. In observational
studies where individuals make their decision on X, X is likely to be correlated with u. When
we observe a unit increase in X from individual i to individual j, we have also implicitly gone
through a change in ability. The change in Y (from individual i to individual j) may be partly
due to the change in X (from individual i to individual j) and may be partly due to the change
in ability (from individual i to individual j). Running an OLS of Y on X will give us a good
estimator of β ∗ , which aggregates the two e§ects. However, in policy analysis, we ask: what
would happen to individual i’s Y if we intervene and change his or her X by one unit. That
is, we care about the e§ect of education on Y while keeping ability as equal.
ence
Predictive Analysis
Model Y = α∗ + Xβ ∗ + e
Interpretation of the slope let other variables run their own course; all else may not be equal
Causal Inference
Model Y = α + Xβ + u
y az
x bz
for b 6= 0. Graphically,
. &
x y
Let z be generated as a sequence of iid random variables Zi so that in the absence of inter-
Xi = Zi × b,
a
Yi = Zi × a = Xi .
For the purpose of this example, we assume that we do not observe Zi0 s. Thus our observations
consist of (Xi , Yi ) lying on the line y = (a/b) x. Given any Xi , the best prediction of Yi is
m (Xi ) = (a/b) Xi . Thus Xi is useful for predicting Yi , even though there is no causal relation
between Xi and Yi . Furthermore, the regression coe¢cient a/b definitely does not measure the
e§ect on y. Instead, the regression coe¢cient a/b works together with Xi to give an optimal
prediction of Yi .
In any equation system, like the one above, if we intervene a variable (say x), then the
equation that determines this variable has to be crossed out. The equation does not describe
x x0 (set)
y az
Graphically,
x0 z
# & .
x y
x and y are not connected in any way: the causal e§ect is zero. The cause e§ect is (when x
y(x00 ) − y(x0 ) = 0
y ax + u,
x by + v,
or graphically
v u
# #.
x $ y
Example: x : crime rate; y : police spending. (The two di§erent causal directions may not
happen at exactly the same time, but if we observe the variables not very frequently, then
interval).
Suppose that the values of (u, v) are generated as an iid sequence of pairs (Ui , Vi ) such
that 2 0 13
σ uu σ uv
(Ui , Vi ) s N 40, @ A5 .
σ uv σ vv
We do not observe (Ui , Vi ) . The reduced form (the equilibrium solution in terms of Ui and Vi )
is given by
Xi = (bUi + Vi ) ,
1 − ab
Yi = (Ui + aVi ) .
1 − ab
cov(Xi , Yi ) bσ uu + (1 + ab) σ uv + aσ vv
β∗ = = .
var(Xi ) b2 σ uu + 2bσ uv + σ vv
Su¢cient freedom exists to deliver a wide range of possible value for β ∗ . For example, when
σ uu = 0, we have
aσ vv
β∗ = =a
σ vv
bσ uu + aσ vv
σ uv = −
1 + ab
gives
β∗ = 0
so that Xi is useless as a predictor of Yi (under the normality, the best prediction is linear in
The optimal linear prediction interpretation of β ∗ holds regardless whether we have each
of the following
(c) x and y are mutually non-causal, although both have a common cause
(d) x and y mutually cause each other in the presence of additional causal variables.
In the first three cases, the predictions are in fact perfect, while in the last case we can
have Xi and Yi useless as predictors of one another despite their causal relationships.
While in case (a), the conditional mean coincides with the causal function, this is not true
in any of other cases. This conditional expectation cannot by itself tell us what we should
tells us what we can expect Yi to be given Xi when Yi and Xi are generated by whatever
Chapter 2
Modeling
2.1 Introduction
Recently empirical research in economics has been enriched by the availability of a wealth of
new sources of data: Cross sections of individuals observed over time. The type of datasets
is called panel data. Other terms used for such data include: longitudinal data and repeated
measures (Statistics and Biostatistics). The availability of panel data has stimulated a rapid
growth in both methodological approaches and applications during the last thirty years.
If Xit contains no lagged dependent variables, the model is a static linear panel data model.
Otherwise, it is a dynamic linear panel data model. The set of explanatory variables may
include:
• variables that vary across individuals and time periods, e.g., wage, age, and years of
• variables that are time-invariant, i.e., vary only across individuals, e.g., race and sex.
Denote them as Xi .
• variables that vary only over time but not across individuals, e.g., economy-wide un-
Xt .
The model given above is not meaningful unless we explain what "it is. Consider the case
and let "it = Yit − Xit β, then "it is the prediction error and Xit β is the best linear prediction
of Yit given Xit . In this case, we have cov (Xit , "it ) = 0 by definition and the above model is a
predictive model.
If we assume that (i) Xit is a causal factor of Yit , (ii) the causal link from Xit to Yit is
linear, and (iii) "it contains all other (unobserved) causal factors, then the above model is
are possibly correlated. By default, the linear model we considered in the course should be
What is generally referred to as the panel data approach to economics research provides several
major advantage over conventional cross sectional or time series data approaches. Both Hsiao
(2014) in his seminal monograph and Baltagi (2013) in his excellent book provide extensive
summaries.
• more informative data, more variability, more degrees of freedom and more e¢ciency.
need panel data to determine whether the same 6% are unemployed each year.
• Repeated observations on the same unit allow identification in the presence of some types
Figure 2.1: We are interested in the slope of the thin lines. If we have only cross sectional
data the fitted line will be the thick one. The estimated slope is obviously biased downward.
Figure 2.2: We are interested in the slope of the thin lines. If we have only cross sectional
data the fitted line will be the thick one, the estimated slope is obviously biased upward.
We have
Suppose Zi is known by the farmer but not by the econometrician. Then the profit maximizing
choice of Xit will depend on Zi . Therefore, Xit will be (positively) correlated with Zi and hence
• The data are for 48 states, where each state is observed in T = 7 time periods (each of
• A study estimated that 25% of drivers on the road between 1am and 3am have been
drinking
Objective: the e§ect of government policies designed to discourage drunk driving on the
fatality rate
Y =) Fatality rate: the number of annual tra¢c death per 10,000 people in a state
X =) Beer tax: the “real” tax on a case of beer, i.e. the beer tax put into 1988 dollars.
Scatterplot:
0 1 2 3
Tax on Case of Beer (in 1988 dollars)
The Tra¢c Fatality Rate and the Tax on Beer (1982 data)
0 1 2
Tax on Case of Beer (in 1988 Dollars)
3
The Tra¢c Fatality Rate and the Tax on Beer (1988 data)
1982 Estimation:
1988 Estimation
• Higher tax are associated with more, not fewer tra¢c fatalities??? But if we focus on
87
86 84
1.95
88
vfrall
1.9
85 82
1.85
83
1.8
• Omitted variable bias: quality of the auto, highway conditions, social attitude toward
• Suppose: High tra¢c density means more tra¢c deaths; (Western) states with lower
- +
traffic deaths
• Solution: Collect all the relevant data and augment the simple regression. However,
• Keep those variables constant across di§erent period ) fixed e§ect model
Let Zi be a variable that determines the fatality rate in state i but does not change over
Cultural attitudes toward drinking and driving a§ect the level of drunk driving and thus the
fatality rate. However, if they do not change over time, then they do not produce any change
in fatalities in the state. The changes must arise from other sources.
.5
-.5
-1
-1.5
For more examples on unobserved heterogeneity, See Arellano (2003, pages 8-10).
For some macro panels and financial panels, both N and T can be large. We need to allow
both N and T goes to infinity. This is the so called multidimensional asymptotics. We may
let N ! 1 first and then let T ! 1 or let T ! 1 first and then let N ! 1p or let N and
T go to 1 at the same time but control the relative rate of expansions (i.e. N /T ! 0).
The first two asymptotics are called sequential asymptotics and the last one is called joint
asymptotics.
Data of this kind have been prominent, for example, in research on models of growth
In micro panels, N is typically very large (several hundreds or even thousands) while T is
quite small (ranging from 2 to 10 in most cases, and very rarely exceeding 20). If T is much
smaller than N , the usual asymptotics is to let N ! 1 with T fixed. Panel data set with
Examples of micro panels are household or firm level panels, which are based on surveys,
census, administrative records or company balance accounts. Two widely used data sets are
• Small N, large T (Seemingly Unrelated Regression Equation (SURE)). This type of data
set is referred to as time series and cross sectional data (TSCS) in political science.
N and T do not necessarily refer to number of individuals and time periods respectively.
Other examples include families and family members, schools and classes, industries and firms.
Many types of cross sectional survey data are obtained through “cluster” sampling. Certain
geographical units are first selected (e.g. villages), then individuals are sampled within each
village. Thus, the village from which the individual observation comes may be thought of as
one dimension of the data. Thus panel data methods are of special importance in research in
where c indexes the cluster and i indexes individuals in the cluster. If we have a large number
of clusters and relatively small group sizes (max(Ic ) is small), then we have a traditional linear
If you will do research in development and deal with survey data, it is worthwhile reading
represented as:
0 1
B C
B y2j = X2j β 2 + e2j C
B C
B C
B ... C
@ A
for j = 1, 2, ..., N , where Xij are regressors that are assumed to be exogeneous and β k are
vectors of parameters. The equations in the system may be related as the error terms in
di§erent equations may be correlated. For example, ykj may be individual j 0 s expenditure on
good k or budget share for good k. In SURE system, K is typically small while N is large.
Now, we can write our panel model with small N and large T as a SURE system:
0 1
B C
B C
B C
B ... C
@ A
yN t = XN t β N + eN t
yt = Xt β + et
where
β = (β 01 , β 02 , ..., β 0N )0 , N d × 1
and 0 1
X1t 0 0 ... 0
B C
B C
B 0 X2t 0 C
B C
B C
Xt = B 0 0 X3t 0 C
B C
B C
B ... ... ... ... ... C
@ A
0 0 0 ... XN t
N ×(N d).
y = Xβ + e
with
E(e|X) = 0.
and
E(ee0 |X) = Φ.
A special case is 0 1
σ 11 σ 12 ... σ 1N
B C
B C
B σ 21 σ 22 ... σ 2N C
Φ = IT ⊗ B
B
C = IT ⊗ Σ,
C
@ A
σ N 1 σ N 2 ... σ N N
ticity. If you are overwhelmed by the notation, you can consider the special case N = 1, which
0 1
B C
B a21 B a22 B ... a2N B C
A⊗B =B B
C.
C
aN 1 B aN 2 B ... aN N B
vec(ABC) = (C 0 ⊗ A) vec(B).
0 1−1
* +
β̂ GLS = @ |{z} X A
X 0 Φ−1 |{z} X 0 Φ−1 y ,
N d×N T N T ×N d
which is a BLUE.
Under one of the following two conditions, OLS applied to each equation is equivalent to
GLS when Φ = IT ⊗ Σ :
• (i) Σ = IN
The proof of this last condition involves rewriting the SUR system individual-by-individual
as follows:
where 0 1 0 1 0 1
B C B C B C
B C B C B C
B yi2 C B Xi2 C B ei2 C
ỹi = B
B
C
C , X̃i = B
B
C
C , ẽi = B
B
C
C .
T ×1 T ×d T ×1
That is,
0 1 0 1 0 1
B C B C B C
B ỹ2 C B 0 X̃2 ... 0 C B ẽ2 C
B C = B C B
β+B C or
B C B C C
ỹ = X̃β + ẽ.
0 1 0 1
Xi1 x1
B C B C
B C B C
B Xi2 C B x2 C
B ... C B ... C
@ A @ A
XiT xT
T ×d
So
h i−1 h i
;* +* + <−1 ;* +* + <
; <−1 ;* −1 + <
= Σ ⊗ x0T ×d xT ×d Σ ⊗ x0T ×d ỹ
h * +−1 0 i
= IN ⊗ x0T ×d xT ×d xT ×d ỹ.
0* +−1 0 10 1
B CB C
B 0 ... 0 C B ỹ2 C
β̂ GLS = BB
CB
CB
C
C
* 0 +−1 0
0 ... xT ×d xT ×d xT ×d ỹN
0 * +−1 0 1
x0T ×d xT ×d xT ×d ỹ1
B * C
B 0 +−1 0 C
B xT ×d xT ×d xT ×d ỹ2 C
= BB C,
C
B ... C
@ A
* 0 +−1 0
xT ×d xT ×d xT ×d ỹN
which is the same as the equation-by-equation OLS estimator. This is a purely algebraic
result. There does not seem to be any good intuition on why OLS is numerically identical to
GLS. The numerical equivalence between OLS and GLS in this case is a well-known result in
econometrics.
For more details on estimating systems of equations, read Chapter 7 in Wooldridge (2010).
A good reference for the Kronecker product and matrix algebra in general is Abadir and
Magnus (2005).
Bibliography
[1] Abadir K. and J. Magnus (2005): Matrix Algebra, Cambridge University Press.
[3] Baltagi, Badi H. (2013): Econometric Analysis of Panel Data, John Wiley & Sons.
[4] Deaton, Angus (1997): The Analysis of Household Survey, The John Hopkins University
Press.
[5] Hsiao, Cheng (2014): Analysis of Panel Data, Cambridge University Press.
[6] Stock, J. and M. Watson (2007): Introduction to Econometrics, 2nd Edition, Addison and
Wesley.
[7] Wooldridge J. (2010): Econometric Analysis of Cross Section and Panel Data, The MIT
Press.
20
Chapter 3
In this chapter, we consider only the static model so that Xit contains no lagged dependent
variables.
0 , X 0 , ..., X 0 )0 2
Notation: all vectors except Xit are column vectors. Denote Xi = ( Xi1 i2 iT
i
RTi ×k where k is the number of elements in Xit and Yi = (Yi1 , ..., YiTi )0 2 RTi ×1 .
"it = αi + uit .
αi = b2 Zi (3.2)
• The key issue is whether αi is correlated with Xi or put in a stronger form whether
E(αi |Xi ) = 0. Following Wooldridge (2010, Chapter 10), we always treat αi as a random
variable. When E(αi |Xi ) = 0, we say the model is a random-e§ects model. Otherwise, it
is a fixed-e§ects model. Some authors refer to the random-e§ects model and fixed-e§ects
model as the uncorrelated e§ect model and correlated e§ect model, respectively.
• Regardless of the type of model, there are two approaches to estimating β : the random-
e§ects approach and the fixed-e§ects approach. In the former approach, αi is not treated
to be estimated.
21
3.2.1 Assumptions
E (αi |Xi ) = E (αi ) = 0 is the the random e§ect assumption. E (ui |Xi ) = 0 is the so-called
This is certainly stronger than zero contemporaneous correlation. The strong exogeneity
*Feedback e§ect: If one attributes the higher wage at time t, which is really due to the
random shock uit , to the training program, then he is more likely to join the training problem
in the future. In this scenario, a higher uit leads to a high value for P rogit+1 , and so uit and
Assumption RE.1(b) is very strong as it rules out cross sectional dependence. It is di¢cult
to allow cross-sectional dependence, especially when N is large and time series are short. Panel
data models with cross sectional dependence have attracted much attention in recent years.
Xi ! Yi ui
"
αi
where the absence of a link between Xi and ui indicates E (ui |Xi ) = 0 and the absence of a
#P $
Assumption RE.3. E(α2i |Xi ) = σ 2α , E(αi ui |Xi ) = 0, and E(ui u0i |Xi ) = σ 2u ITi .
8
>
> σ 2α + σ 2u if i = j and t = s
>
<
>
>
>
:
0 if i 6= j
Such a covariance structure implies serial correlation in the error terms. Hence OLS estimation
0 1
σ 2α + σ 2u σ 2α ... σ 2α
B C
B .. C
B σ 2α σ 2α + σ 2u · · · . C
Ωi = B
B .. ..
C
C (3.7)
B . . σ 2α C
@ A
σ 2α σ 2α + σ 2u
where ITi is Ti × Ti identity matrix and JTi is Ti × Ti is the matrix with unity in every element.
* + JTi
2
* 2 2
+
= σ u QTi + σ u + Ti σ α PTi
N
!−1 N
!
X X
β̂ GLS = Xi0 Ω−1
i Xi Xi0 Ω−1
i Yi . (3.9)
i=1 i=1
Let 0 1 0 1 1 0
X1 Y1 0 Ω1 0 ...
B C B C B C
B C B C B . C
B X2 C B Y2 C B 0 Ω2 · · · .. C
X=B C B C B
B .. C , Y = B .. C and V = B .. ..
C.
C
B . C B . C B . . 0 C
@ A @ A @ A
XN YN 0 ΩN
* +−1 0 −1
β̂ GLS = X 0 V −1 X X V Y.
• Under assumptions RE.1, RE.2 and RE.3, the GLS estimator is e¢cient in the class of
Traditionally, the statistical and econometric literature focuses on the class of unbiased
estimators. Nowadays unbiasedness is just one of many di§erent criteria to evaluate an esti-
mator.
The estimator in (3.9) is not feasible. As in the typical GMM setup, we replace Ωi by ITi
to get an initial consistent estimator of β. This estimator is the pooled OLS estimator.
N
!−1 N
!
X X
i=1 i=1
0 1−1
N
X XN
* +−1 0
i=1 j=1
N
X N
X
* +−1 0 (i)
i=1 i=1
where 0 1−1
N
X
j=1
and * +−1 0
(i)
is the OLS estimator using only the time series observations for individual i. With β̂ OLS , we
Ti
N X
1 X
b2"
σ = PN (Yit − Xit β̂ OLS )2 . (3.15)
N TX Ti
i −1 X
1 X
b2α = PN
σ b
(Yit − Xit β b
OLS )(Yis − Xis β OLS ). (3.16)
b2" and σ
Plugging in σ b2α into the definition of Ωi yields Ω̂−1 −1
i . Using Ω̂i , we get the feasible GLS
estimator !−1 N !
N
X X
i=1 i=1
correlation in uit , probably a substantial amount, which means one of our assumptions is
violated.
Under assumptions RE.1, RE.2 and RE.3, β̂ RE is asymptotically equivalent to the infea-
* +
for * +
PTi = JTi /Ti = J¯Ti 2 RTi ×Ti and QTi = ITi − J¯Ti 2 RTi ×Ti .
It is easy to see both PTi and QTi are projection matrix and PTi QTi = 0. We have
|{z} |{z}
Ti ×Ti Ti ×k
and
−1/2 1 1
−1/2
To see the transformation behind ΩTi , we note that
1 1
−1/2
1 * + 1
= Xi − X̄i,· + p X̄i
σu σ u + Ti σ 2α
2
" ! #
1 σu
= Xi − 1 − p X̄i,·
σu σ 2u + Ti σ 2α
1 * +
= Xi − θi X̄i,· ,
σu
where
p
θi = 1 − σu / σ 2u + Ti σ 2α . (3.19)
* + * +
This is the quasi-demeaned equation but θi is individual dependent. By definition, the variance
of "it − θi¯"i,· is σ 2u . So, OLS is BLUE if it is based on the above regression model.
It can be shown that β̂ GLS and β̂ REF GLS are asymptotically equivalent in the sense that
H I
1
p p
so that the asymptotic distribution of N (β̂ GLS −β) is the same as that of N (β̂ REF GLS −β).
To show this, we consider the case with a balanced panel for simplicity. It is not hard to show
that
" # " #
1 X#
N $0 # $ −1 1 X#
N $0 # $
N N i=1
i=1
1 X# $0 # $
N N
1 X* +0 * +
i=1 i=1
" #−1 " #
N N # $0 # $
1 X* +0 * + 1 X
N N
i=1 i=1
We now consider
1 X# $0
N
1 X*
N
+0
N i=1
p # $ 1 X N p # $ 1 XN p # 2 $ 1 X N
− N θ̂ − θ 0
X̄i,· "i − N θ̂ − θ 0
Xi ¯"i,· + N θ̂ − θ 2 0
X̄i,· ¯"i,·
N N N
i=1 i=1 i=1
P # $0
The above holds by taking a second order Taylor expansion and noting that p1N N i=1 X i − θ̂ X̄ i,· ("i −
N N N
1 X 0 1 X 0 1 X 0
N N N
i=1 i=1 i=1
1 X# $0
N N
1 X* +0
p Xi − θ̂X̄i,· ("i − θ̂¯"i,· ) = p Xi − θX̄i,· ("i − θ¯"i,· ) + op (1) .
N i=1 N i=1
Therefore,
p
1 X* +0 * + 1 X* +0
"
i=1
#−1
N
1 X* +0 * + N
1 X* +0
i=1
p
as desired.
3.3.1 Assumptions
Now we assume that Xit and αi are correlated. In this case, the random e§ect estimator is
biased.
• If Xit contains some time invariant variables, then we can not identify the e§ects of these
time invariant variables on Yit . For individuals, factors such as race and gender can not
• For identification, we only require that each element of Xit varies over time for some
Xi ! Yi ui
..
. "
αi
The idea is to transform the equation to eliminate the unobserved e§ect αi . There are several
transformation that can be used for this purpose. Recalled that we already used “first dif-
ference” for a two-period model. Now we consider fixed e§ects transformation, which is also
* +
.. .. ..
Y it = X it β + uit
* +
* +
So the OLS estimator is consistent and unbiased. Note that the above condition may not hold
n hP * +* +0 io
N
!−1
X Ti
N X Ti
N X
X
b
* +0 * + * +0 * +
If we estimate β using OLS based on equation (3.22), then we get the (weighted) between
estimator:
N
!−1 N !
X 1 X 1
b
β 0 0
BE = X̄ X̄i,· X̄ Ȳi,· (3.27)
!−1 N T !
XN X Ti XX
i
1 0 1 0
* +
where σ 21i = Ti · var(αi + ūi ) = Ti · σ 2α + σ 2u /Ti = Ti σ 2α + σ 2u .
b
• β
BE is inconsistent under the fixed e§ect assumption because X̄i,· and αi are correlated.
Let 0 1
B C
B .. C
B 0 JT2 /T2 · · · . C
P =B
B .. ..
C
C
B . . 0 C
@ A
PN PN
and Q = I − P, where I = IT1 +...+TN is the i=1 Ti × i=1 Ti identity matrix. Then
* 0 +−1 * 0 + a * +−1
b
β X Qu s N (0, σ 2u EX 0 QX
F E − β = X QX ) (3.29)
where sa denotes “is distributed approximately as”. More precisely, we should write
" H I−1 #
p # $ 1
N β b d 2 0
F E − β ! N 0, σ u lim EX QX . (3.30)
N !1 N
* +−1 0 0
ub = QY − QX X 0 QX X Q QY
* 0 +−1 0 0
* 0 +−1 0 0
= Qu − QX X QX X Q Qu
* 0 +−1 0 0
= (I − QX X QX X Q )Qu. (3.31)
Note
* +−1 0 0
b0 u
u b = u0 Q(I − QX X 0 QX X Q )Qu
* + −1
= u0 Qu − u0 QX X 0 QX X 0 Q0 u. (3.32)
So
* +−1 0 0
u0 u
Eb b = Eu0 Qu − Eu0 QX X 0 QX X Qu
* 0 +−1 0 0 0
0
= EtrQuu − EtrQX X QX X Q uu
N
X # * +−1 0 0 $ 2
= (Ti − 1) σ 2u − tr QX X 0 QX X Q σu
i=1
N
!
X
= (Ti − 1) − k σ 2u .
i=1
SSR b0 u
u b
b2u = PN
σ = PN . (3.33)
Given that
and there is a correlation between αi and Xit , we need to control for αi in order to obtain
a consistent estimator of β. That is, we need to look at the data individual by individual
and then aggregate the individual information on β to obtain our final estimator. From the
perspective of the modern control function approach, this is exactly what the least square
< 1, if i = j
Dji =
: 0, otherwise
= Xit β + Di α + uit
0 1
= (Xit , Di ) @ A + uit ,
where 0 10 0 1
D1i α1
B C B C
B C B C
B D2i C B α2 C
Di = B . C , α = B
B C
B ..
C.
C (3.34)
B .. C B . C
@ A @ A
DNi αN
Therefore, the linear causal model reduces to the usual linear statistical model if
E(ui |Xi , Di ) = 0.
b
• β F E is consistent with fixed Ti as N ! 1.
b i is an unbiased estimator for αi but may not be consistent for αi when Ti is fixed.
• α
The problem was originally pointed out by Neyman and Scott (Econometrica, 1948).
In our notation, the problem can be described as follows: For Yi = (Yi1 , Yi2 ), (Y1 , ..., Yn )
We can think of this as a panel data set with two periods and there is no covariate Xi . In this
problem, there are (n + 1) parameters. Neyman and Scott (1948) consider the MLE of {αi }
!
2
X
1 1
1
α̂i = (Yi1 + Yi2 ) , i = 1, 2, ..., n;
n
1 XX
2
2
σ̂ = (Yit − α̂i )2 .
2n
i=1 t=1
Note that α̂i doesn’t converge to αi and we can show that σ̂ 2 converges in probability to
1 X 1 σ2
σ̂ 2 = (Yi1 − Yi2 )2 !p E (Yi1 − Yi2 )2 =
4n 4 2
i=1
When the number of parameters grows with the sample size, the usual argument for
consistency may not work any more. In the model of Neyman and Scott (1948), αi ’s are
the incidental parameters, because they are deemed as secondary importance. Depending
“Incidental parameter problem” to refer to the problem where the the number of parameters
How about the clustered robust variance (see the section on Robust Variance Estimator)
" 2 #2
n
1 X X
σ̃ 2 = (Yit − α̂i ) ?
n
i=1 t=1
Lagging
The first-di§erence (FD) estimator is the pooled OLS estimator for the above regression"
Ti
N X
!−1 Ti
N X
!
X X
b
β ∆Xit0 ∆Xit ∆Xit0 ∆Yit
FD =
b
Under the above assumption, we have E(∆uit |∆Xi2 , ∆Xi3 , ..., ∆XiT ) = 0. So the β F D is
So E∆Xit0 ∆uit may not equal to zero if uit is correlated with Xit−1 , Xit or Xit+1 .
P PTi
Assumption FD.2: Rank( N
i=1
0
t=2 E∆Xit ∆Xit ) = k
• A computational warning: If you stack the data, the di§erence across di§erent individ-
• The FD estimator is less e¢cient than the FE estimator under the FE assumptions
b
• Under assumptions FD.1—FD.3, β F D is the most e¢cient estimator.
h i−1
b
• ebit = ∆Yit − ∆Xit β
FD
b
What if Assumption FE.3: E(ui u0i |Xi ) = σ 2u ITi holds? In this case, β F D is less e¢cient
b
than β F E because var(∆ui1 , ∆ui2 , ..., ∆uiTi ) is not a diagonal matrix.
Let 0 1
−1 1 0
B C
B C
B −1 1 C
Di = BB
C
C
B C
...
@ A
0 ... −1 1
(Ti −1)×Ti
then
The variance of ∆ui is then given by σ 2u Di Di0 . The GLS estimator based on the first-di§erenced
model
Di Yi = Di Xi + Di ui
is (N )−1 ( N )
X * + X * +
β̂ F D,GLS = 0 −1
Xi0 Di Di Di
0
Di Xi 0 −1
Xi0 Di Di Di
0
Di Yi .
i=1 i=1
Note that Di0 (Di Di0 )−1 Di is a projection matrix projecting to the row space of Di . Since
Di `Ti = 0, `Ti = (1, ..., 1)0 is orthogonal to the row space of Di . So, projecting to the the row
* +−1
As a consequence
β̂ F D,GLS = β̂ F E .
It is worth pointing out that the β̂ F D,GLS estimator is the OLS estimator based on the
transformed model
A natural question is: what u∗i = (Di Di0 )−1/2 Di ui is for any vector ui = (ui,1 , ..., ui,Ti )0 ? Note
that (Di Di0 )−1/2 is not unique. If we use the Cholesky decomposition of Di Di0 = Ri0 Ri for some
* +−1/2
Di Di0 = Ri−1 ,
which is still upper triangular, then some algebra shows that we can take
L M
∗ 1
uit = cit uit − (ui,t+1 + ... + uiTi ) ,
(Ti − t)
where
Ti − t
c2it = .
Ti − t + 1
σ 2u ITi , then var(u∗i ) = σ 2u ITi −1 . Therefore, the forward transformation can be regarded as an
e§ects but in contrast does not introduce serial correlation in the transformed errors. Forward
• They are identical with a balanced data set with T = 2 for all individuals. In that case,
the FD model and the FE model are numerically identical models, since Yi2 − Yi1 =
• When T > 2, the choice between FD and FE hinges on the assumption on {uit } .
• The FD estimator and FE estimator will have di§erent probability limits when the strict
• The correlation between uit and Xis leads to inconsistent FD and FE estimators.
−1/2 1 1
Ωi Xi = QTi Xi + PT X i ,
σu σ 1i i
we have
H I0 H I
1 1 1 1
Xi0 Ω−1
i Xi = QTi Xi + PTi Xi QTi Xi + PTi Xi
σu σ 1i σu σ 1i
1 1
So,
XN H I XN
b 1 0 1 0 1 0 1
i=1 i=1 u
Therefore
b
β b b
RE = W1 β between + (I − W1 ) β within , (3.39)
1 0 1 0 1 0
W1 = X QT Xi + 2 Xi PTi Xi X PT X i
σ 2u i i σ 1i σ2 i i
i=1 i=1 1i
!−1 N
XN X
b
β 0 b
Xi0 Yi = β
RE = Xi Xi P OLS .
i=1 i=1
b
• If min Ti ! 1, then σ u /σ 1i ! 0. As a result, β b b
RE ! β within = β F E (“within variation”
b
• If σ 2α ! 1, β b b 2 b b
RE ! β within = β F E . The larger σ α is, the closer β RE is to β F E .
N H
" I #−1
X 1 0 1 0
b )=
asymvar(β RE X QT X i + X PT X i
σ 2u i i σ 21i i i
i=1
and !−1
N
b 1 X 0
asymvar(β within ) = Xi QTi Xi .
σ 2u
i=1
b ) ≤ asymvar(β
Hence asymvar(β b
RE within ).
The panel estimators in the previous sections can be obtained by OLS estimation of β in the
pooled regression
where Ỹit , X̃it are the demeaned versions or quasi-demeaned versions of Yit and Xit for the FE
and RE estimators. For the FD estimator, Ỹit , X̃it are the first di§erences of Yit and Xit .
N XTi
!−1 N T
X XX i
β̃ − β = 0
X̃it X̃it X̃it0 ˜"it
!−1
X Ti
N X N
X
= X̃it0 X̃it vi
i=1 t=1 i=1
where
Ti
X
vi = X̃it0 ˜"it
t=1
N Ti
1 XX
p lim X̃it0 X̃it = SXX
N !1 N
i=1 t=1
1 X
N
p vi !d N (0, SX" )
N i=1
where
N
1 X * 0+
SX" = p lim E vi vi .
N !1 N
i=1
As a consequence,
#p # $$
−1 −1
N T
1 XX 0
i=1 t=1
XN X Ti X
Ti
1
N
i=1 t=1 s=1
1 X 0 est * est +0
= X̃i ˜"i ˜"i X̃i ,
N
i=1
where ˜"est
it is the estimated residual.
• The validity of the above formula depends crucially on the cross-sectional independence
assumption.
One might expect that the random e§ects estimator is superior to the fixed e§ects estimator.
After all, it is the GLS estimator; moreover, the previous discussion shows that the fixed
e§ects estimator is a limiting case of the RE, corresponding to situations where the variation
in the individual e§ects is large. However, there is a very strong assumption built in to the
Suppose we have two alternative estimators, β b and β b , for a true parameter vector β.
I II
Further suppose that if the null hypothesis H0 is correct, both estimators are consistent
and asymptotically normal with (approximate) variance-covariance matrices VI and VII , and
(approximate) matrix of covariances between the two estimators VI,II . Finally, suppose that
if the null hypothesis is false the two estimators converge to di§erent answers – for example,
one of them might remain consistent while the other one becomes inconsistent, or both of
them might become inconsistent but idiosyncratically so. Then, under H0 , the Wu-Hausman
quadratic form:
m = (β b −β b )0 (VI + VII − VI,II − VII,I )− (β
b −βb ) (3.41)
I II I II
b −β b .
converges in distribution to a χ2 (k), where k is the rank of the asymptotic variance of β
I
b , is e¢cient under H0 , it follows from the Rao-
II
Blackwell theorem that VI,II = VII,I = VI . Hence, the variance-covariance expression in the
The intuition behind the Rao-Blackwell is as follows: Suppose we have two consistent
estimators βb and βb , and β b is an e¢cient estimator. Then the variance var(aβ b +(1−a)β b )
I II I I II
is smallest when a = 1. But the FOC for the minimization problem mina var(aβ b + (1 − a)β b )
I II
is
b ) − 2(1 − a)var(β
2var(β b ) + (2 − 4a)cov(β b ,βb ) = 0. (3.42)
I II I II
Letting a = 1 yields
b ) = cov(β
var(β b ,βb ) (3.43)
I I II
as desired.
Applying this approach to the linear panel data problem, we can use the m statistic based the
b
β b
RE and β F E to test the null H0 : cov(αi , Xit ) = 0 for all t.
b
• β b
RE is consistent, asymptotically normal and e¢cient. β F E is consistent and asymptot-
ically normal.
b
• β b
RE is inconsistent while β F E is consistent.
# $0 h i−1 # $
b −βb b −βb ) b −β b
m= β V ar(β β FE . (3.44)
RE FE RE FE RE
For the purpose of exposition, suppose X 0 QX is of full rank, which rules out a constant
* 0 −1 +−1 0 −1 * +−1 0
b −β
β b X V " − X 0 QX
RE FE = X V X X Q". (3.45)
b −β
So, under H0 : β b
RE F E ' 0 and
* + * +
b ,β
cov(β b ) = E X 0 V −1 X −1 X 0 V −1 ""0 QX X 0 QX −1
RE FE
* +−1 0 −1 * +−1
= E X 0 V −1 X X V V QX X 0 QX
* +−1
= E X 0 V −1 X = var(β b ). (3.46)
RE
Thus
b −β
V ar(β b ) = V ar(βb ) − V ar(β b )
RE FE FE RE
2
* 0
+ −1 * +−1
= σ u X QX − X 0 V −1 X , (3.47)
b2u X 0 QX
σ − (X 0 V̂ −1 X)−1 . (3.48)
Alternatively,
Ti #
!−1
N X
X $0 # $
Vd b )=σ
ar(β b2u Xit − θ̂i X̄i,· Xit − θ̂i X̄i,· , (3.49)
RE
i=1 t=1
Ti
N X
!−1
X * +0 * +
Vd b )=σ
ar(β b2u Xit − X̄i,· Xit − X̄i,· . (3.50)
FE
i=1 t=1
b −β
To ensure the positive definiteness of V ar(β b ), we need to use the same σ
b2u in all the
RE FE
places.
3.8.3 Caveats
• RE.3 may not hold. In this case, we need to implement a robust Hausman test, which
The quadratic form of the Hausman-Wu test does not extend easily to other situations, such
the test that turns out to be asymptotically equivalent to the quadratic form of the test.
where
Here we have assumed balanced panel for simplicity. If αi is correlated with {Xit , t = 1, ..., T } ,
then we can follow Mundlak (1978) and assume that the linear projection of αi onto {Xit , t = 1, ..., T }
is
αi = γ 0 + X̄i,· γ + ei , (3.52)
where ei is uncorrelated with Xi . To test whether αi is correlated with {Xit }, we can test
H0 : γ = 0.
The key is that the new error term (uit − θūi,· ) + (1 − θ)ei is uncorrelated with the regressors
Xit∗ and X̄i,· . Therefore, a test of H0 : γ = 0 can be done using the standard Wald test on the
variable X̄i,· in an OLS regression that includes both Xit∗ and X̄i,· . Such a test is equivalent to
the quadratic-form Hausman-Wu test. The advantage of the Wald test is that we can easily
take the possible heteroscedasticity into account when computing the asymptotic variance of
γ̂.
Note that the linear projection in (3.52) is indeed an assumption. In general, we have the
linear projection
αi = γ̃ 0 + Xi γ̃ + ẽi , (3.54)
where ẽi is uncorrelated with Xi . In the above more general projection, the covariates consist
of the whole trajectory Xi not merely X̄i,· . The above holds as long as the variances of αi and
Xi are finite. Methods based on (3.54) are often said to implement the Chamberlain device,
Yi = Xi β + "i
available Zi , which we are confident is independent of "i . We are considering the null hypothesis
To this end, we can estimate the model under the null hypothesis:
* +−1 0
β̂ OLS = X 0 X XY
* +−1
Var(β̂ OLS ) = σ 2" X 0 X .
β̂ IV = (X̂ 0 X̂)−1 X̂ 0 Y
where * +−1 0
X̂ = PZ X = Z Z 0 Z Z X.
If β̂ OLS is asymptotically e¢cient, then we can construct the Hausman test statistic as
follows:
* +−1 i
Ω̂ = σ̂ 2" (X̂ 0 X̂)−1 − X 0 X
Under the null hypothesis, W !d χ2k where k is the rank of (X̂ 0 X̂)−1 − (X 0 X)−1 , which is
the same as the column rank of X̂ − X. That is, only regressors not included in the set of
instruments are counted in the degrees of freedom. To show this, suppose X1 is a part of Z
X̂ = X1 , X̂2
and so 0 1
X̂ 0 X̂ = @ A.
Let
0 1 0 1 # 0 $1
0 X̂ − X
then
X̂ 0 X̂ − X 0X = X X +D − XX
= − I + X 0X D XX D XX
h i
So rank (X̂ 0 X̂)−1 − (X 0 X)−1 ≤ rank(D) = # of X2 = # of regressors not included in the set
h i
of instruments. Under additional identification conditions, we have rank (X̂ 0 X̂)−1 − (X 0 X)−1 =
rank(D).
regression
Yi = Xi β + Vi γ + errori
where Vi = Xi − X̂i is the first stage prediction residual. If we ignore the estimation uncertainty
under the null. So the regression coe¢cient γ in the pseudo-regression is expected to be zero.
On the other hand, when Xi is not exogenous, then Vi will be correlated with "i and we
expect γ to be di§erent from zero. Therefore, to test the exogeneity of X, we can perform
the standard Wald test based on the above pseudo-regression with possibly robust standard
errors. The test can be carried out in Stata using commands similar to
estat overid
One way to measure the impact of a treatment in the setting of a natural experiment is to
use the di§erences-in-di§erences (DD or DiD) method. To apply this method, longitudinal or
repeated cross section data are needed, with at least one period before and one period after
Period 1 Period 2
Training period
In period 1, there is no program for both groups. Program participation only occurs between
Let
Values of Progit
Period 1 Period 2
Control group 0 0
Treatment group 0 1
where λt is the time e§ect, αi is the time invariant unobserved e§ect. Note that αi is likely
and
that
;* + * +<
* + * +
b
β
assumes that the changes of the averaged u are not systematically di§erent across the two
groups. This is the so-call parallel paths assumption (or common trend assumption) in the
DD literature.
Note that in the above table the individual e§ects matter only via ᾱtreat and ᾱcontrol , the
group average of the treatment and control groups. The DD estimator would be the same if
where g(i) = treatment, control indicating individual i’s group. Now if we introduce the
group dummy 8
Gi =
: 0, otherwise
8
< 1, if t = 2
Bt =
: 0, otherwise
we can write
for some constants c0 , cG and cB . Now it is easy to see that progit = Gi × Bt , and so
This is a common formulation in empirical studies. The OLS estimator of the coe¢cient on
For the above formulation, we don’t actually need panel data. It is enough to have repeated
cross-section data. The di§erence between panel data and repeated cross-section data is that
for panel data the same individuals appear in both periods while for repeated cross-section
data the same individuals may not appear in both periods. For repeated cross-section data,
* +
where t(i) indicates the period that individual i’s observation belongs to. As an example,
consider the case with two states CA and N Y. There is a policy change in CA only. At each
time period, we have observations for some individuals from each of the two states. For a
given state, the groups of individuals do not have to be the same across the two periods. In
What if E ūtreat,2 − E ūtreat,1 = E ūcontrol,2 − E ūcontrol,1 does not hold but we have
In this case, we have to control for X, that is, we can do a DiD analysis conditional on each
value of X.
In the panel data case, we may add the time varying variables Xit as a regression control:
Again, let’s assume that program participation only occur in the second period, then
and
So
b = ∆Yetreat − ∆Yecontrol .
β (3.69)
A more general formulation that accommodates more than two periods and other time
This can be estimated by the least square dummy variable regression. When computing the
* +
New notation:
Yit (1) : the value of the outcome variable at period t for individual i had the individual
Yit (0) : the value of the outcome variable at period t for individual i had the individual
Observations Yit :
or
Note that
identif ied
where the first term can be identified from the data. The challenge is to identify the second
holds, then
| {z }
identif ied
| {z }
identif ied
Assume that
= E [Yi1 |Di = 1, Xi = x] ,
Then
| {z } | {z }
Therefore,
| {z }
identif ied
identif ied
Example 6 Consider a simple example to illustrate the basic philosophy behind the di§erences-
employment. We have a group that participates in the program (treatment group) and a com-
parison group (control group) of non-participants. We also have data on the outcome measure
for the participants and the comparison group in the time prior to and after the program. The
data are summarized in the table below. The number in each cell is the average employment
Graphically,
outcome
Y# control,after
18.4%
17.6% K
Y# treatment,after
16.7% Y# control,before
14.7% Y# treatment,before
time
t=1 t=2
The dashed blue line signifies the counterfactual e§ect had the treatment group not exposed to
Let us consider di§erent ways to evaluate this program based on the data presented in this
table.
Method 1: Suppose that we look simply at the employment rate for participants after
the program and compare that to the employment rate for the comparison group after the
program. If we do this, we must conclude that the program actually reduces employment since
17.6% − 18.4% = −0.8%. Obviously this is a very unsatisfying result since, just by looking at
the table, we can see that it neglects to take into account the fact that participants started
ation. That is, we can look at program participants before and after the program. By doing
this, we see a very strong result of the program: 17.6%−14.7% = 2.9%. Yet this answer is also
open to criticism. By looking at the table, we see that the comparison group also improved
between the before and after time periods. This leads us to wonder if there is some external
force acting on everyone — both the control group and treatment group — that leads to higher
employment rates. If that is the case, then some portion of the improvement for participants
may be due to this external force rather than the program itself. For example, if the overall
employment rate has been rising, then both the treatment and control group members would
takes into account all of the information in the table above is the “di§erence-in-di§erences”
approach. First, compute the di§erence in employment for the treatment group before and
Second, compute the di§erence in employment for the control group before and after the
program:
By subtracting o§ the 1.7%, we are removing the increased employment that would have
occurred anyway (the benefit of an improving economy, for example) leaving us with an
Many papers use the DD to identify and estimate the causal e§ect of a policy change.
For example, Eissa and Liebman (1996) want to estimate the e§ect of the earned income
tax credit (EITC) on labor supply of women. The EITC is a subsidy that goes mostly to
low income women who have children. Eissa and Liebman evaluate the e§ect of EITC from
the Tax Reform Act of 1986, at which time only people with children were eligible. Their
treatment groups consist of single women with kids, and their control group consist of single
women without kids. They compare the variable of interest before and after the EITC.
As a second example, Richardson and William (2009) want to study whether Federal Re-
serve would have been able to mitigate the banking crisis that preceded the Great Depression.
In order to overcome many obstacles to answer this question, Richardson and William (2009)
find a group of banks within an economically similar environment which were subject to the
same state regulations but influenced by di§erent monetary policies. Banks in Mississippi fit
the bill. In 1913, the state was split evenly into two Federal Reserve districts (District borders
were determined by the population size in 1913 at the birth of the Federal Reserve System).
The top half of the state was placed in the Eighth District presided over by the St. Louis
Federal Reserve Bank. The lower half was part of the Sixth District which was the domain of
the Atlanta Fed. While the Atlanta Fed acted as a lender of last resort and provided credit
to troubled institutions, the St. Louis Fed allowed the supply of credit to contract as the
economy contracts because less credit is demanded during times of weak economic activity.
The textbook by Stock and Watson (2002) provides a very nice discussion on program eval-
also benefit from some chapters such as Ch 11: Experiments and Quasi Experiments. For a
modern treatment and an excellent survey on the DiD in the potential outcomes framework,
where
αi s iid(0, σ 2α ) and uit s iid(0, σ 2u ), independent of each other and among themselves. The
Hausman suggested computing the di§erence between the random e§ect estimator and fixed
b −β
where q = β b .
RE FE
N
!−1 N
! N
!−1 N
!
X X X X
N
X N
X
i=1 i=1
where ΩT = σ 21 PT + σ 2u QT and σ 21 = T σ 2α + σ 2u .
(b) Consider
Yi∗ = Xi∗ β + X ei γ + vi (3.79)
where
Yi∗ = σ u ΩT
−1/2
Yi , Yit∗ = Yit − θȲi,·
−1/2
e b b b
β eOLS = β
OLS = β between , γ within − β between . (3.81)
b b
(c) Show that under H0 , var (e γ OLS ) = var(β within ) + var(β between ). Prove that the Wald
0 −1
b (var (b
statistic w = γ γ )) γ b is identical to m
2. Use your favorite package to answer this question. A sample Matlab program is posted
on the TED. I encourage you to write your own program before reading the sample program.
The sample program works for a scalar Xit . For vector cases, some modifications are required.
Publish your code and report the URL on the TED. Group study is encouraged but you have
where β = 1, Xit s iidN (0, 1) across i and t, and uit |Xi s iid across t with distribution
N (0, |Xit |2 ). uit is independent of ujs for any i 6= j. Now suppose we use the FE estimator to
estimate β :
N X
T
!−1 N X
T
!
X * +2 X * +
β̂ = β + Xit − X̄i,· Xit − X̄i,· (uit − ūi,· ) (3.83)
(a) Let N = 500 and T = 5. Simulate the sampling distribution of β̂ using 1000 simulation
replications. For each simulated sample, compute the robust standard errors σ̂ β and σ̃ β
according to
N
" T #2
X X* +* +
−2
σ̂ 2β = SXX Xit − X̄i,· ûit − ûi,·
i=1 t=1
XN XT X T
−2
* +* +* +* +
= SXX Xit − X̄i,· ûit − ûi,· ûis − ûi,· Xis − X̄i,· (3.84)
and
N X
X T
* +2 * +2
−2
i=1 t=1
where
N X
T
!
X * +2
i=1 t=1
(b) Compute the standard deviation sd(β̂) of the finite sample distribution of β̂.
(c) Compute the bias, std and rmse (root mean squared error) of σ̂ β and σ̃ β . According to
(d) Repeat (a)—(c) for T = 10, 20. Does the relative rmse advantage of the two estimators
(f) Finally, do you think your results will change if you change the value of β, say let
β = 314.15926?
3. The debate regarding crime and guns is of course long running. The book ‘More Guns,
Less Crime: Understanding Crime and Gun Control Laws’ by Lott (American Enterprise
Institute) loudly made the claim that ‘shall’ laws reduce crime based on correlation analysis.
In this question, we will evaluate the claim and see whether we can shoot down the ‘More
Guns, Less Crime’ hypothesis (Ayres and Donohue III in the Stanford Law Review (2003)).
The book received 4.5 out of five stars at Amazon.com and there are 175 customer reviews.
Everybody has something to say about this issue. Let’s see what we can conclude from
econometric analysis.
The questions are based on the dataset handguns.dta which you can download from the
Ted. The data consists of data from 50 States plus DC for each year from 1977 to 1999. The
data we will be analyzing are crimes rates for various crime definitions provided by the Bureau
of Justice Statistics. The variables are described in the STATA data set. The main regressor
we will be focussing on is a dummy variable for whether or not the state allows widespread
carrying of concealed weapons. The variable shall is one for states which have ‘shall issue’
laws, which means that licenses must be given to all applicants that are citizens, mentally
http://en.wikipedia.org/wiki/More_Guns,_Less_Crime#Shall_issue_laws
or http://islandia.law.yale.edu/ayers/Ayres_Donohue_article.pdf
Note: you do not need to submit your STATA output. However, please submit your Stata
do file.
I. We will examine the e§ect of shall on rates of violent crime, murder rates and robberies.
To this end, run regressions of the logs of each of these variables on shall (including an
intercept) with the robust option. Report the results in a table with a column for each
regression and the values and their standard errors in rows. That is, fill in the following table:
β̂ 0 6.13
(0.02) ( ) ( )
β̂ 1 (shall) -0.443
(0.048) ( ) ( )
R2 0.09
(a) What is the e§ect of ‘shall’ laws on each of the crime rates. Are the e§ects large
statistically? Explain.
To get started, you can first download the file ‘handguns.dta’ from the course webpage
and then use the following commands in your STATA do file. A do file is a text file that
contains a sequence of STATA commands. If you do not feel like using do files, you can type
the commands in the STATA command window. In this case, omit the delimiter semicolon ‘;’.
clear
- clear matrix
#delimit ;
cd "D:\Teaching\";
use handguns.dta;
desc;
summarize;
gen log_vio=log(vio);
gen log_mur=log(mur);
gen log_rob=log(rob);
For this problem set, Stata may need some help with memory allocation. Because we will
set up lots of dummy variables, we need to allocate memory in a way to do this. So, at the
start of your do file, include the commands: set memory 50m and set matsize 300
II. Now we will control for a number of variables. First, it is well understood that de-
mographic variables play a role. Many have argued socioeconomic variables also play a part.
Most also would at least hope that jail is a deterrent. Run the above regressions but now
add the variables incarc_rate, density, pop, pm1029, and avginc to the regression. Report the
(b) Is the di§erence between the results here and in the results from Question (I) large in
a practical sense?
β̂ 0 -0.17
( ) (0.29) ( )
β̂ 1 (shall) -0.309
( ) (0.037) ( )
R2 0.55
Note: incarc_rate, density, pop, pm1029, and avginc should be included in the re-
III. One omitted variable from the above analysis is di§erences in laws and law enforcement
across states and time. We want to understand how this might a§ect results to provide more
foundation for the interval validity of the results. Recall the omitted variable bias formula:
cov(X1i , ui )
β̂ 1 ! β 1 + .
var(X1i )
Stronger laws would hopefully deter crime, especially crimes that are more rational in nature
like roberries, and perhaps violence. In this sense we would expect that stronger laws would
(a) Typically ‘shall’ laws are pushed using law and order arguments. States with a larger
‘law and order’ constituency would have stronger laws and would be more likely to have ‘shall’
laws. What does this suggest the sign of cov(X1i , ui ) where X1i is the dummy variable for
‘shall’.
IV. Since we have a panel data set, we are able to control for omitted variables that are
constant over time. We want to run the same regressions (i.e. use the same control variables)
as in QII, but now add state fixed e§ects. Do this for each of the three dependent variables
we have examined, and construct three tables (one for each dependent variable). In each table
report the coe¢cient on ‘shall’ along with its standard error, test for the inclusion of state
e§ects if included.
Each table should look like the following (with the entries added instead of the XX’s, of
course).
Dep=ln(violence) 1 2
(a) Describe the e§ect of controlling for state e§ects on the coe¢cient estimate for the
(b) What does this tell us about omitted variables in the specification without state or
time e§ects?
(c) What is the statistical evidence that state dummy variables should be included?
(d) Do these results suggest that the arguments in QIII are correct?
(e) What types of e§ects do you think the time fixed e§ects are capturing?
Stata issues:
The command tab state, generate(statedummy) will take a variable in your data
set called state which has a number for each state and construct dummy variables named
for state equal to 1 and zero otherwise, statedummy2=1 for state equal to 2 and zero
otherwise, etc.
testparm statedummy*;
/* if you want to compute standard error that is robust to the time series correlation
reg log_vio shall incarc_rate density pop pm1029 avginc statedummy*, cluster(state)
r ;
testparm statedummy*;
Note: testparm provides a useful alternative to test that permits varlist rather than a list
of coe¢cients (which is often nothing more than a list of variables), allowing use of standard
4.(Estimating Linear Panel Data Models. Not covered in year 2017) For this problem,
please form a group of two students. The group are required to use both Matlab and STATA
to solve the problem. Presumably, one student in the group focuses on the Matlab solution
and the other one focuses on the STATA solution. The group should communicate and share
their programming experiences. Please include names of the group members in your code and
publish the code to TED. That is, post your code and results to a homepage and report the
Download the data file ret_edu.xls from the course web site. The panel data are drawn
from years 1976-1982 of the non-Survey of Economic Opportunity portion of the Panel Study
of Income Dynamics (PSID). The individuals in the sample are 595 heads of household between
the ages of 18 and 65 in 1976, who report a positive wage in some private, non-farm employment
area).
X1 and Z1 are assumed to be exogenous so that X1it and Z1i are uncorrelated with αi and
uis , for all t and s, while X2 and Z2 are endogenous because X2it and Z2i are correlated with
b
(a) Estimate the model using the within estimator β within .
(b) Estimate the model using the GLS estimator (please follow the procedure outlined in
the lecture note. The GLS is OLS based on the following equation:
(c) Compare (a) and (b) through the use of Hausman (1978) specification tests. What can
(d) Using the following instruments suggested by Hausman and Taylor (1981)
estimate (3.90) by IV approach. What assumptions do you need to ensure that A contains
valid instruments? Compare the estimate with (a) through the use of Hausman tests.
(e) Consider the alternative instrument set suggested by Amemiya and MaCurdy (1986)
where
0 1
B C
B C
B C
(QX1 )∗ = B X1,i1 − X̄1,i. X1,i2 − X̄1,i. X1,i3 − X̄1,i. ... X1,iT − X̄1,i. C ⊗ lT (3.93)
B C
B C
@ A
Estimate (3.90) using AAM as instruments. What assumptions do you need to ensure that A
(f) As more instruments are added, we expect to have more e¢cient estimates. Comparing
the standard errors in (d) and (e), do you find any noticeable reduction in standard errors?
(g) Discuss your estimation results with an emphasis on the return to education.
Panel
Neef not be Dyomic uit Lit progit p t uit frog m if Vei Tlikeso
it stock
then pestmolest soProsise 9 4
so not strict exo
SO far I
Chapter 4
4.1 Models with Sequentially Exogenous Variables
Consider the linear panel data model below:
Yi,t = Xi,t β 0 + "i,t , i = 1, . . . , N, t = 1, . . . , T
correlated with future values of Xi,t , i.e., (Xi,t+1 , Xi,t+2 , . . . , Xi,T ) :
0 1
Xi,1 Xi,2 Xi,3 · · · Xi,T
B C
B C
X trajectory f 0
In a dynamic panel data model where Xi,t = Yi,t−1 , ui,t is obviously correlated with
using
Yi,t = Yi,t−1 β 0 + "i,t . (4.3)
Note on notation: in the case of dynamic panels, we assume that Yi0 is available.
To identify the model parameters, we introduce the so-called sequential moment re-
striction:
Yi f of Lc so alongs 57
F f to NIT look
go
at RE
Example 7 Suppose
So ui,t is correlated with future values of Yi,t−1 and is allowed to be correlated with future
values of Zi,t . If
H I
Uct (Ct )
E Rt |It−1 = 1. (4.9)
Uct−1 (Ct−1 )
Suppose
Ct1−γ − 1 t
Ut = , Uc = Ct−γ . (4.10)
1−γ
Ct
E Rt |It−1 = 1, (4.11)
Ct−1
or # $
# $
E Ct−γ Rt − Ct−1
−γ
Zt−1 = 0, (4.13)
# $ N X
X T
* +0 * +
b
plimN !1 β = β 0 + plimN !1 Xi,t − X̄i,· Xi,t − X̄i,·
FE
i=1 t=1
N X
T
!
X * +0
needto see if
i=1 t=1 2
T
!−1
X * +0 * +
f IE Et
−1 so F Tiro
= β0 + T E Xi,t − X̄i,· Xi,t − X̄i,·
t=1
it treat exo
T
−1
X * +0 it is ou
when long
4.2 PROPERTIES OF FE AND FD ESTIMATORS UNDER SEQEX 59
CE
t=1
uit
f Ex !
way 5 is IE
T T
1 1 XX Xi
O
= −E X̄i,· ūi,· =
T T
0
EXi,t ui,s .
c
e
IE wi
(4.15)
t=1 s=1
until T 9
T −1
PT PT
EX 0 u can be regarded as the finite g
sample version of the long run covari-
i,t i,s
Th
t=1 s=1
n
ance between Xi,t and ui,t . It is reasonable to assume that
ias
wait deusto
So
if 3 or 5 1 XX
T T
!
He sics
filet
T
EXi,t 0
ui,s = O(1). doesn't
Cait t=1 s=1 go
dile
Tt not strict exo if T brink aye
Under this assumption, we have
ay
usuely
asymptotically
b
plimN !1 β F E = β 0 + O( ) as T ! 1.
1
BEE ivorishet (4.16)
asks
r even
biased
T
for
4.2.2 Inconsistency of the FD Estimator Tony Snell fixedT
Now
T T
1 X 1 X
0
E∆Xi,t ∆ui,t = E (Xi,t − Xi,t−1 )0 (ui,t − ui,t−1 )
T − 1 T − 1
t=2 t=2
T o+
1 X* 0 0 0 0
= EXi,t ui,t − EXi,t ui,t−1 − EXi,t−1 ui,t + EXi,t−1 ui,t−1
T −1
t=2
X T poster in
1 future x
= − EXi,t 0
ui,t−1 .
T −1
won't even in finite
decoy to
t=2
It is reasonable to assume that each of the summands is O(1) and that gT
X T
0
Hence inconsistent
b
plimN !1 β F D = β 0 + O(1) as T ! 1. (4.19)
Example 9 (Nickell (1981)) Dynamic Panel Data Model: Yi,t = Yi,t−1 β 0 +αi +ui,t . Assume 1 SD V
|β 0 | < 1 so that Yi,t is (weakly) stationary (we will investigate the consequences of relaxing
s=0
Assume ui,t is iid over i and t. The fixed e§ects estimator (LSDV) estimator is
PN PT
β PNt=1PT 2
i=1 t=1 (Yi,t−1 − Ȳi,−1 )
PN PT
(Yi,t−1 − Ȳi,−1 )2 /N T
i=1 t=1
The probability limit of the second term of (4.21) gives the asymptotic bias of the LSDV
Note that
X1
1
1
X
s=0 s=0
−E Ȳi,−1 ūi,·
T 1
! T
!
1 1 XX s 1X
= E αi + β 0 ui,t−1−s ui,t
1 − β0 T T
t=1 s=0 t=1
! !
X T X 1 X T
1 1
β s0 ui,t−1−s Ey o
Ging L i
= E ui,τ
T T
t=1 s=0
!
τ =1
!
T t−1 T
1 X X t−1−q 1X
= E β0 ui,q ui,τ
T T
0
t=1 q=−1
1
τ =1
T −1 X T T
!
1 X 1 X
= E@ β t−1−q
0 ui,q A ui,τ
T q=−1 T
t=q+1 τ =1
T −1
! T
!
1 X β −q+T 0 −1 1X
= E ui,q ui,τ
T q=−1 β 0 − 1 T
τ =1
T −1
1 X β −q+T
0 −1 2 1 T − 1 − β 0 T + β T0 2
= σ u = − σu. (4.23)
T2 β0 − 1 T2 (β 0 − 1)2
q=1
Similarly,
T
X
* +2
T −1 E Yi,t−1 − Ȳi,−1
t=1
!2
T
X 1
X T 1
1 XX s
−1
But
1 T 1
!2
X 1 XX s
E β s0 ui,t−1−s − β 0 ui,τ −1−s
T
s=0 τ =1 s=0
σ 2u 1 − β t0
2σ 2u 1 − β T0 −t
= − + β
1 − β 20 T (1 − β 20 ) 1 − β 0
0
1 − β0
H I
σ 2u 2β 0 (1 − β T0 )
+ 1 − . (4.25)
T (1 − β 0 )2 T (1 − β 20 )
So
X * +2
T −1 E Yi,t−1 − Ȳi,−1
t=1
σ2 β 20 T 2 + T β 20 − 2β 0 T 2 + T 2 − T + 2β 0 − 2β T0 +1
= − u2 * + . (4.26)
T (β 0 − 1)2 β 20 − 1
T − 1 − β 0 T + β T0 β 20 − 1
. (4.27)
β 20 T 2 + T β 20 − 2β 0 T 2 + T 2 − T + 2β 0 − 2β T0 +1
When T = 2, the bias reduces to −( 12 β 0 + 12 ). When T = 3, the bias reduces to −(β 0 + 1)(β 0 +
* +
− (β 0 − 1) β 20 − 1 1 (β + 1)
2 (1 + o(1)) = − 0 . (4.28)
β 0 − 2β 0 + 1 T T
• For small T and β 0 > 0, we can see that the bias is always negative.
• When T is large, the right-hand side variables become asymptotically uncorrelated; the
as any other Yi,t , for t > 0. Sometimes we assume that Yi,0 is a fixed constant. In this
case, the exact expression for the asymptotic bias will be di§erent.
Exercise 10 Calculate the asymptotic bias of the FD estimator. Compare it with that of the
FE estimator.
th ED at least the order
Cait
it corr uit e
use cages3N
we have
E∆Xi,s0
∆ui,t = 0 for all s = 2, . . . , t − 1. (4.31)
So, at time t, we can use ∆Xi,t−1 as the potential instrument for ∆Xi,t (Anderson and Hsiao
!−1 N T !
XN X T XX
0 0
X T
wYI
0
E ∆Xi,t−1 ∆ui,t = 0.
relevance
• Estimate ∆Yi,t = ∆Xi,t β 0 + ∆ui,t by pooled 2SLS using ∆Xi,t−1 as instruments. When
Xi,3 − Xi,2 .
• Rather than use lagged ∆Xi,t as instruments, we can use lagged Xi,t as instruments. For
can still
example, choosing Xi,t−1 , Xi,t−2 as instruments is as A least as e¢cient as the procedure here week
that uses ∆Xi,t−1 as instruments. The former also gives k overidentifying restrictions
su problem
level that can be used to test the sequential exogeneity. It has been found that the estimator
but less
resulting from instrumenting using di§erences has a singularity point and very large
preferred
variances over a significant range of parameter values. Instrumenting using levels does
exo not lead to the singularity problem, and results in much smaller variances, and so is
Geersheep
preferable.
efficient
more
• When T = 2, we have
gun
and we use Xi,1 as the instrument for ∆Xi,2 . In this case, β 0 may be poorly identified,
nonals
ratroof handsEarreleted
use cc
ug p E Ez d
Treffen T
Stix i
fnezixi mean 0 as
EI
use c.ee
every distribution of zu Eo
wedeinsnuneeklnnllg.EE
u
don'tgetNornd
RE Carrey
bug
4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) most 63 L
Vari high
jeep
Example 11 Let
Yi,t = ρ0 Yi,t−1 + αi + ui,t , (4.32)
then the simplest IV estimators are
PN PT cancer
b i=1 t=3 (Y i,t − Y i,t−1 )(Y i,t−2 − Yi,t−3 ) has e
β IV,1 = PN PT , (4.33)
PN PT 80 no
Iv level b
β IV,2 = PN PT i=1 t=3 (Yi,t − Yi,t−1 )Yi,t−2
. (4.34) noon
i=1 (Y
t=3 i,t−1 − Y )Y
i,t−2 i,t−2
Consider the case that αi = 0 and ρ0 = 1 so that
noted
fpehose.to
and it
needHot
Yi,t = Yi,t−1 + ui,t and then (4.35)
gofer Eid 0
d ∆Yi,t = ui,t . wit
zf (4.36)
random well
quest those.uwr
First di§erencing yields: in were
∆Yi,t = ρ0 ∆Yi,t−1 + ∆ui,t . (4.37)
2
Hit Notice that ∆Yi,t−s is uncorrelated with ∆Yi,t−1 for s = 2, . . . , t − 2, ∆Yi,t−s is not a relevant
also
instrument for ∆Yi,t−1 .
great
4.4 Panel GMM estimator (Arellano and Bond)
4.4.1 The GMM Estimator: Definition
The Anderson-Hsiao instrumental variables estimator may be consistent, but it is not e¢cient
because it does not take into account all the available moment restrictions. Arellano and Bond
(1991) propose a more e¢cient estimator which uses additional moment restrictions.
Consider
At t = 2, we have
we save Yi,2 − Yi,1 = (Xi,2 − Xi,1 ) β 0 + ui,2 − ui,1 . (4.39)
asq
We can convincingly argue that Xi,1 is a valid instrument for (Xi,2 − Xi,1 ), since it is likely
to be correlated with (Xi,2 − Xi,1 ), and is not correlated ui,2 − ui,1 . At t = 3, we have
cart Yi,3 − Yi,2 = (Xi,3 − Xi,2 ) β 0 + ui,3 − ui,2 . (4.40)
Here both Xi,1 and Xi,2 are valid instruments: both are likely to be correlated with (Xi,3 − Xi,2 ) , Turgut
and neither is correlated with ui,3 − ui,2 . Proceeding in this manner, we can see that at t, the conditions
o
valid instrument set is Xi,t−1 where
o
Xi,t−1 = (Xi,1 , Xi,2 , . . . , Xi,t−1 ) (4.41)
B i,1
C
B Ix i C
ke C t l
B 0 o ···
Xi,2 0 C
Zi = B
B .. .. . . ..
C,
C (4.42)
B . . . . C
@ A
0 0 o
· · · Xi,T −1 Wan e T I
0 1
o )0 ∆u
(Xi,1 i,2
B C
B o )0 ∆u C
B (X i,2 i,3 C
Zi ∆ui = B
0
B ..
C
C (4.43)
B . C
@ A
o
(Xi,T −1 )0 ∆u
i,T
The moment condition EZi0 ∆ui = 0 is not enough to identify β 0 . We need the following
rank condition:
Assumption GMM: Rank(EZi0 ∆Xi ) = k, where k is the number of regressors after first
di§erencing.
0 10 1 0 1
o )0
(Xi,1 0 ··· 0 ∆Xi,2 o )0 ∆X
(Xi,1
B CB C B
i,2
C
B o )0 · · · CB C B o )0 ∆X C
B 0 (Xi,2 0 C B ∆Xi,3 C B (Xi,2 i,3 C
B CB C=B C (4.44)
B .. .. .. .. C B .. C B .. C
B . . . . CB . C B . C
@ A@ A @ A
0 0 o
· · · (Xi,T −1 )0 ∆Xi,T o
(Xi,T 0
−1 ) ∆Xi,T
Under the assumption EZi0 ∆ui = 0 and Rank(EZi0 ∆Xi ) = k, β 0 is the unique vector that
solves
1 X 0
i=1
In general, we have k(T (T − 1)/2) > k (the only exception is T = 2), so the above equation
quite
4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) 65
WN be a k(T (T − 1)/2) × k(T (T − 1)/2) symmetric and positive semi-definite matrix (with
enough rank), a GMM estimator of β 0 is
N
!0 N
!
X X
b
β 0 0
GM M = arg min Zi (∆Yi − ∆Xi β) WN Zi (∆Yi − ∆Xi β) . (4.47)
β
i=1 i=1
The solution is
" ! !#−1
w
Sf Ez zi5
X N XN
IN b
β GM M =
1 0
∆Xi Zi WN
1 0
Zi ∆Xi (4.48)
N N
tramp gwm Fz
i=1 i=1
N
! N
!
1 X 1 X
× ∆Xi0 Zi WN Zi0 ∆Yi . (4.49)
P
us
N
i=1
N
i=1
Let 0 1 0 1 0 1
Z1 ∆X1 ∆Y1
B C B C B C
B C B C B C
B Z 2 C B ∆X 2 C B ∆Y 2 C
Z = B . C , ∆X = B . C , ∆Y = B . C
B C B C B
C. (4.50)
B .. C B .. C B .. C
@ A @ A @ A
ZN ∆XN ∆YN
Then * +−1
βb 0 0
∆X 0 ZWN Z 0 ∆Y.
GM M = ∆X ZWN Z ∆X (4.51)
4.4.2 The GMM Estimator: Asymptotics
b
Under the orthogonality and rank conditions, we can show that β GM M is consistent. Note
" ! !#−1
XN XN
b 1 1
β GM M − β 0 = ∆Xi0 Zi WN Zi0 ∆Xi
N N
i=1
!
i=1
!
N N
1 X 1 X
× ∆Xi0 Zi WN Zi0 ∆ui . (4.52)
N N
i=1 i=1
But !
X N
1
Zi ∆Xi !p EZi0 ∆Xi = C
0
(4.53)
N
i=1
with rank(C) = k, and !
N
1 X 0
Zi ∆ui !p EZi0 ∆ui = 0. (4.54)
N
i=1
Hence * 0 +−1
b
β p
GM M − β 0 ! C W1 C CW1 0 = 0, (4.55)
since rank(CW1 C 0 ) = k.
N (βb
GM M − β 0 ) ) N (0, Vβ ), (4.56)
where
* +−1 0 * +−1
Vβ = C 0 W1 C C W1 ΛW1 C C 0 W1 C , (4.57)
and ‘)’ signifies convergence in distribution. This follows easily from the fact that
1 X 0
N
N i=1
provided that some moments conditions and cross sectional independence hold.
Let !−1 H
N I
1 X 0 Z 0 Z −1
WN = Zi Zi = . (4.59)
N N
i=1
Then
* +−1
b
β 0 0
∆X 0 ZWN Z 0 ∆Y
GM M = ∆X ZWN Z ∆X
= ∆X Z Z Z Z ∆X Z ∆Y. (4.60)
Let
bit as road not
! −1 be serene
N
1 X 0
WN = Zi GZi
N
i=1
where 0 1
2 −1 0 · · · 0 0
B C
B C
B −1 2 −1 0 0 C
B C
B .. C
B . 0 0 C
B 0 −1 2 C
G=B .. .. .. C, (4.61)
B . . . C
B C
B C
B . C
B 0 0 0 . . 2 −1 C
@ A
0 0 0 · · · −1 2
Here for two matrices A and B, A ≤ B signifies the negative semi-definiteness of the di§erence
A − B.
• let β 0 2SLS .
• Define ∆e b
ui = ∆Yi − ∆Xi β 2SLS .
PN
• Construct a consistent estimator of Λ̂ = N −1 0 u ∆e
i=1 Zi ∆e
0
i ui Zi and choose WN = Λ̂−1 .
WN is a consistent estimator of Λ−1 under general conditions. It is easy to see that the
b
consistency and asymptotic normality of β −1
GM M remain valid with WN = Λ̂ .
* + * +
E Zi0 ∆ui ∆u0i Zi = E Zi0 GZi . (4.63)
A su¢cient condition is
* +
E ∆ui ∆u0i |Zi = G.
• Wald test
N
!0 N
!
X X
i=1 i=1
!0 !
N
X N
X
ur ur
i=1 i=1
Question: What if
0 1
Xi,1 0 ··· 0
B C
B C
B 0 Xi,2 · · · 0 C
Zi = B
B .. .. . . ..
C?
C (4.66)
B . . . . C
@ A
0 0 · · · Xi,T −1
asymptoticallyefficient
if w
Nt I 2 i o 0
of oyi oxi Fash
V a one of the
aesthete of
IN Eziaoye
ox p
A amore wont
restrities
eines Sir
T imarets T2 VI Belter
Earlier
week I
because
I can se fi as 3N
says
is a bad irrelevant
but
if 7 Large but Xi week Ee
2
fewer
in
use elements
Too may N
Yi L t Xi p t ve Zi Ei 7in
Eg
Few Fors
phew F y
or E
proof LS
perfect fit
Bois x
yon
In the previous sections, we maintain the assumption that Xi,t are sequentially exogeneous
and are correlated with αi . Now suppose that there are other types of independent variables,
Consider
where Xi,t is sequentially exogenous while Wi,t is strictly exogeneous. Then the instruments
0 1
o ,Wo )
(Xi,1 0 · · · 0
i,T
B C
B o ,Wo ) C
B 0 (Xi,2 i,T · · · 0 C
B
Zi = B C (4.68)
.. .. .. .. C
B . . . . C
@ A
0 0 o o
· · · (Xi,T −1 , Wi,T )
0 = (W , W , . . . , W ) . Now, Z 0 ∆u becomes
0 1
(X o )0 ∆u
i,2 C
B i,1
B C
B (W o )0 ∆ui,2 C
B i,T C
B C
B (X o )0 ∆ui,3 C
B i,2 C
B C
Zi ∆ui = B
0 o 0
B (Wi,T ) ∆ui,3 C
C (4.69)
B .. C
B C
B . C
B C
B o 0 ∆u C
B (Xi,T −1 ) i,T C
@ A
o )0 ∆u
(Wi,T i,T
We consider the case that Wi,t is sequentially exogeneous and cov(Wi,t , αi ) = 0. In this case,
observations on Wi,t up to and including t = s are valid instruments for the levels equation at
t = s.
To combine the moment conditions for both the first-di§erenced equations and levels equa-
0 1 0 1 0 1
B C B C B C
B · · · C B · · · C B C
B C B C B C
B C B C B C
B C B C B C
"+i = B C , Yi
+
= B C , X +
i = B C. (4.70)
B " C B Y C B X Wi,1 C
B i,1 C B i,1 C B i,1 C
B C B C B C
B "i,2 C B Yi,2 C B X Wi,2 C
B C B C B i,2 C
B C B C B C
B
···
C B
···
C B C
B C B C B ··· ··· C
@ A @ A @ A
Denote 0 1
o
Xi,1 0 ··· 0 0 0 0
B C
B C
B 0 o
Xi,2 ··· 0 0 0 0 C
B C
B .. .. .. .. C
B . . . . C
B C
B C
B 0 0 o
· · · Xi,T 0 C
+ B −1 C
Zi = B C, (4.71)
B 0 0 o
Wi,1 0 C
B C
B C
B 0 0 0 o
Wi,2 0 C
B C
B C
B .. C
B ··· ··· . 0 C
@ A
0 0 0 Wi,T
o )0 ∆u
B
(Xi,1 i,2
C
B C
B (Xi,2o )0 ∆u C
B i,3 C
B . C
B .. C
B C
B C
* + +0 + B o 0
B (Xi,T −1 ) ∆ui,T
C
C
Zi "i = B C. (4.72)
B (W o )0 " C
B i,1 i,1 C
B C
B (W o )0 " C
B i,2 i,2 C
B .. C
B C
B . C
@ A
o )0 "
(Wi,T
i,T
We now consider the case that Wi,t is strictly exogeneous and cov(Wi , αi ) = 0. In this case,
the observations on Wi,t for all periods become valid instruments in the level equations. Using
0 1
Xi,1 0 ··· 0 0 0 0
B C
B C
B 0 Xi,2 o ··· 0 0 0 0 C
B C
B . .. .. .. C
B .. . . . C
B C
B C
B 0 0 · · · X o 0 C
B i,T −1 C
Zi+ = B C, (4.73)
B 0 0 W o 0 C
B i,T C
B C
B 0 o C
B 0 0 Wi,T 0 C
B C
B .. C
B ··· ··· . 0 C
@ A
0 0 0 Wi,To
* +0
we have E Zi+ "+i = 0. Again, we can use the GMM approach as before.
αi ui,s Levels/Di§erences
There is a large literature on weak instruments, although there is no consensus on the definition
of weak instruments. All researchers seem to agree that when the instruments are weakly
correlated with the regressors, the problem of weak instruments is present. In this case, the
GMM estimator may not be consistent. More recently, many papers find that using too
many overidentifying restrictions leads to poor finite sample properties. In practice, it may be
better to use a couple of lags (say 3) rather lags back to t = 1. A rigorous study of the weak
The GMM estimator in the previous sections is in general consistent and asymptotically nor-
mal regardless of how the process is initialized. The only exception is that the instruments
become weak for certain initializations. The GMM estimator is robust at the cost of ignoring
information in the first observation. In the time series context, whether the first observation
is used in the estimation does not matter for robustness and asymptotic e¢ciency. With short
panels, the situation is fundamentally di§erent. As shown by Blundell and Bond (1998) and
Hahn (1999), imposing restrictions on the initial condition can greatly improve the e¢ciency
of GMM over certain parts of the parameter space. In this section, we will not discuss how
to incorporate the information in the first observation in the GMM framework. Instead, we
To use information in the first observation Yi,0 , we need to specify how Yi,0 is generated.
We assume that
Yi,0 = δ 0 + δ 1 αi + vi
where vi s iidN (0, σ 2v ) and is independent of αi and {ui,t }Tt=1 . Some special cases of this
specification are:
c. δ 0 = µ/(1 − β), δ 1 = 1/(1 − β) and σ 2v = σ 2u /(1 − β 2 ) : when Xi,t = 0, Yi,0 follows the
i = 1, 2, . . . , N ) is:
N N N
Y Y Y
L= f (Yi,1 , . . . , Yi,T |Xi ) = f (Yi,1 , . . . , Yi,T |Yi,0 , Xi ) f (Yi,0 |Xi ) (4.75)
with !
N
Y P −N/2P N
X (Yi,0 − δ 0 )2
i=1 i=1
where
σ 20 = δ 21 σ 2α + σ 2v .
δ 1 σ 2α
αi s N (φ(Yi,0 − δ 0 ), σ 2α − φ2 σ 20 ) where φ =
σ 20
Therefore, conditional on Yi,0 , "i = ("i,1 , "i,2 , . . . , "i,T )0 has mean φ(Yi,0 − δ 0 ) and variance Ω
where
* +
Ω = σ 2α − φ2 σ 20 JT + σ 2u IT
: = σ 2α|0 JT + σ 2u IT .
QN
So i=1 f ("i,1 , . . . , "i,T |Yi,0 , Xi ) is
N H
X I
−N T /2 −N/2 1 0 −1
i=1
Using
N
Y N
Y
i=1 i=1
L(σ 20 , σ 2u , σ 2α , δ 0 , µ, β, γ, φ)
N
!
P P X
(Yi,0 − δ 0 )2
= (2π)−N/2 Pσ 20 P
−N/2
exp
− ×
2σ 20
i=1
N
!
X 1
(2π)−N T /2 |Ω|−N/2 exp − e0 Ω−1 ei
2 i
i=1
where
ei = Yi − µ − βYi,−1 − Xi γ − φ(Yi,0 − δ 0 )
e§ects estimator (or the GLS estimator) if the quasi-demeaning uses the MLE of θ. Therefore
the random e§ect estimator is consistent when Yi0 is a fixed constant or Yi0 is random but
uncorrelated with αi .
Remark 13 The random e§ects estimator is inconsistent when φ 6= 0. The consistency of the
random e§ects estimator thus depends crucially on the initialization of the process.
# $
LC σ 2u , σ 2α|0 , µ̃, β, γ, φ
Y
= f (Yi,1 , . . . , Yi,T |Yi,0 , Xi )
i=1
!
N
X
We can obtain the conditional MLE by maximizing the conditional maximum likelihood. If
there is no restriction on the DGP for Yi0 , then the conditional MLE is asymptotically as
e¢cient as the MLE. Otherwise, the conditional MLE may be less e¢cient.
h i−1
h i−1
= σ 2u + T σ 2α|0 PT + σ −2
u QT
: = σ −2 −2
1 PT + σ u QT
and
* + * +T −1
|Ω| = σ 21 σ 2u ,
we have
e0i Ω−1 ei = σ −2 0 −2 0
1 ei PT ei + σ u ei QT ei .
The likelihood function can be regarded as based on the following two equations:
* + * + * +
# $
N N (T − 1) 1 XX
N T
T X
N
!
i=1 t=1 1 i=1
!
N T 1
N
X N (T − 1) 1 1 XN X
T
Remark 16 If the model is correctly specified, the MLE will be asymptotically more e¢cient
than the GMM estimator based on the first di§erenced equation. To improve the asymptotic
e¢ciency of the GMM estimator, Ahn and Schmidt (1995), Arellano and Bover (1995) and
Blundell and Bond (1998) proposed an additional set of moment conditions. See Ahn and
Schmidt (1999) for a survey on the GMM approach applied to the dynamic panel context.
1. Write a Matlab program to investigate the finite sample bias, standard error and root mean
squared error of di§erent estimators for a simple dynamic panel data model. Specifically, for
(iv) Given the simulated panel data Yi,t , i = 1, 2, · · · , N and t = 1, · · · , T, estimate the
dynamic panel data model (4.77) using the pooling OLS estimator, fixed e§ects estimator,
first di§erenced estimator and Anderson-Hsiao estimator (with Yi,t−2 as the instrument).
(v) Repeat (i)-(iv) 1000 times and calculate the finite sample bias, standard error (se)
and root mean squared error (rmse) of each estimator. Let ρ̂(r) be the estimate for the r-
th replication, then the finite sample bias, standard error and root mean squared error are
computed as follows
1000
1 X (r)
bias(ρ̂) = ρ̂ − ρ,
1000
r=1
8 !2 91/2
: 1000
r=1
1000
s=1
;
n o1/2
(a) Let N = 100, T = 6. Graph the bias, se and rmse of each estimator as functions of ρ
for ρ = 0, 0.1, 0.2, · · · , 1. For ρ = 0.7, graph the histogram of each estimator and compare it
2. Download the data file cigar.xls from the course web site. The file contains cigarette
consumption for 46 states over the years 1963-1992. We are interested in estimating a dynamic
demand model for cigarettes using Matlab or STATA. The model is given below
where
Ci,t is real per capita packs of cigarettes sold in state t in year t. Pi,t is the average retail price
of a pack of cigarettes measured in real terms. Yi,t is per capita disposable income. P ni,t is
(a) Estimate the model using the OLS estimator with no αi and λt . In this case, the model
becomes
(Do not forget to report the standard errors or t-statistics. Report only the estimates for
(b) Estimate the model using the OLS estimator with no αi , i.e. "i,t = λt + ui,t .(Construct
the year dummies, say year62, year63,year64, . . ., year92 and include year64, . . ., year92 as
(c) Estimate the model using the within estimator with no λt , i.e. "i,t = αi + ui,t . Assume
ui,t s iid(0, σ 2u ) over i and t and is independent of the regressor. Is the within estimate of β 1
consistent?
(d) Estimate the model using the within estimator, i.e. "i,t = αi + λt + ui,t . Test for the
(e) Estimate the model ("i,t = αi +λt +ui,t ) using the the Anderson-Hsiao estimator (Please
use the model in its first di§erence and employ ln Ci ,t−2 , the second lag of log consumption,
as the only instrument). Is ln Ci ,t−2 a valid instrument when ui,t is autocorrelated, say ui,t s
M A(1)?
(f) Compare the estimate of β 1 in (b) with that in (e). Which one is larger? Can we intuit
(g) Estimate the model ("i,t = αi + λt + ui,t ) using the the Arellano-Bond estimator
including
(ii) the first-step AB estimator using ln Ci ,t−2 , ln Ci ,t−3 and ln Ci ,t−4 as instruments.
The the moment conditions used in g(iii) are di§erent from those in (e). Can you
# P $−1
0 1
2 −1 0 · · · 0 0
B C
B C
B −1 2 −1 0 0 C
B C
B . C
B 0 −1 2 . . 0 0 C
B C
G=B . . . C (4.81)
B .. .. .. C
B C
B C
B . C
B 0 0 0 . . 2 −1 C
@ A
0 0 0 · · · −1 2
(iv) Can you perform a two-step GMM estimation based on the specification in g(i)?
Bibliography
[1] Ahn, S. C., and Schmidt, P. (1995). “E¢cient estimation of models for dynamic panel
[2] Ahn, S. C., and Schmidt, P. (1999). “Estimation of linear panel data models using GMM,”
[3] Anderson, T.W. and Cheng Hsiao (1981). “Estimation of dynamic models with error
[4] Anderson, T.W., and Cheng Hsiao (1982). “Formulation and estimation of dynamic mod-
[5] Arellano, M., and Bond, S. (1991). “Some tests of specification for panel data: Monte
[6] Arellano, M., and Bover, O. (1995). “Another look at the instrumental variable estimation
[7] Blundell, R., and Bond, S. (1998). “Initial conditions and moment restrictions in dynamic
[8] Hahn, Jin (1999). “How informative is the initial condition in the dynamic panel data
[9] Nickell, S. (1981). “Biases in Dynamic Models with Fixed E§ects,” Econometrica, 49,
1417-1426.
[10] Roodman, D. (2009), A Note on the Theme of Too Many Instruments, OBES 71(1),
135-158.
76
bound.
Review Op A Op
G rate of converse
fCx is scar
means g x concrses
foster
Chapter 5
framework to understand
Extremum Estimators unifoired
properties
of oter Estimates
We define and investigate the properties of a very general class of estimators called extremum
estimators. As we will see, many popular estimators that you have encountered before are
part of this class, e.g. (nonlinear) Least Squares, Maximum Likelihood, Generalized Method
of Moments estimator.
5.1 Definitions
Denote by Θ ⊂ Rd the parameter space of interest. Extremum estimators (EE) {bθn : n > 1}
are defined to be random elements of Θ that approximately minimize a stochastic criterion
function Qn (θ). That is, b
θn is defined to satisfy
p
minimicer
Assumption EE: b θn 2 Θ and Qn (b θn ) ≤ inf θ2Θ Qn (θ) + op (1). of Q fn
F
small error
OP
Qaarcheted d estered Ipso a
for as ξ n = asymptotes alee error
Remark 17 A random quantity ξ n is of smaller stochastic order than bn , written
op (bn ), if
P lim (ξ n /bn ) = 0.
n!1
Equivalently, for any δ > 0,
Remark 18 Note that for any nonnegative sequence sn , we have lim sn = 0 i§ limn!1 sn = 0.
First, if lim sn = 0, then limn!1 sn = limn!1 sn = 0. Second, note that 0 ≤ sn ≤ supm≥n sm .
So limn!1 sn = 0 implies that lim sn = 0. As
Kon
Remark 19 When we do not know whether lim sn exists or not a priori, we may work with
limn!1 sn , which always exists if sn is bounded. In our setting, sn will be a probability, which
by definition is bounded by 1.
Remark 20 A random quantity ξ n is stochastically bounded, written as ξ n = Op (1), if for
any " > 0, there exists M" < 1 such that P (|ξ n | ≤ M" ) > 1 − " for n su¢ciently large.
77
5.1 DEFINITIONS 78
are iid. Suppose we specify the density to be f (w, θ) (with respect to some measure µ, in
most cases the Lebesgue measure). Let l(w, θ) := log f (w, θ). The (quasi) ML estimator b θn
minimizes (at least up to op (1)) eoseikaihood if
f is true density
allow
firm Qn (θ) := −
n
1 X n
l(Wi , θ)
criteria for i=1
over θ 2 Θ. one n
(2) Least Squares (LS) Estimator for Nonlinear Regression: Suppose the data
{Wi = (Yi , Xi0 )0 : i = 1, ..., n} are iid. We specify a possibly nonlinear function for E (Yi |Xi ) .
That is,
E (Y |X ) = g(X , θ)
Edition fr cheer
i i i
for some function g (x, θ). The LS estimator b θn minimizes (at least up to op (1))
n
1X
Qn (θ) := (Yi − g(Xi , θ))2 /2 minimooer of severe residuals
n
i=1
over θ 2 Θ. (The scale factor 1/2 is used because it is notationally convenient for the asymp-
totic normality result given below. It has of course no e§ect on the estimator.)
(3) Generalized Method of Moments (GMM) Estimator: Suppose the data {Wi :
i = 1, ..., n} are iid, and we have the moment conditions
g is moment f Eg(Wi , θ0 ) = 0 (5.1)
not necessarilytrue due but was it o
where g(w, θ) 2 Rk (k ≥ d) is a known function. Let An be a k × k random (i.e. depending
on the data) weight matrix. Then, the GMM estimator b θn minimizes (at least up to op (1))
gardeanalogueof want
n
1X
Qn (θ) := ||An g(Wi , θ)||2 /2
n Condit en
ten
i=1
or 5115 i
over θ 2 Θ, where || · || denotes the Euclidean norm on Rk .
not used (4) Minimum Distance (MD) estimator: Let π̂ n be an unrestricted estimator of a
inn
anymorek-vector parameter π . Suppose π is known to be a function of a d-vector parameter_ignore θ0 where reform
0 0
d≤k:
If g invertible π 0 = g(θ0 ).
MI.fm
towwwo and extra stuff as dEk Cover id
Let An be a k × k random weight matrix. Then, the MD estimator θ̂n minimizes To do
_4
Can't get
Qn (θ) = kAn (π̂ n − g(θ))k /2 2
one
but
minimise
over θ 2 Θ.
As an example, consider the simple linear panel data model
Yit = α + Xit β + "it , i = 1, ..., N, t = 1, ...Ti odeon
5.1 DEFINITIONS 79
where αi may be correlated with Xi. Chamberlain’s approach (1982, 1984) is to replace αi
with its linear projection onto {Xi } . Assume that αi and {Xit } have finite second moments.
αi = λ0 + Xi1 λ1 + ... + XiT λT + vi .
Earlier did just XI
(5.3)
Yit = + Xi1 λ1 + ... + Xit (β + λt ) + ... + XiT λT + rit
Tfn
of pets : = π t0 + Xi1 π t1 + Xi2 π t2 + ... + XiT π tT + rit . (5.4)
it
of
0
Now π t = (π t0 , π t1 , ...π tT ) can be estimated by a cross sectional regression using observations
for time t. Let θ collect the parameters α, β, λ0 , λ1 , ..., λT , then π = Hθ for some H. use mini.mn
distancecoiner
gofer
his (5) Two-step (TS) Estimator. Suppose that the criterion function C n (θ; τ 0 ) depend on
both θ and τ 0 . The infeasible extreme estimator is defined to be w
depends on an
θ̃n 2 Θ and Cn (θ̃n ; τ 0 ) ≤ inf Cn (θ; τ 0 ) + op (1).
on to
Q wasdepend
θ2Θ
To obtain a feasible version, we assume that τ̂ n is a preliminary consistent estimator of τ 0 . unknown
We can then define a two-step extreme estimator:
estrele
to
bθn 2 Θ and Qn (b θn ) ≤ inf Qn (θ) + op (1) for Qn (θ) = Cn (θ; τ̂ n ).
estate 0 θ2Θ
Yi = g(Xi , θ0 ) + Ui ,
conduit
2
and a2
with E(Ui |Xi ) = σ (Xi , τ 0 ) almost surely. Suppose τ̂ n is a consistent estimator of τ 0 , then
the (feasible) WLS estimator (up to op (1))
feasible
ITEM
θ̂W LS = arg min Qn (θ)
ofthis θ2Θ
properties
for
n WIS
1 X (Yi − g(Xi , θ))2
eg RE
Qn (θ) = Cn (θ; τ̂ n ) =
2n σ 2 (Xi , τ̂ n ) war of u
i=1 weighted
is a two-step estimator. initial
estate
As another example, suppose the data {Wi , i = 1, 2, ..., n} are iid, and we have the moment
conditions Eg(Wi , θ0 , τ 0 ) = 0. Let
n
1X
Gn (θ, τ ) = g(Wi , θ, τ ) 2 Rk sample arduge
n
i=1
do
can and τ̂ n be a preliminary consistent estimator of τ 0 . Then
a
steeple0 e
θ̂n = arg min Qn (θ) = kAn Gn (θ, τ̂ n )k2 /2
θ2Θ
for some matrix A n is a two-step estimator.
There is also a TS version of the MD estimator. In this case Gn (θ, τ ) = π̂(τ ) − g(θ, τ ),
which is a random k-vector that should be close to 0 when θ = θ0 and τ = τ 0 , and n is large.
Let An be a k × k random weight matrix. Then, the TS MD estimator θ̂n minimizes
Qn (θ) = kAn Gn (θ, τ̂ n )k /2
over θ 2 Θ.
2 on slides
cg
on
ordepends
un
Pointwise n
y E o t t o F n n C Z
_m
OH SH E
lance
to all o
One n trot applies
ball around 0
Assn 3h look at
ball
achieve is
minimi you
So Oo neminiser
larger
value Do rimmiser
not have tr e
Q o Need
of bumit for
is d space e Ete ther Oo exists
up
Sf
f
deterries wine ere 0
go basically
Ball is located
Ass ID is het
A necessary condition for
0 O over
0 uniquely minimises
gMfunch
Goo we can
ondasQnisfn.oE
surfeits's
Treeoria Onn argon On o
n
stochastic fu
ensures random boy I
wt Bon to erase
Is defined my do
gc
qyqggY
under what condition n covarge to do
continuous
besides where functionalgm is
want to measure continuity of forehead space
So
the matrix space
define
on
Continues if noon
eider sniper norm sup
Steps 2nd A 3rd slab
ASK 3
appxf around neighborto
estimate non lineer castrate prove consistency uncle
any
5.2 CONSISTENCY 80
5.2 Consistency
Argmax theorem
5.2.1 Consistency Theorem
hint ofan gas
Consistency of extremum estimators is implied by the following assumptions:
uniform W sachem icfn p
stronger Assumption ULLN: supθ2Θ |Qn (θ) − Q(θ)| ! 0 for some nonstochastic function Q :
rerun Θ ! R. or u CON
for in close in probability sense close to each T.fm
poinanise
G No
Assumption ID: 8" > 0, there exists δ > 0 such that inf θ2Θ\B(θ0 ,") Q(θ) − Q(θ0 ) ≥ δ, I
where by B(θ0 , ") we denote the open ball of radius " centered at θ0 , i.e. B(θ0 , ") := {θ : get
uniformly
a een
e ||θ − θ0 || < "}. Oo minism aer of limit for Q so sharronger
d
newise
Before we discuss these assumptions in detail, we prove the following theorem.
ran can mass oaf
nd onN
p
Theorem 21 Assume that EE, ULLN, and ID hold. Then b
θn ! θ0 .
ta man
closet defined by 3D
earn a Proof: Let " > 0. By ID, there exists δ > 0 such that whenever θ 2 Θ\B(θ0 , ") we have
Thus
norm defn
P (||b
θn − θ0 || > ") = P (b
θn 2 Θ\B(θ0 , ")) buy outside ball
ID
would
assm as outside bell achieve
values ≤ P (Q(b
θn ) − Q(θ0 ) ≥ δ) by us
gapof Q at
mosuat to eine = P (Q(b
θn ) − Qn (b
θn ) + Qn (b
θn ) − Q(θ0 ) ≥ δ)
Ion b b
≤ P (Q(θn ) − Qn (θn ) + Qn (θ0 ) + op (1) − Q(θ0 ) ≥ δ) EE
ut ≤ P (2 sup |Qn (θ) − Q(θ)| + op (1) ≥ δ) = P (op (1) ≥ δ) ! 0. N
COO t 98
θ2Θ
FI
In Iap a run awe soap
The first inequality follows by (5.5), the third inequality by EE, and the last equality by
p
ULLN. So, limn!1 P (||b θn − θ0 || > ") = 0 for any " > 0. That is, b
θn ! θ0 "
just
g incise
result This starting with resuts akee
We now discuss the Assumptions Q En g t 0pct consists
Needsup fever
Q Assumption ULLN. Note that the assumption is stronger than convergence in proba-
10 3D just
need uun n
Lot
of Orton bility Q (θ) !p Q(θ) for each θ 2 Q. As an analogue, we can think of pointwise convergence
n
of nonrandom functions versus uniform convergence.
s
Definition (Uniform Convergence): The sequence of functions {fn (t)} converges
uniformly to function f (t) if for every " > 0, there is an N such that n ≥ N implies
|fn (t) − f (t)| < " for all t 2 R.
C
in 3D
Seats
Obviously every uniformly convergent sequence is pointwise convergent. The di§erence Con get
between pointwise and uniform convergence is this: If fn (t) I converges pointwise to f (t), then
for every " > 0 and for every t 2 R, there is an integer N depending on " and t such that IneOof
|fn (t) − f (t)| < " holds if n ≥ N . If fn (t) converges uniformly to f (t), it is possible for each endnote
" > 0 to find one integer N that will do for all t 2 R. act
nand a
Koin i E y
Q at values is
distance b w
aristene ando_O
There u CON
re
bail to glow
net going from local
winners
notrecostt result
E Q we obtain
Is 0 to can
Tf Hn
T
for any Pt
pointwise
Mn co
Is 0
mn oz
w.my
at o Mn Oz Mn Os Is
at none ten
true do
we stillget
poinnik
o
f 879 In C 2 t s't sot.VE not ten lo mncqdj
co 8 CE YES
Pff chosose none n tenet
waves for both
if f rile
can we alppy WP tin o my nerve Unco Un Ok
GE
not in general
set
teen Wfmt set is essentiallyfinite
is confront a
if assume
ten can pick pick so tret cover entire
to be center of balls whose union covers
can choose 0 Qe
swell variation each ball then
te fn ten I here a on
Zf
we are done
skull costut for over ee.dk
my appx
M
term q should
be Oz
Continuity ten fake sup
bile for
piecewire consent
Pointwise 1 Copartners continuity gives 05 Stoke
vocal result
result
5.2 CONSISTENCY 81
1.0
f
0.8
0.6
0.4
f2(t) f4(t)
0.2
f20(t)
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
t
Lemma 22 Let {Wi : i = 1, ..., n} be a sequence of iid random variables. Let {m(w, θ) :
θ 2 Θ} be a class of Rs -valued functions for which E||m(Wi , θ)|| < 1 8θ 2 Θ. Then,
1 Pn p
n i=1 m(Wi , θ) ! Em(Wi , θ) as n ! 1 8θ 2 Θ.
IS Gormfond estrut
Q Col 11 An Gn gen HY
Q o
HAS to To 11172
not
even if An A
En Eo was
tor Cah
rent Gn G e
of E o as En to
for each o Is o
5.2 CONSISTENCY 82
Using the lemma, we see that Assumption LLN holds in the following examples: (1) ML
estimator, provided the data {Wi : i = 1, ..., n} are iid and E| log f (Wi , θ)| < 1 8θ 2 Θ, (2)
LS estimator, provided {(Yi , Xi ) : i = 1, ..., n} are iid and E(Yi − g(Xi , θ))2 < 1 8θ 2 Θ,
p
and (3) GMM estimator, provided An ! A, {Wi : i = 1, ..., n} are iid and E kg(Wi , θ)k < 1,
8θ 2 Θ.
p p
Assumption LLN holds for the MD estimator provided that An ! A and π̂ n ! π 0 .
Assumption LLN holds for the TS WLS estimator if
P P
P1 Xn
(Yi − g(Xi , θ))2 (Yi − g(Xi , θ))2 PP p
P
sup P −E P!0
τ 2B(τ 0 ,") P n i=1
σ 2 (Xi , τ ) σ 2 (Xi , τ ) P
h i
for some " and any θ 2 Θ, and E (Yi − g(Xi , θ))2 σ −2 (Xi , τ ) /2 is continuous at τ 0 for any
θ 2 Θ.
p p
Assumption LLN holds for the TS GMM/MD estimator provided that An ! A, τ̂ n ! τ 0 ,
p
sup |Gn (θ, τ ) − G(θ, τ )| ! 0
τ 2B(τ 0 ,")
for some " and any θ 2 Θ, and G(θ, τ ) is continuous at τ 0 for any θ 2 Θ. The latter follows
because τ̂ n 2 B(τ 0 , ") with probability approaching one, and
Next we consider how one can verify Assumption ULLN. Suppose we have a sequence of
random variables indexed by θ 2 Θ, {Hn (θ) : n ≥ 1} that converges in probability to zero
8θ 2 Θ. In our setting, we can take Hn (θ) := Qn (θ) − Q(θ). What additional conditions are
p
su¢cient to obtain uniform convergence to zero, supθ2Θ |Hn (θ)| ! 0? The following results
can be applied to verify Assumption ULLN. We need the following definition:
Definition: {Hn (θ) : n ≥ 1} is stochastically equicontinuous (SE) on Θ if 8" > 0
and η > 0, 9δ := δ (", η) > 0 (may depend on ", η but not n) such that
tenthdoesn't o
limn!1 P (sup sup |Hn (θ0 ) − Hn (θ)| > ") < η. too wah ow
θ2Θ θ0 2B(θ,δ)
wowmm are balls
over sued here tudou.es
centered
The “equi” in “equicontinuous” at o
emphasizes the fact that δ does not depend on n. A more
precise term is stochastic uniform equicontinuity. “Uniformity” refer to the fact that δ
does not depend on θ. However, it is common in the literature to use “ stochastic equicon-
tinuity”. Sometimes, we also use “Asymptotic equicontinuity”
Note: by definition,
five
thot words to all n
limn!1 fn = lim sup fm .
n!1 m≥n
8 In
each
up overhead Maco depends on in AO
fluctuates
radius 8 In
her core all balls metcover but
and over equi usniform
E is n
ten P C E are
BConifoom fluctuation
weekofprocas
chesti with
heygopro afar
convergenceed
go
egg 5.2 CONSISTENCY
of
83
dggree
measures
corri's too un
Sf sure ten for noto
Define
0
wn (δ) := sup sup |Hn (θ ) − Hn (θ)| annus
θ2Θ θ0 2B(θ,δ)
which is the so-called modulus of continuity of Hn . Then SE holds if 8" > 0 and η > 0, 9δ > 0
such that
depends 8s
canachievehis limn!1 P (wn (δ) > ") < η.
choice oes
picks's arenas E t 2
at ban a This is equivalent to say that wn (δ n ) = op (1) for any δ n ! 0. See Andrews (1994).
Cootnor a
Note: In mathematical analysis, a family of functions is equicontinuous if all the functions
f a
www.wt
o are continuous and they have equal variation over a given neighborhood, in a precise sense g sizeofball
woes ta
8 described below:
ar
The family F is equicontinuous at a point x0 2 X if for every " > 0, there exists a
of n
δ = δ (x0 ) > 0 such that d(f (x0 ), f (x)) < " for all f 2 F and all x such that d(x0 , x) < δ.
Note that δ = δ (x0 ) should work for all f 2 F. The family is (pointwise) equicontinuous if it
is equicontinuous at each point of X .
The concept of equicontinuity can be applied to a sequence of functions: the sequence of
functions {fn (x) , n = 1, 2, ...} is equicontinuous at a point x0 if for every " > 0, there exists
a δ = δ (x0 ) > 0 such that d(fn (x0 ), fn (x)) < " for all n and all x such that d(x0 , x) < δ.
Here δ may depend on x0 but not on n. When there is a δ that applies to all x 2 X , then the
sequence of functions is uniformly equicontinuous.
AssumptionPC ON Hn (θ) !
LLN:
p
0, 8θ 2 Θ.
Assumption SE: {Hn (θ) : n ≥ 1} is SE on Θ. intuition
it Kate eveneerier p
Theorem 23 (a) If Θ is compact and Assumptions LLN and SE hold, then supθ2Θ |Hn (θ)| !
p
0. (b) If supθ2Θ |Hn (θ)| ! 0, then SE holds.
canbe a s.se ofhere
Proof: (a) Let [θ2Θ {B(θ, δ)} be an open cover of Θ where δ is given in the definition
of stochastic equicontinuity. Since Θ is compact, there exists a finite subcover such that
Θ ⊆ [j=1,...,J {B(θj , δ)}. Then for any " and η > 0, we have
2C K
Ae is za or B ≤ limn!1 P (max sup
j≤J θ2B(θj ,δ)
|Hn (θ) − Hn (θj )| > ") + limn!1 P (max |Hn (θj )| > ")
j≤J
a
J
X J
X
< limn!1 P( sup |Hn (θ0 ) − Hn (θj )| > ") + limn!1 P (|Hn (θj )| > "),
j=1 θ0 2B(θj ,δ) j=1
a = 2Jη, I
A from LLN
where in the lastcan win it swell
norCA or inequality we use Assumption SE to deal with the first summand and
a Assumption LLN to deal with the second summand.
more (b) We have
p plank
plait P (sup
θ 0
sup
θ 2B(θ,δ)
|Hn (θ0 ) − Hn (θ)| > ")
E E
x co
n Coo if
Novel
random to Gaussian pucers
So coverages
p
using the triangle inequality and supθ2Θ |Hn (θ)| ! 0. "
p p
Remark 24 If Θ is finite, then supθ2Θ |Hn (θ)| ! 0 follows automatically from |Hn (θ)| ! 0
for each θ 2 Θ. A common example of an infinite set that exhibits properties similar to finite
sets is the compact set. Combining some continuity (here SE) and compactness, we can extend
pointwise convergence, a local result, to uniform convergence, a global result.
We now obtain a ULLN by applying the previous theorem. We look at the case where
n
1X
Hn (θ) := [m(Wi , θ) − Em(Wi , θ)].
n
i=1
Pn
In examples (1) ML and (2) LS, we can think of Qn (θ) as n−1 i=1 m(Wi , θ) and of Q(θ) as
Em(Wi , θ). Let W denote the support of Wi .
as f o
We claim that meanof function
EYiδ ! 0 as δ ! 0.
To see this, note that Yiδ ! 0 as δ ! 0 a.s. (this follows by (ii) and (iv) which imply uniform
continuity). Furthermore, Yiδ ≤ 2 supθ2Θ |m(Wi , θ)| 8δ > 0 and E supθ2Θ |m(Wi , θ)| < 1 by m is
(iii). The claim thus follows by the dominated convergence theorem. uniformly
coarse
in 0
So 0 as f
watches of th un itself Yi
earlier fer h fn ha oof't of o
ed frrate
n is for of an
area so
can use but
So can get E Yi o
chewer
won here 4 ig as An norm so on ve
connor sie du f supfm Wi e I ca eh whims
www dawn
ten pointise co limits car
be extended to SE
5.2 CONSISTENCY 85
Now
unearhor
≤ P 0 P
sup sup m(Wi , θ) − m(Wi , θ ) + E sup sup Pm(Wi , θ) − m(Wi , θ0 )P
n θ2Θ θ0 2B(θ,δ) n θ2Θ θ0 2B(θ,δ)
i=1 i=1
n n n
1X 1X 1X
= Yiδ + EYiδ = [Yiδ + EYiδ ] .
n n n
i=1 i=1 i=1
I argmn In o
Proof :
where the inequality holds by Jensen’s inequality since log(·) is a concave function. The in-
equality is strict, implying that θ∗0 uniquely minimizes Q(θ) over Θ, if and only if P (f (Wi , θ) 6=
f (Wi , θ∗0 )) > 0 for all θ 2 Θ di§erent from θ∗0 .
Case (ii), the true distribution g(w) is not part of the parametric family {f (w, θ) : θ 2 Θ}.
By
KLIC(g, f (·, θ)) := Eg log g(Wi ) − Eg log f (Wi , θ)
we denote the Kullback-Liebler Information Criterion between g and f (·, θ), where Eg denotes
the expectation when Wi has density g. Note that
Thus, the ML estimator under misspecification (often called the quasi-ML estimator) converges
in probability to the parameter value θ0 that uniquely minimizes the KLIC between the true
density g and the densities in the parametric family {f (w, θ) : θ 2 Θ} provided such a unique
value exists.
More generally, the Kullback—Leibler Information Criterion is a non-symmetric measure of
the di§erence between two probability distributions P1 and P2 . If P1 is absolutely continuous
with respect to P2 , then
Z
dP1 dP1
KLIC(P1 , P2 ) = log dP2
dP2 dP2
If both P1 and P2 are absolutely continuous with respect to some measure µ, then
H I
dP1 dP2
KLIC(P1 , P2 ) = EP1 log − log .
dµ dµ
5.2 CONSISTENCY 87
KLIC (P1 , P2 )
H I
dP1 dP2
= EP1 log − log
dµ dµ
L M
1 det (Σ1 )
= − EP1 log + (X − λ1 )0 Σ−1
1 (X − λ 1 ) − (X − λ 2 )0 −1
Σ 2 (X − λ 2 )
2 det (Σ2 )
1 det (Σ1 ) 1 ; <
= − log − k − EP1 (X − λ1 + λ1 − λ2 )0 Σ−1 2 (X − λ1 + λ1 − λ2 )
2 det (Σ2 ) 2
1 det (Σ1 ) 1 ; * + <
= − log − k − tr Σ1 Σ−1
2 − (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 )
2 det (Σ2 ) 2
1 1; * + < 1 det (Σ1 )
= (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 ) + tr Σ1 Σ−12 − k − log .
2 2 2 det (Σ2 )
1; * + < 1
KLIC (P1 , P2 ) = tr Σ1 Σ−1
2 − k − [log det (Σ1 ) − log det (Σ2 )]
2 2
What is the KLIC between probability distribution P1 with mean λ1 and variance Σ1 and
P2 := N (λ2 , Σ2 ) where P1 may not be normal? We can find this by:
H I
dP1 dP2
EP1 log − log
dµ dµ
dP1 1 ; <
= EP1 log + EP1 log det (Σ2 ) + (X − λ2 )0 Σ−1
2 (X − λ2 )
dµ 2
dP1 1
= EP1 log + C(Σ1 , Σ2 ) + (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 )
dµ 2
where C(Σ1 , Σ2 ) does not depend on λ2 . Suppose P1 is the true distribution, and we use P2
to perform the ML estimation. Then according to our general theory, the MLE of the mean
is consistent for the true mean, even if the normal distribution is mis-specified.
More generally, consider the linear model:
Y = Xβ 0 + U
where U may not be normal and may not be independent of X but satisfies E(U |X) = 0. The
true pdf of (Y, X) can be written as gY,X (y, x) = gU |X=x (y − xβ)gX (x) . Suppose we specify
U to be N (0, σ 2 ) and independent of X. Under this (mis) specification, the pdf of (Y, X) is
2
fY,X (y, x) = p 1 exp(− (y−xβ)
2σ 2
)gX (x) . Now
2πσ 2
and
(Y − Xβ)2 (Y − Xβ)2 [Y − E (Y |X) + E(Y |X) − Xβ]2
EgY X = E E
X Y |X = E E
X Y |X
2σ 2 2σ 2 2σ 2
EX varY |X (Y ) EX [X (β 0 − β)]2
= + .
2σ 2 2σ 2
If KLIC(gY,X , fY,X ) has a unique minimizer (β o , σ 2o ), then
That is, under the misspecification, β̂ M LE is consistent for β 0 , and σ̂ 2M LE converges to the
mean of the conditional variance of U.
(2) LS Estimator:
Case (i), the correctly specified case, i.e., there is a θ∗0 2 Θ such that E(Yi |Xi ) = g(Xi , θ∗0 )
almost surely.
In this case, θ0 = θ∗0 .
2(Q(θ∗0 ) − Q(θ)) = E(Ui )2 − E(Yi − g(Xi , θ) + g(Xi , θ∗0 ) − g(Xi , θ∗0 ))2
= E(Ui )2 − E(Ui + g(Xi , θ∗0 ) − g(Xi , θ))2
= −E(g(Xi , θ∗0 ) − g(Xi , θ))2 − 2E [(g(Xi , θ∗0 ) − g(Xi , θ))Ui ]
= −E(g(Xi , θ∗0 ) − g(Xi , θ))2 ≤ 0.
Note that E(Ui |Xi ) = 0 implies E [(g(Xi , θ∗0 ) − g(Xi , θ))Ui ] = 0. The last inequality is strict,
and so, θ∗0 uniquely minimizes Q(θ) if P (g(Xi , θ) 6= g(Xi , θ∗0 )) > 0 for all θ 2 Θ di§erent from
θ∗0 .
Case (ii), the model is not correctly specified. Define h(x) = E(Yi |Xi = x). Then,
and θ0 uniquely minimizes Q(θ) over Θ if it uniquely minimizes E(h(Xi ) − g(Xi , θ))2 over Θ.
In other words, the LS estimator in this case converges to the point θ0 that gives the best
mean squared approximation in the family {g(·, θ) : θ 2 Θ} to the conditional mean of Yi
given Xi (provided the best approximation is unique).
(3) GMM Estimator: If A is nonsingular and there exists a unique vector θ∗0 2 Θ such
that Eg(Wi , θ∗0 ) = 0, then θ0 = θ∗0 , which uniquely minimizes Q(θ) over Θ.
(4) MD Estimator: If A is nonsingular and there exists a unique vector θ∗0 2 Θ such that
π 0 = g(θ∗0 ), then θ∗0 uniquely minimizes Q (θ) over Θ. Suppose the restrictions are misspecified
and no value of θ 2 Θ such that π 0 = g (θ) . Then, θ0 is the value that uniquely minimizes
Q(θ) = kA(π 0 − g(θ))k /2, if such a value exists.
(5) TS/GMM and TS/MD Estimator: If A is nonsingular and there exists a unique
vector θ∗0 2 Θ such that G(θ∗0 , τ 0 ) = 0, then θ0 = θ∗0 is the value that uniquely minimizes Q(θ)
over Θ.
5.2 CONSISTENCY 89
(6) TS NLS Estimator: Similar to the NLS estimator. Details are omitted here.
We now make Assumption ID1, which is more primitive than and implies Assumption ID.
Assumption ID1: (i) Θ is compact. (ii) Q is continuous. (iii) θ0 uniquely minimizes
Q(θ) over θ 2 Θ.
ID1(i) is typically assumed. Continuity of Q holds if Qn (θ) is continuous and E supθ2Θ |Qn (θ)| <
1.
Show as an exercise that Assumption ID1 implies ID (proof by contradiction), but that no
two of the three conditions in ID1 alone are enough to imply ID. Figures below illustrate that
all three conditions are needed.
Θ is not compact
For the proof the following lemma, which can be viewed as a generalization of Slutsky’s
theorem, is helpful.
b ! p p
Lemma 27 Suppose (i) β n β 0 2 Rs , (ii) supβ2B(β 0 ,") |Ln (β) − L(β)| ! 0 for some " > 0
b )!p
and (iii) the non-stochastic function L(β) is continuous at β . Then, Ln (β
0 n L(β ).
0
Proof (lemma):
b ) − L(β )| ≤ |Ln (β
|Ln (β b ) − L(β
b )| + |L(β
b ) − L(β )|
n 0 n n n 0
b p
≤ sup |Ln (β) − L(β)| + |L(β n ) − L(β 0 )| ! 0,
β2B(β 0 ,")
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 91
where the first inequality holds by the triangle inequality. The second inequality holds with
probability approaching 1 (wp ! 1), because β b 2 B(β , ") wp ! 1. By assumption (ii) the
n 0
first summand converges to zero in probability, and by assumptions (i), (iii) and Slutsky’s
theorem the second does so too. "
Proof (theorem): Using CF (i) and (ii) and EE2 (ii), element-by-element mean value
@
expansions of @θ Qn (b
θn ) about θ0 yield
@ @ @2
op (n−1/2 ) = Qn (b
θn ) = Qn (θ0 ) + Qn (θ∗n )(b
θn − θ0 ), (5.6)
@θ @θ @θ@θ0
p
where the mean-value θ∗n lies on the segment joining b θn and θ0 (and hence satisfies θ∗n ! θ0 by
@2 ∗ ∗ p
EE2 (i)) and may di§er across the rows of @θ@θ 0 Qn (θ n ). Applying the lemma (using θ n ! θ 0
We now provide su¢cient conditions for Assumption CF and discuss the form of the
covariance matrix B0−1 Ω0 B0−1 for each of the examples introduced in the previous section.
5.3.2 ML Estimator
Pn
Recall Qn (θ) := − n1 i=1 log f (Wi , θ). Thus we have
n
@ 1X @
Qn (θ) = − log f (Wi , θ),
@θ n @θ
i=1
n
@2 1 X @2
Qn (θ) = − log f (Wi , θ),
@θ@θ0 n
i=1
@θ@θ0
@ @
Ω0 = E log f (Wi , θ0 ) 0 log f (Wi , θ0 ),
@θ @θ
@2
B(θ) = −E log f (Wi , θ).
@θ@θ0
Assumption CF (ii) holds if f (w, θ) 2 C 2 (Θ0 ) on some neighborhood Θ0 ⊂ Θ of θ0 for all
w in the support W of Wi .
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 92
CF (iii) holds by the central limit theorem (CLT) for iid random vectors with finite second
moment provided
@ @
E Qn (θ0 ) = 0 and E|| log f (Wi , θ0 )||2 < 1. (5.8)
@θ @θ
The former condition in (5.8) holds by the first order conditions for minimization of Q(θ) over
Θ, assuming that Q is di§erentiable at θ0 , provided θ0 is an interior point of Θ0 . That is,
@ @ @ @
0= Q(θ0 ) = − E log f (Wi , θ0 ) = −E log f (Wi , θ0 ) = E Qn (θ0 ),
@θ @θ @θ @θ
@
provided the inter-change of the order of E and @θ is justified. Su¢cient conditions for this
1
are that 1) log f (w, θ) 2 C (Θ0 ) on some neighborhood Θ0 ⊂ Θ of θ0 for all w 2 W, and
@
2) E supθ2Θ0 || @θ log f (Wi , θ)|| < 1. These conditions also imply that Q is di§erentiable at
@
θ0 . Note that 0 = @θ Q(θ0 ) holds by definition of θ0 (as the value that minimizes Q(θ), recall
Assumption ID) whether or not the model is correctly specified.
The second condition in (5.8) is equivalent to requiring the information matrix at θ0 to be
well defined, since the latter equals
@ @
I0 = E log f (Wi , θ0 ) 0 log f (Wi , θ0 ).
@θ @θ
@ 2
Assumption CF (iv) states that { @θ@θ 0 log f (Wi , θ) : i ≥ 1} satisfies a uniform WLLN over
is correctly specified and one can switch the order of di§erentiation and integration in the
definition of B0 = B(θ0 ). More specifically, under the correct specification,
h i
H I Z @
f (w, θ )
@ @θ0 0
E log f (W, θ0 ) = f (w, θ0 ) dw
@θ0 f (w, θ0 )
Z
@
= f (w, θ0 )dw
@θ0
HZ I
@ @1
= f (w, θ0 )dw = = 0. (5.9)
@θ0 @θ0
That is Z L M
@
log f (w, θ0 ) f (w, θ0 ) dw = 0.
@θ0
Taking another derivative on both sides, we have
Z L M Z L M
@ @ @2
log f (w, θ0 ) f (w, θ0 ) dw + log f (w, θ0 ) f (w, θ0 ) dw = 0,
@θ0 @θ00 @θ0 @θ00
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 93
i.e., L M
@ @ @2
E log f (w, θ0 ) 0 f (w, θ0 ) = −E log f (w, θ0 ). (5.10)
@θ0 @θ0 @θ0 @θ00
The information matrix equality is
B0 = Ω0 .
p b
Hence, in this case, the asymptotic covariance matrix of n(θn − θ0 ) simplifies to B0−1 , the
inverse of the information matrix.
We call (5.9) and (5.10) the Bartlett identities of order one and two. Taking higher order
derivatives will give us more identities.
@ @
0 = Q(θ0 ) = E(Yi − g(Xi , θ0 ))2 /2
@θ @θ
@ @
= E (Yi − g(Xi , θ0 ))2 /2 = E Qn (θ0 ),
@θ @θ
provided the interchange of the E and @/@θ operators in the third equality is justified. (Suf-
ficient conditions for that are that g(x, ·) 2 C 1 (Θ0 ) for all x 2 X and E supθ2Θ0 ||(Yi −
@ @
g(Xi , θ)) @θ g(Xi , θ)|| < 1.) As in the ML example, E @θ Qn (θ0 ) = 0 holds by definition of θ0
whether or not the model is correctly specified.
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 94
Assumption CF(iv) holds by Theorem 25 provided g(x, ·) 2 C 2 (Θ0 ) for all x 2 X (as
assumed above) and
@ @ @2
E sup || g(Xi , θ) 0 g(Xi , θ) − (Yi − g(Xi , θ)) g(Xi , θ)|| < 1,
θ2Θ0 @θ @θ @θ@θ0
If in addition, the errors Ui are homoskedastic, i.e. σ 2 (Xi ) ≡ σ 2 a.s. for some σ 2 > 0 (σ 2 (Xi )
does not depend on the realization of Xi ), then Ω0 = σ 2 B0 and thus B0−1 Ω0 B0−1 = σ 2 B0−1 .
The above holds only when the model is correctly specified. In a misspecified model, the
@
definition of θ0 only ensures that EUi @θ g(Xi , θ0 ) = 0. This does not imply that E (Ui |Xi ) = 0
a.s.
5.3.4 GMM
We consider only the case P with correctly specified moment conditions here.
Recall Qn (θ) := ||An n ni=1 g(Wi , θ)||2 /2 for θ 2 Θ ⊂ Rd for some function g that maps
1
for `, j = 1, ..., d,
5.3.5 MD Estimator
Again we consider only the case with correct specification in some details here.
Recall that Qn (θ) = kAn (π̂ n − g(θ))k2 /2. We have
H I0
@ @
Qn (θ) = − g(θ) A0n An (π̂ n − g(θ))
@θ @θ0
L 2 M H I0
@ @ @ @2
Qn (θ) = − g(θ) A0n An g(θ) − g(θ)0 A0n An (π̂ n − g(θ))
@θ@θ `,j @θ ` @θ j @θ ` @θ j
B0 = Γ00 A0 AΓ0 .
CF(iv) holds under the assumptions given above, provided that Γ0 and A are full rank.
@ p @ p
Gn (θ0 , τ̂ n ) ! G(θ0 , τ 0 ) = Γ0 , An ! A
@θ0 @θ0
@ p @
0
Gn (θ0 , τ ∗n ) ! G(θ0 , τ 0 ) = Λ0
@τ @τ 0
W 2 W
W @ W p
sup WW @θ@θ Qn (θ) − B(θ)W ! 0
W
θ2Θ0
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 97
and
B(θ0 ) = Γ00 A0 AΓ0 .
p
To find the asymptotic distribution of N (0, Ω0 ) of n@Qn (θ)/@θ, as required for CF (ii), we
p
carry out element-by-element mean value expansions of n@Qn (θ)/@θ about τ 0 and use the
above assumptions:
H I0
p @ @ p
n Qn (θ0 ) = 0 Gn (θ 0 , τ̂ n ) A0n An nGn (θ0 , τ̂ n )
@θ @θ
L M
0 0 p @ ∗ p
= (Γ0 + op (1)) An An nGn (θ0 , τ 0 ) + 0 Gn (θ0 , τ̂ n ) n (τ̂ n − τ 0 )
@τ
0 1
# $p G (θ , τ )
. n 0 0
= (Γ0 + op (1))0 A0n An Ik .. Λ0 + op (1) n@ A
τ̂ n − τ 0
! d Γ00 A0 A (Z1 + Λ0 Z2 ) s N (0, Ω0 )
where * +
Ω0 = Γ00 A0 A V10 + Λ0 V20
0
+ V20 Λ00 + Λ0 V30 Λ00 A0 AΓ0 .
Note that if Λ0 = 0, then Ω0 simplifies to an expression that is the same as one would get
p
if τ 0 replaced τ̂ n in Qn (θ). In this case, the asymptotic distribution of n(θ̂n − θ0 ) is the same
whether τ 0 is known or estimated. In general, however, Λ0 6= 0 and the estimator of τ 0 by τ̂ n
a§ects the limit distribution of θ̂n .
X = Zτ 0 + eX .
For simplicity, here we have assumed that there is no intercept in the model. Let τ̂ n be the
first stage OLS estimator. For the 2SLS, we have An = 1,
n
1X
Gn (θ, τ̂ n ) = [Yi − (Zi τ̂ n ) θ] (Zi τ̂ n ) ,
n
i=1
G(θ, τ ) = E [Y − (Zτ ) θ] (Zτ ) ,
and
@ * +
G(θ0 , τ 0 ) = E [Y − (Zτ 0 ) θ0 ] Z − EZ 2 τ 0 θ0 .
@τ
So in general, * +
Λ0 = − EZ 2 τ 0 θ0 6= 0.
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 98
We have
n n
p 1 X τ0 X 2 p
nGn (θ0 , τ̂ n ) = p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − Zi θ0 n (τ̂ n − τ 0 ) + op (1)
n n
i=1 i=1
n n n
!−1 n
!
1 X τ0 X 2 1X 2 1 X
= p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − Zi θ0 Zi p Zi ei,X + op (1)
n n n n
i=1 i=1 i=1 i=1
n n
1 X 1 X
= p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − τ 0 θ0 p Zi ei,X + op (1)
n n
i=1 i=1
n
1 X
= p τ 0 Zi {[Yi − (Zi τ 0 ) θ0 ] − θ0 ei,X } + op (1)
n
i=1
n
1 X
= p τ 0 Zi {Yi − (Zi τ 0 + ei,X ) θ0 } + op (1)
n
i=1
Xn
1
= p τ 0 Zi {Yi − Xi θ0 } + op (1) .
n
i=1
Except for a multiplicative scaler change, which has no e§ect on calculating the asymptotic
variance for θ̂IV , this is exactly the moment condition behind the IV estimator. The correction
leads to the well-expected moment conditions. If we ignore the estimation error in τ̂ n , we would
p
approximate the distribution of nGn (θ0 , τ̂ n ) by
n
1 X
p τ 0 Zi [Yi − (Zi τ 0 ) θ0 ] .
n
i=1
P
There is clearly a discrepancy between the above and p1n ni=1 τ 0 Zi {Yi − Xi θ0 } . Ignoring the
estimation error in τ̂ n leads to an inconsistent estimator of the asymptotic variance of the
2SLS estimator. Note that
n n
1 X 1 X
p τ 0 Zi {Yi − Xi θ0 } = p τ 0 Zi ui ,
n n
i=1 i=1
n n
1 X 1 X
p τ 0 Zi [Yi − (Zi τ 0 ) θ0 ] = p τ 0 Zi [θ0 ei,X + ui ] .
n n
i=1 i=1
The ratio of the asymptotic variances based on the above expressions should be
var(u)
ρ= .
var (θ0 eX + u)
The ratio is 1 if θ0 = 0. In this case, the correction is asymptotically ignorable.
Suppose u = κeX + v for some v such that cov(eX , v) = 0, then
var(u) κ2 var(eX ) + var (v)
ρ= = .
var (θ0 eX + κeX + v) (θ0 + κ)2 var(eX ) + var (v)
We have
n
@ 1 X (Yi − g(Xi , θ)) @g(Xi , θ)
Qn (θ) = ,
@θ n σ 2 (Xi , τ̂ n ) @θ
i=1
and so
n
p @ 1 X (Yi − g(Xi , θ0 )) @g(Xi , θ0 )
n Qn (θ0 ) = p
@θ n σ 2 (Xi , τ̂ n ) @θ
i=1
Xn
1 (Yi − g(Xi , θ0 )) @g(Xi , θ0 ) p
= p 2
− Λn n (τ̂ n − τ 0 ) + op (1) .
n σ (Xi , τ 0 ) @θ
i=1
where
n
1 X (Yi − g(Xi , θ0 )) @g(Xi , θ0 ) @σ 2 (Xi , τ 0 )
Λn = .
n [σ 2 (Xi , τ 0 )]2
i=1
@θ @τ 0
Under correct specification, we have
en−1 = σ Pn
b2 B
Note that for a linear regression model, σ b2 ( n1 0 −1
i=1 Xi Xi ) .
Note that the definition of Bbn does not include the second summand of @ 2 0 Qn (b θn ) in (5.11).
@θ@θ
The reason is that the second summand converges in probability to zero because Eg(Wi , θ0 ) =
0 and, hence, can be omitted.
Each component of Ω b n has been shown in the previous section to converge in probability to
P
the corresponding component of Ω0 . The only exception is the component n1 ni=1 g(Wi , b θn )g(Wi , b
θn )0 .
The latter converges in probability to Eg(Wi , θ0 )g(Wi , θ0 )0 by Lemma 27 and Theorem 25 pro-
p
vided b
θn ! θ0 , g(w, ·) 2 C 0 (Θ0 ) 8w 2 W and E supθ2Θ0 ||g(Wi , θ)||2 < 1.
H I0 # $
@ 0 0 0
Ω̂n = G (θ̂
n n n, τ̂ ) A A
n n V̂ 1n + Λ̂ V̂
n 2n + V̂ Λ̂
2n n + Λ̂ V Λ
n 3n n
@θ0
@
×A0n An 0 Gn (θ̂n , τ̂ n )
@θ
where
@
Λ̂n = Gn (θ̂n , τ̂ n )
@τ 0
and V̂jn are some consistent estimator of Vj0 for j = 1, 2, 3. If Λ0 is zero, as occurs in some
cases, such as feasible GLS estimation, then one can take Λ̂n = 0 and the estimators V̂2n and
V̂3n are not required.
5.5 OPTIMAL WEIGHT MATRIX 102
(Γ00 CΓ0 )−1 Γ00 CΣ0 CΓ0 (Γ00 CΓ0 )−1 , (5.13)
When this last condition holds then the asymptotic covariance in (5.13) simplifies to
(Γ00 Σ−1 −1 b
0 Γ0 ) . This choice minimizes the asymptotic covariance matrix of θ n because we now
show that always
(Γ00 CΓ0 )−1 Γ00 CΣ0 CΓ0 (Γ00 CΓ0 )−1 − (Γ00 Σ−1
0 Γ0 )
−1
≥ 0, (5.14)
where “≥” denotes “is psd.”. Note that for two invertible matrices, F and G it holds that
F −1 − G−1 ≥ 0 if and only if G − F ≥ 0. Thus, (5.14) holds if and only if
Γ00 Σ−1 0 0 −1 0
0 Γ0 − Γ0 CΓ0 (Γ0 CΣ0 CΓ0 ) Γ0 CΓ0 ≥ 0. (5.15)
−1/2 1/2 1/2
Defining H := Γ0 Σ0 and P := Ik − Σ0 CΓ0 (Γ00 CΣ0 CΓ0 )−1 Γ00 CΣ0 the left hand side of
(5.15) equals
−1/2 −1/2
Γ00 Σ0 P Σ0 Γ0 = HP H 0 = HP (HP )0 ≥ 0,
where the second equality uses the fact that P is a projection matrix (i.e. P is symmetric and
idempotent, P 2 = P ). The final inequality follows because a matrix of the form HP (HP )0 is
necessarily psd, because z 0 HP (HP )0 z = ||P H 0 z||2 ≥ 0 8z 2 Rd .
In sum, the optimal weight matrix for GMM, MD and TS estimator depends on the
@
asymptotic covariance matrix of n1/2 @θ Qn (θ0 ), which is Ω0 = Γ00 CΣ0 CΓ0 . The optimal weight
p −1
matrix An is such that A0n An ! Σ0 . For the GMM and MD estimators, Σ0 = V0 and the
optimal weight matrix An is such that
p
A0n An ! A0 A = V0−1 .
Yi = Xi β 0 + "i
where
P ("i < 0|Xi = x) = α 2 (0, 1)
for almost all x. That is, conditioning on Xi , the 100α% quantile of Yi is Xi β 0 . We could
write Xi β 0 as Xi β α0 if we want to emphasize that it is the 100α% conditional quantile. For
this problem, we can take
and En g(W, β) is the expectation operator with respect to the empirical distribution. The
limiting function is Q (β) = kEg(W, β)k2 .
Often in the literature (e.g., Pollard (1984)), the following notation is used: Pg(W, β) :=
Eg(W, β), Pn g(W, β) = En g(W, β), and
n
p 1 X
n (Pn − P) g(W, β) = p [g (Wi , β) − Eg (Wi , β)]
n
i=1
5.6.1 Consistency
Theorem 29 Assume
(i) (Definition of β n )
Qn (β n ) ≤ inf Qn (β) + op (1) .
β2B
(v) (ULLN)
sup k(Pn − P) g(W, β)k = op (1) .
β2B
Then
β n − β 0 = op (1) .
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 104
(ii) D (β) = @ [Pg(W, β)] /@β 0 2 Rk×d exists in a neighborhood of β 0 , is of full column
rank, and is continuous at β = β 0 .
(iii) For all δ n such that δ n = op (1)
Wp W
sup W n (Pn − P) [g(W, β) − g(W, β 0 )]W = op (1) .
kβ−β 0 k≤δ n
Then p
n (β n − β 0 ) !d N (0, Ω)
where * +−1 * 0 +* +−1
Ω = D0 D D V D D0 D and D = D (β 0 ) .
Remark on condition (iii). For each given g, we can apply CLT to show that
p h * + i
n (Pn − P) g !d N 0, P g 2 − (Pg)2 .
A uniform version of the above CLT (Uniform CLT) gives conditions under which the conver-
gence is locally uniform in g, in the sense that a small perturbation in g leads to only a small
p
change in n (Pn − P) g. We do not need a precise statement of the uniform CLT, which says
p
that the empirical process n (Pn − P) g converges in distribution to a continuous Gaussian
p
process. What we need here is the local perturbation property of n (Pn − P) g given in the
lemma below (Pakes and Pollard (1989, Lemma 2.16)).
* +
Lemma 31 Let G be an Euclidean class of functions with Envelop G for which P G2 < 1.
For each η > 0 and " > 0, there exists a δ > 0 such that
!
Pp p P
lim sup P sup P n (Pn − P) g1 − n (Pn − P) g2 P > η < "
[δ]
Let G = {g (·, β) : kβ − β 0 k ≤ δ n } . Then the above lemma can be used to verify condition
(iii), as the class of functions [δ] eventually contains all the pairs g (·, β 1 ) and g (·, β 2 ) for
which kβ 1 − β 2 k ≤ δ n . For the definition of “Euclidean class”, please refer to Pakes and
Pollard (1989, Lemma 2.16), Pollard (1984) or Andrews (1999).
Proof of the Theorem.
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 105
p
We begin with the n-consistency step. The fact that Pg(W, β) is di§erentiable in β at
β = β 0 and D = D (β 0 ) is of full column rank implies that
* p +
I2 = kPn g(W, β n )k ≤ inf kPn g(W, β)k + op 1/ n
β2B
* p +
≤ kPn g(W, β 0 )k + op 1/ n
* p +
= Op 1/ n .
Define
Ln (β) = D (β − β 0 ) + Pn g(W, β 0 ).
We can now proceed as in the case with a smooth objective function. Define
U V
Mn
Bn = β : kβ − β 0 k ≤ p
n
p
for some Mn ! 1 arbitrarily slowly. Given the n consistency of β̂ n , we have β̂ n 2 Bn with
probability approaching one. So we can focus on β 2 Bn . Now uniformly over any β 2 Bn , we
have
Pn g(W, β) = [Pn − P] g(W, β) − [Pn − P] g(W, β 0 )
+Pg(W, β) + [Pn − P] g(W, β 0 )
* p +
= Pg(W, β) + Pn g(W, β 0 ) + op 1/ n by (iii)
# $ * p +
= D β̃ (β − β 0 ) + Pn g(W, β 0 ) + op 1/ n by (ii)
* p + p
= D (β 0 ) (β − β 0 ) + Pn g(W, β 0 ) + op 1/ n by n-consistency
That is, uniformly over any β 2 Bn ,
* p +
kPn g(W, β) − Ln (β)k = op 1/ n .
This implies * p +
β n = β ∗n + op 1/ n
where * p +
β ∗n = arg min kLn (β)k + op 1/ n .
As a result
p p
n (β n − β 0 ) = n (β ∗n − β 0 ) + op (1)
* +−1 0 p
= D0 D D nPn g(W, β 0 ) + op (1) !d N (0, Ω)
by assumption (iv).
Consequently,
D = EXi0 Xi f"i |Xi (0) .
In addition
n
p 1 X
nPn g(W, β 0 ) = p [1 {Yi < Xi β 0 } − α] Xi0
n
i=1
d
! N (0, V )
Ω = D−1 V D−1
; <−1 ; < ; <−1
= α (1 − α) EXi0 Xi f"i |Xi (0) × EXi0 Xi × EXi0 Xi f"i |Xi (0) .
For further reading on the extremum estimators, consult some Handbook of Econometrics
chapters and advanced textbooks in econometrics.
Bibliography
[3] Pakes, A. and D. Pollard (1989): Simulation and the Asymptotics of Optimization Esti-
mators, Econometrica, Vol. 57(5), pp. 1027-1057.
[4] Pollard, D. (1984): Convergence of Stochastic Processes. Springer. Available online from
his webpage at www.stat.yale.edu
[5] Newey, W. K., and D. L. McFadden (1999): “Large Sample Estimation and Hypothesis
Testing.” In Handbook of Econometrics. Vol. 4. Edited by R. F. Engle and D. McFadden.
Amsterdam, The Netherlands: Elsevier Science, pp. 2113-2245.
[6] Wooldridge, J. (2010): Econometric Analysis of Cross Section and Panel Data, Chapters
12-14. The MIT Press.
108
Chapter 6
Many economic variables are observed as the result of individuals’ choices between a limited
number of alternatives. In this chapter, we shall assume that only two alternatives are avail-
able, e.g., purchase/not purchase a car, apply for/not apply for a job, obtain/not obtain a
loan, travel to work by own car/public transport. These are examples of genuine qualitative
choices. Since there are two alternatives, we call it a binary choice. We represent the outcome
of the choice by a binary variable.
where we have assumed that Pr(Yi = 1|Xi ) depends on Xi through the linear index Xi β.
Suppose we would use a linear regression model to explain Yi :
Yi = Xi β + "i . (6.2)
Because Yi can take only two values, the error term "i , for a given value of Xi , can only take
two values: 8
< 1 − Xi β, if Yi = 1,
"i = (6.3)
: −X β, if Y = 0,
i i
109
6.2 SUPPORT VECTOR MACHINE 110
Clearly heteroskedastic! Given (6.2) and (6.6), we know that the OLS estimator of β is
unbiased and consistent. Because of heteroskedasticity, we have to use the robust variance
estimator and the robust t-statistic to make inferences.
Since we know the form of heterogeneity, we can use WLS to obtain more e¢cient estimates
by regressing Yi /b
σ i on Xi /b
σ i , where
b
b2i = Xi β
σ b
OLS (1 − Xi β OLS ). (6.7)
• The model provides ‘reasonable’ estimates of the partial e§ects near the center of the
distribution of X. This is not rigorously established.
b may be outside of the interval [0, 1]. The partial e§ect may not be reliable for
• But Xi β
extreme values of X.
(wT Xi + b)
r= .
kwk
This works out for the case of a positive training example at X Figure 6.3 (i.e., the case with
Y = 1). More generally, we have
Yi (wT Xi + b)
r= .
kwk
6.2 SUPPORT VECTOR MACHINE 112
or equivalently
1
min kwk2 , s.t. Yi (wT Xi + b) ≥ 1 for all i = 1, ..., n.
2
This is a quadratic optimization problem subject to linear constraints.
At the solution, there will be a few points on the plus and minus planes. These points are
called the support vectors. The number of support vectors can be much smaller than the size
of the training set.
The Lagrangian for the optimization problem is
X ; n
1 <
L (w, b, λ) = kwk2 − λi Yi (wT Xi + b) − 1 .
2
i=1
The support vectors are given by those Xi ’s whose λi are positive. In fact,
n
X
wT x + b = λi Yi hXi , xi + b.
i=1
Thus, we need to find only the inner products between x and the support vectors in order
to make our prediction. More general inner products can be used, leading to the so-called
“kernel tricks” in the machine learning literature.
6.2 SUPPORT VECTOR MACHINE 114
X n
1
min kwk2 + C ξi
2
i=1
T
s.t. Yi (w Xi + b) ≥ 1 − ξ i for all i = 1, ..., n
ξ i ≥ 0 for all i = 1, ..., n.
The problem is almost the same as before. The only change is that there is now an upper
bound on the Lagrangian multiplier.
6.3 PROBIT AND LOGIT 115
exp(w)
F (w) = L(w) = logistic distribution ) Logit Model. (6.11)
1 + exp(w)
For the latter one, we have
exp(w)
f (w) = F 0 (w) = . (6.12)
(1 + exp(w))2
The probit model, which uses the normal distribution, may be justified by appealing to
a central limit theorem, while the logit model can be justified by the fact that it is similar
to a normal distribution but has a much simpler form. The di§erence between the logit and
normal distributions is that the logit has slightly heavier tails. The standard normal has mean
zero and variance 1 while the logit has mean zero and variance equal to π 2 /3.
Often the binary choice model is derived from underlying behavioral assumptions: a woman
will choose to work for pay if the utility she derives from working is larger than the utility not
to work for pay. This leads to a latent variable representation of the model.
Assuming a linear additive relationship, we obtain the utility di§erence, denoted by Yi∗ :
where "i /σ " has the CDF F (·) . Because utility Yi∗ is unobserved, it is referred to as a latent
variable. We assume that an individual chooses to work if the utility di§erence exceeds a
certain threshold level, which can be set equal to zero without loss of generality (assuming
our model contains an intercept):
Yi = 1 {Yi∗ > 0} . (6.14)
Consequently, we have
provided that the distribution F is symmetric, where F (·) is the c.d.f. of "i /σ " .
6.3 PROBIT AND LOGIT 116
In limited dependent variable models we typically lack identification for some unknown
parameter(s). This is due to the fact that we observe a limited set of the latent variable:
Yi = τ (Yi∗ ) for a noninvertible map τ .
First, in the binary choice example, σ " is not identified. Observing Yi , we only know
whether Yi∗ exceeds the threshold or not, there is no way to find the scale of Yi∗ . In the
sequel, therefore, we will set σ " = 1.
Second, setting the threshold for Y ∗ at 0 is likewise innocent if the model contains a
constant term. (In general, unless there is some compelling reason, binary choice models
should not be estimated without constant terms.)
Remark 32 Marginal E§ect
@ Pr(Yi = 1|Xi )
= f (Xi β)β j . (6.16)
@Xij
Note that if we do not set σ " = 1, then
The marginal e§ect depends on the point (Xi1 , Xi2 , ..., Ck ) at which we evaluate the marginal
e§ect.
Remark 35 Apart from the sign of the coe¢cients, the coe¢cients in these binary choice
models are not easily interpretable. Except maybe in the logit model, where one can consider
the β’s to represent the marginal e§ect of Xik on the log of the odds: log(“odds”) = Xi β,
where
Pr(Yi = 1|Xi )
“odds” = (6.19)
1 − Pr(Yi = 1|Xi )
6.3.2 Estimation
If we assume that F (·) is known, then the optimal parametric estimator for this problem will
be ML:
X n
Yi log F (Xi β) + (1 − Yi )(log (1 − F (Xi β))). (6.20)
i=1
The score is
Yi (1 − Yi )
si (β) = f (Xi β)Xi0 − f (Xi β)Xi0
F (Xi β) 1 − F (Xi β)
H I
Yi (1 − Yi )
= − f (Xi β)Xi0
F (Xi β) 1 − F (Xi β)
Yi − F (Xi β)
= f (Xi β)Xi0 . (6.21)
F (Xi β) (1 − F (Xi β))
So,
@si (β) −f 2 (Xi β)
= X 0 Xi
@β 0 F (Xi β) (1 − F (Xi β)) i
[Yi − F (Xi β)] [1 − 2F (Xi β)] 2
− f (Xi β)Xi0 Xi
[F (Xi β) (1 − F (Xi β))]2
Yi − F (Xi β)
+ f (Xi β)f 0 (Xi β) Xi0 Xi
F (Xi β) (1 − F (Xi β))
Yi − 2F (Xi β)Yi + F 2 (Xi β) 2
= − f (Xi β)Xi0 Xi
[F (Xi β) (1 − F (Xi β))]2
Yi − F (Xi β)
+ f (Xi β)f 0 (Xi β) Xi0 Xi
F (Xi β) (1 − F (Xi β))
and the (normalized) expected Hessian is
L M
@si (β) 2 Xi0 Xi
E = −E f (Xi β) , (6.22)
@β 0 F (Xi β) (1 − F (Xi β))
6.3 PROBIT AND LOGIT 118
exp(Xi β)
, (6.23)
1 + exp(Xi β)
b
The approximate variance of β M LE is
n
!−1
X Xi0 Xi
2
f (Xi β) , (6.25)
F (Xi β) (1 − F (Xi β))
i=1
For testing about the coe¢cients, the full menu of procedures is available (LR, LM and
Wald):
• The null H0 : γ = 0.
b β
5. Partial e§ects: ∆P (Y = 1|X) ≈ f (X β) b ∆Xj for small Xj .
j
b = f (0) = 0.4; for logit, f (0) = 0.25. The
6. Compare logit and probit: for probit f (X β)
logit estimates can be expected to be larger by a factor 0.4/0.25 = 1.6. To compare with
LPM, logit estimates should be divided by 4 while probit estimates should be divided by 2.5.
7. Variance of the partial e§ect.
Y1∗ = Xβ + Y2 α + U
Y2 = Xγ + Zδ + V (6.28)
where (X, Z) is exogeneous and Y2 is endogeneous. The first equation is the structural equa-
tion, and the second equation is the reduced-form equation (not structural). Y2 is endogenous
because U and V are correlated.
Recall the 2SLS regression:
(i) First stage Regression: Ŷ2 = X γ̂ + Z δ̂ with estimated residual V̂ = Y2 − Ŷ2 ;
(ii) Second stage Regression: Y1∗ = Xβ + Ŷ2 α + error.
The essence behind the 2SLS is to decompose Y2 into two parts: one part (i.e., Xγ + Zδ)
is exogeneous and the other part (i.e., V ) is endogeneous. We include the exogenous part as
the regressor in the second stage regression while excluding the endogeneous part from the
regression. E§ectively, we move the endogeneous part to the error term.
Symmetrically, we can decompose U into two parts: U = ρV + e where by definition V is
correlated with Y2 and e is not. We include e in the error term while excluding V from the
error term. That is, we add V as an additional regressor, leading to
Y1∗ = Xβ + Y2 α + ρV + e.
If V is observable, we can regress Y1∗ on X, Y2 and V by OLS. Since X, Y2 and V are not
correlated with e, the OLS estimator of (α, β, ρ) is consistent. When V is not observable
6.4 PROBIT WITH ENDOGENOUS COVARIATES 120
in practice, we can replace it by V̂ from the first stage regression. This, combined with
Y2 = Ŷ2 + V̂ , gives rise to the model
# $
Y1∗ = Xβ + Y2 α + ρV̂ + e + ρ V − V̂
| {z }
# $
= Xβ + Ŷ2 α + (α + ρ) V̂ + e + ρ V − V̂
| {z }
Y1∗ = G (X, Y2 , U )
where Y2 is correlated with U ? In this case, we hope to find a control variable W such that
the full conditional independence (CI) holds:
U ? Y2 |(X, W ).
We then compute
E (Y1∗ |X = x, Y2 = y2 , W = w)
| {z }
identif iable
= E [G (x, y2 , U ) |X = x, Y2 = y2 , W = w)]
= E [G (x, y2 , U ) |X = x, W = w)] (by conditional independence)
Z
= G (x, y2 , u) fU |X,W (u|x, w) du,
U
6.4 PROBIT WITH ENDOGENOUS COVARIATES 121
and so
@E (Y1∗ |X = x, Y2 = y2 , W = w)
ρ(x, y2 , w) : =
@y2
Z
@
= G (x, y2 , u) fU |X,W (u|x, w) du
@y2 U
Z
@
= G (x, y2 , u) fU |X,W (u|x, w) du
U @y2
R
provided that the interchange @y@ 2 and U is justified.
Note that @y@ 2 G (x, y2 , u) has a causal interpretation. It is exactly the ceteris paribus causal
e§ect that we defined before. When computing such an e§ect, we hold x and u constant while
making a small change to y2 .
Such an e§ect depends on the unobserved causal factor U. In general, we should not expect
to recover @y@ 2 G (x, y2 , u) at each given value of (x, y2 , u) , that is, the ceteris paribus causal
e§ects can not be identified at the individual level. Di§erent individuals have di§erent ceteris
paribus causal e§ects. Instead of focusing on individual treatment e§ects, we consider an
averaged version of it. ρ(x, y2 , w) is an averaged treatment e§ect where the average is taken
over the conditional distribution of the unobservable U given (X, W ) = (x, w) .
If we would like to average over the conditional distribution of U given X = x, we can
integrate out w using the conditional pdf fW |X (w|x) :
Z
ρ̃ (x, y2 ) = ρ(x, y2 , w)fW |X (w|x) dw
W
Z LZ M
@
= G (x, y2 , u) fU |X,W (u|x, w) fW |X (w|x) dw du
@y
ZU 2 W
@
= G (x, y2 , u) fU |X (u|x) du.
U @y2
Y1∗ = Xβ + Y2 α + U
Y2 = Xγ + Zδ + V
Y1 = 1 {Y1∗ > 0} (6.29)
where (U, V ) are bivariate normal distributions and are independent of X and Z, and U is
possibly correlated with V. As an example, Y1 is a binary variable indicating a woman’s labor
force participation, Y2 is her years of education, and X contains her parental education.
Assume that var(U ) = 1 and denote var(V ) = σ 2 . Write
U = θV + e (6.30)
6.4 PROBIT WITH ENDOGENOUS COVARIATES 122
where
cov(V, U ) cov(V, U ) 1 ρ
θ= =p p p := , (6.31)
var(V ) var(V ) var(U ) var(V ) σ
for ρ = corr(V, U ) and
e s N (0, 1 − θ2 σ 2 ) = N (0, 1 − ρ2 ). (6.32)
Then, we have
Y1∗ = Xβ + Y2 α + θV + e, (6.33)
and !
Xβ + Y2 α + θV
P (Y = 1|X, Y2 , V ) = Φ p . (6.34)
1 − ρ2
But we do not know V and have to estimate it.
β α θ
β ρ := p , αρ := p and θρ := p . (6.35)
1−ρ2 1−ρ2 1 − ρ2
Ordinary probit standard errors calculated from the second step are inconsistent because
estimated residuals are treated as if they were observations of the true first-stage errors. To
get consistent standard errors, we need to take into account the additional uncertainty that
results from using (γ̂, δ̂) as opposed to the true values γ and δ.
Note that once β ρ , αρ , θρ , and σ 2 are estimated, we can back out the estimates for α, β
and ρ. More specifically, note that
ρ 1
θρ = p
1 − ρ2 σ
So we can back out ρ if we know θρ and σ 2 . Once we know ρ, we can back out α and β from
αρ and β ρ :
p αρ
α = αρ 1 − ρ2 = q ,
1 + θ2ρ σ 2
p βρ
β = βρ 1 − ρ2 = q .
1 + θ2ρ σ 2
6.4 PROBIT WITH ENDOGENOUS COVARIATES 123
For the subpopulation whose (X, Y2 ) is equal to (xo , y2o ), we ask what could be the average
response had their U follows a certain distribution, say N (0, 1), the marginal distribution of
U in the population. The answer is
Z
EY1 (xo , y2o , U ) = Φ (xo β + y2o α) = 1 {xo β + y2o α + u > 0} φU (u) du,
where u is the dummy for integration, EY1 (xo , y2o , U ) is the expectation of Y1 when we set
(X, Y2 ) to be (xo , y2o ) while letting U follow the distribution N (0, 1) instead of its conditional
distribution given that (X, Y2 ) = (xo , y2o ) . EY1 (xo , y2o , U ) has a causal/structural interpreta-
tion, as it is the average value of Y1 when (X, Y2 ) is set to be any value (xo , y2o ) while keeping
U the same in an average sense (i.e., the distribution of U does not change with the settings
of (X, Y2 )). For another pair of value (x∗o , y2o ∗ ) , we would do exactly the same calculation:
Z
EY1 (x∗o , y2o
∗
, U ) = Φ (x∗o β + y2o
∗
α) = 1 {x∗o β + y2o ∗
α + u > 0} φU (u) du.
When U s N (0, 1) but the conditional distribution of U given (X, Y2 ) = (xo , y2o ) is not
N (0, 1), in general
While EY1 (xo , y2o , U ) can be regarded as a “structural” expectation, E [Y1 (X, Y2 , U ) | (X, Y2 ) = (xo , y2o )]
is the statistical expectation.
Assuming that the unobserved factor U has the marginal distribution N (0, 1), the partial
e§ect can be computed by
@EY1 (xo , y2o , U )
AP E1 = = αφ (xo β + y2o α) ,
@y2o
which can be estimated by # $
\
AP E1 = α̂φ xo β̂ + y2o α̂ .
6.4 PROBIT WITH ENDOGENOUS COVARIATES 124
Y1 (xo , y2o + ∆) − Y1 (xo , y2o ) = 1 {xo β + (y2o + ∆) α + U > 0} − 1 {xo β + y2o α + U > 0} .
The e§ect will be di§erent for di§erent individuals because their U ’s are di§erent. For some
individuals, Y1 will not change. For others, Y1 will change. We ask: what is the average change
in the population? To address this question, we take an average of the above di§erence, leading
to
E [Y1 (xo , y2o + ∆) − Y1 (xo , y2o )] = Φ (xo β + (y2o + ∆) α) − Φ (xo β + y2o α) .
Scaling this by ∆ and letting ∆ ! 0, we obtain AP E1 given above. For more discussions
along this line, see section 2.2.5 in Wooldridge (2010).
If we are not interested in a particular pair (xo , y2o ) , we can let (xo , y2o ) vary over the
sample and obtain
1X # $
n
^
AP E1 = α̂ φ Xi β̂ + Y2i α̂ .
n
i=1
The distinction between the “structural” expectation and statistical expectation is not spe-
cific to probit models or more general nonlinear and nonseparable models. Such a distinction
applies to linear models as well. Consider the linear causal model with endogeneity
Y = Wθ + U
EY (wo , U ) = wo θ + EU = wo θ,
E [Y (W, U ) |W = wo ] = wo θ + E (U |W = wo ) .
Second Approach
An alternative method of computing the APE is to look at
Z
E(Y1 (xo , y2o , U ) |V = vo ) = 1 {xo β + y2o α + u > 0} φU |V (u|vo ) du
we look at the subpopulation whose V value is equal to vo . For this subpopulation, (X, Y2 )
and U are independent. So
The right hand side is the usual conditional expectation. We have connected a structural ob-
ject, i.e., E(Y1 (xo , y2o , U ) |V = vo ) with a statistical object, i.e., E [Y1 | (X, Y2 , V ) = (xo , y2o , vo )] .
Define the Average Structural Function as
where
β α θ
βρ = p , αρ = p and θρ = p . (6.36)
1 − ρ2 1 − ρ2 1 − ρ2
Let fV (·) be the marginal pdf of V. The second way to compute the average partial e§ect
with respect to Y2 at (X, Y2 ) = (xo , y2o ) is
Z L M
@ASF (xo , y2o , vo )
AP E2 = fV (vo ) dvo (6.37)
V @y2
@
= EV [ASF (xo , y2o , V )] (6.38)
@y2
@ ; * +<
= EV Φ xo β ρ + y2o αρ + θρ V , (6.39)
@y2
where EV is the expectation operator with respect to the marginal distribution p of V.
Under the normality assumption, i.e., V s N (0, σ 2 ), we can find ẽ s N (0, 1 − ρ2 ) and
is independent of V such that
; * +<
EV Φ xo β ρ + y2o αρ + θρ V
= EV Eẽ 1 {xo β + y2o α + θV + ẽ > 0}
= E1 {xo β + y2o α + θV + ẽ > 0}
n o
= E1 xo β + y2o α + Ũ > 0
= Φ (xo β + y2o α) .
So
AP E2 = αφ (xo β + y2o α) .
It is now clear that AP E1 = AP E2 . For AP E1 , the average is taken over the marginal
distribution of U . For AP E2 , the average is taken over the conditional distribution of U given
V followed by averaging over the marginal distribution of V.
6.4 PROBIT WITH ENDOGENOUS COVARIATES 126
Based on the equivalence of the two definitions, we can find a di§erent way to compute
the APE. Given that V s N (0, σ 2 ), we have
@ ; * +<
AP E2 = EV Φ xo β ρ + y2o αρ + θρ V
@y2
* +
= EV φ xo β ρ + y2o αρ + θρ V αρ
0 1
αρ xo β ρ + y2o αρ
= q φ@q A.
2 2 2 2
(1 + σ θρ ) (1 + σ θρ )
\
Obviously AP E1 = AP\ \
E2 . While AP E1 uses the estimates of the deep structural parameters
\
α and β, AP E2 uses the estimates of the parameters αρ , β ρ , θ2ρ and σ 2 which are (arguably)
more of the reduced-form nature.
^
Like AP E1 , we could also compute
1X # $
n
^
AP E2 = φ xo β̂ ρ + y2o α̂ρ + θ̂ρ V̂i α̂ρ . (6.41)
n
i=1
The asymptotic variance of APE is di¢cult to compute, but we can employ standard
bootstrap methods to compute them.
If we replace γ and δ by the first stage LS estimators γ̂ and δ̂ and perform the probit MLE:
# +
$ n
X # $ h # $i
α̂+
ρ , β̂ ρ = arg max Y1i log Φ Xi β + + + +
ρ + Ŷ2i αρ + (1 − Y1i ) log 1 − Φ Xi β ρ + Ŷ2i αρ
i=1
where
β
β+
ρ = p ,
1 + α σ 2 + 2ρασ
2
α
α+
ρ = p .
1 + α2 σ 2 + 2ρασ
+
Then using the general theory of 2 step MLE, we can show that α̂+ p + p +
ρ ! αρ and β̂ ρ ! β ρ .
Note that σ can be consistently estimated from the first stage regression. The problem is that
we do not know ρ and the 2-step IV does not provide a consistent estimator. So we can not
recover α and β. As a result, we cannot estimate the APE.
Pr(Y1 = 1|Y2 , X, Z)
= Pr(Xβ + Y2 α + U > 0|Y2 , X, Z)
!
Xβ + Y2 α + θ (Y2 − Xγ − Zδ)
= Φ p
1 − ρ2
!
Xβ + Y2 α + ρσ −1 (Y2 − Xγ − Zδ)
= Φ p
1 − ρ2
: = Φ(q) (6.43)
We can then maximize the the sum of the log-likelihood wrt α, β, γ, δ, ρ, and σ 2 , leading to a
conditional MLE (CMLE).
Remark 38 1. The conditional MLE is more e¢cient than the two step procedure but com-
putationally more demanding.
2. Test H0 : ρ = 0 is straightforward ) either t test or LR test.
3. It is easy to abuse the two-step procedure.
6.5 PANEL LOGIT AND PROBIT MODELS 128
f (Y1∗ , Y2 |X, Z)
= f (Y1∗ |Y2 , X, Z)f (Y2 |X, Z)
H ∗ I H I
1 Y1 − Xβ − Y2 α − θ (Y2 − Xγ − Zδ) 1 Y2 − Xγ − Zδ
= φ φ . (6.45)
σe σe σV σV
The CMLE based on the above pdf is the Limited Information Maximum Likelihood (LIML)
estimator. When both Y1∗ , Y2 are observable, we can treat them symmetrically and obtain
a di§erent representation of f (Y1∗ , Y2 |X, Z). If the error terms are normal, then Y1∗ , Y2 |X, Z
follows a bivariate normal distribution.
where G is a known function (normal or logistic CDF). Note that the conditioning set is the
contemporaneous Xit not Xi .
is consistent and asymptotically normal. The partial likelihood is just a way to combine the
0
β 0τ s. Of course, we could just take simple or weighted average of β̂ τ s to get our final estimator.
But it is typical in the literature to pool the objective functions and define β̂ as the maximizer
0
of the partial log-likelihood in (6.47). In fact, the so-defined β̂ is a weighted average of β̂ τ s
with weights depending on asymvar(β̂ τ ).
We now show that the partial MLE will be asymptotically normal. We usually proceed as
follows: the FOC is
XN X T
sit (β̂) = 0 (6.50)
i=1 t=1
where sit (β) = rβ log f (Yit |Xit , β). A Taylor expansion of the above FOC gives
N X
X T N X
X T # $
0= sit (β 0 ) − Hit (β̃) β̂ − β 0 (6.51)
i=1 t=1 i=1 t=1
Now, under mild regularity conditions, we have the weak covergence result:
N T
1 XX
p sit (β 0 ) ) N (0, B) , (6.53)
N i=1 t=1
where !
N T
1 XX
B = lim var p sit (β 0 ) (6.54)
N !1 N i=1 t=1
and
N T T
1 XX p
X
Hit (β̃) ! E Hit (β 0 ) := A. (6.55)
N
i=1 t=1 t=1
6.5 PANEL LOGIT AND PROBIT MODELS 130
So p # $
N β̂ − β 0 ) N (0, A−1 BA−1 ). (6.56)
or
N T
1 XX
 = sit (β̂)sit (β̂)0 . (6.58)
N
i=1 t=1
# P PT $
Due to cross sectional independence, var p1N N i=1 t=1 sit (β 0 ) is
N T
! T
!0
1 X X X
E sit (β 0 ) sit (β 0 ) . (6.59)
N
i=1 t=1 t=1
So B can be estimated by
N T
! T !0
1 X X X
sit (β̂) sit (β̂)
N
i=1 t=1 t=1
1
N
XX T # $ # $0 1 XX # $
N T # $0
= sit β̂ sit β̂ + sit β̂ siτ β̂ (6.60)
N N
i=1 t=1 i=1 t6=τ
where the second term in the above expression accounts for possible serial correlation in the
score.
For the probit model, a simple, general estimator of the asymptotic variance is
!−1 ! !−1
N X
X T # $ N
X # $0 N X
X T # $
Ait β̂ si (β̂)si β̂ Ait β̂ (6.61)
i=1 t=1 i=1 i=1 t=1
where # $
# $ φ2 Xit β̂ Xit0 Xit
Ait β̂ = # $h # $i , (6.62)
Φ Xit β̂ 1 − Φ Xit β̂
and # $ h i
# $ X T # $ X T φ Xit β̂ X 0 Yit − Φ(Xit β̂)
it
si β̂ = sit β̂ = # $h # $i . (6.63)
t=1 t=1 Φ X it β̂ 1 − Φ X it β̂
6.5 PANEL LOGIT AND PROBIT MODELS 131
Esit (β 0 ) siτ (β 0 )0
; * +<
= E E sit (β 0 ) siτ (β 0 )0 |Xit , Yit−1 , Xit−1 , Yit−2 , ..., Yi1 , Xi1
L M
0
= E E (sit (β 0 ) |Xit , Yit−1 , Xit−1 , Yit−2 , ..., Yi1 , Xi1 ) siτ (β 0 )
| {z }
= 0 (6.68)
where
Ui s N (0, IT ) conditional on Xi and αi . (6.70)
Implicitly, we assume that the distribution of Ui is independent of Xi , conditional on αi . In
this sense, Xi is strictly exogenous. The strict exogeneity rules out lagged dependent variables,
as well as explanatory variables whose future movements depend on current or past values of
Ui .
For this model, the probability density of Yi1 , Yi2 , ..., YiT conditional on Xi and αi is
T
Y
f (Yi |Xi , αi , β) = f (Yit |Xi , αi , β) (6.71)
t=1
YT T
Y
= f (Yit |Xit , αi ; β) = Φ (Xit β + αi )Yit (1 − Φ (Xit β + αi ))1−Yit(6.72)
.
t=1 t=1
6.5 PANEL LOGIT AND PROBIT MODELS 132
We can maximize the following conditional log-likelihood function with respect to both β and
σα:
XN N
X Z 1Y T H I
1 α
log f (Yi |Xi ) = log f (Yit |Xit , α; β) φ dα. (6.74)
−1 σα σα
i=1 i=1 t=1
Since β and σ α can be consistently estimated, we can estimate the partial e§ect at α = 0 and
the APE, viz. !
@P (Yit = 1|Xit ) βj Xit β
=p φ p . (6.75)
@Xit,j 1 + σ 2α 1 + σ 2α
The integral in the log-likelihood function can be approximated using M -point Gauss-
Hermite quadrature:
Z 1 XM
2
e−X /2 g(X)dX = ∗
wm g(a∗m )
−1 m=1
where ∗
wm denote the quadrature weights and the a∗m
denote the quadrature nodes. The
log-likelihood function is then calculated using
" M (T )#
XN X Y 1 # p $
∗
L= log wm p f Yit |Xit , 2σ a a∗m ; β .
m=1 t=1
π
i=1
QT
When T is small, t=1 f (Yit |Xit , α; β) can be well approximated by a polynomial. In this
case, the Gauss-Hermite quadrature provides a good approximation to the integral. Some
simulations show that 50 is a safe upper bound. When T is large, the Gauss-Hermite approx-
imation will be very poor. The quality of approximation also depends on the value of σ a . The
larger σ α is, the poorer the approximation.
6.5 PANEL LOGIT AND PROBIT MODELS 133
Some Extensions
Assumptions (6.70) and (6.73) are very strong, and it is possible to relax them.
where conditional on Xit , αi s N (0, σ 2α ) and Uit s N (0, 1), αi and Uit are independent.
However, Uit may be correlated with Uis for t 6= s. This assumption may be more reasonable
and weaker than the assumption in the previous section, which assumes that conditional on
Xi : αi and Ui are independent and αi s N (0, σ 2α ), Ui s N (0, IT ).
For this model, we have
# p $
P (Yit = 1|Xit ) = P (Uit + αi > −Xit β|Xit ) = Φ Xit β/ 1 + σ 2α . (6.77)
p
Therefore, as in the previous section, we can estimate β/ 1 + σ 2α from pooled probit of Yit
on Xit . If αi is truly present or Uit is autocorrelated, then Yit will not be independent across
t. Robust inference is needed to account for the serial correlation, as discussed in the previous
section.
or
αi = + X̄i,· ξ + ai (6.80)
with ai s N (0, σ 2a ) and independent of Xi . As in the linear model, we can not estimate the
e§ect of time invariant variables. This is because they are indistinguishable from the e§ect
X̄i,· ξ.
The latent structure representation becomes Yit = 1 {Yit∗ > 0} with
where conditional on (Xi , ai ), Uit s iidN (0, 1). The parameters β, , ξ and σ a can be estimated
as before, i.e. by maximizing (6.74) with Xit properly defined. More specifically, the log-
6.5 PANEL LOGIT AND PROBIT MODELS 134
likelihood function is
N
X
log f (Yi |Xi )
i=1
" #
N Z
Y 1 T
Y # $Yit h $i1−Yit 1 H a I
#
i
= log Φ X̃it θ + ai 1 − Φ X̃it θ + ai φ dai
σa σa
i=1 −1 t=1
XN Z 1 "Y
T # $Yit h # $i1−Yit 1 H a I
#
i
= log Φ X̃it θ + ai 1 − Φ X̃it θ + ai φ dai , (6.82)
−1 σa σa
i=1 t=1
* +
where X̃it = 1, Xit , X̄i,· and θ0 = ( 0 , β 0 , ξ 0 ).
If conditional on (Xi , ai ), Uit s N (0, 1), but Uit may be correlated with Uis , we can still
estimate the scaled version of θ. In this case, we have
!
Xit β + + X̄i,· ξ # $
P (Yit = 1|Xi ) = Φ p := Φ X̃it θa . (6.83)
1 + σ 2a
; <
The partial likelihood function ΠTt=1 ΠN i=1 P (Yit |Xi ) can be regarded as derived from di§erent
waves with each wave corresponding to one time period. For example, in wave/period t, the
observations are (Yit , Xi )N
i=1 . It is important to point out that
T
"N #
Y Y
P (Yit |Xi )
t=1 i=1
If, in addition, Uit s iidN (0, 1) across t conditional on (Xi , ai ) , then the above likelihood
function becomes "T #
N Z 1 Y
Y H I
1 ai
P (Yit |Xi , ai ) φ dai .
−1 σa σa
i=1 t=1
where Eα (Ea ) is the expectation with respect to the marginal distribution of αi (ai ) . The
corresponding APE is
"* +#
@Eα P (Yit = 1|Xit = xo , αi ) βj xo β + + X̄i,· ξ
=p Eφ p .
@xo,j 1 + σ 2a 1 + σ 2a
The average (structural) response probability (over the whole population) can be estimated
by
1 X # $
N
Φ xo β̂ a + ˆ a + X̄i,· ξ̂ a . (6.85)
N
i=1
The corresponding estimator of the APE is then
N
1 X # $
β̂ a,j φ xo β̂ a + ˆ a + X̄i,· ξ̂ a .
N
i=1
The latter is
where Z !
@ (xo β + + x̄ξ)
AP E(fX̄i,· |Xit (x̄|xo )) = Φ p fX̄i,· |Xit (x̄|xo ) dx̄
@xo,j 1 + σ 2a
| {z }
has a structural interpretation.
6.5 PANEL LOGIT AND PROBIT MODELS 136
and
Z !
(xo β + + x̄ξ) @ log fX̄i,· |Xit (x̄|xo )
Bias = Φ p fX̄i,· |Xit (x̄|xo ) dx̄
1 + σ 2a @x0,j
" * +! * + #
Xit β + + X̄i,· ξ @ log fX̄i,· |Xit X̄i,· |Xit
= E Φ p |Xit = xo
1 + σ 2a @Xit,j
" * +! * + #
Xit β + + X̄i,· ξ @ log fX̄i,· |Xit X̄i,· |Xit
= cov Φ p , |Xit = xo .
1 + σ 2a @Xit,j
| {z }
structural or causal discrepancy
Similarly,
P (Yi1 = 1, Yi2 = 0|Xi , α, ni = 1) = 1 − Λ ((Xi2 − Xi1 ) β) . (6.88)
6.5 PANEL LOGIT AND PROBIT MODELS 137
{ni = 1} {Wi log Λ ((Xi2 − Xi1 ) β) + (1 − Wi ) log [1 − Λ ((Xi2 − Xi1 ) β)]} (6.89)
where
Wi = {Yi1 = 0, Yi2 = 1} . (6.90)
The above likelihood approach is equivalent to a standard cross-sectional logit of Wi on Xi2 −
Xi1 using the observations for which ni = 1.
To generalize the result from T = 2 to a more general T, we derive an alternative presen-
tation for P (Yi1 = 0, Yi2 = 1|Xi , αi , ni = 1) :
exp (Xi2 β)
P (Yi1 = 0, Yi2 = 1|Xi , αi , ni = 1) = P0 .
exp [ai1 (Xi1 β) + ai2 (Xi2 β)]
Similarly
exp (Xi1 β)
P (Yi1 = 1, Yi2 = 0|Xi , αi , ni = 1) = P0 .
exp [ai1 (Xi1 β) + ai2 (Xi2 β)]
For a general T, the log-likelihood is more complicated, but it is tractable. First,
where ( )
T
X
Ri = ai 2 RT : ait 2 {0, 1} , ait = ni . (6.92)
t=1
p
The log-likelihood summed over i can be used to obtain a N -asymptotically normal estimator
of β, and all inference follows from conditional MLE theory.
Remark 40 We can not estimate the average partial e§ect because we do not know the dis-
tribution of αi . Even worse, the mean of αi may be nonzero. Alternatively, we can include a
constant in the regression such that the mean of αi is zero. But in this case, the constant is
“di§erenced” out and can not be estimated from the conditional logit.
6.5 PANEL LOGIT AND PROBIT MODELS 138
Remark 41 The consistency relies on the assumption that Uit is iid logistic across i and t
conditional on (Xi , αi ) .
exp(Xit β)
P (Wi = t|Xi , αi , ni = 1) = PT . (6.94)
s=1 exp(Xis β)
Formally, this is the same as the conditional logit model in the next chapter when we model
unordered multinomial responses.
where Uit s iid G(·) with G0 (·) being symmetric. In addition, Uit is independent of {Yi,t−1 , ..., Yi,0 , Zi , αi } .
The joint density is
With fixed T asymptotics, this density will not deliver a consistent estimator of β, due to the
incidental parameter problem. To avoid the the incidental parameter problem, we again make
distributional assumptions on α0i s and integrate them out.
Z 1
f (Yi,1 , Yi,2 , ..., Yi,T |Yi,0 , Zi ; θ) = f (Yi,1 , Yi,2 , ..., Yi,T |Yi,0, Zi , α; β)h(α|Yi0 , Zi ; γ)dα. (6.98)
−1
where eit s iidN (0, 1) and is independent of other variables. Therefore, the density of
Yi,1 , Yi,2 , ..., Yi,T given (Yi,0 , Zi ) is
N
X Z 1 T
Y
log f (Yi,t |Yi,0, Zi , a; β) 1/σ a φ (a/σ a ) da (6.100)
i=1 −1 t=1
where
f (Yi,t |Yi,0, Zi , a; β) = Φ (Xi,t β)Yi,t (1 − Φ (Xi,t β))1−Yi,t (6.101)
and Xi,t = (1, Zi,t , Yi,t−1 , Yi,0, Zi ) .
For more details such as how to initialize the process di§erently, see Ch 7.4 of Hsiao (2003).
where Ψ(Xθ) is some parametric model of the conditional probability of the binary variable
Y given X, i.e. Ψ(Xθ) = P {Y = 1|X, θ}.
(a) Using the artificially generated data, compute maximum likelihood estimates of the
parameters (θ0 , θ1 , θ2 ) of the logit and probit specifications where Xθ is given by: X 0 θ =
θ0 + θ1 X + θ2 X 2 .
Y = Ψ(Xθ) + η (6.103)
instead of doing maximum likelihood? If so, provide a proof of the consistency of the NLLS
estimator. If not, provide a counterexample showing that the NLLS estimator is inconsistent.
(c) Estimate both the probit and logit specifications by nonlinear least squares as suggested
in part (b). How do the parameter estimates and standard errors compare to the maximum
likelihood estimates computed in part (a)?
(d) Is there any problem of heteroscedasticity in the nonlinear regression formulation of the
problem in part (c)? If so, derive the form of the heteroscedasticity and, using the estimated
“first stage” parameters from part (b) above, compute second stage “feasible generalized least
squares” (FGLS) estimates of θ. More specifically, the FGLS estimator is defined to be
N
X [Yi − Ψ(Xi θ)]2
θ̂F GLS = arg min h i (6.104)
θ2Θ
i=1 1 − Ψ(Xi θ̂) Ψ(Xi θ̂)
6.6 PROBLEM SET 140
(e) Are the FGLS estimates of θ consistent and asymptotically normally distributed (as-
suming the model is correctly specified)? If so, derive the asymptotic distribution of the FGLS
estimator, and if not provide a counter example showing that the FGLS estimator is inconsis-
tent or not asymptotically normally distributed. If you conclude that the FGLS estimator is
asymptotically normally distributed, is it as e¢cient as the maximum likelihood estimator of
θ? Explain your reasoning for full credit.
and claims that θ̂A is asymptotically equivalent to θ̂F GLS ? Do you agree? Explain. (What is
the limit of θ̂A ?)
(g) Under some conditions, the FGLS estimator is asymptotically equivalent to θ̃F GLS that
satisfies the first order condition:
h i # $
XN Yi − Φ(Xi θ̃F GLS ) φ Xi θ̃F GLS Xi
h i = 0. (6.106)
i=1 1 − Φ(Xi θ̂) Φ(Xi θ̂)
and claims that θ̂B is asymptotically equivalent to θ̂F GLS . Do you agree? Explain.
2. Download the data employ.xls from the class webpage. The data file contains em-
ployment information for 1881 young men over the years 1981-1987. Restrict your attention
to black men.
(a) Use pooled probit to estimate the model
What assumption is needed to ensure that the usual standard errors and test statistics from
pooled probit are asymptotically valid? Can you compute the standard errors that are robust
to the series correlation?
(c) Add a full list of year dummies to the analysis in part (a), and estimate the probabilities
in part (b) for 1987. Are there important di§erences with estimates in part (b)?
6.6 PROBLEM SET 141
where ai s iidN (0, σ 2a ) across i, eit s iidN (0, 1) across i and t, and {ai }N
i=1 is independent of
{eit , i = 1, .., N ; t = 1, ..., Ti }
Compute the average partial e§ect for year 1987 by averaging the partial e§ect over employ1981 .
Chapter 7
So far we have talked about 0/1 decisions. What if there are more response categories?
An important distinction is between ordered categorical data, where the response categories
possess a natural ordering (e.g. low income, mid-level income or high income, bond rating
A, B, C,... ), and unordered categorical data, where the response categories are mere labels
totally devoid of structure (e.g. traveling by bus, by train or by car). Di§erent models are
used in the two cases.
where xij is the vector of values of attributes of j-th choice as perceived by the i-th individual
and aij is a (0, σ 2 ) random variable which is used to capture factors unobservable to the
researcher. Individuals choose the option that maximizes his utility. Let yi denote the choice
of individual i, then
∗ ∗ ∗
yi = arg max {yi0 , yi1 , ..., yiJ } (7.2)
j
As an example, consider a person who can take a car, a bus or a subway to work. The
researcher observes the time and cost that the person would incur under each mode. However,
the researcher realizes that there are factors other than time and cost that a§ect the person’s
utility and hence his choice. The researcher specifies
∗
yic = Tic β 1 + Mic β 2 + aic (7.3)
∗
yib = Tib β 1 + Mib β 2 + aib (7.4)
∗
yis = Tis β 1 + Mis β 2 + ais (7.5)
where Tic and Mic are the time and cost (in money) that the person incurs traveling to work
by car, Tib and Mib , Tis and Mis are defined analogously for bus and subway.
142
7.1 PROBABILISTIC CHOICE MODEL FOR UNORDERED RESPONSE 143
The probability that the person chooses bus instead of car and subway is the probability
that
β 1 Tib + β 2 Mib + aib > β 1 Tic + β 2 Mic + aic
and
β 1 Tib + β 2 Mib + aib > β 1 Tis + β 2 Mis + ais
Remark 43 Can we include a constant in the utility specification so that
∗
yij = α + xij β + aij , j = 0, 1, 2, ..., J? (7.6)
The answer is no. The presence of α changes all the utilities (yi0 ∗ , y ∗ , ..., y ∗ ) by the same
i1 iJ
amount. The rank of the utilities and thus individual’s choice do not depend on α. Therefore,
the intercept α is not identified.
Remark 44 Can we include an alternative-specific constant in the utility specification so that
∗
yij = αj + xij β + aij , j = 0, 1, 2, ..., J? (7.7)
The answer is yes. But we can not identify all α0j s and have to normalize one α, say α0 , to
be zero. The alternative with zero α is called the base category.
Remark 45 Since
Z ∗[ Z ∗ ∗
[
arg max yij = arg max yij − yi0 (7.8)
j j
= arg max {(xij − xi0 ) β + aij − ai0 } (7.9)
j
we can not include a variable in xij if it is constant across di§erent alternatives. For example,
∗ by simply letting
we can not include age in yij
∗
yic = Tic β 1 + Mic β 2 + Agei β 3 + aic (7.10)
∗
yib = Tib β 1 + Mib β 2 + Agei β 3 + aib (7.11)
∗
yis = Tis β 1 + Mis β 2 + Agei β 3 + ais (7.12)
because in this specification age does not a§ect one’s decision. If we believe that Age actually
plays a role, then we have to allow the coe¢cient associated with Age to change with the
alternative. In general, we can assume
∗
yij = xij β + zi γ j + aij , j = 0, 1, 2, ..., J, (7.13)
where zi is the individual-specific variable. In the above specification, we can not identify all
γ’s and need to normalize one γ, say γ 0 , to be zero. It is recommended to normalize γ for the
base category to be zero.
Remark 46 A variant of the utility specification is to allow β to depend on individual-specific
characteristics. For example
(i) (i)
β k = β k + wi θk + σ k uk
(i)
where β k can be regarded as individual i0 s marginal utility for the k-th covariate. This spec-
ification is used widely in the empirical IO literature, which considers aggregation over con-
sumers choices to estimate demand parameters using market level data. We will not discuss
this extension in this class.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 144
y
0.3
0.2
0.1
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Solid line: PDF of Type I Extreme Value. Dotted line: PDF of Normal with the same mean
and variance
In this case, we can show that
exp(vij )
P (yi = j|xi , zi ) = PJ where vij = xij β + zi γ j (7.16)
h=0 exp (vih )
So P (yi = j|xi , zi ) is
The above probabilities constitute what is usually called the conditional logit model. See Mc-
Fadden (1973).
exp(zi γ j )
P (yi = j|zi ) = PJ , j = 0, 1, ..., J. (7.22)
h=0 exp (zi γ h )
The above probabilities constitute what is usually called the multinomial logit model.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 146
Remark 50 The di§erence between the conditional logit and multinomial logit models:
• In the MNL model, the conditioning variables do not change across alternative: for each
i, zi contains variables specific to the individual but not to the alternatives. The model
is appropriate for problems where characteristics of the alternatives are not important.
Multinomial logit models help us answer the question “how do individual’s characteristics
a§ect their choice probabilities.” Multinomial logit is a generalization of the binary logit.
What if instead we are interested in how the characteristics of the categories a§ect indi-
viduals’ likelihood of betting in them?
• The CL model is intended specifically for problems where the individual choice are at least
made based on the observable attributes of each alternative. Conditional logit models help
us answer the question “how the characteristics of the categories a§ect individuals choice
probabilities.”
Then
γ j = D0j γ 0 + D1j γ 1 + ... + DJj γ J
and
zi γ j = (zi × D0j ) γ 0 + (zi × D1j ) γ 1 + ... + (zi × DJj ) γ J .
Let xij = (zi × D0j , zi × D1j , ..., zi × DJj ) and γ = (γ 00 , γ 01 , ..., γ 0J )0 , then
exp(zi γ j ) exp(xij γ)
P (yi = j|zi ) = PJ = PJ := P (yi = j|xi ) .
h=0 exp (zi γ h ) h=0 exp (xij γ)
Therefore the CL model contains the MNL model formally as a special case. So wlog we
can focus on the CL model hereafter.
n X
J J
!
X @vij X @vik
= 1 {yi = j} − pik .
@β @β
i=1 j=0 k=0
@vij
When @β = x0ij , we have
n X
X J
Sn (β) = 1 {yi = j} (xij − x̄i )0 = 0,
i=1 j=0
PJ
where x̄i = k=0 pik xik is the weighted average of xik ’s. The FOC can be rewritten as
J X
X n J X
X n
1 {yi = j} xij = pij xij .
j=0 i=1 j=0 i=1
If an alternative-specific constant is included in vij , then one of elements in xij is Dhj , the
dummy for h-th choice. In this case,
J X
X n J X
X n
1 {yi = j} Dhj = pij Dhj ,
j=0 i=1 j=0 i=1
That is
n n
1X 1X
1 {yi = h} = pih .
n n
i=1 i=1
where
yij := 1 {yi = j} .
So the MLE is a moment-based estimator with the moment conditions being
0 1
J
X
E@ [yij − pij ] x0ij A = 0. (7.24)
j=0
In fact, we know that E (yij − pij |xi ) = 0 which implies that E (yij − pij ) x0ij = 0 for all
j = 0, ..., J. This is a set of overidentifying moment conditions. The moment conditions in
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 148
(7.24) can be regarded as the optimal linear combinations of E (yij − pij ) x0ij = 0 for j =
0, ..., J.
The variance of the score conditional on {xi }ni=1 is
n X
X J X
J
V ar(Sn (β)| {xi }ni=1 ) = E (yij − pij ) (yik − pik ) (xij − x̄i )0 (xik − x̄i ) .
i=1 j=0 k=0
By definition,
E (yij − pij ) (yik − pik ) = −pij pik + pij {j = k} .1
It follows that
n X
X J
V ar(Sn (β)| {xi }ni=1 ) = pij (xij − x̄i )0 (xij − x̄i ) .
i=1 j=0
This is because
J
X
pij (xij − x̄i )0 = 0 for any i.
j=0
If the model is correctly specified, V ar(Sn (β)| {xi }ni=1 ) is the negative expected Hessian
matrix. Using the general MLE theory, we deduce that the distribution of β̂ − β can be
approximated by 8 0 1−1 9
< Xn X J =
N 0, @ p̂ij (xij − x̄i )0 (xij − x̄i )A .
: ;
i=1 j=0
If we use a version of the BHHH algorithm to obtain the MLE, the iterative step is given
by
2 3−1
Xn X
J # $0 # $ J #
n X
X $ # $
(k) (k) (k) (k) 0 (k)
β (k+1) = β (k) + 4 pij xij − x̄i xij − x̄i 5 xij − x̄i yij − pij
i=1 j=0 i=1 j=0
where
J
X
(k) (k) (k)
pij = pij (β k ) and x̄i = pij xij .
j=0
1
For those who are not familiar with the multinomial distribution, we can derive the equation from the first
principle. Note that for k 6= j,
8
>
> (1, 1), with probability zero
>
>
>
>
< (0, 0), with probability 1 − pik − pij
(yij , yik ) =
>
>
> (1, 0), with probability pij
>
>
>
: (0, 1), with probability p
ik
So
E (yij − pij ) (yik − pik ) = pij pik (1 − pik − pij ) − (1 − pij ) pik pij − pij (1 − pik ) pik = −pij pik
as desired.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 149
Since the person is indi§erent between car and bus, it would be reasonable to assume that
Pr(C|C, B, R) Pr(C|C, B)
= =1 (7.28)
Pr(B|C, B, R) Pr(B|C, B)
and
Pr(C|C, B, R) Pr(C|C, R)
= =1 (7.29)
Pr(R|C, B, R) Pr(R|C, R)
so
1
Pr(C|C, B, R) = Pr(B|C, B, R) = Pr(R|C, B, R) = (7.30)
3
which is less than 0.5.
The IIA problem arises because we assume that aij are independent across j. If two al-
ternatives are close substitutes, we expect that their random utilities are correlated. We can
test whether some alternatives are potentially correlated by using a typical Hausman test
(Hausman and McFadden, 1984). Under the null H0 of IIA, one can estimate all β parameters
f
consistently and e¢ciently using the full data set. Denote the estimator as β̂ . A useful way
to obtain a robust estimator is to drop some potentially correlated alternatives. For example,
suppose that we suspect that alternatives J − 1 and J are correlated and they are correlated
with the rest of the alternatives, we can drop these two alternatives and all the individuals
who choose these two alternatives, leading to a restricted data set with only J − 1 alternatives
(j now runs from j = 0 to j = J − 2) and fewer than n observations. Based on the restricted
7.3 MULTINOMIAL PROBIT MODEL 150
choice set and data set, we can still estimate all of the parameters (if the model is a CL model).
The choice probabilities with only J − 1 alternatives (instead of J + 1 alternatives) are:
exp (vij )
P (yi = j|xi , zi ) = PJ−2 for j = 0, ..., J − 2.
h=0 exp (vih )
r
Under IIA, the estimator β̂ from the restricted choice set is consistent. When the IIA is
r f
violated, β̂ is still consistent but β̂ is not. This provides the usual basis for a Hausman test.
In absence of some natural grouping of the alternatives, the choice of the subset to leave out
is arbitrary and, hence, so is the test.
Note: the original Σ has 6 free elements but Σ̃ can have only three free elements. In
addition, not all three free elements of Σ̃ can be identified. We need to impose one restriction
on Σ̃. One way to achieve identification is to set ai0 = 0 so that σ 00 = σ 01 = σ 02 = 0 and
normalize σ 11 = 1.
Alternatively, the independence assumption of CL can be relaxed using the generalized
extreme value (GEV) models. The GEV distribution generalizes the independent univariate
extreme value cdfs to allow for the correlation of ai across choices:
F (ai0, ai1 , ai2 , ..., aiJ ) = exp[−G(exp(−ai0 ), ..., exp(−aiJ ))] (7.34)
for some function G. The GEV approach has been widely used in the context of the nested
logit model. See Train (2003).
Example 52 Choice of house: choose the neighborhood and select a specific house within a
chosen neighborhood. Choose to travel by plane, then choose among the airlines.
In the presence of a nested structure, we assume that the utility from house j in neighbor-
hood k looks as follows:
Vkj = xkj β + zk α + akj , (7.35)
where zk are characteristics of neighborhoods and xkj are house-specific characteristics. To
facilitate estimation when the number of choices is very large but the decision problem has a
tree structure, we use pkj = pk pj|k , whereas it turns out pj|k only involves β but not α. Under
the assumption that akj has iid type I extreme value distribution, we have
exp (xkj β + zk α)
pkj = PC PNm , (7.39)
m=1 j=1 exp(xmj β + zm α)
P
which is obvious if we think each individual has C m=1 Nm options.
One can therefore first estimate β o§ the choice within neighborhoods (based on p(j|k))
and then use the β̂ to impute Iˆk and estimate α by maximizing a likelihood consisting of pk .
This sequential estimation provides consistent estimators and can be applied in all problems
in which the number of choices is very large but the decision process has a tree structure.
7.4 NESTED LOGIT MODEL 152
The extension of this model to cases involving several branches of a tree is obvious. See
Maddala (Ch.3).
As multinomial/conditional logit model, the above nest logit model su§ers from the IIA
property. When two alternatives are in the same nest, the IIA property holds since the ratio
of probabilities is independent of the existence of other alternatives.
p(j1 |k) exp (xkj1 β)
=
p(j2 |k) exp (xkj2 β)
However, if two alternatives are placed in di§erent nests, then the IIA property no longer
holds. One way to avoid the IIA problem with the same net is to assume a variance component
structure for the random utility:
akj = ϵk + λk ϵkj (7.40)
for some λk 2 [0, 1], where ϵkj ∼ type I extreme value, ϵk ∼ C(λk ) where C (λ) is a distribution
defined below.
Definition 53 The C(λ) distribution is defined to be the unique distribution for which v and
e are independent, v ∼ C(λ), and e ∼ type I extreme value, implies that v + λe ∼ type I
extreme value. See Cardell (1997).
With the social surplus function, we can place a dollar value on the e§ect of changing one or
more of the determinants of choices, such as price and time cost of travel in transportation
mode choice.
We now come back to the nested logit model. Note that if we choose the k-th neighborhood,
the utility derived from the houses in this neighborhood is
where " #
U V Nk
X
xkj β
η k = ϵk + λk max + ϵkj − ln exp (xkh β/λk )
j λk
h=1
is a type I extreme value random variable and η k is independent of Ik and zk . It now follows
from (7.44) that equation (7.41) holds.
Let y be an ordered response taking on the values {0, 1, 2, ..., J} for some known integer J.
The ordered probit model can be derived from a latent variable model. Assume that a latent
variable y ∗ defined by
y ∗ = xβ + e, e|x s N (0, 1) (7.45)
Let c1 < c2 < ... < cJ be unknown cut points and define
8
>
> 0, if y ∗ ≤ c1
>
>
>
>
< 1, if c1 < y ∗ ≤ c2
y= (7.46)
>
> ... ...
>
>
>
>
: J, if y ∗ > c
J
Given the standard normal assumption, we can compute each response probability:
P (y = 0|x) = P (y ∗ < c1 ) = P (xβ + e < c1 ) = Φ (c1 − xβ)
7.6 PROBLEMS 155
Note that P (y ≤ j|x) = Φ (cj+1 − xβ) for j = 0, ..., J − 1, which is a system of J probit
equations. Optimal GMM in this system should be asymptotically equivalent to ordered probit
MLE.
Other distribution functions can be used in place of Φ. Replacing Φ with the logit function,
Λ, gives the ordered logit model.
The focus of interest is: @P (y = 1|x)/@xj and c0i s. Interpreting the coe¢cients based on
their sign is not obvious in the ordered response model. See the textbook by Wooldridge
(2010).
7.6 Problems
Download the file ca_heating.xls from course web site. The file contains data on choice
of heating system in California houses. The observations consist of single-family houses in
California that were newly built and had central air-conditioning. The choice is among heating
systems. Five types of systems are considered to have been possible:
(1) gas central,
(2) gas room,
(3) electric central,
(4) electric room,
(5) heat pump.
There are 900 observations with the following variables:
idcase gives the observation number (1-900)
depvar identifies the chosen alternative (1-5)
ic1 is the installation cost for a gas central system
ic2 is the installation cost for a gas room system
ic3 is the installation cost for an electric central system
ic4 is the installation cost for an electric room system
ic5 is the installation cost for a heat pump
oc1 is the annual operating cost for a gas central system
oc2 is the annual operating cost for a gas room system
oc3 is the annual operating cost for an electric central system
oc4 is the annual operating cost for an electric room system
oc5 is the annual operating cost for a heat pump
income is the annual income of the household
agehed is the age of the household head
rooms is the number of rooms in the house
7.6 PROBLEMS 156
(1)
for j = 1, 2, 3, 4, 5, where xij is the installation cost of the j-th heating system for household i,
(2)
xij is the annual operating cost of the j-th heating system for household i. Do the estimated
coe¢cients (βb ,β
b ) have the expected signs? Are both coe¢cients significantly di§erent from
1 2
zero?
(b) How closely do the average probabilities match the shares of customers choosing each
alternative?
(c) The ratio of coe¢cients usually provides economically meaningful information. The
willingness to pay (wtp) through higher installation cost for a one-dollar reduction in operating
costs is the ratio of the operating cost coe¢cient to the installation cost coe¢cient. What is
the estimated wtp from this model? Is it reasonable in magnitude?
(d) We can use the estimated wtp to obtain an estimate of the discount rate that is implied
by the model of choice of operating system. The present value of the future operating costs is
the discounted sum of operating costs over the life of the system:
L
X OC
PV = (7.49)
(1 + r)t
t=1
where r is the discount rate and L is the life of the system. As L rises, the P V approaches
(1/r)OC. Therefore, for a system with a su¢ciently long life (which we will assume these
systems have), a one-dollar reduction in OC reduces the present value of future operating costs
by (1/r). This means that if the person choosing the system were incurring the installation
costs and the operating costs over the life of the system, and rationally traded-o§ the two at
a discount rate of r, the decision maker’s wtp for operating cost reductions would be (1/r).
Given this, what value of r is implied by the estimated wtp that you calculated in part (c)?
Is this reasonable?
7.6 PROBLEMS 157
(f) Calculate the wtp and discount rate r that is implied by the estimates. Are these
reasonable?
(g) If you want to include income in the model, what P (yi = j|x) looks like? Are all the
parameters identifiable? Explain. (Note: no need to estimate this model).
(h) The California Energy Commission (CEC) is considering whether to o§er rebates on
heat pumps. The CEC wants to predict the e§ect of the rebates on the heating system choices
of customers in California. The rebates will be set at 10% of the installation cost. The new
installation cost for heat pumps will therefore be: set nic5 = .90 ∗ ic5 . Using the estimated
coe¢cients from the model in part (e), calculate new probabilities and predicted shares using
nic5 instead of ic5. How much do the rebates raise the share of houses with heat pumps?
(i) Suppose a new technology is developed that provides more e¢cient central heating. The
new technology costs $200 more than the central electric system that we have specified as our
alternative 3. However, it saves 25% of the electricity, such that its operating costs are 75% of
the operating costs of our alternative 3. We want to predict the potential market penetration
of this technology. Note that there are now six alternatives: the original five alternatives
plus this new one. Calculate the probability and predict the market share (i.e., the average
probability) for all six alternatives, using the model that is estimated on the original five
alternatives. (Note: (i) Be sure to use the original installation cost for heat pumps, rather
than the reduced cost in part (h). (ii) For the new technology, assume all else are the same
as alternative 3). What is the predicted market share for the new technology? From which of
the original five systems does the new technology draw the most customers?
Bibliography
[1] Cardell, N Scott (1997). “Variance Components Structures for the Extreme-Value and
Logistic Distributions with Application to Models of Heterogeneity.” Econometric Theory,
13(2): 185-213.
[2] Hausman, Jerry and Daniel McFadden (1984). “Specification Tests for the Multinomial
Logit Model,” Econometrica, 52: 1219-1240.
[3] Train, Kenneth (2002). Discrete Choice Methods with Simulation, Cambridge University
Press.
[4] Maddala, G.S., (1987). Limited Dependent and Qualitative Variables in Econometrics.
Cambridge University Press.
158
Chapter 8
However, we can only sample people with a net wealth greater than $10,000, so the sample is
selected on the basis of wealth (People with net wealth less than $10,000 may be hard to reach).
We only observe the Yi∗ satisfying Yi∗ > c where c is a known constant, i.e.,
8
< (Yi∗ , Xi ) if Yi∗ > c
(Yi , Xi ) = (8.3)
: no observation if Y ∗ ≤ c
i
Here we consider only truncation from below. The extension to truncation from above is
straightforward.
159
8.1 TRUNCATED REGRESSION MODEL 160
3.5
2.5 E(Z|Z>c)
1.5
0.5
0
-3 -2 -1 0 1 2 3
c
So
H I2
cφ(c) φ(c)
var (Z|Z > c) = +1−
1 − Φ(c) 1 − Φ(c)
H I
φ(c) φ(c)
= 1− −c
1 − Φ(c) 1 − Φ(c)
= 1 − λ (c) [λ (c) − c] ≤ 1.
Figure 8.2 graphs the mean and variance of truncated standard normal distributions
against the truncation point c.
Therefore, if u s N (0, σ 2 ), then
* + u2 u c h c c i
E u2 |u > c = σ 2 E( 2 | > ) = σ 2 1 + λ( ) ,
σ σ σ σ σ
and
u u c
var(u|u > c) = σ 2 var( | > )
n # cσ$σh cσ c io
2
= σ 1−λ λ( ) − . (8.10)
σ σ σ
8.1 TRUNCATED REGRESSION MODEL 162
h i * +
E (Yi − Xi β)2 |Xi , Yi > c = E "2i |"i > c − Xi β, Xi
L H I H IM
2 c − Xi β c − Xi β
= σ 1+ λ , (8.12)
σ σ
and
where the inequality follows from: 1 − λ (c) [λ (c) − c] = var (Z|Z > c) 2 [0, 1] . The above
calculation is indicative only. The OLS estimator may not converge to @E(Yi@X|Xi ,Yi >c)
i
. (see a
question in PS1).
In passing, we note that the sample selection bias does not arise if selection is based on
regressors, not on the dependent variable. Note that the unbiasedness of the OLS estimator
relies on
E ("i |Xi , Si = 1) = 0,
where Si = 1 indicates the sample selected, i.e., Si = {Yi > C} or Si = {Xi > C} . When
Si = {Xi > C} , we have
E ("i |Xi , Si = 1) = E ("i |Xi ) = 0,
provided that E ("i |Xi ) = 0 in the population.
n
( L H I H IM)
X (Yi − Xi β)2 c − Xi β c − Xi β
− 1+ λ = 0.
σ2 σ σ
i=1
@
The conditions that E @θ log L(θ) = 0 are compatible with (8.11) and (8.12).
The usual asymptotics for MLE applies.
Stata command: truncreg y x, ul(#). See http://www.stata.com/manuals13/rtruncreg.pdf
Figure 8.4 presents finite sample distributions of the OLS estimator and MLE for a certain
DGP. For details of the simulation, see the Stata program.
10
10
8
8
6
6
Density
Density
4
4
2
2
0
Figure 8.4: Finite sample distribution of the OLS and MLE in the presence of truncation.
clear
capture postclose tempid
postfile tempid beta_hat_ols beta_hat_mle using mydata_truncated.dta,replace
set seed 1
forvalues i = 1(1)1000 {
drop _all
quietly set obs 200
gen e = rnormal()
gen x = rnormal()^2
8.2 TOBIT AND CENSORED REGRESSIONS 165
gen y = x + e
qui drop if y < -1 /* truncation */
qui reg y x
scalar beta_hat_ols = _b[x]
where Y ∗ ,Y , and u 2 R. For example, Y ∗ is the willingness to donate and Y is the donation.
We have an iid sample {Xi , Yi } from this population model. Here Yi has a limited support in
the population.
A closely related model is the censored regression model under which the population model
is
Y ∗ = Xβ + u, u|X s N (0, σ 2 ). (8.18)
We have an iid sample from this population but we observe only {Xi , Yi } where as before
8
< Y ∗, if Yi∗ > 0
i
Yi = (8.19)
: missing, if Y ∗ ≤ 0
i
Here Yi∗ is the variable of interest and its support is not limited. However, Yi has a limited
support because the data on Yi∗ is missing when Yi∗ ≤ 0. Sometimes we fill in the missing
value by zero, leading to 8
< Y ∗ , if Y ∗ > 0
i i
Yi = . (8.20)
: 0, if Y ∗ ≤ 0
i
8.2 TOBIT AND CENSORED REGRESSIONS 166
However, it may not make much sense in some applications to impute the missing value to be
zero. We will be clear in the following when the result will reply on the specification that
Yi = 0 when Yi∗ ≤ 0.
The di§erence between the truncated regression model and the censored regression model
is that in the former case, Xi is not observable when Yi∗ ≤ 0 while in the latter case, Xi is
always observable regardless of whether Yi∗ > 0 or not.
Statistically, the Tobit model and censored regression model (with imputed zeros) are the
same. However, their interpretations are di§erent. In the censored regression model, we are
interested in the marginal e§ect of Xi on Yi∗ (not Yi ). The marginal e§ect is β. In the Tobit
model, we are interested in the marginal e§ect of Xi on Yi (not Yi∗ ). This e§ect is not equal
to β, as we demonstrate below.
In both Tobit and censored regressions, the threshold value may be di§erent from zero.
More generally,
Yi∗ = Xi β + ui , ui |X s N (0, σ 2 ). (8.21)
and we have data on Xi and Yi where
8
< Y ∗ , if Y ∗ > c
i i
Yi = (8.22)
: c, if Y ∗ ≤ c
i
Many researchers mistakenly think these two models are equivalent. A model equivalent to
(8.23) is
which is now in the same form as (8.24). If c is not known, then we can estimate by ĉ = min(Yi ).
Carson and Sun (2007) show that ĉ converges to c at the rate of n−1 which is faster than the
p
parametric rate of 1/ n. So the estimation uncertainty can be ignored in making inference
on β. We will use (8.24) in the subsequent sections but keep in mind that when c is not zero,
we have to transform the dependent variable first.
8.2 TOBIT AND CENSORED REGRESSIONS 167
and
Pr(Y > 0|X) = Pr (Xβ + u > 0) = Φ (Xβ/σ) . (8.26)
Hence, in order to find E(Y |X), we only need to compute E(Y |X, Y > 0) :
As a consequence, H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ (8.29)
σ σ
and
H I H I H I
@E (Y |X) Xβ Xβ βj βj Xβ
= Φ βj + φ Xβ − Xβφ
@Xj σ σ σ σ σ
H I
Xβ
= Φ β j = Pr(Y > 0|X)β j . (8.30)
σ
Hence P P
P @E(Y |X) P
P P
P @Xj P ≤ |β j |.
8.2 TOBIT AND CENSORED REGRESSIONS 168
with
E (ei |Xi , Yi > 0) = 0. (8.32)
This implies that if we run the OLS of Yi on Xi using the sample for which Yi > 0, we e§ectively
omit the variable λ. Due to the omitted variable bias, the OLS estimator is inconsistent. This
is e§ectively a truncation regression with omitted variables.
Even if we use all the data, the OLS estimator is still inconsistent because
H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ . (8.33)
σ σ
Define
I (Yi ) = 1 {Yi > 0} = 1 {Xi β + ui > 0} .
Clearly, Y has a mixed distribution, a distribution that is partly discrete and partly continuous.
Part of the distribution is concentrated at the point Y = 0 and the rest of the distribution
8.2 TOBIT AND CENSORED REGRESSIONS 169
The dominating measure behind the pdf is the sum of the counting measure and the Lebesgue
measure. Note that the above pdf does not depend on the value of Yi when I (Yi ) = 0 or
Y ∗ < 0. * +0
Let θ = β 0 , σ 2 , then the log-likelihood is
n
X
l(θ) = li (θ)
i=1
where
L H IM L H IM
Xi β 1 Yi − Xi β
li (θ) = [1 − I (Yi )] log 1 − Φ + I (Yi ) log φ (8.36)
σ σ σ
L H IM " ! * +#
Xi β (Yi − Xi β)2 log σ 2
= [1 − I (Yi )] log 1 − Φ − I (Yi ) + + const.
σ 2σ 2 2
By definition, E @l@β
i (θ)
= 0. In fact,
82 # $ 3P 9
< 1 Xi β P
4 σ φ σ (Yi − Xi β) 5PP =
E − [1 − I (Yi )] # $ + I (Yi ) P Xi ;
: 1 − Φ Xσi β σ2 P
# $
1 Xi β L M
σφ σ (Yi − Xi β)
= − [1 − P {Yi > 0|Xi }] # $ + E I (Yi ) |Xi
1 − Φ Xσi β σ2
H I L M
1 Xi β (Yi − Xi β)
= − φ +E |Xi , Yi > 0 P (Yi > 0|Xi )
σ σ σ2
H I L H IM H I
1 Xi β Xi β Xi β
= − φ + σλ − Φ = 0.
σ σ σ σ
8.2 TOBIT AND CENSORED REGRESSIONS 170
for any measurable function h (·) , but the theory of MLE suggests that there is no additional
information in large samples beyond the moment conditions obtained by letting h (Xi ) = Xi .
l(θ) has a single maximum, but two step procedures have been devised by Heckman and
Amemiya. The two step procedure starts with a probit on Yi > 0 or not. This delivers a
consistent estimator of β/σ. In the second step, we bring in the continuous information and
consider H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ . (8.37)
σ σ
d to predict Φi = Φ(Xi β/σ)
Use the first-step β/σ d and φ = φ(Xi β/σ)
d and estimate
i
The argument in this subsection relies on the assumption that Yi = 0 when Yi∗ ≤ 0.
8.2 TOBIT AND CENSORED REGRESSIONS 171
E (Y |D = 1) − E (Y |D = 0)
L H I H IM #α$ #α$
α+β α+β
= Φ (α + β) + σφ −Φ α − σφ .
σ σ σ σ
We can run the Tobit regression and then plug the parameter estimates into the right hand
side of the above equation to obtain an estimator of E (Y |D = 1) − E (Y |D = 0) . Angrist and
Pischke point out that this is unnecessary. They propose to run a simple OLS regression:
and claim that the OLS estimator of δ will provide a legitimate estimator of E (Y |D = 1) −
E (Y |D = 0) . To verify their claim, we note that E (Y |D) can be rewritten as
E (Y |D) = δ + γD
where
#α$ #α$
δ = Φ α + σφ
L Hσ I σ H IM
α+β α+β #α$ #α$
γ = Φ (α + β) + σφ −Φ α − σφ .
σ σ σ σ
So indeed the OLS estimator γ̂ OLS based on the regression in (8.43) is consistent for the
average causal e§ect E (Y |D = 1) − E (Y |D = 0) .
In fact, consistency of γ̂ OLS for E (Y |D = 1) − E (Y |D = 0) does not rely on any distrib-
utional assumption on u. To see this, note that we can always write
with E(error|D) = 0. Given this, we know that the OLS estimator of the slope coe¢cient is
consistent for f (1) − f (0) = E (Y |D = 1) − E (Y |D = 0). This result reminds us of binary
choice models with only dummy covariates. There we show that the linear probability model,
probit model and logit model give us exactly the same predicted probabilities.
The equation that drives all these results is (8.44), which can be rewritten as
E (Y |D) = E (Y |D = 0) + [E (Y |D = 1) − E (Y |D = 0)] D.
For a binary variable D, the above equation holds for any type of dependent variable Y, being it
continuous or discrete. There is no mis-specification problem. That is, the linear specification
is always correct.
8.3 TOBIT MODELS WITH ENDOGENEITY 172
The question is: what if we have continuous covariates in the Tobit model? Can we ignore
the “LDVness” (Limited Dependent Variable) of Y and just run OLS regardless of the type of
the dependent variable we have? Angrist and Pischke again provide somewhat positive answer
to this question. They argue that the OLS estimator still provide a good approximation to
some average causal e§ect of interest. As they admit, this is not a theorem. They only provide
some empirical evidence that the linear OLS estimator is close to the average causal e§ect
based on nonlinear models. For more discussions, see Angrist and Pischke (2009, Sec 3.4.2).
Y1 = max(0, Xβ + Y2 α + u)
Y2 = Xγ + Zδ + v (8.45)
where (u, v) are zero mean normally distributed, independent of (X, Z). For identification, we
need the usual rank condition δ 6= 0 and E [(X, Z)0 (X, Z)] is assumed to have full rank, as
always.
Under the normality assumption, we have
u = θv + e (8.46)
where
θ = σ uv /σ 2v , σ uv = cov(u, v) and σ 2v = var(v) (8.47)
and e s N (0, σ 2e ) and is independent of (Z, v). Obviously, σ 2e = σ 2u −σ 2uv /σ 2v . Plugging u = θv+e
into
Y1 = max(0, Xβ + Y2 α + u) (8.48)
gives
Y1 = max(0, Xβ + Y2 α + θv + e) (8.49)
The Smith-Blundell procedure
(1) Run OLS of Y2 on X, Z and get the residual v̂ = Y2 − X γ̂ − Z δ̂.
(2) Estimate a standard Tobit of Y1 on X, Y2 and v̂ to get consistent estimates of β, α, θ
and σ 2e . (tobit y1, X, y2, v_hat, ll(0))
The usual t-statistic on v̂ provides a simple test of the null H0 : θ = 0, which says that Y2
is exogeneous. Note that when we compute the asymptotic variance, we need to account for
the fact that this is a two-step procedure.
A full MLE approach avoids the two-step estimation problem:
Once the MLE has been obtained, we can easily test the null hypothesis of exogeneity of
Y2 using the t-statistic for θ̂.
E max(0, xo β + y2o α + U )
H IL H IM
xo β + y2o α xo β + y2o α
= Φ xo β + y2o α + σ u λ −
σu σu
H I H I
xo β + y2o α xo β + y2o α
= Φ [xo β + y2o α] + σ u φ .
σu σu
Interest lies in estimating E(Wio |Xi ) where Wio is the hourly wage o§er for a randomly
drawn individual i. If Wio is observed for everyone of working age, we would proceed in a
standard regression framework. However, a potential sample selection problem arises because
Wio is observed only for people who work.
Suppose we want to estimate a wage o§er function. The true model is the Mincer-type
wage o§er equation:
log Wio = Xi1 β 1 + ui1 (8.52)
where Wio is wage, Xi1 is a vector of human capital attributes (e.g. work experience and
education) with β being the associated vector of coe¢cients. This model has been examined
on many datasets, and it is one of the most widely used models in empirical economics. Due
to the sample selection problem, the assumption of the classical regression model, namely
E[ui1 |Xi1 , worker] = 0, is unlikely to hold. This is because a person who chooses to work
may be particularly diligent or have other characteristics that make him more desirable as a
worker, and E[ui1 |Xi1 , worker] may well be positive.
We now model the decision to work by a simple rule. We assume that everyone of working
age has a reservation wage Wir . The person chooses to work only if
where Xi2 contains variables that determine the marginal utility of leisure and income and ai
is the non-income wage of person i. We assume that (ui1 , ui2 ) is independent of (Xi1 , Xi2 , ai ).
Person i decides to work if
or
Xi δ + vi > 0, Xi = (Xi1 , Xi2 , ai ) , vi = ui1 − ui2 (8.56)
Remark 63 If Wir is observed and is exogeneous, and Xi1 is always available, then we would
be in the censored regression framework.
Remark 64 If Wir is observed and is exogeneous, and Xi1 is available only when Wio is
available, then we would be in the truncated regression framework
Let Y1 = log W o and Y2 be the binary labor force participation indicator, then
Y1 = X1 β 1 + u (8.57)
and
Y2 = 1 {Xδ + v > 0} (8.58)
we discuss the estimation of the model under the following set of assumptions:
Assumption A:
(d) E(u|v) = vγ 1 .
The above model is the general Heckman selection model. Amemiya (1985) calls the above
model the Type II Tobit Model. Wooldridge (2009) calls it the probit selection model. When
X = X1 , δ = β 1 , and v = u, the model reduces to the standard Tobit model.
Note that
Therefore,
E (Y1 |X, Y2 = 1) = E (X1 β 1 + u|X, Y2 = 1)
= X1 β 1 + E (u|X, Xδ + v > 0)
= X1 β 1 + E (vγ 1 + η|X, Xδ + v > 0)
= X1 β 1 + E (v|X, v > −Xδ, ) γ 1
= X1 β 1 + λ (−Xδ) γ 1 . (8.63)
The above equation makes it clear that an OLS regression of Y1 on X1 using the selected
sample omits the term λ (−Xδ) γ 1 and generally leads to an inconsistent estimator of β 1 .
Following Heckman (1979), we can consistently estimate β 1 and γ 1 using the selected
sample by regressing Yi1 on Xi1 , λ (−Xi δ) . The problem is that δ is unknown. Fortunately, δ
can be consistently estimated using probit based on Yi2 . This two step procedure is sometimes
called Heckit.
To estimate the asymptotic variance of β̂ 1 , we have to make a correction for the fact that
we’re not using λ (−Xδ) but only λ(−X δ̂) so that the error term contains the following:
h i L $M
@λ(Z) #
λ(−Xδ) − λ(−X δ̂) γ 1 ≈ − X δ − δ̂ γ 1 (8.64)
@Z
evaluated at Z = −Xδ. More specifically, the second step regression is
Y1i = X1i β 1 + λ(−Xi δ̂)γ 1 + ei ,
where h i
ei = ui − E(ui |vi > −Xi δ) + γ 1 λ(−Xi δ) − λ(−Xi δ̂) .
The second term in ei needs to be taken into consideration when computing the standard
errors of the two-step estimator β̂ 1 .
In view of the above analysis, the contribution to the partial likelihood function for a given
observation (Y1 , Y2 , X) is
L(Y1 , Y2 |X)
8 0 19Y2
< 1 HY − X β I Xδ + σ uv σ u (Y1 − X1 β 1 ) A=
−2
L M1−Y2
1 1 1
= φ Φ@ q 1 − Φ (Xδ) .
: σu σu 2 −2 ;
1 − σ uv σ u
Note that L(Y1 , Y2 |X) is not f (Y1 , Y2 |X), the conditional likelihood function. When viewed
as a function of the parameters, we call L(Y1 , Y2 |X) the partial likelihood function.
Let θ be a vector that collects all the parameters. The partial MLE is defined by
n
Y n
X
θ̂ = arg max L(Y1i , Y2i |Xi ) = arg max log L(Y1i , Y2i |Xi ).
i=1 i=1
log L(Y1 , Y2 |X) = 1{Y2 = 1} log f (Y1 |Y2 = 1, X) + log f (Y2 |X).
When the model is correctly specified, the true parameter θ0 maximizes E log f (Y2 |X). For all
(Y2 , X), the true parameter θ0 also maximizes E log f (Y1 |Y2 , X). In particular, it maximizes
E log f (Y1 |Y2 = 1, X). Therefore
For identification, we have to assume or verify that θ0 is the unique maximizer of E log L(Y1 , Y2 |X).
Consistency of θ̂ now follows if we can show that a ULLN
n
1X
P lim sup || log L(Y1i , Y2i |Xi ) − E log L(Y1 , Y2 |X)|| = 0
n!1 θ2Θ n
i=1
8.5 SAMPLE SELECTION WITH TOBIT SELECTION 178
holds.
How to compute the standard error of θ̂? The score function can be written as
@ log f (Y1i |Y2i = 1, Xi ) @ log f (Y2i |Xi )
si (θ) = {Y2i = 1} +
@θ @θ
: = {Y2i = 1}s1i (θ) + s2i (θ).
Therefore
Esi (θ)si (θ)0 = E{Y2i = 1}s1i (θ)s1i (θ)0 + Es2i (θ)s2i (θ)0
+E{Y2i = 1}s1i (θ)s2i (θ)0 + E{Y2i = 1}s2i (θ)s1i (θ)0 .
Since
@ log f (Y1i |Y2i , Xi )
E [s1i (θ0 )|Y2i , Xi ] = E |θ=θ0 = 0
@θ
for any given Y2i and Xi , E [s1i (θ0 )|Y2i = 1, Xi ] = 0. As a consequence,
Esi (θ0 )si (θ0 )0 = E{Y2i = 1}s1i (θ0 )s1i (θ0 )0 + Es2i (θ0 )s2i (θ0 )0
= −E{Y2i = 1}H1i (θ0 ) − EH2i (θ0 ) = −EHi (θ0 )
where
@ 2 log f (Y1i |Y2i = 1, Xi ) @ 2 log f (Y2i |, Xi )
H1i (θ0 ) = 0 , H2i (θ0 ) = .
@θ@θ @θ@θ0
Here we have used the fact that
; <
E s1i (θ0 )s1i (θ0 )0 |Y2i , Xi = −E [Hi1 (θ0 )|Y2i , Xi ] for any Y2i and Xi
Y1 = X1 β 1 + u
Y2 = max(0, Xδ + v) (8.66)
A familiar example occurs when Y1 is the log of the hourly wage o§ered and Y2 is hours of
labor supply. The model is sometimes called the type III Tobit model.
Assumption B
8.5 SAMPLE SELECTION WITH TOBIT SELECTION 179
(c) v s N (0, σ 2v )
(d) E (u|v) = vγ 1
(a) Estimate Y2 = max(0, Xδ + v) by standard Tobit using the whole sample. For Yi2 > 0,
define
v̂i = Yi2 − Xi δ̂. (8.68)
(b) Using observations for which Yi2 > 0, estimate β 1 and γ 1 by the OLS regression
To make a comparison with the partial likelihood function for the Type II Tobit model, the
above partial likelihood can be rewritten as
L H IM H H III(Y2 )
Xδ 1−I(Y2 ) 1 Y1 − X1 β 1
1−Φ φ
σv σu σu
8 0 19I(Y2 )
< 1 Y2 − Xδ − σ uv σ u (Y1 − X1 β 1 ) A=
−2
× q φ @ q .
: ;
σ 2v − σ 2uv σ −2
u σ 2v − σ 2uv σ −2
u
8.6 PROBLEM SET 180
On the basis of the first form of the partial likelihood function, we can use the same
argument as in the previous section to show that the partial MLE is consistent and asymptot-
ically normal. Its asymptotic variance can be consistently estimated by the (negative) inverse
Hessian matrix.
Y1 = Y2∗ α + Xβ + u (8.70)
Y2∗ = Y1 γ + Zδ + v (8.71)
where Y1 and Y2∗ are dependent variables, X and Z are scalar exogenous explanatory variables.
Suppose Y1 , X, and Z are always observable, but Y2∗ is only partially observable in the sense
that we observe only Y2 : 8
< 1, if Y ∗ > 0
2
Y2 = (8.72)
: 0, if Y ∗ ≤ 0
2
What is the log-likelihood function based on the observations {Yi , i = 1, 2, ..., n}? Is γ ∗ iden-
tified? What is the mean of Yi given X = Xi ? What is the median of Yi given X = Xi ?
8.6 PROBLEM SET 181
8.6.2 Answers
1. From the two-equation system, we can solve for Y1 :
1
Y1 = (Xβ + Zαδ) + e.
1 − αγ
where e = (u + vα) /(1 − αγ) with
* +
σ 2e = var(e) = σ 2u + α2 σ 2v / (1 − αγ)2 and cov(e, v) = σ 2v α/(1 − αγ).
We will estimate the model by MLE. First, the density of Y1i conditional on Xi and Zi is
H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
f (Y1i , Xi , Zi ) = φ
σe σe
Second, we can write
cov(e, v) σ 2 α(1 − αγ)
v= e + w = v2 e+w
var(e) (σ u + α2 σ 2v )
where
* + σ 2u σ 2v
σ 2w = var(w) = σ 2v − (σ 4v α2 )/ σ 2u + α2 σ 2v = .
α2 σ 2v + σ 2u
2 2 2
So conditional ei , vi is normal random variable with mean σ(σv2α(1−αγ) σu σv
2 2 ei and variance α2 σ 2 +σ 2 .
u +α σ v ) v u
As a result, the probability of Y2i = 1 conditional on Y1i , Xi and Zi is
L H IM
1 σ 2v α(1 − αγ)
f (Y2i |Y1i , Xi , Zi ) = Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )
Therefore, the likelihood function is
n
Y
f (Y1i , Xi , Zi )f (Y2i |Y1i , Xi , Zi )
i=1
Yn H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
= φ
σ σe
i=1 e
U L H IMVY2i
1 σ 2v α(1 − αγ)
× Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ))
σw (σ u + α2 σ 2v )
U L H IMV1−Y2i
1 σ 2v α(1 − αγ)
× 1−Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )
Note that we can not identify all the model parameters and have to normalize σ 2v = 1.
Maximizing the above likelihood function with respect to the remaining parameters leads to
consistent estimators of these parameters.
Next we consider the case when we observe only Y2 :
8
< Y ∗ , if Y ∗ > 0
2 2
Y2 = (8.76)
: 0, if Y ∗ ≤ 0
2
8.6 PROBLEM SET 182
Using the same steps, we can show that the likelihood function is:
Yn
f (Y1i , Xi , Zi )f (Y2i |Y1i , Xi , Zi )
i=1
Yn H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
= φ
σ σe
i=1 e
U L H IMVI(Y2i )
1 1 σ 2v α(1 − αγ)
× φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ))
σw σw (σ u + α2 σ 2v )
U L H IMV1−I (Y2i )
1 σ 2v α(1 − αγ)
× 1−Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )
The MLE of the model parameters can be obtained by maximizing the above likelihood func-
tion.
2. (i) First, the probability of Yi = 0 is
P (Yi = 0|Xi ) = P (Yi∗ < γ ∗ |Xi ) = P [exp(Xi β + "i ) < γ ∗ |Xi ]
= P ["i < ln (γ ∗ ) − Xi β|Xi ]
H I
ln (γ ∗ ) − Xi β
= Φ .
σ
Second, we find the density of Yi∗ given that Yi∗ ≥ γ ∗ . To this end, we compute
Pr (Yi∗ ≤ Y |Yi∗ ≥ γ ∗ )
Pr (γ ∗ ≤ Yi∗ ≤ Y ) Pr (ln γ ∗ ≤ ln Yi∗ ≤ ln Y )
= =
P (Yi∗ ≥ γ ∗ ) P (Yi∗ ≥ γ ∗ )
# $ # ∗ $
ln Y −Xi β ln γ −Xi β
∗
Pr (ln γ − Xi β ≤ "i ≤ ln Y − Xi β) Φ σ − Φ σ
= ∗ ∗
= ∗ ∗
.
P (Yi ≥ γ ) P (Yi ≥ γ )
So the probability of observing a nonzero Yi that is less or equal to Y is
Pr (Yi < Y |Yi 6= 0) = Pr (Yi∗ ≤ Y |Yi∗ ≥ γ ∗ ) × Pr (Yi∗ ≥ γ ∗ )
H I H I
ln Y − Xi β ln γ ∗ − Xi β
= Φ −Φ .
σ σ
The pdf is therefore H I
1 ln Y − Xi β
φ .
σY σ
Combining the above analysis, we find that the loglikelihood function is
Yn U H IV{Yi =0} U H IV{Yi 6=0}
ln (γ ∗ ) − Xi β 1 ln Yi − Xi β
ln Φ φ
σ σYi σ
i=1
Xn H I X n L H IM
ln (γ ∗ ) − Xi β 1 ln Yi − Xi β
= {Yi = 0} ln Φ + {Yi 6= 0} log φ
σ σ σ
i=1 i=1
n
X
− {Yi 6= 0} log (Yi ) .
i=1
8.6 PROBLEM SET 183
(iv) If ln (γ ∗ ) − Xi β ≥ 0, then
H I
ln (γ ∗ ) − Xi β 1
P (Yi = 0|Xi ) = Φ ≥ .
σ 2
[1] Angrist, J. D. and Pischke, J-S. (2009), Mostly Harmless Econometrics, Princeton Univer-
sity Press.
[2] Amemiya, T. (1985), Advanced Econometrics, Harvard University Press, Cambridge, MA.
[3] Carson, R. T. and Y. Sun (2007), The Tobit model with a non-zero threshold, The
Econometrics Journal, Vol. 10, No. 3 (2007), pp. 488-502
[4] Heckman, J. (1979), Sample Selection Bias as a Specification Error, Econometrica, Vol.
47, No. 1 153-162.
[5] Tobin, J. (1958), Estimation of relationships for limited dependent variables, Econometrica,
26 (1): 24—36. doi:10.2307/1907382.
[6] Wooldridge, J. (2009), Econometric Analysis of Cross Section and Panel Data, MIT press.
184
Chapter 9
Causal Inference
In the microeconometrics textbooks of both Wooldridge (2010) and Angrist and Pischke
(2009), the very first page describes the estimation of causal e§ects as the principal goal of
empirical microeconomists. According to Angrist and Pischke, “In the beginning, we should
ask, What is the causal relationship of interest? Although purely descriptive research has an
important role to play, we believe that the most interesting research in social science is about
questions of cause and e§ect, such as the e§ect of class size on children’s test scores. . .
.” Similarly, the first sentences in the Wooldridge textbook are, “The goal of most empirical
studies in economics and other social sciences is to determine whether a change in one variable,
say w, causes a change in another variable, say y. For example, does having another year of
education cause an increase in monthly salary? Does reducing class size cause an improvement
in student performance? Does lowering the business property tax rate cause an increase in
city economic activity?”
Yi : T ! Y
185
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES2 186
We observe Yi (Ti ) , but we are interested in the causal e§ect for individual i :
Yi (1) − Yi (0).
The fundamental problem is that we can not observe both Yi (1) and Yi (0). For individuals
in the treatment group, we observe only their Y (1). For individuals in the control group, we
observe only their Y (0). For example, we may have the following data:
Causal e§ects for each individual is not fully observable. For this reason, we often focus
on the average treatment e§ect:
Observed outcome:
Yi := (1 − Ti )Yi (0) + Ti Yi (1)
so we only observe one of the two potential outcomes. The other potential outcome is coun-
terfactual.
Consider a concrete example:
Some potential outcomes are observed, as is the ATE. The unobserved Yi ’s are the numbers
in a box. Here ATE is
−2 + 3 + 1 − 1 + 1 2
=
5 5
A naive estimator of the ATE is
5+4+2 5+3 1
− =−
3 2 3
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES3 187
Yi (0) : = α + ϵi
Yi (1) : = α + β + ϵi
So, the linear structural /causal model is a special case of the general potential outcome model,
with the restriction that
Yi (1) − Yi (0) = β
for all i.
We call this the constant treatment e§ect assumption.
In general, the potential outcome model allows for heterogeneous treatment e§ects across
individuals, as Yi (1) − Yi (0) is di§erent for di§erent i’s. This can also be seen by noting that
where the first equation is the definition of Yi in the potential outcomes framework, and
So the potential outcomes framework is equivalent to the linear structure/causal model of the
form:
Yi α + Ti β i + ϵi
with the coe¢cient β i being di§erent for di§erent individuals. In addition to being empiri-
cally relevant, treatment heterogeneity has been important in the development of econometric
thought.
The above can be mapped into a causal model. Let
x̃ = t,
Yi = c(Ti , Ẍi ).
Define
Then Yi (0) is the value of Yi when Ti is set to 0 while keeping Ẍi constant. Similarly, Yi (1)
is the value of Yi when Ti is set to 1 while keeping Ẍi constant. Given the above definitions,
we have
where
α = Ec0 (Ẍi )
h i
β i = c1 (Ẍi ) − c0 (Ẍi ) = [Yi (1) − Yi (0)] ;
ϵi = c0 (Ẍi ) − Ec0 (Ẍi ) = [Yi (0) − EYi (0)] .
So there is an equivalence between the causal modeling framework and the potential outcomes
framework.
However, since we only observe one of the two potential outcomes for any individual, there
is a limit on how much we can learn about the distribution of individual-level treatment e§ects.
Now suppose we run the regression based on the predictive model:
Yi = α∗ + Ti β ∗ + ui
α∗ = EYi − (ETi ) β ∗
cov(Yi , Ti ) E(Yi Ti ) − E(Yi ) (ETi )
β∗ = =
var(Ti ) var(Ti )
E(Yi Ti |Ti = 1)P (Ti = 1) + E(Yi Ti |Ti = 0)P (Ti = 0) − E(Yi ) (ETi )
=
var(Ti ))
E(Yi |Ti = 1) (ETi ) − E(Yi ) (ETi ) E(Yi |Ti = 1) − E(Yi )
= =
(ETi ) (1 − ETi ) (1 − ETi )
E(Yi |Ti = 1) − E(Yi |Ti = 1)ETi − E(Yi |Ti = 0) (1 − ETi )
=
(1 − ETi )
= E(Yi |Ti = 1) − E(Yi |Ti = 0).
Exercise: Let E(ui ) = 0 and E(ui Ti ) = 0 and Ti is a binary random variable. Show that
E(ui |Ti = 1) = 0 and E(ui |Ti = 0) = 0.
The question: is β ∗ = β? In general, no.
Consider
• E[Yi (1)] is the mean of the potential outcomes under treatment 1 for all individuals
in the population.
• E[Yi (1)|Ti = 1] is the mean of the potential outcomes under treatment 1 for those who
actually received treatment 1.
• So E[Yi (1)] is the mean for the population while E[Yi (1)|Ti = 1] is the mean for a
subpopulation.
• We expect that E[Yi (1)|Ti = 1] 6= E[Yi (1)] in general. Mathematically, the unconditional
mean may be di§erent from the conditional mean.
Similarly,
E[Yi |Ti = 0] = E[Yi (0)|Ti = 0] 6= E[Yi (0)] in general
Hence β ∗ 6= β in general.
But β ∗ = β if
Example: Suppose that for all i, Yi (1) = Yi (0) and Ti indicates college attendance. So
AT E = 0. There is no causal e§ect of college. However, if Ti is positively correlated with
Yi (1) = Yi (0) (more motivated students go to college), then Ti is positively correlated with
[Yi (0) − EYi (0)] , which is the error term in the absence of ATE: Yi := EYi (0) + Ti × 0 +
[Yi (0) − EYi (0)] , then we would have
Hence, β ∗ > 0.
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES6 190
Another way to look at the problem is to focus on the linear causal model representation:
Yi = c α + Ti β i + ϵi
= α + Ti (Eβ i ) + ϵi + Ti [β i − (Eβ i )]
= α + Ti × AT E + ei
where
ei = [Yi (0) − EYi (0)] + Ti [Yi (1) − Yi (0) − E [Yi (1) − Yi (0)]]
= [Yi (0) − EYi (0)] + Ti {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} .
The above representation has taken on the look and feel of a regression model. ei is
akin to an error term, even though it represents both heterogeneity of the baseline of no
treatment potential outcome i.e., [Yi (0) − EYi (0)] and heterogeneity of the treatment e§ects,
i.e., Yi (1) − Yi (0) − E [Yi (1) − Yi (0)], and even though it includes within it the observed
variable Ti . The above representation is in fact quite di§erent from the traditional bivariate
regression in the sense that it is not only more finely articulated but also tied to a particular
formulation of causal e§ects that are allowed to vary across individuals.
Whether OLS is consistent depends on whether Ti is endogenous, i.e. whether
or
If the above two equations hold, then OLS is consistent for β. This is exactly the same
conclusion we obtained by comparing β ∗ with β directly. * +
There are two ways that Ti could be endogenous. First cov T i , Yi (0) − EYi (0 ) may not be
zero. That is, there is a correlation between the treatment membership and the (net) baseline
di§erence in the hypothetical no-treatment state. Second,
may not be zero. That is, there is a correlation between the net treatment e§ect di§erence
and the treatment membership.
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES7 191
The quantities in the definition of ATT can be illustrated by the figure below:
In some cases, we are interested in some policy or treatment which we will make available
to individuals, but we will not force them to take the treatment. So if individuals who
are likely to benefit from the treatment are the ones who end up taking it, then we could
have AT T > AT E.
We should be a little careful in using this definition. If information about the e¢cacy of
the treatment becomes widely known, individuals may change their behavior over time. So
AT T is arguably somewhat more sensitive to problems of “external generalizability.”
We can decompose the apparent e§ect E(Yi |Ti = 1) − E(Yi |Ti = 0) as follows:
This is the average e§ect for individuals with covariate value x. It could be particularly
useful to a social planner who wants to make treatment assignments on the basis of individual
characteristics.
So randomized experiments permit consistent estimation of the ATE by OLS, without strong
distributional or functional form assumptions. We say the ATE is nonparametrically identified.
An alternative definition of identification (roughly equivalent): ATE (or other object of
interest) is identified if we can recover it from the distribution of observables – this is, from
the distribution of (Ti ,Yi ).
Example 67 Vitamin C. Cameron and Pauling (1976): gave vitamin C to 100 patients be-
lieved to be terminally ill from cancer.
Comparison group constructed by matched sampling: for each treated patient, select 10
patients from historical records with same type of cancer and other characteristics (age, gender)
Patients receiving vitamin C lived about 4 times longer than controls, highly significant.
Later, careful randomized experiment conducted at the Mayo Clinic. Patients randomly
assigned to receive vitamin C or a placebo.
NO evidence that vitamin C prolonged survival.
Example 68 The RAND Health Insurance Experiment (RAND HIE) was an experimental
study of health care costs, utilization and outcomes in the United States, which assigned people
randomly to di§erent kinds of plans and followed their behavior, from 1974 to 1982. As a
result, it provided stronger evidence than studies that examine people afterwards who were not
randomly assigned.
9.2 RANDOMIZED EXPERIMENTS 194
People assigned to more generous plans used substantially more health care
Y − c(Ti , Ẍi )
where Ẍi are unobserved causes. The condition independence assumption in the above model
entails # $
Ti ? Ẍi |Xi
for some Xi . But Yi (1) = c1 (Ẍi ) and Yi (0) = c0 (Ẍi ), so the above assumption implies that
Ti ? (Yi (1), Yi (0))|Xi .
When Xi contains pre-treatment variables that determine the treatment selection and
a§ect the outcome of interest, we have the following DAG:
Xi
. &
Ti Ẍi
& .
Yi
Ti = 1 {φ0 + Xi φ1 + Ui > 0}
where X represents all observed variables that determine treatment selection and Ui represent
all unobserved variables. When Ui is completely random in that it is independent of Yi ,
9.3 STRONGLY IGNORABLE TREATMENT ASSIGNMENT 197
then selection depends only on the observables systematically and Ui can be regarded as a
‘randomizer’. The left panel of the graph below illustrates the “selection on observables”:
There are no back-door paths from T to Y other than the one that is blocked by X. The
term U represents completely random and idiosyncratic determinants of treatment selection.
In contrast, the right panel of the graph below illustrates the “selection on unobservables”:
The term U , like the elements in X, is not completely random. It is correlated with Y.
There are now back-door paths from T ! Y other than those via X. Conditioning on X
does not block the back-door path T U ! Y . In this case, we have to make strong
assumptions about the links between unobservables to observables, for example, we may to
find a randomized instrument variable which does not a§ect the unobservables, leading to the
instrumental variables approach.
2 32 3
Xi Xi
6 76 7
6 76 7
6 . & 76 . & 7
6 76 7
6 76 7
6 Ti Ẍi 7 6 Ti Ẍi 7
6 76 7
6 76 7
6 % & . 76 % & . 7
4 54 5
Ui Yi Ui −! Yi
left: selection on observables; right: selection on unobservables
• Selection on unobservables
Yi (1) − Yi (0) = β
Partial Overlap: for very large values of x, everyone is treated and for very small values of x,
everyone is untreated. In this case, ATE(x) for a very large x or a very small x is not
identified. However, ATE(x) for x in the middle range is identified.
9.4 IDENTIFICATION UNDER STRONG IGNORABILITY 199
When the unconfoundedness and overlap assumptions hold, we say that the treatment
assignment is strongly ignorable.
Note that
where the first equality follows from the unconfoundedness assumption. Since P r(Ti = 1|Xi =
x) > 0 by assumption, we can consistently estimate E[Yi |Ti = 1, Xi = x], and therefore we
can identify E[Yi (1)|Xi = x].
Likewise
so we can estimate E[Yi (0)|Xi = x] as well. Thus, we can identify AT E(x). For this, we need
to E[Yi |Ti , Xi ] for all values of Ti and Xi .
Notice also that Z
AT E = E[AT E(Xi )] = AT E(x)dFX (x)
Note that E(Y |X = x, T = t) and hence dG (t, x) can be estimated from the data. The discrete
analogue of the “partial derivative” is
h # $ # $ i
dG (1, x) − dG (0, x) = E c 1, Ẍ − c 0, Ẍ |X = x .
and therefore
Mathematically, the above are weaker than Ti ? (Yi (1), Yi (0))|Xi but is hard to see the
practical advantage of the weaker conditional mean independence assumption. Recall that
conditional mean independence is su¢cient for a linear causal model. Here when Ti is a
binary variable, we have a linear or quasi-linear model by construction. That is why it su¢ces
to have conditional mean independence.
For the ATT, we only need to assume that
because Yi (1) is not missing for the treated individuals. We use the above to impute their
missing Yi (0) . To identify ATT, the overlap condition can be relaxed to
for all x in the support of Xi conditional on Ti = 1. Similarly, for ATC, we only need to
assume
E[Yi (1)|Xi = x] = E[Yi (1)|Ti = 1, Xi = x]
and the overlap assumption can be similarly weakened.
Let
and m̂1 (x) and m̂0 (x) be the consistent estimators of m1 (x) and m0 (x) , respectively. Then
we can estimate ATE, ATT and ATC by
n
[ 1X
AT E = [m̂1 (Xi ) − m̂0 (Xi )]
n
i=1
Pn
[ i=1 [YP
i − m̂0 (Xi )] 1 {Ti = 1}
AT T = n
i=1 1 {Ti = 1}
Pn
[ [
i=1 Pm̂1 (Xi ) − Yi ] 1 {Ti = 0}
AT C = n
i=1 1 {Ti = 0}
We are simply taking the treatment and control averages for the subsample with Xi equal to
a particular value. Then taking a sample analog to the equation AT E = E[AT E(x)], we have
n
[ 1X[
AT E= AT E(Xi ).
n
i=1
A nice feature of this estimator is that we avoid making strong assumptions on the form
of E[Yi |Ti , Xi ]. However, if Xi takes on many values, there will be relatively few observations
[
with any particular value of Xi , leading to high variance for AT E(x) and AT [ E.
E(Yi |Ti = 1, Xi ) = a + b + Xi (c + d) ,
E(Yi |Ti = 0, Xi ) = a + Xi c.
So the regression line for the Ti = 0 subgroup could have a di§erent slope and intercept from
the regression line for the Ti = 1 subgroup. We could include transformations of Xi (such as
powers of Xi ) as well, and get a fairly general regression specification.
We could then estimate this regression function by OLS, and then estimate AT E(x) by:
[
AT E(x) = b̂OLS + xdˆOLS .
where X̄|treatment is the average of Xi for the treated and X̄|control is the average of Xi for the
untreated.
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 203
We can get an even nicer expression for AT[ E if we run the regression with X̃i := Xi − X̄
in place of Xi . Then we will have
β̂ AT E = b̂OLS,alt
where b̂OLS ,alt is the OLS estimator of b in the regression
Notice that the unconfoundedness assumption provides a link between conventional re-
gression parameters and the causal parameter. Thus, it is possible to interpret regression
parameters as causal, but only under somewhat strong assumptions about the selection into
treatment and control groups.
The OLS estimators of a, b, c, d can be obtained by running two separate regressions:
X
(a[+ b, c[
+ d) = arg min [Yi − (a + b) − Xi (c + d)]2 ,
i: Ti =1
X
a, b
(b c) = arg min [Yi − a − Xi c]2 .
i: Ti =0
So
; <
[
AT T = a[+ b + X̄|treatment (c[
+ d) − b
a + X̄|treatment (b
c)
| {z } | {z }
h i
= a[+b−b a + X̄|treatment (c[+ d) − (b
c)
Similarly, we have
; <
[
AT C = Ȳ |treatment − X̄|treatment (c[
+ d) − Ȳ |control − X̄|control (b
c)
| {z } | {z }
[
a+b b
a
h i
+X̄|control (c[+ d) − (bc)
h * + i
= Ȳ |treatment − X̄|treatment − X̄|control (c[
+ d) − Ȳ |control .
These two expressions give us a clear idea what the parametric assumption entails. As an
example, consider the ATT. In this case, the estimate is di§erence between Ȳ |treatment and
the adjusted average of the control group Ȳ |adjcontrol for
; * + <
Ȳ |adj
control = Ȳ |control + X̄|treatment − X̄|control (b
c) .
The magnitude of the adjustment depends on the extent to which the two groups are balanced
in terms of the covariate averages. Under the completely randomized experiments, we have
E X̄|treatment = E X̄|control = E X̄ if Ti does not cause Xi . Then
[
AT T t Ȳ |treatment − Ȳ |control .
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 204
Ȳ |treatment − Ȳ |control is exactly the same as the estimate of the ATE in the context of com-
pletely randomized experiments.
Remark 71 The method can also be viewed from an imputation perspective. Suppose for
individual i0 , Ti0 = 1 so that we observe Yi0 (1) but Yi0 (0) is missing. We want to impute
the missing a, b
P value Yi0 (0). The 2above procedure amounts to first running the regression (b c) =
arg min j :Tj =0 [Yj − a − Xj c] based on the subsample with Tj = 0 and then imputing the
missing value by
Ŷi0 (0) = b
a + Xi0 b
c.
Similarly, the missing value of Yj0 (1) for Tj0 = 0 is imputed by
Ŷj0 (0) = a[
+ b + (c[
+ d)Xj0 .
With the imputed values, we can then proceed as if there is no missing value. This is equivalent
to the OLS regression adjustment above.
Remark 72 The linear parametric assumption implies that the adjustment is linear. If the
adjustment should be nonlinear, when X̄|treatment is approximately equal to X̄|control , the linear
specification will not lead to large bias. Otherwise, the bias due to the misspecification of the
functional form can be large. Consider AT [ T as an example, in which case, we essentially use
the control cases to impute the missing values of Yi (0) for the treated. If X̄|treatment is very
di§erent from X̄|control , we rely on extrapolation to make the imputation. Linear extrapolation
may not be reliable when the averages of the covariates across two groups are not close to each
other.
Remark 73 Bias from the global linear regression is a function of (i) the covariate distrib-
utions and (ii) nonlinearity. One way to assess potential bias is to calculate the normalized
di§erence:
X̄|treatment − X̄|control
normdif f = p
(S 2 |treatment + S 2 |control ) /2
where S 2 |treatment and S 2 |control are the sample variance of Xi for the two groups. Imbens and
Wooldridge (2009) suggest a rule of thumb:
If normdif f > 0.25, then don’t use global linear regression.
Note that normdif f is not the t-statistic for testing equal means. With really large samples,
small values of X̄|treatment − X̄|control can have a large t-stat. But inference doesn’t get harder
as the sample sizes increases.
Remark 74 Suppose that we have a completely randomized experiment so that Ti ? (Yi (1), Yi (0))
but we are not aware of this. Instead, we proceed to estimate the treatment e§ect under the
assumption that
Ti ? (Yi (1), Yi (0))|Xi .
If Ti causes Xi so that E X̄|treatment 6= E X̄|control , and if Yi causes Xi , then using X to adjust
the treatment and control averages would introduce a bias. * This is a case of+ having a bad
control. For the ATT estimation, the bias is reflected in X̄|treatment − X̄|control (b c). This can
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 205
be regarded as a selection bias, as it depends on the di§erence of the averages of Xi across the
two subgroups.
2 3
Ti −! Xi
6 7
6 7
6 & " 7
4 5
Yi − Ẍi
Bad control
2 3
Ti −! Xi
6 7
6 7
6 & # 7
4 5
Yi − Ẍi
Unnecesssary control, “controling” for Xi
leads to a di§erent interpretation of the
estimated causal e§ect. It is now the causal
e§ect that is not mediated via Xi
If
E {Yi (0)|Xi } = δ 0 + Xi δ 1 , E {Yi (1)|Xi } = γ 0 + Xi γ 1 ,
then
E {[Yi (0) − EYi (0)] |Xi } = (Xi − EXi ) δ 1 ,
and
E {[Yi (1) − EYi (1)] |Xi } = (Xi − EXi ) γ 1 ,
9.6 NONPARAMETRIC METHODS UNDER STRONG IGNORABILITY 206
Note that we are taking each observed value of Xi , and evaluating m̂1 and m̂0 at each such
value, then averaging.
An advantage of the series method is that inference be made by pretending that we have
two (pseudo) parametric regressions.
As a second example, we can use the kernel smoothing method to estimate m1 (x) and
m0 (x) by
P
j:T =0 Yj Kh (Xj − x)
m̂0 (x) = P j
j:Tj =0 Kh (Xj − x)
P
j:T =1 Yj Kh (Xj − x)
m̂1 (x) = P j
j:Tj =1 Kh (Xj − x)
where
1 #u$
Kh (u) = K
h h
and K (·) is a kernel function and h is the bandwidth.
The first term E(Yi |Ti = 1) can be estimated simply by taking the sample average of the
treated observations:
Pn Pn
\ i=1 Yi 1 {Ti = 1} Yi Ti
E(Yi |Ti = 1) = P n = Pi=1
n .
i=1 1 {Ti = 1} i=1 Ti
9.7 PROPENSITY SCORE 207
The problem is how to estimate the second “counterfactual” term? For the individual with
Ti = 1, Yi (0) is the counterfactual outcome which by definition is not observable. Here we
have to use the unconfoundedness assumption:
L P M
P
E(Yi (0)|Ti = 1) = E E(Yi (0)|Ti = 1, Xi )PP Ti = 1 by LIE
| {z }
L P M
P
P
= E E(Yi (0)|Ti = 0, Xi )P Ti = 1 by unconfoundedness
| {z }
L P M
P
= E E(Yi |Ti = 0, Xi )PP Ti = 1 by the definition of Yi
| {z }
We can estimate E(Yi |Ti = 0, Xi ) by m̂0 (Xi ) and estimate E(Yi (0)|Ti = 1) by
Pn
m̂ (X ) 1 {Ti = 1}
\
E(Yi (0)|Ti = 1) = i=1 Pn 0 i .
i=1 1 {Ti = 1}
As in the case of ATE, we can use any of the nonparametric methods to estimate m0 (x) and
m1 (x). For example, suppose we use 1-nearest neighbor matching for m̂0 . Basically, what we
are doing is taking each treated outcome, and finding a control unit that has the closest value
of X. We subtract the treated outcome from its “matched” control outcome, and average.
This is a type of matching estimator, which we will discuss in some details latter.
These assumptions may be weakened when AT T or ATC is the object of interest. Under the
above two assumptions, we look at the propensity score approach to estimate the ATE and
ATT. Similar ideas apply to ATC estimation.
9.7 PROPENSITY SCORE 208
p(x) = P (T = 1|X = x)
or equivalently
p(x) = E(T |X = x).
Theorem 77 If T ? (Y (1), Y (0))|X and p(x) 2 (0, 1) for all x in the support of X, then
T ? (Y (1), Y (0))|p(X).
Second,
U P V
P
P
P [T = 1|p(X)] = E E [T |p(X), X]P p(X)
| {z }
U P V
P
= E E [T |X]PP p(X)
| {z }
= E { p(X)| p(X)} = p(X).
Hence
P [T = 1|Y (1), Y (0), p(X)] = P [T = 1|p(X)]
as desired.
In the SCM framework, we have
Theorem 78 If T ? Ẍ|X and p(x) 2 (0, 1) for all x in the support of X, then T ? Ẍ|p(X).
The proof is the same but with Y (1), Y (0) replaced by Ẍ.
9.7 PROPENSITY SCORE 209
2 3
P (Xi )
6 7
6 7
6 " 7
6 7
6 7
6 Xi 7
6 7
6 7
6 . & 7
6 7
6 7
6 Ti Ẍi 7
6 7
6 7
6 % & . 7
4 5
Ui Yi
The above results imply that we can replace the covariate X with the scalar p(X) in our
previous regression-based approaches.
For the parametric case, recall that
Yi − α + Ti × AT E + ei
where
ei = [Yi (0) − EYi (0)] + Ti {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} .
So
E (Yi |Ti , P (Xi )) = α + Ti × AT E + E (ei |Ti , P (Xi )) ,
where
E (ei |Ti , P (Xi ))
= E {[Yi (0) − EYi (0)] |Ti , P (Xi )}
h i
+Ti E {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |Ti , P (Xi )
= E {[Yi (0) − EYi (0)] |P (Xi )} + Ti E [{[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |P (Xi )] .
If
E {Yi (0)|P (Xi )} = δ 0 + P (Xi )δ 1 , E {Yi (1)|P (Xi )} = γ 0 + P (Xi )γ 1 ,
then
E {[Yi (0) − EYi (0)] |Xi } = (P (Xi ) − EP (Xi )) δ 1 ,
and
E {[Yi (1) − EYi (1)] |Xi } = (P (Xi ) − EP (Xi ))γ 1 ,
which imply that
E (Yi |Ti , P (Xi ))
= α + Ti × AT E + E (ei |Ti , Xi )
= α + Ti × AT E + (P (Xi ) − EP (Xi ))δ 1 + Ti × (P (Xi ) − EP (Xi )) δ 1
So we can estimate the ATE by regressing Yi on a constant, Ti , the centered propensity score,
and their interaction. In the presence of constant treatment e§ects, we have
E (Yi |Ti , P (Xi )) = α + Ti × AT E + (P (Xi ) − EP (Xi ))δ 1 ,
9.7 PROPENSITY SCORE 210
and we can estimate the ATE by using P (Xi ) as the control variable. This is entirely analogous
to the control variable approach under the conditional mean independence assumption.
For the nonparametric approach, letting X̃ = p(X), we can estimate ATE by
1 Xh # $i
n
[
AT E= ŝ1 (X̃i ) − ŝ0 X̃i
n
i=1
where ŝ1 (x) is a nonparametric estimator of s1 (x) = E(Yi |Ti = 1, X̃i = x) and ŝ0 (x) is a
nonparametric estimator of s0 (x) = E(Yi |Ti = 0, X̃i = x). Similarly, we can estimate ATT by
Pn h # $i
i=1 Yi − ŝ0 X̃i 1 {Ti = 1}
[
AT T = Pn .
i=1 1 {Ti = 1}
p(x) = Φ (xδ) .
In the (semi)-nonparametric method, we relax the above assumption slightly and may take
* +
p(x) = Φ δ 0 + xδ 1 + x2 δ 2 + ... + xJ δ J .
How should X enters the probit or logit model? If X is multivariate and we include high
order terms and their interactions, the the number of terms can explode. Imbens and Rubin
(2015, textbook) propose using stepwise regression, a sensible way to limit the number of X’s:
1. Start with covariates that are expected a priori to matter. This requires some subject
matter knowledge.
2. Add the covariate that has the largest test statistic (e.g., likelihood ratio) one by one.
If the test statistic exceeds a pre-specified threshold, then include the corresponding
covariate. Iterate this procedure until no more test statistic exceeds the threshold.
3. Do this for the first order terms and then the second order terms, with potentially
di§erent thresholds. We probably don’t want to use the fourth order terms, and maybe
not even third order terms.
We could also use Lasso or other shrinkage methods. For Lasso, we choose the model that
minimizes
Xn J
X
− log Li + λ |δ j |
i=1 j=1
9.7 PROPENSITY SCORE 211
It is often argued in the literature that the propensity score approach reduces the dimen-
sionality of the problem. This seems to be true on the surface. In the ATT estimation, if we
use the single-nearest neighbor estimator ŝ0 , we essentially match a treated individual with an
untreated individual with the closest propensity score. Originally we have to match according
to a multidimensional vector X and now we only have to match according to a scalar. We have
seemingly achieved dimension reduction. However, the high dimension problem is still there,
as the propensity score still depends on the multidimensional vector X. So the dimension
reduction has not been really achieved. It is just hidden a little deeper. Nevertheless, it is still
insightful to know that it is su¢cient to use the propensity score under the unconfoundedness
assumption.
9.7.3 Matching
Note that ATT can be estimated by
Pn
[ [Yi − m̂0 (Xi )] 1 {Ti = 1}
AT T = i=1 Pn (9.1)
i=1 1 {Ti = 1}
or Pn h # $i
i=1 Yi − ŝ0 X̃i 1 {Ti = 1}
[
AT T = Pn .
i=1 1 {Ti = 1}
9.7 PROPENSITY SCORE 212
As we discussed before, when m̂0 and ŝ0 are estimated by the kNN method with k = 1, the
above estimators are matching estimators that entail matching a treated individual with an
untreated individual with the closest propensity score or the closest propensity score x value.
Matching involves finding direct comparisons, that is, matches, for each individual.
For the kNN matching based on the values of X variable, matching involves a distance
measure and the number of points k we choose to match. For instance, we can use the
Euclidean distance: s
X# $2
kXi − Xj k = Xi` − Xj`
`
In practice, we need to standardize the covariates somehow (e.g., using the inverse of the
variances) in order to avoid the situation that some covariates dominate other covariates.
Selecting the value of k represents a subtle tradeo§. Matching more control cases to each
treatment case results in lower asymptotic variance of the treatment e§ect estimator but also
tends to increase the bias because the probability of making poorer matches increases with
the number of matches.
A danger to select the fixed number of matches such as 5 is that it may lead to some
poor matches for some treatment cases. A version of nearest-neighorhood matching, known as
‘radius’ or ‘caliper’ matching, is designed to remedy this drawback by restricting the matches
to a chosen maximum distance. A hybrid approach is to keep the single nearest match and
throw away the kNN matches whose distances are larger than a given maximum distance.
For the kernel matching based on the X variable, all control cases are used as counter-
factuals for each treatment case but weigh each control case based on its distance from the
treatment case. Note that
P Pn
j:Tj =0 Yj Kh (Xj − Xi ) j=1 Yj Kh (Xj − Xi ) 1 {Tj = 0}
m̂0 (Xi ) = P = Pn
j:Tj =0 Kh (Xj − Xi ) j=1 Kh (Xj − Xi ) 1 {Tj = 0}
n
X
= Yj Wji 1 {Tj = 0}
j=1
where
Kh (Xj − Xi ) 1 {Tj = 0}
Wji = Pn .
j=1 Kh (Xj − Xi ) 1 {Tj = 0}
The weight attached to the jth control case for the ith treatment case is Wji .
where Pn (Xi |Ti = 1) is the empirical distribution of X for the treatment group and Pn (Xi |Ti = 0)
is the empirical distribution of X for the matched dataset, i.e., the set of control cases that
9.7 PROPENSITY SCORE 214
are matched to any of the treatment cases. Recent matching literature has developed di§erent
algorithms to achieve “balance” as much as possible.
At the minimum, we should check whether the propensity scores are balanced across the
treatment group and matched control group.
where 8
1 < 1
if Ti = 1
p(Xi )
λi = Ti 1−Ti
=
p (Xi ) (1 − p (Xi )) : 1
if Ti = 0
1−p(Xi )
]
AT E is the di§erence between two weighted averages.
Remark 80 Let us compare the IPW estimator with the regression-based estimator:
n
[ 1X
AT E= [m̂1 (Xi ) − m̂0 (Xi )] ,
n
i=1
where
m̂1 (Xi ) = Ê[Yi |Ti = 1, Xi ], m̂0 (Xi ) = Ê[Yi |Ti = 1, Xi ]
and Ê (·|·) stands for an estimator of the conditional mean E (·|·) . Note that
E[Yi Ti |Xi ] E[Yi (1 − Ti ) |Xi ]
E[Yi |Ti = 1, Xi ] = and E[Yi |Ti = 0, Xi ] = .
E [Ti |Xi ] E [1 − Ti |Xi ]
If we estimate E[Yi |Ti = 1, Xi ] by Ê[Yi Ti |Xi ]/Ê [Ti |Xi ] and E[Yi |Ti = 0, Xi ] by Ê[Yi (1 − Ti ) |Xi ]/Ê [1 − Ti |Xi ] ,
then " #
X n
[ 1 Ê[T i Yi |X i ] Ê[(1 − T i ) Yi |X i ]
AT E= − .
n p̂(Xi ) 1 − p̂(Xi )
i=1
]
This estimator is close to AT E, which is
n L M
1 X Ti Yi (1 − Ti ) Yi
−
n p̂(Xi ) 1 − p̂(Xi )
i=1
when the estimated propensity score is plugged in. The di§erence appears to be whether
Ê[Ti Yi |Xi ] or Ti Yi is used.
Inverse probability weighting was proposed by Horvita and Thompson (1952) in a some-
what di§erent setting. One problem in practice with this estimator is that the weights do not
necessarily add up to 1. We could modify the estimator by normalizing the weights so they
add up to one:
Pn Ti Yi Pn (1−Ti )Yi
i=1 p(Xi ) i=1 1−p(Xi )
]
AT E renorm = P − P .
n Ti n (1−Ti )
i=1 p(Xi ) i=1 1−p(Xi )
Example 81 To provide some intuition behind inverse probability weighting, consider the
following hypothetical data set with a binary X and binary treatment T :
i 1 2 3 4 5 6 7 8 9 10
Xi 0 0 0 0 0 1 1 1 1 1
Ti 0 1 1 1 1 0 0 1 1 1
Yi 3 4 5 1 2 6 0 3 9 7
4 4 4 4 4 3 3 3 3 3
p(Xi ) 5 5 5 5 5 5 5 5 5 5
9.7 PROPENSITY SCORE 216
In order to avoid small sample problems, we can think each i as representing a large number
of individuals, say 10,000 individuals.
We have
10 L M
] 1 X Ti Yi (1 − Ti ) Yi
AT E = −
10 p(Xi ) 1 − p(Xi )
i=1
" ! !#
1 4 5 1 2 3 9 7 3 6 0 5
= 4 + 4 + 4 + 4 + 3 + 3 + 3 − 4 + 3 + 3 = .
10 5 5 5 5 5 5 5 1− 5 1− 5 1− 5 3
]
Note that AT E can be rewritten as
]
AT "E ! !#
1 4 5 1 2 3 3 9 7 6 0 5
= 4 + 4 + 4 + 4 − + + + − − = .
10 5 5 5 5 1 − 45 3
5
3
5
3
5 1− 3
5 1− 3
5
3
i 1 2 3 4 5
Xi 0 0 0 0 0
Ti 0 1 1 1 1
Yi 3 4 5 1 2
For i = 1, we observe Yi (0) = 3. We ask: what would happened to individual 2,3,4,5 had they
chosen T = 0? Because of the randomization within the subsample, we expect each of Yi (0) for
i = 2, 3, 4, 5 to be 3. So individual 1 can be regarded as representing all five individuals in the
counterfactual world where everyone chooses the control state. Similarly, each of individual
i = 2, 3, 4, 6 can be regarded as representing 1.25 individuals in the counterfactual world where
everyone chooses the treatment state. We have e§ectively generated a pseudo-subsample with
10 individuals:
i 1 10 10 10 10 20 + 30 + 40 + 50 2 3 4 5
Ti 0 0 0 0 0 1 1 1 1 1
4+5+1+2
Yi 3 3 3 3 3 4 4 5 1 2
ATE(0) is then given by
4+5+1+2
4 +4+5+1+2
AT E(0) = − 3 = 0.
5
Another way to estimate ATE is to take an average of ATE(x) for x = 0 and 1 where
4+5+1+2
AT E(0) = − 3 = 0,
4
3+9+7 6+0 10
AT E(1) = − = ,
3 2 3
9.7 PROPENSITY SCORE 217
leading to
[ 1 1 10 5
AT E = ×0+ × = .
2 2 3 3
Clearly the two estimators are the same. This is not a coincidence! We can prove this rigor-
ously.
In practice, we do not know the propensity score, and have to use an estimated version
p̂ (Xi ) , leading to
] 1 X 1 X
AT E renorm = λ̂i Yi − λ̂i Yi
n n
i:Ti =1 i:Ti =0
where 8
> n[p̂(Xi )]−1
< P −1 , if Ti = 1
j:T =1 [p̂(Xi )]
j
λ̂i =
> n[1−p̂(Xi )]−1
: P −1 , if Ti = 0
j:T =0 [1−p̂(Xi )]
j
In a completely random experiment, p̂ (Xi ) is a constant and so λ̂i = (n/n1 )Ti (n/n0 )1−Ti .
]
AT E renorm is then equal to Ȳtreatment − Ȳcontrol .
Now consider the ATT. We can construct a weighting estimator for this quantity as well,
building upon our approach for ATE. First, note that
Z
AT T = AT E(x)fX (x|T = 1)dx
Z
= [m1 (x) − m0 (x)] fX (x|T = 1)dx
Z
fX (x)P (T = 1|X = x)
= [m1 (x) − m0 (x)] R dx
fX (v)P (T = 1|X = v)dv
Z
fX (x)p(x)
= [m1 (x) − m0 (x)] R dx
fX (v)p(v)dv
E [m1 (X) − m0 (X)] p(X)
= .
Ep(X)
1 Pn h Yi Ti Yi (1−Ti )
i
n i=1 p(Xi ) − 1−p(Xi ) p(Xi )
]
AT T = 1 Pn .
n i=1 p(Xi )
9.8 DOUBLY ROBUST ESTIMATOR 218
The problem is that we do not observe both Yi (1) and Yi (0). Instead, we only observe
Yi = Ti Yi (1) + (1 − Ti ) Yi (0) .
[
AT E R is a consistent estimator of ATE only if the p(Xi , θ) is correctly specified (and
hence p(Xi , θ̂) is a consistent estimator of the true propensity score).
It is possible to combine the two estimators so that the new estimator is less subject to
model misspecification. Consider
n
" #
1 X T i (Yi − m 1 (X i , α̂ 1 ))
[
AT E DR = m1 (Xi , α̂1 ) +
n p(Xi , θ̂)
i=1
n
" #
1X (1 − Ti ) (Yi − m0 (Xi , α̂0 ))
− m0 (Xi , α̂0 ) +
n 1 − p(Xi , θ̂)
i=1
where m0 (X, α0 ) and m1 (X, α1 ) are the postulated models for the true regression E(Y |T =
0, X) and E(Y |T = 1, X) (fitted by OLS). AT [ E DR may be viewed as taking the regression
estimator and “augmenting” it with some adjustment.
[
We can rewrite AT E DR as
n
" #
1 X T Y
i i T i − p(X i , θ̂)
[
AT E DR = − m1 (Xi , α̂1 )
n p(X i , θ̂) p(X i , θ̂)
i=1
n
" #
1 X (1 − Ti ) Yi Ti − p(Xi , θ̂)
− + m0 (Xi , α̂0 )
n 1 − p(Xi , θ̂) 1 − p(Xi , θ̂)
i=1
[
so AT E DR may be also viewed as taking the IPW estimator and “augmenting” it with some
adjustment.
Let
n
" #
1X Ti Yi Ti − p(Xi , θ̂)
µ̂1,DR = − m1 (Xi , α̂1 )
n p(X i , θ̂) p(X i , θ̂)
i=1
n
" #
1 X (1 − Ti ) Yi Ti − p(Xi , θ̂)
µ̂0,DR = + m0 (Xi , α̂0 )
n 1 − p(Xi , θ̂) 1 − p(Xi , θ̂)
i=1
9.8 DOUBLY ROBUST ESTIMATOR 220
where the last line follows because m1 (Xi , α1 ) = E(Yi |Ti = 1, Xi ) = E(Yi (1)|Ti =
1, Xi ) = E (Yi (1)|Xi ) .
9.8 DOUBLY ROBUST ESTIMATOR 221
So, as long as m1 (Xi , α1 ) is correctly specified, even if p(Xi , θ) is not, µ̂1,DR is a consistent
estimator of EYi (1) . Similarly, µ̂0,DR is a consistent estimator of EYi (0) . Therefore,
µ̂1,DR − µ̂0,DR is a consistent estimator of ATE.
To sum up, µ̂1,DR − µ̂0,DR is a consistent estimator of ATE if either the propensity score
function or the conditional mean function is correctly specified. This property of µ̂1,DR − µ̂0,DR
is referred to as Double Robustness. The double robustness o§ers some protection against
model misspecification.
For the ATT estimation, we may use
Pn
[ Yi Ti
AT T DR = Pi=1 n
Ti
(i=1
Pn n
)
T m
i 0 (X i , α̂ 0 ) 1 X [Yi − m 0 (X i , α̂ 0 )] p(X i , θ̂)
i=1 P
− n + Pn (1 − Ti )
i=1 T i i=1 T i 1 − p(X i , θ̂)
i=1
It su¢ces to show that the second term in the above di§erence converges to
E (Yi (0) |Ti = 1) .
The limit should be
ETi m0 (Xi , α0 ) 1 (1 − Ti ) [Yi − m0 (Xi , α0 )] p(Xi , θ)
A= + ·E
ETi ETi 1 − p(Xi , θ)
where
1 (1 − Ti ) [Yi − m0 (Xi , α0 )] p(Xi , θ)
·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] p(Xi , θ)
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] [p(Xi , θ) − Ti ]
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] [p(Xi , θ) − 1 − (Ti − 1)]
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )]
= − E (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] + E
ETi 1 − p(Xi , θ)
1
= − E (1 − Ti ) [Yi (0) − m0 (Xi , α0 )]
ETi
[{1 − p (Xi , θ) − [Ti − p (X, θ)]}] [Yi (0) − m0 (Xi , α0 )]
+E
1 − p(Xi , θ)
ETi [Yi (0) − m0 (Xi , α0 )] 1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
= − E .
ETi ETi 1 − p(Xi , θ)
So
ETi Yi (0) 1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
A = − E
ETi ETi 1 − p(Xi , θ)
1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
= E [Yi (0) |Ti = 1] − E .
ETi 1 − p(Xi , θ)
9.8 DOUBLY ROBUST ESTIMATOR 222
For the same reason as above, the second term in A is zero as long as either p (X, θ) or
[
m0 (Xi , α0 ) is correctly specified. In this case, A = E [Yi (0) |Ti = 1] and AT T DR converges to
AT T.
Can you design a double robust estimator for ATC?