Econometric models based on observational data are often endogenous due to measurement error, autocorrelated errors, simultaneity and omitted variables, non-random sampling, self-selection, etc. Parameter estimates of these models without corrective measures may be inconsistent. The potential high-dimensional feature of these models (where the dimension of the parameters of interests is comparable to or even larger than the sample size) further complicates the statistical estimation and inference. My dissertation studies two different types of high-dimensional endogenous econometrics problems in depth and develops statistical tools together with their theoretical guarantees.
The first essay in this dissertation explores the validity of the two-stage regularized
least squares estimation procedure for sparse linear models in high-dimensional
settings with possibly many endogenous regressors. The second essay is focused on the semiparametric sample selection model in high-dimensional settings under a weak nonparametric restriction on the form of the selection correction, for which a multi-stage
projection-based regularized procedure is proposed. The number of regressors in the
main equation, p, and the number of regressors in the first-stage equation, d, can grow
with and exceed the sample size n in the respective models. The analysis considers the
sparsity case where the number of non-zero components in the vectors of coefficients
is bounded above by some integer which is allowed to grow with n but slowly compared
to n, or the vectors of coefficients can be approximated by exactly sparse vectors.
Simulations are conducted to gain insight on the small-sample performance of these
high-dimensional multi-stage estimators. The proposed estimators in the second essay
are also applied to study the pricing decisions of the gasoline retailers in the Greater
Saint Louis area.
The main theoretical results of both essays are finite-sample bounds from which
sufficient scaling conditions on the sample size for estimation consistency and variable selection consistency (i.e., the multi-stage high-dimensional estimation procedures correctly select the non-zero coefficients in the main equation with high probability) are established. A technical issue regarding the so-called “restricted eigenvalue (RE) condition” for estimation consistency and the “mutual incoherence (MI) condition” for selection consistency arises in these multi-stage estimation procedures from allowing the number of regressors in the main equation to exceed n and this paper provides analysis to verify these RE and MI conditions. In particular, for the semiparametric sample selection model, these verifications also provide a finite-sample guarantee of the population identification condition required by the semiparametric sample selection models.
In the second essay, statistical efficiency of the proposed estimators is studied via
lower bounds on minimax risks and the result shows that, for a family of models with exactly sparse structure on the coefficient vector in the main equation, one of the proposed estimators attains the smallest estimation error up to the (n, d, p)−scaling among a class of procedures in worst-case scenarios. Inference procedures for the coefficients of the main equation, one based on a pivotal Dantzig selector to construct non-asymptotic confidence sets and one based on a post-selection strategy (when perfect or near-perfect selection of the high-dimensional coefficients is achieved), are discussed. Other theoretical contributions of this essay include establishing the non-asymptotic counterpart of the familiar asymptotic “oracle” type of results from previous literature: the estimator of the coefficients in the main equation behaves as if the unknown nonparametric component were known, provided the nonparametric component is sufficiently smooth.