Econometrics II Handout For Students
Econometrics II Handout For Students
Econometrics II Handout For Students
Dummy variable -is one that takes the values 0 or 1 to indicate the absence or presence of
some categorical effect that may be expected to shift the outcome.
are variables essentially not quantified or non-measurable by their nature.
are "proxy"(substitute) variables/numeric stand-ins for qualitative facts in a regression model.
In regression analysis, the dependent variables can also influenced by qualitative variables
(gender, religion, geographic region, etc.) & quantitative(income, age, consumption, price etc)
Qualitative data are sometimes called dummy variables (categorical); (not quantified).
1.2. Dummy as Independent Variables
In regression analysis the dependent variable is also affected by variables that are essentially
qualitative in nature (e.g., sex, race, color, religion, nationality, wars, earthquakes, strikes,
political upheavals, and changes in government economic policy).
For example, holding all other factors constant, female professors are found to earn less
than their male counterparts, & nonwhites are found to earn less than whites.
This pattern may result from sex or racial discrimination, but whatever the reason, qualitative variables (
sex & ) does influence the dependent variable & clearly included among the explanatory variables.
Such qualitative variables usually indicate the presence or absence of a “quality” or an attribute,
such as male or female, black or white, or Christian or Muslim, one method of “quantifying”
such attributes is by constructing artificial variables that take on values of 1 or 0,
0 indicating the absence & 1 indicating the presence (possession) of that attribute.
Eg, 1 may indicate male, & 0 a female; or 1 college graduate, & 0 that he is not, & so on.
Variables that assume such 0 & 1 values are called dummy/indicator/ binary/ categorical/
dichotomous variable).
Dummy variables can be used in regression models just as easily as quantitative variables.
Regression model that contains only dummy explanatory variables is Analysis of
Variance(ANOVA) model.
Example: i Y =α + βD + u
i i ------------------------------------------ (1.1)
Where: Y = annual salary of a college professor
D i=1 if male college professor = 0 otherwise (i.e., female professor)
NB :-(we shall assign all dummy variables by the letter D)
Assuming that the disturbance term satisfy usually assumptions of the classical linear regression
model (i.e the mean of error term is zero), we obtain from (1.1).
Mean salary of female college professor: E(Y i /Di =0 )=α ….………………….(1.2)
β - tells by how much the mean salary of a male college professor differs from the
mean salary of his female counterpart,
α+ β reflecting the mean salary of the male college professor. Where β
is significant.
Consider the following hypothetical data on satisfying salaries of college teachers by sex
Starting salary Sex
(Y) (1 = male, 0 = female)
22,000 1
19,000 0
18,000 0
21,700 1
18,500 0
21,000 1
20,500 1
17,000 0
17,500 0
21,200 1
19,000+ 18.000+18,500+17,000+17,500
The estimated mean salary of female is = =18,000
5
22,000+21.700+21,000+20,500+21,200
The estimated mean salary of male is = =21,280
5
Therefore the results of regression analysis are presented as follows:
Y^ i = 18,000 + 3,280D
i
(0.32) (0.44)
t = (57.74) (7.439)
R2 = 0.8737
The above results shows that the estimated mean salary of female college instructor is birr
¿
A/ Regression on one quantitative variable and one qualitative variable with two classes
Consider the model:
Y i=α i + α 2 Di + βX i +ui ---------------------------- (1.3)
Where:
Y i= annual salary of a college professor X i = Years of teaching experience
Di= 1 if male & D = 0 otherwise
i
other two categories differ from the intercept of the base category, which can be readily
checked as follows: Assuming E(ui )=0 , we obtain from equation (1.6)
E(Y i|D2 =0 , D 3 =0 , X i )=α 1 +βX i
E(Y i|D2 =1 , D 3=0 , X i )=(α 1 +α 2 )+ βX i
E(Y i|D 2 =0 , D3 =1 , X i )=(α 1 +α 3 )+ βX i
Which are, respectively the mean health care expenditure functions for the three levels of
education. Geometrically, the situation is shown in fig 1.2 (for illustrative purposes it is assumed
that
α 3 >α 2 ).
Until now, in the models considered in this chapter we assumed that the qualitative variables
affect the intercept but not the slope coefficient of the various subgroup regressions. But what if
the slopes are also different? If the slopes are in fact different, testing for differences in the
intercepts may be of little practical significance. Therefore, we need to develop a general
methodology to find out whether two (or more) regressions are different, where the difference
may be in the intercepts or the slopes or both.
Implicit in this model is the assumption that the differential effect of the sex dummy
D2 is
constant across the two levels of education and the differential effect of the education dummy
D3 is also constant across the two sexes. That is, if, say, the mean expenditure on clothing is
higher for females than males this is so whether they are college graduates or not.
Likewise, if, say, college graduates on the average spend more on clothing than non college
graduates, this is so whether they are female or males.
A female college graduate may spend more on clothing than a male graduate. In other words,
effect on mean Y may not be simply additive as in (1.8) but multiplicative as well, as in the
following model:
Y i=α 1 +α 2 D2 i +α 3 D3 i +α 4 (D 2i D3 i )+βX i +u i ----------------- (1.9)
Which shows that; the mean clothing expenditure of graduate females are different (by α 4 ) from
In this situation considering Qualitative Response Model (QRM) is very important. QRM
contains different method of models. These are models in which the dependent variable is a
discrete outcome. There are two broad categories of QRM. These are:
A. Binomial Model: The choice is between two alternatives
B. Multinomial models: The choice is between more than two alternatives
Example: Y = 1, occupation is farming
= 2, occupation is carpentry
= 3, occupation is fishing
Binary variables: are variables that have two categories and are often used to indicate that an
event has occurred or that some characteristic is present.
Example: - Decision to participate in the labor force/or not to participate
Multinomial variables: These variables occur when there are multiple outcomes. Type of
binomial models are:
1. Linear probability models
2. The logit models
3. The Probit model
Yi =
β 0 + β 1 X + U ……………………………(1)
i i
E(Yi/Xi) =
β 0 + β 1 X …………………………………….(2)
i
Now, letting Pi = probability that Yi = 1 (that is, that the event occurs) and 1 – P i = probability
that Yi = 0 (that is, that the event does not occur), the variable Yi has the following distributions:
Yi Probability
0 1−Pi
1 Pi
Total 1
E(Yi/Xi) = Yi =
β 0 + β 1 X = P ……………………………………(4)
i i
Since the probability Pi must lie between 0 and 1, we have the restriction 0 E (Yi/Xi) 1 that is,
the conditional expectation, or conditional probability, must lie between 0 and 1.
1 CDF
X
- 0
The above S-shaped curve is very much similar with the cumulative distribution function (CDF)
of a random variable. (Note that the CDF of a random variable X is simply the probability that it
takes a value less than or equal to x0, were x0 is some specified numerical value of X. In short,
F(X), the CDF of X, is F(X = x0) = P(X x0). Please refer to your text statistics for economists).
Therefore, one can easily use the CDF to model regressions where the response variable is
dichotomous, taking 0-1 values.
The CDFs commonly chosen to represent the 0-1 response models are.
a) the logistic – which gives rise to the logit model (used to solve the problems logically)
b) the normal – which gives rise to the probit (or normit) model
Now let us see how one can estimate and interpret the logit model.
Recall that the LPM was (for home ownership)
Pi = E(Y = 1/Xi) =
β0 + β1 X
i
This equation represents what is known as the (cumulative) logistic distribution function. Since
the above equation is no linear in both the X and the β ’s. This means we cannot use the familiar
OLS procedure to estimate the parameters. This can be linear as follows.
1
Zi
1 – Pi = 1+ e
Pi 1+ e Zi Zi
= =e
1−Pi 1+ e −Zi
Pi
Now 1−Pi is simply the odds ratio in favor of owning a house- the ratio of the probability that a
family will own a house to the probability that it will not own a house.
Taking the natural log of the odds ratio we obtain
Li = ln
( )
Pi
1−Pi
= Zi =
β0 + β1 X
i
L (the log of the odds ratio) is linear in X as well as β (the parameters). L is called the logit and
hence the name logit model is given to it. The interpretation of the logit model is as follows:
β 1 – The slope measures the change in L for a unit change in X.
β 0 – The intercept tells the value of the log-odds in favor of owning a house if income is
zero. Like most interpretations of intercepts, this interpretation may not have any physical
meaning.
Now for estimation purposes, let us write the logit model as
Li = ln
( )
Pi
1−Pi
=
β0 + β1 X + U
i i
The estimating model that emerges from the normal CDF is popularly known as the probit model.
Here the observed dependent variable Y, takes on one of the values 0 and 1 using the following
¿ 1
criteria. Define a latent variable Y* such that Y i = X i β + I
¿
Y = 1 if Y i > 0
¿
0 if Y i 0
The latent variable Y* is continuous (-< Y* <). It generates the observed binary variable Y.
An observed variable, Y can be observed in two states:
- Probit function
CHAPTER TWO
In all the previous chapters discussed so far, we have been focusing exclusively with the
problems and estimations of a single equation regression models. In such models, a dependent
variable is expressed as a linear function of one or more explanatory variables. The cause-and-
effect relationship in such models between the dependent and independent variable is
unidirectional. That is, the explanatory variables are the cause and the independent variable is
the effect. But there are situations where such one-way or unidirectional causation in the function
is not meaningful. This occurs if, for instance, Y (dependent variable) is not only function of X’s
(explanatory variables) but also all or some of the X’s are, in turn, determined by Y. There is,
therefore, a two-way flow of influence between Y and (some of) the X’s which in turn makes the
distinction between dependent and independent variables a little doubtful. Under such
If one applies OLS to estimate the parameters of each equation disregarding other equations of
the model, the estimates so obtained are not only biased but also inconsistent; i.e. even if the
sample size increases indefinitely, the estimators do not converge to their true values. The bias
arising from application of such procedure of estimation which treats each equation of the
simultaneous equations model as though it were a single model is known as simultaneity bias or
simultaneous equation bias. To avoid this bias we will use other methods of estimation, such as:
What happens to the parameters of the relationship if we estimate by applying OLS to each
equation without taking into account the information provided by the other equations in the
system? One of the crucial assumptions of the OLS is that the explanatory variables and the
disturbance term is independent i.e. the disturbance term is truly exogenous. Symbolically:
E[Xi/Ui] = 0. As a result, the linear model could be interpreted as describing the conditional
expectation of the dependent variable (Y) given a set of explanatory variables. In the
X=
(
β 0 + α 0 β1
+
β2
1−α 1 β 1 1−α 1 β1 ) (
Z+
β 1 U +V
1−α 1 β 1 )
−−−−−−−−−−−−−−−−−−−−−( 11)
Applying OLS to the first equation of the above structural model will result in biased estimator
endogenous. For instance; X t and X t−1 depict the current and lagged exogenous variables and
Y t−1 depicts lagged endogenous variable. This is on the assumption that X’s symbolize the
C=
α
+
β
1−β 1−β ( )
Z+
U
1−β ---------------------------------- (18)
Substituting again (18) into (17) we get;
Y=
α
+
1
1−β 1−β ( )
Z+
U
1−β -------------------------------- (19)
Equation (18) and (19) are called the reduced form of the structural model of the above. We can
write this more formally as:
Structural form equations Reduced form equations
C=α+ βY +U
C=
α
( )
+
β
1−β 1−β
Z+
U
1−β
+( )
Y =C +Z α 1 U
Y= Z+
1−β 1−β 1−β
Parameters of the reduced form measure the total effect (direct and indirect) of a change in
exogenous variables on the endogenous variable. For instance, in the above reduced form
equation (18),
( 1−ββ ) measures the total effect of a unit change in the non-consumption
Note: 11, 12 , 21 , 22 , 31 , and 32 are reduced from parameters.
Recursive models
OLS is not applicable if there is interdependence between the explanatory variables and the error
term. In the simultaneous equation models, the endogenous variables may depend on the error
terms of the model; hence the OLS technique is not appropriate for estimation of an equation in a
simulations equations model.
Y1=α1+β2 X2+β3X3+U1¿}Y2=α4+β1Y1+α5X4+U2¿}¿ ¿
In the first equation, there are only exogenous variables and are assumed to be independent of
U 1 . In the second equation, the causal relation between Y 1 and Y 2 is in one direction. Also Y 1 is
independent of U 2 and can be treated just like exogenous variable. Similarly since Y 2 is
U
independent of 3 , OLS can be applied to the third equation. Thus, we can rewrite the above
equations as follows:
Y1−α1−α2X2−α3X3=U1¿}−β1Y1+Y2−α4−α5 X4=U2¿}¿ ¿
We can again rewrite this in matrix form as follows:
[]
X1
[ ] [ ][ ] []
1 0 0 Y 1 −α 1 −α 2 −α 3 0 0 X2 U1
−β1 1 0 underbracealignl ⏟ 2 4 5 ⏟
Coefficient matrix of ¿ Y + −α 0 0 −α 0 underbracealignl coefficient matrix of ¿ ¿ X = U ¿
3 2
0 −β2 1 endogenous variables ¿ Y 3 −α 6 0 0 0 −α7 exogenous variable ¿ X 4 U3
X5
The coefficient matrix of endogenous variables is thus a triangular one; hence recursive models
are also called as triangular models.
In applying the identification rules, we should either ignore the constant term, or, if we want to
retain it, we must include in the set of variables a dummy variable (say X 0) which would always
take on the value 1. Let’s ignore the constant intercept. There are two formal rules for
identification
CHAPTER THREE
A time-series data is a set of observations on a quantitative variable collected over time. A time-
series is data collected over discrete intervals of time.
Examples include the annual price of wheat in the United States and the daily price of General
Electric stock shares. Macroeconomic data are usually reported in monthly, quarterly, or
annual terms.
Financial data, such as stock prices, can be recorded daily, or at even higher frequencies.
The key feature of time-series data is that the same economic quantity is recorded at a regular time
interval. A time series data set consists of observations on a variable or several variables over
time. In time series analysis, we analyze the past behavior of a variable in order to predict its
future behavior.
Some Time Series Terms
• Stationary - a time series variable exhibiting no significant upward or downward trend over time.
• Nonstationary - a time series variable exhibiting a significant upward or downward trend over time.
• Seasonal Data - a time series variable exhibiting a repeating pattern at regular intervals over time.
A second distinguishing feature of time-series data is its natural ordering according to time. With
cross-section data there is no particular ordering of the observations that is better or more natural
than another. To show the dynamic nature of relationships:
Given that the effects of changes in variables are not always instantaneous, we need to ask how to
model the dynamic nature of relationships. Wehave a dynamic model with lagged values of both the
dependent and explanatory variables, such as
Yt = f(Yt-1,Xt,Xt-1, Xt-2 ) -------------------------------(3.1)
Such models are called autoregressive distributed lag (ARDL) models, with ‘‘autoregressive’’
meaning a regression of Yt on its own lag or lags.
Examples: Consider the following:
Yt = 1.2Yt-1 + Ɛt --------------------------------- the AR (1) model, then,
Or Yt = δ + 1.2Yt-1 + Ɛt --------------------------------- the AR (1) model, then
Yt = 1.2Yt-1- 0.32Yt-2 + Ɛt -------------------------- the AR (2) model
OrYt = δ + 1.2Yt-1- 0.32Yt-2 + Ɛt -------------------- the AR (2) model
Stochastic process is said to be stationary if its mean and variance are constant over time (do not
depend on time or do not change as time changes). Moreover, the value of the covariance
between the two time periods depends only on the lag between the two time periods and not on
the actual time. A non-stationary time series will have a time varying mean or a time-varying
variance or both.
Non-stationary Stochastic Processes
In practical research one often encounters non-stationary time series. The classic example is the
Random Walk Model (RWM). We distinguish two types of random walks:
The series process, Ytis said to be a random walk without drift if;
Yt= Yt−1 + ut
Where: ut is a white noise error term (error term with mean 0 and variance σ2).
This model says that the value of Y at time period t (i.e., Yt)is equal to its value at time (t−1) plus a
random shock(ut) and it is an AR(1) model, because it is regressed on itself lagged one period.
Y t =δ +Y t−1 +ut
The model will show that Yt drifts upward or downward, depending on whether δ being positive
or negative. Note that RWM with drift is also an AR model. Therefore, in general we can conclude
that the Random Walk Model (with or without drift) is non-stationary stochastic process.
The random walk model is an example of what is known as a unit root process. Let us write the
RWM as:
In practical research, it is important to find out whether a time series possesses has unit root (or if it
is non-stationery). Note that the term unit root process is similar to non-stationery process.
4.1. Introduction
Panel Data are Models that Combine Cross-section and Time-Series Data. In panel data the
same cross-sectional unit (industry, firm, country) is surveyed over time, so we have data
which is pooled over space as well as time.
Reasons for using Panel Data
Panel data can take explicit account of individual-specific heterogeneity (“individual”
here means related to the micro unit)
By combining data in two dimensions, panel data gives more data variation, less
collinearity and more degrees of freedom.
Panel data is better suited than cross-sectional data for studying the dynamics of change.
For example it is well suited to understanding transition behaviour – for example
company bankruptcy or merger.
Autocorrelation
Although different to autocorrelation using the usual OLS models, a version of the Durbin-
Watson test can be used in the usual way. (E-views reports this). To remedy autocorrelation we
can use the usual methods, such as the Error Correction Model. ‘Dynamic Models’ are also often
used, which basically involves adding a lagged dependent variable. Recently the use of a method
Balanced panel data: the time period is the same for each sampling unit.
E.g. year1=year2=year3=300 households
Unbalanced panel data has potentially different numbers of observations for each unit at
different points in time.
E.g. year1=300, year2=295, year3=270
– Households move to other places/ members/household heads die
– Firms go out of business
Panel data in stata
• Use the data file ‘Epanel’
• Check in what format it is presented: make sure that it is presented in long format as
opposed to wide format.
• Make sure that the data has two identifiers: the entity id (hid) and the panel period (year).
• Make sure that the entity id is unique for a panel period.
• Declare to stata that your data is a panel data using this command: xtset hid year
• The key issue with panel data is that Yit (outcome in period t) and Yis (outcome in
periods) tend to be correlated even conditional on the covariates Xit and Xis.
Covariance Model
Within Estimator
Individual Dummy Variable Model
Least Squares Dummy Variable Model.
• Each entity has its own individual characteristics that may or may not influence the
predictor variables
Fixed Effect removes the effect of those time-invariant characteristics from the predictor
variables so we can assess the predictors’ net effect. Each entity is different therefore the
entity’s error term and the constant (which captures individual characteristics) should not be
correlated with the others. If the error terms are correlated then FE is not suitable since
inferences may not be correct and you need to model that relationship (probably using random-
effects), this is the main rationale for the Hausman test (presented later on in this document).
Another important assumption of the FE model is that those time-invariant characteristics are
unique to the individual and should not be correlated with other individual characteristics
4.3. Estimation of panel data Regression model Random effect estimation
This is a very strong assumption to make in empirical analysis. The rationale behind random
effects model is that, unlike the fixed effects model, the variation across entities is assumed to be
random and uncorrelated with the predictor or independent variables included in the model. If
you have reason to believe that differences across entities have some influence on your
To decide between fixed or random effects you can run a Hausman test where the null hypothesis
is that the preferred model is random effects versus the alternative the fixed effects (see Green,
2008, chapter 9). It basically tests whether the unique errors (unobserved individual
characteristics) are correlated with the regressors.
Conclusion:
Panel data is a method for estimating data which is both time series and cross
sectional
It has both advantages but also disadvantages over OLS estimation
It applies to many different techniques, such as tests for stationarity
Assignment of Econometrics II for 3rd Yr Regular Students in 2022 (20%)
INSTRUCTION
a. Do in groups with 10 students (alphabetically).
b. Unreadable handwriting and copy paste will deduct point.
c. Submit it on 30/08/2014 E.C.
d. Don’t use more than three pages.
e. You must write your Name, Id no and section correctly.
f. Don’t use computer for assignment writing.