Econometrics II Handout For Students

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

CHAPTER ONE

1. REGRESSION ANALYSIS WITH QUALITATIVE INFORMATION:

BINARY/ DUMMY VARIABLES

1.1 Describing Qualitative Information :- In regression analysis;

 Dummy variable -is one that takes the values 0 or 1 to indicate the absence or presence of
some categorical effect that may be expected to shift the outcome.
are variables essentially not quantified or non-measurable by their nature.
are "proxy"(substitute) variables/numeric stand-ins for qualitative facts in a regression model.
In regression analysis, the dependent variables can also influenced by qualitative variables
(gender, religion, geographic region, etc.) & quantitative(income, age, consumption, price etc)
Qualitative data are sometimes called dummy variables (categorical); (not quantified).
1.2. Dummy as Independent Variables

In regression analysis the dependent variable is also affected by variables that are essentially
qualitative in nature (e.g., sex, race, color, religion, nationality, wars, earthquakes, strikes,
political upheavals, and changes in government economic policy).
For example, holding all other factors constant, female professors are found to earn less
than their male counterparts, & nonwhites are found to earn less than whites.
This pattern may result from sex or racial discrimination, but whatever the reason, qualitative variables (
sex & ) does influence the dependent variable & clearly included among the explanatory variables.
Such qualitative variables usually indicate the presence or absence of a “quality” or an attribute,
such as male or female, black or white, or Christian or Muslim, one method of “quantifying”
such attributes is by constructing artificial variables that take on values of 1 or 0,
0 indicating the absence & 1 indicating the presence (possession) of that attribute.
Eg, 1 may indicate male, & 0 a female; or 1 college graduate, & 0 that he is not, & so on.
Variables that assume such 0 & 1 values are called dummy/indicator/ binary/ categorical/
dichotomous variable).
Dummy variables can be used in regression models just as easily as quantitative variables.
Regression model that contains only dummy explanatory variables is Analysis of
Variance(ANOVA) model.
Example: i Y =α + βD + u
i i ------------------------------------------ (1.1)
Where: Y = annual salary of a college professor
D i=1 if male college professor = 0 otherwise (i.e., female professor)
NB :-(we shall assign all dummy variables by the letter D)
Assuming that the disturbance term satisfy usually assumptions of the classical linear regression
model (i.e the mean of error term is zero), we obtain from (1.1).
Mean salary of female college professor: E(Y i /Di =0 )=α ….………………….(1.2)

Econometrics II Lecture Note Efa W. Page 1


Mean salary of male college professor:
E(Y i /D i =1)=α+β
 The intercept term ( α ) - mean salary of female college professors

 β - tells by how much the mean salary of a male college professor differs from the
mean salary of his female counterpart,
 α+ β reflecting the mean salary of the male college professor. Where β
is significant.
Consider the following hypothetical data on satisfying salaries of college teachers by sex
Starting salary Sex
(Y) (1 = male, 0 = female)

22,000 1
19,000 0
18,000 0
21,700 1
18,500 0
21,000 1
20,500 1
17,000 0
17,500 0
21,200 1
19,000+ 18.000+18,500+17,000+17,500
The estimated mean salary of female is = =18,000
5
22,000+21.700+21,000+20,500+21,200
The estimated mean salary of male is = =21,280
5
Therefore the results of regression analysis are presented as follows:
Y^ i = 18,000 + 3,280D
i
(0.32) (0.44)
t = (57.74) (7.439)
R2 = 0.8737
The above results shows that the estimated mean salary of female college instructor is birr
¿

18,000 (= ) and that of male instructor is birr 21,280 ( + β 1 )


α α

A/ Regression on one quantitative variable and one qualitative variable with two classes
Consider the model:
Y i=α i + α 2 Di + βX i +ui ---------------------------- (1.3)

Where:
Y i= annual salary of a college professor X i = Years of teaching experience
Di= 1 if male & D = 0 otherwise
i

Econometrics II Lecture Note Efa W. Page 2


Model (1.3) contains one quantitative variable (years of teaching experience) and one qualitative
variable (sex) that has two classes namely, male and female.
Assuming, as usual, that E(ui )=0 , we see that
Mean salary of female college professor: E(Y i / X i , D i =0 )=α 1 +βX i --------- (1.4)
Mean salary of male college professor: E(Y i / X i , Di =1)=(α 1+α 2 )+βX i ------ (1.5)
Model (1.4) and (1.5) have the same intercept (i.e., there is no sex discrimination) can be made
easily by running the regression (1.3)
B. Regression on one quantitative & one qualitative variable with more than two classes
Suppose that, on the basis of the cross-sectional data, we want to regress the annual
expenditure of health care by an individual on income & education.
 Since the variable education is qualitative variable of more than two categories,(primary
school, high school, & college.) Therefore, following the rule that the number of dummies be
one less than the number of categories of the variable, with the following model:
Y i=α 1 + α 2 D 2 i +α 3 D 3 i + βX i +ui -------------------------- (1.6)
Y=
Where: i annual expenditure on health care
X i = Annual expenditure
D2 = 1 if high school education and 0 otherwise
D3 = 1 if college education and 0 otherwise
NB:- in the above dummy variables we are arbitrarily (by chance) treating the “primary school
education” category as the base category. Therefore, the intercept
α 1 will reflect the intercept

for base category. The differential intercepts


α 2 & α 3 tells by how much the intercepts of the

other two categories differ from the intercept of the base category, which can be readily
checked as follows: Assuming E(ui )=0 , we obtain from equation (1.6)
E(Y i|D2 =0 , D 3 =0 , X i )=α 1 +βX i
E(Y i|D2 =1 , D 3=0 , X i )=(α 1 +α 2 )+ βX i
E(Y i|D 2 =0 , D3 =1 , X i )=(α 1 +α 3 )+ βX i
Which are, respectively the mean health care expenditure functions for the three levels of
education. Geometrically, the situation is shown in fig 1.2 (for illustrative purposes it is assumed
that
α 3 >α 2 ).

Econometrics II Lecture Note Efa W. Page 3


C/ Regression on one quantitative variable and two qualitative variables
The technique of dummy variable can be easily extended to handle more than one qualitative
variable.
Y i=α 1 + α 2 D2 i +α 3 D3 i + βX i +ui ------------------------------------------- (1.7)
Y=
Where: i annual salary
X i = Years of teaching experience
D2 =1 if female and 0 otherwise D 3 =1 if white and = 0 otherwise
NB; each of the two qualitative variables sex & color has two categories & hence needs one
dummy variable for each. Also the omitted, or base category now is “black female professor.”
Assuming; E(ui )=0 , we can obtain the following regression from (1.7)
Mean salary for black female professor: E(Y i|D 2 =0 , D 3 =0 , X i )=α 1 + βX i
Mean salary for black male professor: E(Y i|D 2 =1 , D 3 =0 , X i )=(α 1 +α 2 )+βX i
Mean salary for white female professor: E(Y i|D2 =0 , D 3 =1, X i )=(α 1 +α 3 )+ βX i
Mean salary for white male professor: E( Y i|D 2 =1 , D 3 =1 , X i )=( α 1 +α 2 + α 3 )+ βX i
Once again, it is assumed that the preceding(above) regressions differ only in the intercept
coefficient (α 1 ,α 2 , α 3 ) (but not in the slope coefficient β . Thus, the model may have more
than one quantitative variable and more than two qualitative variables.
Features of the dummy variable regression model:
1. One dummy variable be enough to distinguish two categories. The general rule is this: If a
qualitative variable has ‘m’ categories, introduce only ‘m-1’ dummy variables. If this rule
is not followed, we shall fall into what might be called the dummy variable trap, that is, the
situation of perfect multicollinearity.
2. In two categories, we could have assigned D = 1 for female and D = 0 for male.
3. The group, category, that is assigned the value of 0 is often referred to as the base,
benchmark, control, comparison, reference, or omitted category. It is the base in the sense
that comparisons are made with that category.
4.
α
The coefficient 2 attached to the dummy variable D can be called the differential
intercept coefficient b/ce it tells by how much the value of the intercept term of the category
that receives the value of 1 differs from the intercept coefficient of the base category.

Econometrics II Lecture Note Efa W. Page 4


1.3 Testing for structural stability of regression models

Until now, in the models considered in this chapter we assumed that the qualitative variables
affect the intercept but not the slope coefficient of the various subgroup regressions. But what if
the slopes are also different? If the slopes are in fact different, testing for differences in the
intercepts may be of little practical significance. Therefore, we need to develop a general
methodology to find out whether two (or more) regressions are different, where the difference
may be in the intercepts or the slopes or both.

Interaction effects: Consider the following model:


Y i=α 1 + α 2 D 2 i +α 3 D 3 i + βX i +ui --------------------------------- (1.8)
Y=
Where: i annual expenditure on clothing
X i = Income
D2 =1 if female and 0 if male
D3 =1 if college graduate and 0 otherwise

Implicit in this model is the assumption that the differential effect of the sex dummy
D2 is

constant across the two levels of education and the differential effect of the education dummy
D3 is also constant across the two sexes. That is, if, say, the mean expenditure on clothing is
higher for females than males this is so whether they are college graduates or not.
Likewise, if, say, college graduates on the average spend more on clothing than non college
graduates, this is so whether they are female or males.
A female college graduate may spend more on clothing than a male graduate. In other words,

there may be interaction between the two qualitative variables


D2 and D3 and therefore their

effect on mean Y may not be simply additive as in (1.8) but multiplicative as well, as in the
following model:
Y i=α 1 +α 2 D2 i +α 3 D3 i +α 4 (D 2i D3 i )+βX i +u i ----------------- (1.9)

From (4.9) we obtain: E(Y i|D 2 =1 , D3 =1 , X i )=( α 1 +α 2 + α 3 +α 4 )+ βX i ------------ (1.10)


Which is the mean clothing expenditure of graduate females are:
α 2= Differential effect of being a female
α 3= Differential effect of being a college graduate
α 4 = Differential effect of being a female graduate

Which shows that; the mean clothing expenditure of graduate females are different (by α 4 ) from

the mean clothing expenditure of females or college graduates. If


α 2 , α 3 , and α 4 are all positive,
the average clothing expenditure of females is higher (than the base category, which here is male

Econometrics II Lecture Note Efa W. Page 5


non graduate), but it is much more so if the females also happen to be graduates. Similarly, the
average expenditure on clothing by a college graduate tends to be higher than the base category
but much more so if the graduate happens to be a female. This shows how the interaction
dummy modifies the effect of the two attributes considered individually. Whether the coefficient
of the interaction dummy is statistically significant can be tested by the usual t test. If it turns out
to be significant, the simultaneous presence of the two attributes will attenuate or reinforce the
individual effects of these attributes. Needless to say, omitting a significant interaction term
incorrectly will lead to a specification bias.

1.4. Dummy as Dependent Variable


Here the dependent variable is qualitative. Suppose we want to study the labor-force
participation of adult males as a function of the unemployment rate, average wage rate, family
income, education, etc. A person either is in the labor force or not. Hence, the dependent
variable, labor-force participation, can take only two values: 1 if the person is in the labor force
and 0 if he or she is not. We can consider another example. A family may or may not own a
house. If it owns a house, it takes a value 1 and 0 if it does not.

In this situation considering Qualitative Response Model (QRM) is very important. QRM
contains different method of models. These are models in which the dependent variable is a
discrete outcome. There are two broad categories of QRM. These are:
A. Binomial Model: The choice is between two alternatives
B. Multinomial models: The choice is between more than two alternatives
Example: Y = 1, occupation is farming
= 2, occupation is carpentry
= 3, occupation is fishing
Binary variables: are variables that have two categories and are often used to indicate that an
event has occurred or that some characteristic is present.
Example: - Decision to participate in the labor force/or not to participate
Multinomial variables: These variables occur when there are multiple outcomes. Type of
binomial models are:
1. Linear probability models
2. The logit models
3. The Probit model

1.4.1. The Linear Probability Model (LPM)

Econometrics II Lecture Note Efa W. Page 6


The linear probability model is the regression model applied to a binary dependent variable. To
fix ideas, consider the following simple model:

Yi =
β 0 + β 1 X + U ……………………………(1)
i i

Where: X = family income


Y = 1 if the family owns a house
= 0 if the family does not own a house
Ui is the disturbance term
The independent variable Xi can be discrete or continuous variable. The model can be extended
to include other additional explanatory variables.
The above model expresses the dichotomous Yi as a linear function of the explanatory variable
Xi. Such kinds of models are called linear probability models (LPM) since E(Y i/Xi) the
conditional expectation of Yi given Xi, can be interpreted as the conditional probability that the
event will occur given Xi; that is, Pr(Yi = 1/Xi). Thus, in the preceding case, E(Yi/Xi) gives the
probability of a family owing a house and whose income is the given amount X i. The
justification of the name LPM can be seen as follows.
Assuming E(Ui) = 0, as usual (to obtain unbiased estimators), we obtain

E(Yi/Xi) =
β 0 + β 1 X …………………………………….(2)
i

Now, letting Pi = probability that Yi = 1 (that is, that the event occurs) and 1 – P i = probability
that Yi = 0 (that is, that the event does not occur), the variable Yi has the following distributions:

Yi Probability
0 1−Pi

1 Pi

Total 1

Therefore, by the definition of mathematical expectation, we obtain


E(Yi) = 0 (1 – Pi) + 1(Pi) = Pi ……………………………………..(3)
Now, comparing (2) with (3), we can equate

E(Yi/Xi) = Yi =
β 0 + β 1 X = P ……………………………………(4)
i i

Since the probability Pi must lie between 0 and 1, we have the restriction 0  E (Yi/Xi)  1 that is,
the conditional expectation, or conditional probability, must lie between 0 and 1.

Econometrics II Lecture Note Efa W. Page 7


1.4.2. The Logit Model
We have seen that LPM has many problems, such as non-normality of U i, heteroscedasticity of
^
Ui, possibility of Y i lying outside the 0-1 range, and the generally lower R 2 values. But these
problems are surmountable. The fundamental problem with the LPM is that it is not logically a
very attractive model because it assumes that P i = E(Y = 1/X) increases linearly with X, that is,
the marginal or incremental effect of X remains constant throughout.
Geometrically, the model we want would look something like fig 7.1 below.

1 CDF

X
- 0 

Fig A Cumulative Distribution Function (CDF)

The above S-shaped curve is very much similar with the cumulative distribution function (CDF)
of a random variable. (Note that the CDF of a random variable X is simply the probability that it
takes a value less than or equal to x0, were x0 is some specified numerical value of X. In short,
F(X), the CDF of X, is F(X = x0) = P(X x0). Please refer to your text statistics for economists).
Therefore, one can easily use the CDF to model regressions where the response variable is
dichotomous, taking 0-1 values.

The CDFs commonly chosen to represent the 0-1 response models are.
a) the logistic – which gives rise to the logit model (used to solve the problems logically)
b) the normal – which gives rise to the probit (or normit) model
Now let us see how one can estimate and interpret the logit model.
Recall that the LPM was (for home ownership)

Pi = E(Y = 1/Xi) =
β0 + β1 X
i

Where X is income and Y = 1 means the family owns a house.


Now consider the following representation of home ownership.
1
−( β 0 +β 1 X i )
Pi = E(Y = 1/Xi) = 1+ e

Econometrics II Lecture Note Efa W. Page 8


1
Pi = 1+ e
−Zi
where Zi =
β0 + β1 X
i

This equation represents what is known as the (cumulative) logistic distribution function. Since

the above equation is no linear in both the X and the β ’s. This means we cannot use the familiar
OLS procedure to estimate the parameters. This can be linear as follows.

1
Zi
1 – Pi = 1+ e
Pi 1+ e Zi Zi
= =e
1−Pi 1+ e −Zi

Pi
Now 1−Pi is simply the odds ratio in favor of owning a house- the ratio of the probability that a
family will own a house to the probability that it will not own a house.
Taking the natural log of the odds ratio we obtain

Li = ln
( )
Pi
1−Pi
= Zi =
β0 + β1 X
i

L (the log of the odds ratio) is linear in X as well as β (the parameters). L is called the logit and
hence the name logit model is given to it. The interpretation of the logit model is as follows:
β 1 – The slope measures the change in L for a unit change in X.

β 0 – The intercept tells the value of the log-odds in favor of owning a house if income is
zero. Like most interpretations of intercepts, this interpretation may not have any physical
meaning.
Now for estimation purposes, let us write the logit model as

Li = ln
( )
Pi
1−Pi
=
β0 + β1 X + U
i i

1.4.3. The Probit Model

The estimating model that emerges from the normal CDF is popularly known as the probit model.
Here the observed dependent variable Y, takes on one of the values 0 and 1 using the following
¿ 1
criteria. Define a latent variable Y* such that Y i = X i β +  I
¿
Y = 1 if Y i > 0
¿
0 if Y i  0
The latent variable Y* is continuous (-< Y* <). It generates the observed binary variable Y.
An observed variable, Y can be observed in two states:

Econometrics II Lecture Note Efa W. Page 9


i) if an event occurs it takes a value of 1
ii) if an event does not occur it takes a value of 0
The latent variable is assumed to be a linear function of the observed X’s through the structural
model.
- In the probit model, it is assumed that Var (i/Xi) = 1.
2
- In the logit model, it is assumed that Var (i/Xi) = π /3 .
Summary
- logit function
( α +βX i )
e 1
=
( α + βX i ) −α −βX i
P(Y = 1/X) = 1+ e 1+ e

- Probit function

P(Y = 1/X) =  (- - β Xi)


Where: (.) is the normal probability distribution function

CHAPTER TWO

3. INTRODUCTION TO SIMULTANEOUS EQUATION MODELS

3.1. The Nature of Simultaneous Equation Model

In all the previous chapters discussed so far, we have been focusing exclusively with the
problems and estimations of a single equation regression models. In such models, a dependent
variable is expressed as a linear function of one or more explanatory variables. The cause-and-
effect relationship in such models between the dependent and independent variable is
unidirectional. That is, the explanatory variables are the cause and the independent variable is
the effect. But there are situations where such one-way or unidirectional causation in the function
is not meaningful. This occurs if, for instance, Y (dependent variable) is not only function of X’s
(explanatory variables) but also all or some of the X’s are, in turn, determined by Y. There is,
therefore, a two-way flow of influence between Y and (some of) the X’s which in turn makes the
distinction between dependent and independent variables a little doubtful. Under such

Econometrics II Lecture Note Efa W. Page 10


circumstances, we need to consider more than one regression equations; one for each
interdependent variables to understand the multi-flow of influence among the variables. This is
precisely what is done in simultaneous equation models.
Some examples of SEMs in Economics
1. Demand-Supply Model
2. Keynesian Model of Income Determination
3. Wage–Price Models
4. The Is and Lm Model Of Macroeconomics
A system describing the joint dependence of variables is called a simultaneous equations model.
The number of equations in such models is equal to the number of jointly dependent or
endogenous variables involved in the phenomenon under analysis. Unlike the single equation
models, in simultaneous equation models it is not usually possible to estimate a single equation
of the model without taking into account the information provided by other equation of the
system.

3.2. Simultaneity bias

If one applies OLS to estimate the parameters of each equation disregarding other equations of
the model, the estimates so obtained are not only biased but also inconsistent; i.e. even if the
sample size increases indefinitely, the estimators do not converge to their true values. The bias
arising from application of such procedure of estimation which treats each equation of the
simultaneous equations model as though it were a single model is known as simultaneity bias or
simultaneous equation bias. To avoid this bias we will use other methods of estimation, such as:

 Indirect Least Square (ILS),


 Two Stage Least Square (2SLS),
 three Stage Least Square(3SLS),
 Maximum Likelihood Methods and
 The Method of Instrumental Variable (IV).

What happens to the parameters of the relationship if we estimate by applying OLS to each
equation without taking into account the information provided by the other equations in the
system? One of the crucial assumptions of the OLS is that the explanatory variables and the
disturbance term is independent i.e. the disturbance term is truly exogenous. Symbolically:
E[Xi/Ui] = 0. As a result, the linear model could be interpreted as describing the conditional
expectation of the dependent variable (Y) given a set of explanatory variables. In the

Econometrics II Lecture Note Efa W. Page 11


simultaneous equation models, such independence of explanatory variables and disturbance
term is violated i.e. E[XiUi]  0. If this assumption is violated, the OLS estimator is biased and
inconsistent.

Simultaneity bias of OLS estimators: The two-way causation in a relationship leads to


violation of the important assumption of linear regression model, i.e. one variable can be
dependent variable in one of the equation but becomes also explanatory variable in the other
equations of the simultaneous-equation model. In this case E[Xi/Ui] may be different from zero.
To show simultaneity bias, let’s consider the following simple simultaneous equation model:

Y=α 0+α1 X+U ¿ }¿ ¿¿


-------------------------------------------------- (10)
Suppose that the following assumptions hold.
Ε(U )=0 , Ε (V )=0
2 2 2 2
Ε(U )=σ u , Ε (V )=σ u
Ε(U i U j )=0 , Ε (V i V j )=0 , also Ε (UiVi)=0 ;
Where X and Y are endogenous variables and Z is an exogenous variable. The reduced form of
X of the above model is obtained by substituting Y in the equation of X.
X =β 0 + β 1 (α 0 + α 1 X +U )+ β 2 Z +V

X=
(
β 0 + α 0 β1
+
β2
1−α 1 β 1 1−α 1 β1 ) (
Z+
β 1 U +V
1−α 1 β 1 )
−−−−−−−−−−−−−−−−−−−−−( 11)

Applying OLS to the first equation of the above structural model will result in biased estimator

because cov ( X i U i )=Ε( X i U j )≠0 .

The Definitions of Some Concepts


 Endogenous and exogenous variables
In simultaneous equation models variables are classified as endogenous and exogenous.
Endogenous variables: are variables that are determined by the economic model (within the
system). Exogenous variables are those determined from outside and are also called
predetermined. Predetermined groups can be divided into two categories which are considered
in general as exogenous variables. These are: current and lagged exogenous and lagged

endogenous. For instance; X t and X t−1 depict the current and lagged exogenous variables and
Y t−1 depicts lagged endogenous variable. This is on the assumption that X’s symbolize the

Econometrics II Lecture Note Efa W. Page 12


exogenous variables and Y’s symbolize the endogenous variables. Thus,
X t , X t−1 and Y t−1 are
regarded as predetermined (exogenous) variables.
Since the exogenous variables are predetermined, they are supposed to be independent of the
error terms in the model.
Consider the demand and supply functions.
d
Q =β 0 + β 1 P+ β2 Y +U 1−−−−−−−−−−−−−−−−−−(14 )

Qs =α 0 +α 1 P+α 2 R+U 2 −−−−−−−−−−−−−−−−−−(15)

Where: Q = quantity, Y = income, P = price, R = Rainfalls, U 1 ∧U 2 are error terms.


Here P and Q are endogenous variables and Y and R are exogenous variables.
 Structural models
A structural model describes the complete structure of the relationships among the economic
variables. Structural equations of the model may be expressed in terms of endogenous variables,
exogenous variables and disturbances (random variables). The parameters of structural model
express the direct effect of each explanatory variable on the dependent variable. Variables not
appearing in any function explicitly may have an indirect effect and is taken into account by the
simultaneous solution of the system. For instance, a change in consumption affects the
investment indirectly and is not considered in the consumption function.
The effect of consumption on investment cannot be measured directly by any structural
parameter, but is measured indirectly by considering the system as a whole.
Example: The following simple Keynesian model of income determination can be considered as
a structural model.
C=α+ βY +U ----------------------------------------------- (16)
Y =C +Z ---------------------------------------------------- (17)
For  > 0 and 0<<1
Where: C = consumption expenditure
Z = non-consumption expenditure
Y = national income
C and Y are endogenous variables while Z is exogenous variable.
 Reduced form of the model:
The reduced form of a structural model is the model in which the endogenous variables are
expressed a function of the predetermined variables and the error term only.

Econometrics II Lecture Note Efa W. Page 13


Illustration: Find the reduced form of the above structural model. Since C and Y are
endogenous variables and only Z is the exogenous variables, we have to express C and Y in
terms of Z. To do this substitute Y= C+Z into equation (16).
C=α + β( C +Z ) + U
C=α+ βC+ βZ +U
C−βC=α+ βZ+U
C(1−β )=α+ βZ +U

C=
α
+
β
1−β 1−β ( )
Z+
U
1−β ---------------------------------- (18)
Substituting again (18) into (17) we get;

Y=
α
+
1
1−β 1−β ( )
Z+
U
1−β -------------------------------- (19)
Equation (18) and (19) are called the reduced form of the structural model of the above. We can
write this more formally as:
Structural form equations Reduced form equations
C=α+ βY +U
C=
α
( )
+
β
1−β 1−β
Z+
U
1−β

+( )
Y =C +Z α 1 U
Y= Z+
1−β 1−β 1−β

Parameters of the reduced form measure the total effect (direct and indirect) of a change in
exogenous variables on the endogenous variable. For instance, in the above reduced form

equation (18),
( 1−ββ ) measures the total effect of a unit change in the non-consumption

expenditure on consumption. This total effect is β , the direct effect, times


( 1−β1 )
, the indirect
effect. The reduced form equations can be obtained in two ways:
1) To express the endogenous variables directly as a function of the predetermined
variables.
2) To solve the structural system of endogenous variables in terms of the predetermined
variables, the structural parameters, and the disturbance terms.
Consider the following simple model for a closed economy.
Ct = a1Yt + U1 --------------------------------------------------------- (i)
It = b1Yt + b2Yt-1 + U2----------------------------------------------- (ii)

Econometrics II Lecture Note Efa W. Page 14


Yt = Ct +It + Gt------------------------------------------------------- (iii)
This model has three equations in three endogenous variables (Ct, It, and Yt) and two
predetermined variables (Gt, andYt-1).
To obtain the reduced form of this model, we may use two methods (direct method and solving
the structural model method).
Direct Method: Express the three endogenous variables (Ct , It , and Yt ) as functions of the two
predetermined variables (Gt, andYt-1) directly using ’s as the parameters of the reduced form
model as follows.

Ct = 11Yt-1 + 12Gt + V1 ------------------------------------(iv)


It , =21Yt-1 + 22Gt +V2 -------------------------------------(v)
Yt =31Yt-1 + 32Gt + V3 ------------------------------------(vi)

Note: 11, 12 , 21 , 22 , 31 , and 32 are reduced from parameters.

By solving the structural system of endogenous variables in terms of predetermined variables,


structural parameters and disturbances, the expressions for the reduced parameters can be
obtained easily. For instance, the third structural equation (iii) can be expressed in reduced form
as follows:
Yt = b2/ (1-a1-b1)Yt-1 + 1/(1-a1-b1) Gt + (U1 +U2)/ (1-a1-b1). This equation is obtained by simply
substituting structural equations (i) and (ii) in (iii). Form this expression: 31 = b2/ (1-a1-b1)
32 = b2/ (1-a1-b1)
Exercise
a) Determine the reduced form equations for the structural equations (ii) and (iii).
b) Indicate the expressions for 11, 12, 21 , and 22 form (a) above.
How to estimate the reduced form parameters?
The estimates of the reduced from coefficients (’s ) may be obtained in two ways.
1) Direct estimation of the reduced coefficients by applying OLS.
2) Indirect estimation of the reduced form coefficients:
Steps:
i) Solve the system of endogenous variables so that each equation contains only
predetermined explanatory variables. In this way we may obtain the system of
parameters’ relations (relations between ’s and structural parameters)
ii) Obtain the estimates of the structural parameters by any appropriate econometric
method.
iii) Substitute the estimates of the structural coefficients into the system of parameters’
relations to find the estimates of the reduced coefficients,

 Recursive models

Econometrics II Lecture Note Efa W. Page 15


A model is called recursive if its structural equations can be ordered in such a way that the first
equation includes only the predetermined variables in the right hand side; the second equation
contains predetermined variables and the first endogenous variable (of the first equation) in the
right hand side and so on. The special feature of recursive model is that its equations may be
estimated, one at a time, by OLS without simultaneous equations bias.

OLS is not applicable if there is interdependence between the explanatory variables and the error
term. In the simultaneous equation models, the endogenous variables may depend on the error
terms of the model; hence the OLS technique is not appropriate for estimation of an equation in a
simulations equations model.

However, in a special type of simultaneous equations model called Recursive, Triangular or


Causal model, the use of OLS procedure of estimation is appropriate. Consider the following
three equation system to understand the nature of such models:

Y1=α10+β1 X1+β12 X2+U1 ¿}Y 2=α20+α21Y1+β21 X1+β2 X2+U2 ¿}¿¿¿


In the above illustration, as usual, the X’s and Y’s are exogenous and endogenous variables
respectively. The disturbance terms follow the following assumptions.
Ε(U 1 U 2 )=Ε (U 1 U 3 )=Ε(U 2 U 3 )=0
The above assumption is the most crucial assumption that defines the recursive model. If this
does not hold, the above system is no longer recursive and OLS is also no longer valid. The first
equation of the above system contains only the exogenous variables on the right hand side. Since
by assumption, the exogenous variable is independent of U 1 , the first equation satisfies the
critical assumption of the OLS procedure. Hence OLS can be applied straight forwardly to this
equation.
Consider the second equation. It contains the endogenous variable Y 1 as one of the explanatory
variables along with non-stochastic X’s. OLS can be applied to this equation only if it can be
shown that
Y 1 and U 2 are independent of each other. This is true because U , which affects Y 1 is
1

by assumption uncorrelated with U 2 , i.e. Ε(U 1 U 2 )=0 . Y 1 acts as a predetermined variable in so


far as
Y 2 is concerned. Hence OLS can be applied to this equation. Similar argument can be
Y and Y 2 are independent of U 3 . In this way, in the
stretched to the 3rd equation because 1
recursive system OLS can be applied to each equation separately.
Let us build a hypothetical recursive model for an agricultural commodity, say wheat. The
production of wheat =Y 1 , may be assumed to depend on exogenous factors: X 2 = climatic
conditions; and
X 3 =last season’s price. The retail rice = Y 2 may be assumed to be the function of

Econometrics II Lecture Note Efa W. Page 16


production level = Y 1 and exogenous factor X 4 = disposable income. Finally the price obtained
Y X
by the producer = 3 can be expressed in terms of the retail price Y 2 and exogenous factor j =
the cost of marketing the producer.
The relevant equations of the model may be described as under:

Y1=α1+β2 X2+β3X3+U1¿}Y2=α4+β1Y1+α5X4+U2¿}¿ ¿
In the first equation, there are only exogenous variables and are assumed to be independent of
U 1 . In the second equation, the causal relation between Y 1 and Y 2 is in one direction. Also Y 1 is

independent of U 2 and can be treated just like exogenous variable. Similarly since Y 2 is
U
independent of 3 , OLS can be applied to the third equation. Thus, we can rewrite the above
equations as follows:

Y1−α1−α2X2−α3X3=U1¿}−β1Y1+Y2−α4−α5 X4=U2¿}¿ ¿
We can again rewrite this in matrix form as follows:

[]
X1

[ ] [ ][ ] []
1 0 0 Y 1 −α 1 −α 2 −α 3 0 0 X2 U1
−β1 1 0 underbracealignl ⏟ 2 4 5 ⏟
Coefficient matrix of ¿ Y + −α 0 0 −α 0 underbracealignl coefficient matrix of ¿ ¿ X = U ¿
3 2
0 −β2 1 endogenous variables ¿ Y 3 −α 6 0 0 0 −α7 exogenous variable ¿ X 4 U3
X5
The coefficient matrix of endogenous variables is thus a triangular one; hence recursive models
are also called as triangular models.

2.3 The Order and Rank Condition of identification problem


In simultaneous equation models, the Problem of identification is a problem of model
formulation; it does not concern with the estimation of the model. The estimation of the model
depends up on the empirical data and the form of the model. If the model is not in the proper
statistical form, it may turn out that the parameters may not uniquely estimated even though
adequate and relevant data are available. In a language of econometrics, a model is said to be
identified only when it is in unique statistical form to enable us to obtain unique estimates of its
parameters from the sample data. To illustrate the problem identification, let’s consider a
simplified wage-price model. In simultaneous equation models, the Problem of identification:
 is a problem of model formulation;

Econometrics II Lecture Note Efa W. Page 17


 Does not concern with the estimation of the model because, the estimation of the model
depends up on the empirical data and the form of the model.
 In a language of econometrics, a model is said to be identified only when it is in unique
statistical form to enable us to obtain unique estimates of its parameters from the sample
data.
There are three possible situations of identification:
 Exactly identified
 Over identified
 Under identified
By observing the correspondence between reduced form and structural form coefficients, it is
possible to determine whether the given equation is,exactly identified, over identified or under
identified as follows; An equation is exactly identified if there is one to one correspondence
between reduced form coefficients and structural form coefficients. We will get a unique solution
in this case.
 If the number of reduced form coefficients exceeds that of structural form
coefficients, we have over identification (that is no unique solution)-more than
sufficient information available.
 If the number of reduced form coefficients is less than that of structural form
coefficients- under identification (no solution can be found)-no sufficient
information available.
Formal Rules (Conditions) for Identification

In applying the identification rules, we should either ignore the constant term, or, if we want to
retain it, we must include in the set of variables a dummy variable (say X 0) which would always
take on the value 1. Let’s ignore the constant intercept. There are two formal rules for
identification

i) Order condition and


ii) Rank Condition for identification.
Here we shall discuss the order condition.
1. The order condition for identification
This condition is based on a counting rule of the variables included and excluded from the
particular equation. It is a necessary but not sufficient condition for the identification of an
equation. The order condition may be stated as follows.

Econometrics II Lecture Note Efa W. Page 18


For an equation to be identified the total number of variables (endogenous and exogenous)
excluded from it must be equal to or greater than the number of endogenous variables in the
model less one. Given that in a complete model the number of endogenous variables is equal to
the number of equations of the model, the order condition for identification is sometimes stated
in the following equivalent form.
The order condition of identification
Let G be the no of endogenous variables in the system and let k be the total number of variables
(both endogenous and exogenous) missing from the equation under consideration, then if;
a) k = G-1, the equation is exactly identified
b) k >G-1, the equation is over identified
c) k <G-1, the equation is under identified
Where, k = number of total variables in the model (endogenous and predetermined) minus number of
variables, endogenous and exogenous, included in a particular equation. The order condition is a
necessary but not sufficient condition for identification.
Then the order condition for identification may be symbolically expressed as:
k ≥ ( G−1 )
[ excluded ¿ ] ¿ ¿ ¿ ¿
¿
¿
Examples: State identifiability status of each the following equation using order condition stated
above.
1) If a system contains 10 equations with 15 variables, ten endogenous and five exogenous, an
equation containing 11 variables. For the equation we have,
G=10 K =15 M =11
Order condition:
( K−M )≥(G−1 )
(15−11 )<(10−1) ;that is, the order condition is not satisfied and thus, not identified.
2) if a system contains 10 equations with 15 variables, ten endogenous and five exogenous, an
equation containing 5 variables.
The order condition for identification is necessary for a relation to be identified, but it is not
sufficient, that is, it may be fulfilled in any particular equation and yet the relation may not be
identified.

Econometrics II Lecture Note Efa W. Page 19


.

CHAPTER THREE

2. INTRODUCTION TO BASIC REGRESSION ANALYSIS WITH TIME SERIES


ECONOMETRICS

2.1. The Nature of Time Series Data

A time-series data is a set of observations on a quantitative variable collected over time. A time-
series is data collected over discrete intervals of time.
 Examples include the annual price of wheat in the United States and the daily price of General
Electric stock shares. Macroeconomic data are usually reported in monthly, quarterly, or
annual terms.
 Financial data, such as stock prices, can be recorded daily, or at even higher frequencies.
The key feature of time-series data is that the same economic quantity is recorded at a regular time
interval. A time series data set consists of observations on a variable or several variables over
time. In time series analysis, we analyze the past behavior of a variable in order to predict its
future behavior.
Some Time Series Terms
• Stationary - a time series variable exhibiting no significant upward or downward trend over time.
• Nonstationary - a time series variable exhibiting a significant upward or downward trend over time.
• Seasonal Data - a time series variable exhibiting a repeating pattern at regular intervals over time.

Econometrics II Lecture Note Efa W. Page 20


• Univariate time-series analysis- analysis of single sequence of data describing the behavior of one
variable in terms of its own past values.
Graphically, Stationarity

Below you can see Non-stationarity (upward trend)

Time Series Models:

It is important to distinguish between cross-section data (data on a number of economic units at a


particular point in time) and time-series data (data collected over time on one particular economic
unit). When we say ‘‘economic units’’ we could be referring to individuals, households, firms,
geographical regions, countries, or some other entity on which data is collected.

Cross-sectional observations: Data collected at a given time;

Econometrics II Lecture Note Efa W. Page 21


On the other hand, time-series observations on a given economic unit are observed over a number of
time periods.

A second distinguishing feature of time-series data is its natural ordering according to time. With
cross-section data there is no particular ordering of the observations that is better or more natural
than another. To show the dynamic nature of relationships:
Given that the effects of changes in variables are not always instantaneous, we need to ask how to
model the dynamic nature of relationships. Wehave a dynamic model with lagged values of both the
dependent and explanatory variables, such as
Yt = f(Yt-1,Xt,Xt-1, Xt-2 ) -------------------------------(3.1)
Such models are called autoregressive distributed lag (ARDL) models, with ‘‘autoregressive’’
meaning a regression of Yt on its own lag or lags.
Examples: Consider the following:
Yt = 1.2Yt-1 + Ɛt --------------------------------- the AR (1) model, then,
Or Yt = δ + 1.2Yt-1 + Ɛt --------------------------------- the AR (1) model, then
Yt = 1.2Yt-1- 0.32Yt-2 + Ɛt -------------------------- the AR (2) model
OrYt = δ + 1.2Yt-1- 0.32Yt-2 + Ɛt -------------------- the AR (2) model

Yt= δ + θ1Yt−1+ θ2Yt−2+ Ɛt-----------------------------the AR (2) model


Yt= δ + θ1Yt−1+ θ2Yt−2+θ3Yt−3+Ɛt------------------------the AR (3) model
. . .
. . .
. . .
Yt= δ + θ1Yt−1+ θ2Yt−2+ ·· ·+θpYt−p + Ɛt-------------------the AR(p) model
The Additional Autoregressive models:
Ut = ρUt−1 + εtfirst order autoregressive or
Yt= ρ1 Yt−1+ ρ2Yt−2+εtsecond order autoregressive
Analysis of several sets of data(variables) for the same sequence of time periods is called
multivariate time-series analysis.
Examples, analysis of the relationships among price level, money supply and GDP on the basis
of say quarterly or annual collected data).
The main purpose of time-series analysis is to study the dynamics or temporal structure of the
data.

2.2. Stationary and Non-stationary stochastic processes

Econometrics II Lecture Note Efa W. Page 22


The collection of random variable yt ordered in time is called a stochastic process or random
process. There are two different classes of the stochastic process.
 Stationary stochastic process-gives rise to stationary time series.
 Non-stationary stochastic process- gives rise to non-stationary time series.
Stationary Stochastic Processes

Stochastic process is said to be stationary if its mean and variance are constant over time (do not
depend on time or do not change as time changes). Moreover, the value of the covariance
between the two time periods depends only on the lag between the two time periods and not on
the actual time. A non-stationary time series will have a time varying mean or a time-varying
variance or both.
Non-stationary Stochastic Processes

In practical research one often encounters non-stationary time series. The classic example is the
Random Walk Model (RWM). We distinguish two types of random walks:

 Random walk model without drift (with no intercept term)


 Random walk model with drift (constant term is present).

Random Walk without Drift

The series process, Ytis said to be a random walk without drift if;

Yt= Yt−1 + ut
Where: ut is a white noise error term (error term with mean 0 and variance σ2).
This model says that the value of Y at time period t (i.e., Yt)is equal to its value at time (t−1) plus a
random shock(ut) and it is an AR(1) model, because it is regressed on itself lagged one period.

We can write the above model as;


Y 1 =Y 0 +u 1
Y 2 =Y 1 +u2 =Y 0 +u1 +u 2
Y 3 =Y 2 +u 3=Y 0 +u1 + u2 +u3
In general, if the process started at some time 0 with a value of Y0, we have;
Y t =Y 0 +∑ ut
E(Y t )=E ( Y 0 + ∑ ut )=Y 0 (Why? )
In short, the RWM without drift is a non-stationary stochastic process.

Random Walk with Drift (with intercept)


Let us modify the above RWM as follows:

Y t =δ +Y t−1 +ut

Econometrics II Lecture Note Efa W. Page 23


Where: δ is known as the drift parameter.
Why we call it drift? Because if we write the preceding equation as;

The model will show that Yt drifts upward or downward, depending on whether δ being positive
or negative. Note that RWM with drift is also an AR model. Therefore, in general we can conclude
that the Random Walk Model (with or without drift) is non-stationary stochastic process.

2.3. The Unit Root Stochastic Process

The random walk model is an example of what is known as a unit root process. Let us write the
RWM as:

 If ρ=1, the model becomes a RWM (that a RWM without drift).


 If ρ is in fact equal to 1, we face what is known as the unit root problem, that is, a
situation of non-stationarity because in this case the variance of Yt is not stationary.
The name unit root is due to the fact that ρ=1. Thus, the terms non-stationarity, random walk,
and unit root can be treated as synonymous.
 If, however, |ρ| < 1, that is if the absolute value of ρ is less than one, then the time series Yt is
stationary in the sense we have defined it.

In practical research, it is important to find out whether a time series possesses has unit root (or if it
is non-stationery). Note that the term unit root process is similar to non-stationery process.

Econometrics II Lecture Note Efa W. Page 24


CHAPTER FOUR

4. INTRODUCTION TO PANEL DATA ANALYSIS

4.1. Introduction
Panel Data are Models that Combine Cross-section and Time-Series Data. In panel data the
same cross-sectional unit (industry, firm, country) is surveyed over time, so we have data
which is pooled over space as well as time.
Reasons for using Panel Data
 Panel data can take explicit account of individual-specific heterogeneity (“individual”
here means related to the micro unit)
 By combining data in two dimensions, panel data gives more data variation, less
collinearity and more degrees of freedom.
 Panel data is better suited than cross-sectional data for studying the dynamics of change.
For example it is well suited to understanding transition behaviour – for example
company bankruptcy or merger.
Autocorrelation
Although different to autocorrelation using the usual OLS models, a version of the Durbin-
Watson test can be used in the usual way. (E-views reports this). To remedy autocorrelation we
can use the usual methods, such as the Error Correction Model. ‘Dynamic Models’ are also often
used, which basically involves adding a lagged dependent variable. Recently the use of a method

Econometrics II Lecture Note Efa W. Page 25


for adjusting the standard errors has become popular, the most common method is termed the
‘Newey-West’ adjusted standard errors.
Heteroskedasticity
Given that there is a cross-section component to panel data, there will always be a potential for
heteroskedasticity. Although there are various tests for heteroskedastcity, as with autocorrelation
there is a tendency to automatically use adjusted standard errors, which remove the problem.
With heteroskedasticity, it is usually White’s adjusted standard errors that are used.
1. Example, the data consists of 20 countries over 10 years of annual data, giving 200
observations in all (T=200). This produces the following result, where stock prices are
regressed against expenditure on research (r ):
2. Example, the results are interpreted in the usual way, however you would need to decide
whether you wished to use fixed or random effects in this model.
Panel or longitudinal data sets consist of repeated observations for the same units, firms,
individuals or other economic agents. Typically the observations are at different points in time.
Let Yit denote the outcome for unit i in period t, and Xit a vector of explanatory variables. The
index i denotes the unit and runs from 1 to N, and the index t denotes time and runs from 1 to T.
There are two types of Panel Data. These are:

Balanced panel data: the time period is the same for each sampling unit.
E.g. year1=year2=year3=300 households
Unbalanced panel data has potentially different numbers of observations for each unit at
different points in time.
E.g. year1=300, year2=295, year3=270
– Households move to other places/ members/household heads die
– Firms go out of business
Panel data in stata
• Use the data file ‘Epanel’
• Check in what format it is presented: make sure that it is presented in long format as
opposed to wide format.
• Make sure that the data has two identifiers: the entity id (hid) and the panel period (year).
• Make sure that the entity id is unique for a panel period.
• Declare to stata that your data is a panel data using this command: xtset hid year
• The key issue with panel data is that Yit (outcome in period t) and Yis (outcome in
periods) tend to be correlated even conditional on the covariates Xit and Xis.

Econometrics II Lecture Note Efa W. Page 26


Let us look at this in a linear model
• What is Mr.C ? JJ
It is called the unobserved individual effect.
– It is unobserved: e. g: genetic make up
– It is individual-specific
– It is time-invariant : stays the same over time
– It is random.
It creates the correlation between Yit and Yis even with the error term uncorrelated over time
and units.
4.2. Estimation of panel data Regression model :The fixed Effect Approach
Use fixed effects provided the following assumptions are fulfilled:
• Assumption1: Strict Exogeneity:
• Assumption2: Uncorrelated Effects:
Fixed Effects Estimation

Covariance Model
Within Estimator
Individual Dummy Variable Model
Least Squares Dummy Variable Model.
• Each entity has its own individual characteristics that may or may not influence the
predictor variables
Fixed Effect removes the effect of those time-invariant characteristics from the predictor
variables so we can assess the predictors’ net effect. Each entity is different therefore the
entity’s error term and the constant (which captures individual characteristics) should not be
correlated with the others. If the error terms are correlated then FE is not suitable since
inferences may not be correct and you need to model that relationship (probably using random-
effects), this is the main rationale for the Hausman test (presented later on in this document).
Another important assumption of the FE model is that those time-invariant characteristics are
unique to the individual and should not be correlated with other individual characteristics
4.3. Estimation of panel data Regression model Random effect estimation

This is a very strong assumption to make in empirical analysis. The rationale behind random
effects model is that, unlike the fixed effects model, the variation across entities is assumed to be
random and uncorrelated with the predictor or independent variables included in the model. If
you have reason to believe that differences across entities have some influence on your

Econometrics II Lecture Note Efa W. Page 27


dependent variable then you should use random effects. An advantage of random effects is that
you can include time invariant variables (i.e. gender). In the fixed effects model these variables
are absorbed by the intercept

To decide between fixed or random effects you can run a Hausman test where the null hypothesis
is that the preferred model is random effects versus the alternative the fixed effects (see Green,
2008, chapter 9). It basically tests whether the unique errors (unobserved individual
characteristics) are correlated with the regressors.
Conclusion:
 Panel data is a method for estimating data which is both time series and cross
sectional
 It has both advantages but also disadvantages over OLS estimation
 It applies to many different techniques, such as tests for stationarity
Assignment of Econometrics II for 3rd Yr Regular Students in 2022 (20%)
INSTRUCTION
a. Do in groups with 10 students (alphabetically).
b. Unreadable handwriting and copy paste will deduct point.
c. Submit it on 30/08/2014 E.C.
d. Don’t use more than three pages.
e. You must write your Name, Id no and section correctly.
f. Don’t use computer for assignment writing.

1. Define Panel Data regression model?(3%)


2. Why econometricians use Panel Data?(3%)
3. List and justify the two types of Panel Data(4%)
4. Define Balanced panel data and Unbalanced panel data?(4%)
5. Explain the assumptions which should be fulfilled In Estimation
of panel data Regression model by using fixed effects? (3%)
6. Write the advantage of using random effect approach in
Estimation of panel data Regression model?(3%)

Econometrics II Lecture Note Efa W. Page 28


By: Dugasa J. & Efa W.

Econometrics II Lecture Note Efa W. Page 29

You might also like