Econ 471 Notes 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Chapter 1

The Two Variable Linear Model

1.1 The Basic Linear Model


The goal of this section is to build a simple model for the non-exact relationship between
two variables Y and X, related by some economic theory. For example, consumption and
income, quantity consumed and price, etc.
The proposed model:

Yi = + Xi + ui , i = 1, . . . , n (1.1)

where and are unknown parameters which are the purpose of the estimation. What we
will call data are the n realizations of (Xi , Yi ). We are abusing notation a bit by using the
same letters to refer to random variables and their realizations.
ui is an unobserved random variable which represents the fact that the relationship be-
tween Y and X is not exactly linear. We will momentarily assumet that ui has expected
value zero. Note that if ui = 0, then the relationship between Yi and Xi would be exactly
linear, so it is the presence of ui what breaks this exact nature of the relationship. Y is usu-
ally reffered to as the explained or dependent variable, X is the explanatory or independent
variable.
We will refer to ui as the error term, which is a terminology more appropriate in
the experimental sciences, where a cause x (say the dose of a drug) is administered to
different subjects and then an effect y is measured (say, body temperature). In this case ui
might be a measurement error due to the erratic behavior of a measurement instrument (for
example, a thermometer). In a social science like economics, ui represents a broader notion
of ignorance that represents whatever is not observed (by ignorance, ommision, etc.) that
affects y besides x.

[ FIGURE 1: SCATTER DIAGRAM ]

The first goal will be to find reasonable estimates for and based solely on the data,
that is (Xi , Yi ), i = 1, . . . , n.

1
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 2

1.2 The Least Squares Method


Let us denote with and the estimates of and in the simple linear model. Let us
also define the following quantities. The first one is an estimate of Y :

Yi i
+ X

Intuitively, we have replaced and by its estimates, and treated ui as if the relationship
were exactly linear, i.e., as if ui were zero. This will be undersood as an estimate of Yi .
Then it is natural to define a notion of estimation error as follows:

ei Yi Yi

which measures the difference between Yi and its estimate.


and so as ei s are small in some sense. It is interesting to
A natural goal is to find
see how the problem works from a graphical perspective. Data will correspond to n points
scattered in a (X, Y ) plane. The presence of a linear relationship like (1.1) is consistent
with points scatterd around an imaginary straight line. Note that if ui where indeed zero,
all points will lie along the same line, consistent with an exact linear relationship. As
mentioned above, it is the presence of ui what breaks this exact relationship.
Now note that for any given values of and , the points determined by the fitted
model:
Y + X

correspond to a line in the (X, Y ) plane. Hence different values of and correspond
to different estimated lines, which implies that choosing particular values is equivalent to
choosing a specific line on the plane. For the i-th observation, the estimation errors ei can
be seen graphically as the vertical distance between the points (Xi , Yi ) and (Xi , Yi ), that
and so as the
is, between (Xi , Yi ) and the fitted line. So, intuitively, we want values of
fitted line they induce passes as close as possible to all the points in the scatter so errors
are as small as possible.

[ FIGURE 2: SCATTER DIAGRAM WITH CANDIDATE LINE]

Note that if we had only two observations, the problem has a very simple solution, and
and that make estimation errors exactly equal
reduces to finding the only two values of
to zero. Graphically, this is possible since this is equivalent to finding the only straight
line that passes through the two observations available. Trivially, in this extreme case all
estimation errors will be zero.
The more realistic case appears when we have more than two observations, not all of
them lying on a single line. Obviously, a line cannot pass through more than two non-
aligned points, so we cannot make all errors equal to zero. So now the problem is to find
and that determine a line that passes the closest as posible to all the points,
values of
so estimation errors are, in the aggregate, small. For this we need to introduce a criterion
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 3

of what do we mean by the line being close or far from the points. Let us define a penalty
function, which consists in adding all the estimation errors squared, so as positive and
negative errors matter alike. For any this will give us an idea of how large is the
and ,
aggregate estimation error:
n
X X
SSR( =
, ) e2i = (Yi i )2
X
i=1

SSR stands for sum of squared residuals. Note that given the observations Yi and Xi ,
this is a function that depends on that is, different values of
and , and correspond
to different lines that pass through the data points, implying different estimation errors. It
is now natural to look for and so as to make this aggregate error as small as possible.
The values of
and that minimize the sum of squared residuals are:
P
Xi Yi nY X
= P 2 2
Xi nX

and
= Y X

which are known as the least squares estimators of and .

Derivation of the Least Squares Estimators


The next paragraphs show how to obtain these estimators. Fortunately, it is easy to
show that SRC( is globally concave and differentiable, so first order conditions for
, )
a local minimum are:

SRC(
, )
= 0

SRC(
, )
= 0

The first order condition is:


P X
e2 i) = 0
= 2 (Yi
X (1.2)


Dividing by minus 2 and distributing the summations:
X X
+
Yi = n Xi (1.3)
This last expression is very important, and we will return to it frequently. From the
second first order condition:
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 4

P X
e2 i) = 0
= 2 Xi (Yi
X (1.4)


Dividing by -2 and distributing the summations:
X X X
Xi Yi =
Xi + Xi2 (1.5)

(1.3) and (1.5) form a system of two linear equations with two unknowns ( known
y )
as the normal equations.
Dividing (1.3) by n and solving for
we get:

= Y X
(1.6)

Replacing in (1.5):

X X X
Xi Yi = (Y X)
Xi + Xi 2
X X X X
Xi Yi = Y Xi X Xi + Xi 2
X X X X
Xi Yi Y Xi =
Xi 2 X Xi

P P
Xi Yi Y X

= P 2 P i
Xi X Xi

= P Xi /n then P Zi = Zn.
Note that: X Replacing, we get:

P
Xi Yi nY X

= P 2 2 (1.7)
X i nX

It will be useful to adopt the following notation. xi = Xi X, and yi = Yi Y , so


lowercase letters denote the observations as deviations from their sample means.
Using this notation:
X X
xi yi = i Y )
(Xi X)(Y
X
= (Xi Yi Xi Y XY
I +X
Y )
X X X
= Xi Yi Y
Xi X Y
Yi + nX
X
= Xi Yi nY X
nX
Y + nX
Y
X
= Xi Yi nY X

CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 5

corresponds to the numerator of (1.7). Making a similar operation in the denominator of


(1.7) we get the following alternative expression for the least squares estimate of :
P
xi yi
= P 2
xi

[ FIGURE 3: SCATTER DIAGRAM AND OLS LINE ]

1.3 Algebraic Properties of Least Squares Estimators


By algebraic properties of the estimator we mean those that are a direct consequence of
the minimizacion process, stressing the difference with statistical properties, which will be
studied in the next section.
P
Property 1: ei = 0 From the first normal equation (1.2), dividing by minus 2 and
replacing by the definition of ei we easily verify that as a consequence of minimizing
the sum of squared residuals, the sum of the residuals, and consequently their average,
is equal to zero.
P
Property 2: Xi ei = 0. This can be checked by dividing by minus 2 in the second
normal equation (1.4). The covariance between X and e is given by:
1 X i e)
Cov(X, e) = (Xi X)(e
n1
1 hX X

X X
e
i
= Xi ei e Xi X ei + X
n1
1 X
= X i ei
n1
P
since from the previous property ei and hence e are equal to zero. Then, this
property says that as a consequence of using the method of least squares the sample
covariance between the explanatory variable X and the error term e is zero, or, which
is the same, the residuals are linearly unrelated to the explanatory variable.
Property 3: The estimated regression line corresponds to the function Y (X) = +
X
where e take
and as parameters, so as Y is a function that depends on X. Consider
what happens when we evaluate this function at X, the mean of X:

Y (X) + X
=

But from (1.6):

+ X
= Y

Then Y (X)
= Y , this is, the estimated regression line by the method of least squares
passes through the point of means.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 6

Property 4: Relationship between regression and correlation: Remember that the


sample correlation coefficient between X and Y for a sample of n observations (Xi , Yi ),
i = 1, 2, . . . , n is defined as:

Cov(X, Y )
rXY =
SX SY

The following result establishes the relationship between rXY and .
P
xi yi
= P 2
xi
P
xi yi
= qP qP
x2i x2i
qP
P
xi yi yi2
= qP qP qP
x2i x2i yi2
qP
P
yi2 / n
xi yi
= qP qP qP
x2i yi2 x2i / n

SY
= r
SX

If r = 0 then = 0.Note that if both variables have the same sample variance, then
We can also see
the correlation coefficient is equal to the regression coefficient .

that, unlike the correlation coefficient, is not invariant to changes in scales or unit
of measurement.

Property 5: The sample means of Yi and Yi are the same. By definition, Yi = Yi + ei


for i = 1, . . . , n. Then, summing for every i:
X X X
Yi = Yi + ei

and dividing by n:
P P
Yi Yi
=
n n
P
since ei = 0 from the first order conditions. Then:


Y = Y

which is the desired result.


CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 7

P
Property 6: is a linear function of the Yi s. This is, can be written as = wi Yi ,
where the wi s are real numbers not all of them equal to zero.

This is easy to prove. Let us start by writing as follows:


X xi
= P 2 yi
xi
P
and call wi = xi / x2i . Note that:
X X X
xi = =
(Xi X) =0
Xi nX
P
which implies wi = 0. From the previous result:
X
= wi yi
X
= wi (Yi Y )
X X
= wi Yi Y wi
X
= wi Yi

which gives the desired result.

This does not have much intuitive meaning so far, but it will be a useful for later
results.

1.4 The Two-Variable Linear Model under the Classical As-


sumptions

Yi = + Xi + ui , i = 1, . . . , n

In addition the the linear relationhips beteween Y and X we will assume:

1. E(ui ) = 0, i = 1, 2, . . . , n. On average the relationship between Y and X is linear.

2. V ar(ui ) = E[(ui E(ui ))2 ] = Eu2i = 2 i = 1, 2, . . . , n. The variance of the error


term is constant for all observations. We will say that the error term is homoskedastic.

3. Cov(ui , uj ) = 0 i 6= j. The error term for an observation i is not linearly related


to the error term of any other different observation j. If variables are measured over
time, i.e., i = 1980, 1981 . . . , 1997 we will say that there is no autocorrelation. In
general, we will say that there is no serial correlation. Note that since E(ui ) = 0,
assuming Cov(ui , uj ) = 0 is equivalent to assuming E(ui uj ) = 0.

4. The values of Xi are non-stochastic and not all of them equal.


CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 8

The classical assumptions provide a basic probabilistic structure to study the linear
model. Most assumptions are of a pedagogic nature and we will study later on how they
can be relaxed. Nevertheless, they provide a simple framework to explore the nature of
least squares estimator.

1.5 Statistical Properties of Least Squares Estimators


Actually, the problem is to find good estimates of , and 2 . The previous section
presents estimates of the first two based on the principle of least squares so, trivially, these
estimates are good in the sense that they minimize certain notion of fit: they make the
sum of squared residuals as small as possible. It is relevant to remark that in obtaining the
least squares estimators we have made no use of the classical assumptions described above.
Hence, the natural step is to explore whether we can deduce additional properties satisfied
by the least squares estimator, so we can say that it is good in a sense that goes beyond
that implicit in the least squares criterion. The following are called statistical properties
since they arise as a consequence of the statistical structure of the model.
We will use repeatedly the following expressions for the LS estimators:
P
xi yi
= P 2
xi

= Y X

We will first explore the main properties of in detail, and leave the analysis of
as
exercises. The starting conceptual point is to see that depends explicitely on the Yi s
which, in turn, depend on the ui s which are, by construction, random variables. Then is
a random variable and then it makes sense to talk about its moments (mean and variance,
for example) and its distribution.
It is easy to verify that:

yi = xi + ui

where ui = ui u
, and, according to the classical assumptions, E(ui ) = 0 and, consequently,
E(yi ) = xi . This is known as the classical two-variables linear model in deviations form
the means.

is an unbiased estimator, that is: E()


=

To prove the result, from the linearity property of the previous section
X
= wi yi
X
E() = wi E(yi ) (wi s are non-stochastic)
X
= wi xi
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 9

X
= wi xi
X X
= x2i /( x2i )
=
P
The variance of is 2 / x2i
P
From the linearity property, = wi Yi , then
X
=V
V () wi Yi

Now note two things. First:

V (Yi ) = V ( + Xi + ui ) = V (ui ) = 2

since Xi is non-stochastic. Second, note that E(Yi ) = + Xi , so

Cov(Yi , Yj ) = E [(Yi E(Yi ))(Yj E(Yj ))]


= E(ui uj ) = 0
P
by the no serial correlation assumption. Then V ( wi Yi ) is the variance of
(weighted) sum of uncorrelated terms. Hence

X
=
V () V wi Yi
X
= wi2 V (Yi )
X
= 2 wi2
X hX i2
= 2 (x2i )/ x2i
X
= 2 / xi2

Gauss-Markov Theorem: under the classical assumptions, , the LS estimator of ,


has the smallest variance among the class of linear and unbiased estimators. More
formally, if is any linear and unbiased estimator of then:


V ( ) V ()

The proof of a more general version of this result will be postponed until Chapter 3.
Discussion: BLUE, best does not mean good, we want minimum variance unbiased
(without linear), linear is not an interesting class, etc. If we drop any assumption,
the OLS estimate is no longer BLUE. This justifies the use of OLS when all the
asumptions are correct.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 10

Estimation of 2
So far we have concentrated the analysis on and . As an estimate for 2 we will propose:
P 2
2 e i
S =
n2
We will later show that S 2 provides and unbiased estimator for 2 .

1.6 Goodness of fit


After estimating the parameters of the regression line, it is interesting to check how well
does the estimated model fit the data. We want a measure of how well does the fitted line
represent the observations of the variables of the model.
To look for such measure of goodness of fit, we start from the definition of fitted value
ei = Yi Yi , solve for Yi and substract in both members the sample mean of Yi to obtain:

Yi Y = Yi Y + ei
yi = yi + ei

using the notation defined before and noting that from Property 4, Y = Y . Taking the
square of both sides and summing over all the observations:

yi2 = (
yi + ei )2
= yi2 + ei + 2
y i ei
X X X X
yi2 = yi2 + e2i + 2 yi ei
P
The next step is to show that yi ei = 0:

X X
yi ei = ( i )ei
+ X
X X
=
ei + Xi ei
= 0+0

from the first order conditions. Then we get the following important decomposition:
P 2 P P
y =
i yi 2 + e2i
T SS = ESS + RSS

This is a key result that indicates that when the we use the least squares method, the total
variability of the dependent variable (TSS) around its sample mean can be decomposed
as the sum of two factors. The first one corresponds to the variability of Y (ESS) and
represents the variability explained by the fitted model. The second term represents the
variability not explained by the model (RSS), associated to the error term.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 11

For a given model, the best situation arises when errors are all zero, in which case
the total variability (TSS) conincides with the explained varaibility (ESS). The worst case
corresponds to the situation in which the fitted model does not explain anything of the total
variability, in which case TSS coincides with RSS. From this observation, it is natural to
suggest the following goodness of fit measure, known as R2 , or coefficient of determination:
SCE SCR
R2 = =1
SCT SCT
It can be shown (we will do it in the exercises) that R2 = r2 . Consequently, 0 R2 1.
When R2 = 1 |r| = 1, which corresponds to the case in which the relationship between
Y and X is exactly linear. On the other hand, R2 = 0 is equivalent to r = 0, which
corresponds to the case in which Y and X are linearly unrelated. It is interesting to note
that T SS does not depend on the estimated model, that is, it does not depend on nor
. Then, if and
are choosen so as to minimize SSR then they automatically maximize
R2 . This implies that, for a given model, the least squares estimate maximizes R2 .
The R2 is, arguably, the most used and abused measure of quality of a regression model.
A detailed analysis of the extent to which a high R2 can be taken as representative of a
good model will be undertaken in Chapter 4.

1.7 Inference in the two-variable linear model


The methods discussed so far provide reasonably good point estimates of the parameters
of interest , and 2 but usually we will be interested in evaluating hypotheses involving
the parameters, or constructing confidence intervals for them. For example, consider the
case of a simple consumption function where consumption is specified as a simple linear
function of income. We could be interested in evaluating whether the marginal propensity
to consume is equal to, say, 0.75, or that autonomous consumption is equal to zero.
In general terms, a hypothesis about a parameter of the model is a conjecture about
it, that can be either false or true. The central problem is that in order to check whether
such statement is true or false we do not have the chance to observe such a parameter.
Instead, based on the available data, we have an estimate of it. As an example, suppose
we are interested in evaluating the, rather strong, null hypothesis that income is not an
explanatory factor of consumption, against the hypothesis that it is a relevant factor. In
our simple setup this corresponds to H0 : = 0 against HA : 6= 0. The logic we will use is
the following: if the null hypothesis were in fact true would be exactly zero. Realizations
of can potentially take any value, since is, by construction, a random variable. But if
is a good estimator of , when the null hypothesis is true it should take values close
to zero. On the other hand, if the null hypothesis were false, the realizations of should
be significantly different from zero. Then, the procedure consists in computing from the
data, and reject the null if the obtained value is significantly different from zero, or accept
otherwise.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 12

Of course, the central concept behind this procedure lies in specifying what do we mean
by very close or very far, given that is a random variable. More specifically, we need
to know the distribution of under the null hypothesis so we can define precisely the
notion of significantly different from zero. In this context such a statement is necessarily
probabilistic, that is, we will take as the rejection region a set of values that lie far away
from zero, or, a set of values that under the null hypothesis appear with very low probability.
The properties discussed in the previous section are informative about certain moments

of or
(for example, their means and variances) but they are not enough for the purposes
of knowing their distrubutions. Consequently, we need to introduce an additional assump-
tion. We will assume that ui is normally distributed, for i = 1, . . . , n. Given that we have
already assumed that ui has zero mean and constant variance equal to 2 , we have:

ui N (0, 2 )

Given that Yi = + Xi + ui and that the Xi s are non-stochastic, we immediately


see that the Yi s are also normally distributed since linear transformations of normal ran-
dom variables are also normal. In particular, given that the normal distibution can be
characterized by its mean and variance only, we get:

Yi N ( + Xi , 2 )

, for every i = 1 . . . , n. In a similar fashion is also normally distributed since by Property


1 it is a linear combination of the Yi s, that is:
X
N (, 2 / x2i )

If 2 were known we could use this result to test simple hypothesis like:

Ho : = o vs. HA : 6= o

Substracting from its expected value and dividing by its standard deviation we get:

o
z= qP N (0, 1)
/ x2i

Hence, if the null hypothesis is true, z should take values that are small in absolute value, and
large otherwise. As you should remember from a basic statistics course, this is acomplished
by defining a rejection region and an acceptance region as follows. The acceptance region
includes values that lie close to the one corresponding to the null hypothesis. Let c < 1 and
zc be a number such that:

P r(zc z zc ) = 1 c

Replacing z by its definition:


CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 13

qX qX
P r o zc / x2i o + zc / x2i =1c

Then the acceptance region is given by the interval:


qX
o zc (/ x2i )

so we accept the null hypothesis if the observed realization of lies within this interval and
reject otherwise. The number c is specified in advance and it is usually a small number. It
is called the significance of the test. Note that it gives the probability that we reject the
null hypohtesis when it is correct. Under the normality assumptions, the value zc can be
easily obtained from a table of percentiles of the standard normal distribution.
As you should also remember from a basic statistics class, a similar logic can be applied
to construct a confidence interval for 0 . Note that:
qX qX
P r zc (/ x2i ) o + zc (/ x2i ) =1c

Then a 1 c confidence interval for 0 will be given by:


qX
zc / x2i

The practical problem with the previous procedures is that they require that we know
2 , which is usually not available. Instead, we can compute its estimated version S 2 . Define
t as:


t=
S/ x2

t is simply z where we have replaced 2 by S 2 . A very important result is that by doing


this replacement we have:

t tn2

that is, the t-statistic has the so-called t-distribution with n 2 degrees of freedom.
Hence, when we use the estimated version of the variance we obtain a different distribution
for the statistic used to test simple hypotheses and construct confidence intervals.
Consequently, applying once again the same logic, in order to test the null hypothesis
Ho : = o against HA : 6= o we use the t-statistic:

o
t= qP tn2
S/ x2i
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 14

and a 1 c confidence interval for 0 will be given by:


qX
tc (S/ x2i )
where now tc is a percentile of the t distribution with n 2 degrees of freedom, which is
usually tabulated in basic statistics and econometrics textbooks.
An important particular case is the insignificance hypothesis, that is Ho : o = 0 against
HA : 0 6= 0. Under the null X does not help explain Y , and under the alternative, X is
linearly related to Y . Replacing o by 0 above we get:

tI = qP tn2
S/ x2i

which is usually reported as a standard outcome in most regression packages.


Another alternative to check for the significance of the linear relationship is to look at
how large is the explained sum of squares ESS. Recall that if the model has an intercept
we have that:
T SS = ESS + RSS
If there is no linear relationship between Y and X, ESS should be very close to zero.
Consider the following statistic, which is just a standardized version of the ESS:
ESS
F =
RSS/(n 2)
It can be shown that under the normality assumption, F has the F distribution with
1 degree of freedom in the numerator, and n 2 degrees of freedom in the denominator,
which is usually labeled as F (1, n 2). Note that if X does not help explain Y in a linear
sense, ESS should be very small, which would make F very small. Then, we should reject
the null hypothesis that X does not help explain Y is the F statistic computed from the
data takes a large value, and accept otherwise.
Note that by definition R2 = ESS/T SS = 1 RSS/T SS. Divide both the numerator
of the F statistic by T SS. Solving for ESS and RSS and replacing above we can write the
F statistic in terms of the R2 coefficient as:
R2
F =
(1 R2 )/(n 2)
Then, the F test is actually looking at whether the R2 is significantly high. As it is expected,
there is a close relationship between the F statistic and the t statistic for the insignificance
hypothesis (tI ). In fact, when there is no linear relationship between Y and X, ESS is zero,
or 0 = 0. In fact, it can be easily shown that:

F = t2I
We will leave the proof as an excercise.

You might also like