Multiple Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Multiple Linear Regression

The population model


In a simple linear regression model, a single response measurement Y is related to a single
predictor (covariate, regressor) X for each observation. The critical assumption of the model
is that the conditional mean function is linear: E(Y |X) = + X.
In most problems, more than one predictor variable will be available. This leads to the
following multiple regression mean function:
E(Y |X) = + 1 X1 + + p Xp ,
where is caled the intercept and the j are called slopes or coefficients.
For example, if Y is annual income ($1000/year), X1 is educational level (number of years
of schooling), X2 is number of years of work experience, and X3 is gender (X3 = 0 is male,
X3 = 1 is female), then the population mean function may be
E(Y |X) = 15 + 0.8 X1 + 0.5 X2 3 X3 .
Based on this mean function, we can determine the expected income for any person as long
as we know his or her educational level, work experience, and gender.
For example, according to this mean function, a female with 12 years of schooling and 10
years of work experience would expect to earn $26,600 annually. A male with 16 years of
schooling and 5 years of work experience would expect to earn $30,300 annually.
Going one step further, we can specify how the responses vary around their mean values.
This leads to a model of the form
Yi = + 1 Xi,1 + + p Xi,p + i .
which is equivalent to writing Yi = E(Y |Xi ) + i .
We write Xi,j for the j th predictor variable measured for the ith observation.
The main assumptions for the errors i is that Ei = 0 and var(i ) = 2 (all variances are
equal). Also the i should be independent of each other.
For small sample sizes, it is also important that the i approximately have a normal distribution.

For example if we have the population model


Y = 15 + 0.8 X1 + 0.5 X2 3 X3 + .
as above, and we know that = 9, we can answer questions like: what is the probability that
a female with 16 years education and no work experience will earn more than $40,000/year?
The mean for such a person is 24.8, so standardizing yields the probability:

P (Y > 40) = P ((Y 24.8)/9 > (40 24.8)/9)


= P (Z > 1.69)
0.05.
Another way to interpret the mean function
E(Y |X) = 15 + 0.8 X1 + 0.5 X2 3 X3 .
is that for each additional year of schooling that you have, you can expecect to earn an
additional $800 per year, and for each additional year of work experience, you can expect to
earn an additional $500 per year.
This is a very strong assumption. For example, it may not be realistic that the gain in
income when moving from from X2 = 20 to X2 = 21 would be equal to the gain in income
when moving from X2 = 1 to X2 = 2.
We will discuss ways to address this later.
The gender variable X3 is an indicator variable, since it only takes on the values 0/1 (as
opposed to X1 and X2 which are quantitative).
The slope of an indicator variable (i.e. 3 ) is the average gain for observations possessing
the characteristic measured by X3 over observations lacking that characteristic. When the
slope is negative, the negative gain is a loss.
Multiple regression in linear algebra notation
We can pack all response values for all observations into a n-dimensional vector called the
response vector:

Y =

Y1
Y2

Yn

We can pack all predictors into a n p + 1 matrix called the design matrix:

1 X11 X12 X1p

1 X21 X22 X2p

X=

1 Xn1 Xn2 Xnp

Note the initial column of 1s. The reason for this will become clear shortly.
We can pack the intercepts and slopes into a p + 1-dimensional vector called the slope vector,
denoted :

Finally, we can pack all the errors terms into a n-dimensional vector called the error vector:

=

1
2

n

Using linear algebra notation, the model


Yi = + 1 Xi,1 + p Xi,p + i
can be compactly written:
Y = X + ,
where X is the matrix-vector product.

In order to estimate , we take a least squares approach that is analogous to what we did
in the simple linear regression case. That is, we want to minimize
X

(Yi 1 Xi,1 p Xi,p )2

over all possible values of the intercept and slopes.


It is a fact that this is minimized by setting
= (X 0 X)1 X 0 Y
X 0 X and (X 0 X)1 are p + 1 p + 1 symmetric matrices.
X 0 Y is a p + 1 dimensional vector.
The fitted values are
Y = X = X(X 0 X)1 X 0 Y,
and the residuals are
r = Y Y = (I X(X 0 X)1 X 0 )Y.
The error standard deviation is estimated as

sX

ri2 /(n p 1)

The variances of
, 1 , . . . , p are the diagonal elements of the standard error matrix:

2 (X 0 X)1 .
We can verify that these formulas agree with the formulas that we worked out for simple
linear regression (p = 1). In that case, the design matrix can be written:

X=

1
1

X1
X2

Xn

So
0

XX=

n
X
P
P 2i
Xi
Xi

!
0

(X X)

1
P
= P 2
n Xi ( Xi )2

P 2
X
P i

Xi

Xi
n

Equivalently, we can write


0

(X X)

1/(n 1)
=
var(X)

Xi2 /n X

X
1

and
0

XY =

P
Yi
P

(X X) X Y =

Yi Xi

nY

(n 1)Cov(X, Y ) + nY X

Y XCov(X,
Y )/Var(X)
Cov(X, Y )/Var(X)

Y X

Thus we get the same values for


and .
Moreover, from the matrix approach the standard deviations of
and are
qP

SD(
) =
=
SD()

Xi2 /n

n 1X

,
n 1X

which agree with what we derived earlier.


Example: Yi are the average maximum daily temperatures at n = 1070 weather stations in
the U.S during March, 2001. The predictors are: latitude (X1 ), longitude (X2 ), and elevation
(X3 ).
Here is the fitted model:
E(Y |X) = 101 2 X1 + 0.3 X2 0.003 X3
Average temperature decreases as latitude and elevation increase, but it increases as longitude increases.
For example, when moving from Miami (latitude 25 ) to Detroit (latitude 42 ), an increase
in latitude of 17 , according to the model average temperature decreases by 2 17 = 34 .
In the actual data, Miamis temperature was 83 and Detroits temperature was 45 , so the
actual difference was 38 .

The sum of squares of the residuals is


deviation of  is

2
i ri

= 25301, so the estimate of the standard

25301/1066 4.9.

The standard error matrix


2 (X 0 X)1 is:
2.4
3.2 102 1.3 102
2.1 104
3.2 102
7.9 104
3.3 105 2.1 106
2
5
1.3 10
3.3 10
1.3 104 1.8 106
2.1 104 2.1 106 1.8 106
1.2 107
The diagonal elements give the standard deviations of the parameter estimates, so SD(
) =

1.55, SD(1 ) = 0.03, etc.


One of the main goals of fitting a regression model is to determine which predictor variables
are truly related to the response. This can be fomulated as a set of hypothesis tests.
For each predictor variable Xi , we may test the null hypothesis i = 0 against the alternative
i 6= 0.
To obtain the p-value, first standardize the slope estimates:

1 /SD(1 ) 72
2 /SD(2 ) 29
3 /SD(3 ) 9
Then look up the result in a Z table. In this case the p-values are all extremely small, so all
three predictors are significantly related to the response.
Sums of squares
Just as with the simple linear model, the residuals and fitted values are uncorrelated:
X

(Yi Yi )(Yi Y ) = 0.

Thus we continue to have the SSTO = SSE + SSR decomposition


X

(Yi Y )2 =

(Yi Yi )2 +

(Yi Y )2 .

Here are the sums of squares with degrees of freedom (DF):


Source
SSTO
SSE
SSR

Formula

DF

(Yi Y )2
(Yi Yi )2
P
(Yi Y )2

n1
np1
p

P
P

Each mean square is a sum of squares divided by its degrees of freedom:


MSTO =

SSTO
,
n1

MSE =

SSE
SSR
, MSR =
np1
p

The F statistic
F =

MSR
MSE

is used to test the hypothesis all i = 0 against the alternative at least one i 6= 0.
Larger values of F indicate more evidence for the alternative.
The F-statistic has p, n p 1 degrees of freedom, p-values can be obtained from an F table,
or from a computer program.
Example: (cont.) The sums of squares, mean squares, and F statistic for the temperature
analysis are given below:
Source
Total
Error
Regression

Sum square

DF

Mean square

181439 1069
25301 1066
156138
3

170
24
52046

F = 52046/24 2169 on 3,1066 DF. The p-value is extremely small.


The proportion of explained variation (PVE) is SSR/SSTO. The PVE is always between 0
and 1. Values of the PVE close to 1 indicate a closer fit to the data.
For the temperature analysis the PVE is 0.86.

If the sample size is large, all variables are likely to be significantly different from zero. Yet
not all are equally important.
The relative importance of the variables can be assessed based on the PVEs for various
submodels:
Predictors
Latitude
Longitude
Elevation
Longitude, Elevation
Latitude, Elevation
Latitude, Longitude
Latitude, Longitude, Elevation

PVE

0.75
0.10
0.02
0.19
0.75
0.85
0.86

1601
59
9
82
1080
2000
1645

Latitude is by far the most important predictor, with longitude a distant second.
Interactions
Up to this point, each predictor variable has been incorporated into the regression function
through an additive term i Xi . Such a term is called a main effect.
For a main affect, a variable increases the average response by i for each unit increase in
Xi , regardless of the levels of the other variables.
An interaction between two variables Xi and Xj is an additive term of the form ij Xi Xj in
the regression function.
For example, if there are two variables, the main effects and interactions give the following
regression function:
E(Y |X) = + 1 X1 + 2 X2 + 12 X1 X2 .
With an interaction, the slope of X1 depends on the level of X2 , and vice versa. For example,
holding X2 fixed, the regression function can be written
E(Y |X) = ( + 2 X2 ) + (1 + 12 X2 )X1 ,
so for a given level of X2 the response increases by 1 + 12 X2 for each unit increase in X1 .
Similarly, when holding X1 fixed, the regression function can be written
E(Y |X) = ( + 1 X1 ) + (2 + 12 X1 )X2 ,
so for a given level of X1 the response increases by 2 + 12 X1 for each unit increase in X2 .

Example: (cont.) For the temperature data, each of the three possible interactions was
added (individually) to the model along with the three main effects. PVEs and F statistics
are given below:
Interactions
LatitudeLongitude
LatitudeElevation
LongitudeElevation

PVE

0.88
0.86
0.88

1514
1347
1519

The improvements in fit (PVE) are small, nevertheless we may learn something from the
coefficients.
The coefficients for the model including the latitudelongitude interaction are:

E(Y |X) = 188 4.25Latitude + 0.61Longitude


0.003Elevation + 0.02Latitude Longitude
Longitude ranges from 68 to 125 in this data set. Thus in the eastern US, the model can
be aproximated as
E(Y |X) 229 2.89Latitude 0.003Elevation,
while in the western US the model can be approximated as
E(Y |X) 264 1.75Latitude 0.003Elevation.
This tells us that the effect of latitude was stronger in the eastern US than in the western
US.

This scatterplot compares the relationships between latitude and temperature in the eastern
and western US (divided at the median longitude of 93 ).
The slope in the western stations is seen to be slightly closer to 0, but more notably, latitude
has much less predictive power in the west compared to the east.
90
80

Temperature

70
60
50
40
30
20
10

20

25

30

35

Latitude

Western stations

40

45

50

Eastern stations

Polynomial regression
The term linear in linear regression means that the regression function is linear in the
coefficients and j . It is not required that the Xi appear as linear terms in the regression
function.
For example, we may include power transforms of the form Xiq for integer values of q. This
allows us to fit regression functions that are polynomials in the covariates.
For example, we could fit the following cubic model:
E(Y |X) = + 1 X + 2 X 2 + 3 X 3 .
This is a bivariate model, as Y and X are the only measured quantities. But we must use
multiple regression to fit the model since X occurs under several power transforms.

10

The following data come from the population regression function E(Y |X) = X 3 X, with
|X) = 0.54 1.30X 0.18X 2 + 1.15X 3 .
var(Y |X) = 4. The fitted regression function is E(Y
8

-2

-4

-6

-8

-2

-1.5

-1

-0.5

0.5

1.5

If more than one predictor is observed, we can include polynomial terms for any of the
predictors.
For example, in the temperature data we could include the three main affects along with a
quadratic term for any one of the three predictors:
Quadratic term PVE
Latitude
Longitude
Elevation

0.86
0.89
0.86

The strongest quadratic effect occurs for longitude.

11

F
1320
1680
1319

The fitted model with quadratic longitude effect is

E(Y |X) = 197 2.09Latitude 1.62Longitude


0.002Elevation + 0.01Longitude2
Recall that a quadratic function ax2 + bx + c has a minimum if a > 0, a maximum if a < 0,
and either value falls at x = b/2a.
Thus the longitude effect 0.01Longitude2 1.62Longitude has a minimum at 81 , which is
around the 20th percentile of our data (roughly Celevand, OH, or Columbia, SC).
The longitude effect decreases from the east coast as one moves west to around 81 , but then
increases again as one continues to move further west.
This plot shows the longitude effect for the linear fit (green), and the longitude effect for the
quadratic fit (red).
15

Longitude effect

10

-5

-10

60

70

80

90

100

Longitude

Quadratic

110

120

130

Linear

Model building
Suppose you measure a response variable Y and several predictor variables X1 , X2 , , Xp .
We can directly fit the full model
E(Y |X) = + 1 X1 + + p Xp ,
but what if we are not certain that all of the variables are informative about the response?
Model building, or variable selection is the process of building a model that aims to include
only the relevant predictors.
12

One approach is all subsets regression, in which all possible models are fit (if there are p
predictors then there are 2p different models).
A critical issue is that if more variables are included, the fit will always be better. Thus if
we select the model with the highest F statistic or PVE, we will always select the full model.
Therefore we adjust by penalizing models with many variables that dont fit much better
than models with fewer variables.
One way to do this is using the Akaike Information Criterion (AIC):
AIC = n log(SSE/n) + 2(p + 1).
Lower AIC values indicate a better model.
Here are the all subsets results for the temperature data:
Predictors

AIC

PVE

None
Latitude
Longitude
Elevation
Longitude, Elevation
Latitude, Elevation
Latitude, Longitude
Latitude, Longitude, Elevation

5499
4016
5388
5484
5281
4010
3479
3397

0
0.75
0.10
0.02
0.19
0.75
0.85
0.86

0
1601
59
9
82
1080
2000
1645

So based on the AIC we would select the full model.


As an illustration, suppose we simulate random (standard normal) predictor variables and
include these into the temperature dataset alongside the three genuine variables.
These are the AIC PVE, and F values:
1
AIC
PVE
F

10

50

100

200

3398 3399 3406 3490 3594


0.86 0.86 0.87 0.87 0.88
1316 474 128
64
33

The PVE continues to climb (suggesting better fit) as meaningless variables are added. The
AIC increases (suggesting worse fit).

13

If p is large, then it is not practical to investigate all 2p distinct submodels. In this case we
can apply forward selection.
First find the best one-variable model based on AIC:
Predictors

AIC

PVE

Latitude
Longitude
Elevation

4016
5388
5484

0.75
0.10
0.02

1601
59
9

The best model includes latitude only.


Then select the best two variable model, where one of the variables must be latitude:
Predictors

AIC

PVE

Latitude, Elevation
Latitude, Longitude

4010
3479

0.75
0.85

1080
2000

The best two-variable model includes latitude and longitude.


If this model has worse (higher) AIC than the one-variable model, stop here. Otherwise
continue to a three variable model.
There is only one three-variable model
Predictors

AIC

PVE

Latitude, Longitude, Elevation

3397

0.86

1645

Since this has lower AIC than the best two-variable model, this is our final model.
Note that in order to arrive at this model, we never considered the longitude and elevation
model.
In general, around p2 /2 models most be checked in forward selection. For large p, this is far
less than the 2p models that must be checked for the all subsets aproach (i.e. if p = 10 then
p2 /2 = 50 while 210 = 1024).

14

A similar idea is backward selection. Start with the full model


Predictors

AIC

PVE

Latitude, Longitude, Elevation

3397

0.86

1645

Then consider all models obtained by dropping one variable:


Predictors

AIC

PVE

Longitude, Elevation 5281


Latitude, Elevation
4010
Latitude, Longitude 3479

0.19
0.75
0.85

82
1080
2000

The best of these is the latitude and longitude model. Since it has higher AIC than the full
model, we stop here and use the full model as our final model. If one of the two-variable
models had lower AIC than the full model, then we would continue by looking at one-variable
models.
Diagnostics
The residuals on fitted values plot should show no pattern:
15

10

Residuals

-5

-10

-15

20

30

40

50

60

Fitted values

15

70

80

90

The standardized residuals should be approxiately normal:


4

Normal Quantiles

-1

-2

-3

-4

-4

-3

-2

-1

Standardize Residuals

16

There should be no pattern when plotting residuals against each predictor variable:
15

10

Residual

-5

-10

-15

20

25

30

35

Latitude

40

45

50

A strong suggestion that the longitude effect is quadratic:


15

10

Residual

-5

-10

-15

60

70

80

90

100

Longitude
17

110

120

130

15

10

Residual

-5

-10

-15

500

1000

1500

2000

2500

Elevation

3000

3500

Since two of the predictors are map coordinates, we can check whether large residuals cluster
regionally:
50

45

Latitude

40

35

30

25

20
-130

-120
0-75pctl

-110

-100

-90

-Longitude
75-90pctl

18

-80

-70
90-100pctl

-60

You might also like