Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression
Y =
Y1
Y2
Yn
We can pack all predictors into a n p + 1 matrix called the design matrix:
X=
Note the initial column of 1s. The reason for this will become clear shortly.
We can pack the intercepts and slopes into a p + 1-dimensional vector called the slope vector,
denoted :
Finally, we can pack all the errors terms into a n-dimensional vector called the error vector:
=
1
2
n
In order to estimate , we take a least squares approach that is analogous to what we did
in the simple linear regression case. That is, we want to minimize
X
sX
ri2 /(n p 1)
The variances of
, 1 , . . . , p are the diagonal elements of the standard error matrix:
2 (X 0 X)1 .
We can verify that these formulas agree with the formulas that we worked out for simple
linear regression (p = 1). In that case, the design matrix can be written:
X=
1
1
X1
X2
Xn
So
0
XX=
n
X
P
P 2i
Xi
Xi
!
0
(X X)
1
P
= P 2
n Xi ( Xi )2
P 2
X
P i
Xi
Xi
n
(X X)
1/(n 1)
=
var(X)
Xi2 /n X
X
1
and
0
XY =
P
Yi
P
(X X) X Y =
Yi Xi
nY
(n 1)Cov(X, Y ) + nY X
Y XCov(X,
Y )/Var(X)
Cov(X, Y )/Var(X)
Y X
SD(
) =
=
SD()
Xi2 /n
n 1X
,
n 1X
2
i ri
25301/1066 4.9.
1 /SD(1 ) 72
2 /SD(2 ) 29
3 /SD(3 ) 9
Then look up the result in a Z table. In this case the p-values are all extremely small, so all
three predictors are significantly related to the response.
Sums of squares
Just as with the simple linear model, the residuals and fitted values are uncorrelated:
X
(Yi Yi )(Yi Y ) = 0.
(Yi Y )2 =
(Yi Yi )2 +
(Yi Y )2 .
Formula
DF
(Yi Y )2
(Yi Yi )2
P
(Yi Y )2
n1
np1
p
P
P
SSTO
,
n1
MSE =
SSE
SSR
, MSR =
np1
p
The F statistic
F =
MSR
MSE
is used to test the hypothesis all i = 0 against the alternative at least one i 6= 0.
Larger values of F indicate more evidence for the alternative.
The F-statistic has p, n p 1 degrees of freedom, p-values can be obtained from an F table,
or from a computer program.
Example: (cont.) The sums of squares, mean squares, and F statistic for the temperature
analysis are given below:
Source
Total
Error
Regression
Sum square
DF
Mean square
181439 1069
25301 1066
156138
3
170
24
52046
If the sample size is large, all variables are likely to be significantly different from zero. Yet
not all are equally important.
The relative importance of the variables can be assessed based on the PVEs for various
submodels:
Predictors
Latitude
Longitude
Elevation
Longitude, Elevation
Latitude, Elevation
Latitude, Longitude
Latitude, Longitude, Elevation
PVE
0.75
0.10
0.02
0.19
0.75
0.85
0.86
1601
59
9
82
1080
2000
1645
Latitude is by far the most important predictor, with longitude a distant second.
Interactions
Up to this point, each predictor variable has been incorporated into the regression function
through an additive term i Xi . Such a term is called a main effect.
For a main affect, a variable increases the average response by i for each unit increase in
Xi , regardless of the levels of the other variables.
An interaction between two variables Xi and Xj is an additive term of the form ij Xi Xj in
the regression function.
For example, if there are two variables, the main effects and interactions give the following
regression function:
E(Y |X) = + 1 X1 + 2 X2 + 12 X1 X2 .
With an interaction, the slope of X1 depends on the level of X2 , and vice versa. For example,
holding X2 fixed, the regression function can be written
E(Y |X) = ( + 2 X2 ) + (1 + 12 X2 )X1 ,
so for a given level of X2 the response increases by 1 + 12 X2 for each unit increase in X1 .
Similarly, when holding X1 fixed, the regression function can be written
E(Y |X) = ( + 1 X1 ) + (2 + 12 X1 )X2 ,
so for a given level of X1 the response increases by 2 + 12 X1 for each unit increase in X2 .
Example: (cont.) For the temperature data, each of the three possible interactions was
added (individually) to the model along with the three main effects. PVEs and F statistics
are given below:
Interactions
LatitudeLongitude
LatitudeElevation
LongitudeElevation
PVE
0.88
0.86
0.88
1514
1347
1519
The improvements in fit (PVE) are small, nevertheless we may learn something from the
coefficients.
The coefficients for the model including the latitudelongitude interaction are:
This scatterplot compares the relationships between latitude and temperature in the eastern
and western US (divided at the median longitude of 93 ).
The slope in the western stations is seen to be slightly closer to 0, but more notably, latitude
has much less predictive power in the west compared to the east.
90
80
Temperature
70
60
50
40
30
20
10
20
25
30
35
Latitude
Western stations
40
45
50
Eastern stations
Polynomial regression
The term linear in linear regression means that the regression function is linear in the
coefficients and j . It is not required that the Xi appear as linear terms in the regression
function.
For example, we may include power transforms of the form Xiq for integer values of q. This
allows us to fit regression functions that are polynomials in the covariates.
For example, we could fit the following cubic model:
E(Y |X) = + 1 X + 2 X 2 + 3 X 3 .
This is a bivariate model, as Y and X are the only measured quantities. But we must use
multiple regression to fit the model since X occurs under several power transforms.
10
The following data come from the population regression function E(Y |X) = X 3 X, with
|X) = 0.54 1.30X 0.18X 2 + 1.15X 3 .
var(Y |X) = 4. The fitted regression function is E(Y
8
-2
-4
-6
-8
-2
-1.5
-1
-0.5
0.5
1.5
If more than one predictor is observed, we can include polynomial terms for any of the
predictors.
For example, in the temperature data we could include the three main affects along with a
quadratic term for any one of the three predictors:
Quadratic term PVE
Latitude
Longitude
Elevation
0.86
0.89
0.86
11
F
1320
1680
1319
Longitude effect
10
-5
-10
60
70
80
90
100
Longitude
Quadratic
110
120
130
Linear
Model building
Suppose you measure a response variable Y and several predictor variables X1 , X2 , , Xp .
We can directly fit the full model
E(Y |X) = + 1 X1 + + p Xp ,
but what if we are not certain that all of the variables are informative about the response?
Model building, or variable selection is the process of building a model that aims to include
only the relevant predictors.
12
One approach is all subsets regression, in which all possible models are fit (if there are p
predictors then there are 2p different models).
A critical issue is that if more variables are included, the fit will always be better. Thus if
we select the model with the highest F statistic or PVE, we will always select the full model.
Therefore we adjust by penalizing models with many variables that dont fit much better
than models with fewer variables.
One way to do this is using the Akaike Information Criterion (AIC):
AIC = n log(SSE/n) + 2(p + 1).
Lower AIC values indicate a better model.
Here are the all subsets results for the temperature data:
Predictors
AIC
PVE
None
Latitude
Longitude
Elevation
Longitude, Elevation
Latitude, Elevation
Latitude, Longitude
Latitude, Longitude, Elevation
5499
4016
5388
5484
5281
4010
3479
3397
0
0.75
0.10
0.02
0.19
0.75
0.85
0.86
0
1601
59
9
82
1080
2000
1645
10
50
100
200
The PVE continues to climb (suggesting better fit) as meaningless variables are added. The
AIC increases (suggesting worse fit).
13
If p is large, then it is not practical to investigate all 2p distinct submodels. In this case we
can apply forward selection.
First find the best one-variable model based on AIC:
Predictors
AIC
PVE
Latitude
Longitude
Elevation
4016
5388
5484
0.75
0.10
0.02
1601
59
9
AIC
PVE
Latitude, Elevation
Latitude, Longitude
4010
3479
0.75
0.85
1080
2000
AIC
PVE
3397
0.86
1645
Since this has lower AIC than the best two-variable model, this is our final model.
Note that in order to arrive at this model, we never considered the longitude and elevation
model.
In general, around p2 /2 models most be checked in forward selection. For large p, this is far
less than the 2p models that must be checked for the all subsets aproach (i.e. if p = 10 then
p2 /2 = 50 while 210 = 1024).
14
AIC
PVE
3397
0.86
1645
AIC
PVE
0.19
0.75
0.85
82
1080
2000
The best of these is the latitude and longitude model. Since it has higher AIC than the full
model, we stop here and use the full model as our final model. If one of the two-variable
models had lower AIC than the full model, then we would continue by looking at one-variable
models.
Diagnostics
The residuals on fitted values plot should show no pattern:
15
10
Residuals
-5
-10
-15
20
30
40
50
60
Fitted values
15
70
80
90
Normal Quantiles
-1
-2
-3
-4
-4
-3
-2
-1
Standardize Residuals
16
There should be no pattern when plotting residuals against each predictor variable:
15
10
Residual
-5
-10
-15
20
25
30
35
Latitude
40
45
50
10
Residual
-5
-10
-15
60
70
80
90
100
Longitude
17
110
120
130
15
10
Residual
-5
-10
-15
500
1000
1500
2000
2500
Elevation
3000
3500
Since two of the predictors are map coordinates, we can check whether large residuals cluster
regionally:
50
45
Latitude
40
35
30
25
20
-130
-120
0-75pctl
-110
-100
-90
-Longitude
75-90pctl
18
-80
-70
90-100pctl
-60