3.1 Multiple Linear Regression Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Chapter 3

Multiple Regression

3.1 Multiple Linear Regression Model

A fitted linear regression model always leaves some residual variation. There
might be another systematic cause for the variability in the observations yi . If we
have data on other explanatory variables we can ask whether they can be used to
explain some of the residual variation in Y . If this is a case, we should take it into
account in the model, so that the errors are purely random. We could write

Yi = β0 + β1 xi + β2 zi + ε⋆i .
| {z }
previously εi

Z is another explanatory variable. Usually, we denote all explanatory variables


(there may be more than two of them) using letter X with an index to distinguish
between them, i.e., X1 , X2 , . . . , Xp−1.

Example 3.1. (Neter at al, 1996) Dwine Studios, Inc.


The company operates portrait studios in 21 cities of medium size. These studios
specialize in portraits of children. The company is considering an expansion into
other cities of medium size and wishes to investigate whether sales (Y ) in a com-
munity can be predicted from the number of persons aged 16 or younger in the
community (X1 ) and the per capita disposable personal income in the community
(X2 ).
If we use just X2 (per capita disposable personal income in the community) to
model Y (sales in the community) we obtain the following model fit.

55
56 CHAPTER 3. MULTIPLE REGRESSION

The regression equation is


Y = - 352.5 + 31.17 X2

S = 20.3863 R-Sq = 69.9% R-Sq(adj) = 68.3%

Analysis of Variance
Source DF SS MS F P
Regression 1 18299.8 18299.8 44.03 0.000
Error 19 7896.4 415.6
Total 20 26196.2

(a) (b)
Figure 3.1: (a) Fitted line plot for Dwine Studios versus per capita disposable personal income
in the community. (b) Residual plots.

The regression is highly significant, but R2 is rather small. It suggests that there
could be some other factors, which are also important for the sales. We have
data on the number of persons aged 16 or younger in the community, so we will
examine whether the residuals of the above fit are related to this variable. If yes,
then including it in the model may improve the fit.

Indeed, the residuals show a possible relationship with the number of persons aged
16 or younger in the community. We will fit the model with both variables, X1
and X2 included, that is

Yi = β0 + β1 x1i + β1 x2i + εi , i = 1, . . . , n.
The model fit is following

The regression equation is


Y = - 68.9 + 1.45 X1 + 9.37 X2

Predictor Coef SE Coef T P


Constant -68.86 60.02 -1.15 0.266
3.1. MULTIPLE LINEAR REGRESSION MODEL 57

Figure 3.2: The dependence of the residuals on X1 .

X1 1.4546 0.2118 6.87 0.000


X2 9.366 4.064 2.30 0.033

S = 11.0074 R-Sq = 91.7% R-Sq(adj) = 90.7%

Analysis of Variance
Source DF SS MS F P
Regression 2 24015 12008 99.10 0.000
Residual Error 18 2181 121
Total 20 26196

Here we see that the intercept parameter is not significantly different from zero
(p = 0.226) and so the model without the intercept was fitted. R2 is now close to
100% and both parameters are highly significant.

Regression Equation
Y = 1.62 X1 + 4.75 X2

Coefficients
Term Coef SE Coef T P
X1 1.62175 0.154948 10.4664 0.000
X2 4.75042 0.583246 8.1448 0.000

S = 11.0986 R-Sq = 99.68% R-Sq(adj) = 99.64%

Analysis of Variance
Source DF Seq SS Adj SS Adj MS F P
Regression 2 718732 718732 359366 2917.42 0.000
Error 19 2340 2340 123
Total 21 721072
58 CHAPTER 3. MULTIPLE REGRESSION

Figure 3.3: Fitted surface plot and the Dwine Studios observations.

A Multiple Linear Regression (MLR) model for a response variable Y and ex-
planatory variables X1 , X2 , . . . , Xp−1 is
E(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = β0 + β1 x1i + . . . + βp−1xp−1i
var(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = σ 2 , i = 1, . . . , n
cov(Y |X1 = x1i , .., Xp−1 = xp−1i , Y |X1 = x1j , .., Xp−1 = xp−1j ) = 0, i 6= j
As in the SLR model we denote
Yi = (Y |X1 = x1i , . . . , Xp−1 = xp−1i )
and we usually omit the condition on Xs and write
µi = E(Yi ) = β0 + β1 x1i + . . . + βp−1 xp−1i
var(Yi ) = σ 2 , i = 1, . . . , n
cov(Yi , Yj ) = 0, i 6= j
or
Yi = β0 + β1 x1i + . . . + βp−1 xp−1i + εi
E(εi ) = 0
var(εi ) = σ 2 , i = 1, . . . , n
cov(εi , εj ) = 0, i 6= j
For testing we need the assumption of Normality, i.e., we assume that
Yi ∼ N (µi , σ 2 )
ind
3.2. LEAST SQUARES ESTIMATION 59

or
εi ∼ N (0, σ 2 )
ind
To simplify the notation we write the MLR model in a matrix form
Y = Xβ + ε, (3.1)
that is,
      
Y1 1 x1,1 ··· xp−1,1 β0 ε1
 Y2   1 x1,2 ··· xp−1,2  β1   ε2 
      
 .. = .. .. ..  .. + .. 
 .   . . ··· .  .   . 
Yn 1 x1,n · · · xp−1,n βp−1 εn
| {z } | {z }| {z } | {z }
:= Y := X := β := ε

Here Y is the vector of responses, X is often called the design matrix, β is the
vector of unknown, constant parameters and ε is the vector of random errors.

εi are independent and identically distributed, that is


ε ∼ N n (0n , σ 2 I n ).
Note that the properties of the errors give
Y ∼ N n (Xβ, σ 2 I n ).

3.2 Least squares estimation

To derive the least squares estimator (LSE) for the parameter vector β we min-
imise the sum of squares of the errors, that is
n
X
S(β) = [Yi − {β0 + β1 x1,i + · · · + βp−1 xp−1,i }]2
i=1
X
= ε2i
= εT ε
= (Y − Xβ)T (Y − Xβ)
= (Y T − β T X T )(Y − Xβ)
= Y T Y − Y T Xβ − β T X T Y + β T X T Xβ
= Y T Y − 2βT X T Y + β T X T Xβ.
60 CHAPTER 3. MULTIPLE REGRESSION

Theorem 3.1. The LSE βb of β is given by

βb = (X T X)−1 X T Y

if X T X is non-singular. If X T X is singular there is no unique LSE of β.

Proof. Let β 0 be any solution of X T Xβ = X T Y . Then, X T Xβ 0 = X T Y and

S(β) − S(β 0 )

= Y T Y − 2β T X T Y + β T X T Xβ − Y T Y + 2β T0 X T Y − β T0 X T Xβ 0

= −2β T X T Xβ 0 + β T X T Xβ + 2β T0 X T Xβ 0 − β T0 X T Xβ 0

= β T X T Xβ − 2β T X T Xβ 0 + β T0 X T Xβ 0

= β T X T Xβ − β T X T Xβ 0 − β T X T Xβ 0 + β T0 X T Xβ 0

= β T X T Xβ − β T X T Xβ 0 − β T0 X T Xβ + β T0 X T Xβ 0

= β T (X T Xβ − X T Xβ 0 ) − β T0 (X T Xβ − X T Xβ 0 )

= (β T − β T0 )(X T Xβ − X T Xβ 0 )

= (β T − β T0 )X T X(β − β 0 )

= {X(β − β 0 )}T {X(β − β 0 )} ≥ 0

since it is a sum of squares of elements of the vector X(β − β 0 ).

We have shown that S(β) − S(β 0 ) ≥ 0.

Hence, β 0 minimises S(β), i.e. any solution of X T Xβ = X T Y minimises


S(β).

If X T X is nonsingular the unique solution is βb = (X T X)−1 X T Y .

If X T X is singular there is no unique solution. 


3.2. LEAST SQUARES ESTIMATION 61

Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result
by differentiating S(β) with respect to β and setting it equal to 0.

3.2.1 Properties of the least squares estimator

Theorem 3.2. If
Y = Xβ + ε, ε ∼ Nn (0, σ 2 I),
then
βb ∼ N p (β, σ 2 (X T X)−1 ).

Proof. Each element of βb is a linear function of Y1 , . . . , Yn . We assume that


Yi , i = 1, . . . , n are normally distributed. Hence βb is also normally distributed.

The expectation and variance-covariance matrix can be shown in the same way as
in Theorem 2.7. 
Remark 3.1. The vector of fitted values is given by

b = Yb = X βb
µ
= X(X T X)−1 X T Y
= HY .

The matrix H = X(X T X)−1 X T is called the hat matrix.

Note that
HT = H
and also

HH = X(X T X)−1 X T X(X T X)−1 X T


| {z }
=I
T T
= X(X X) X−1

= H.

A matrix, which satisfies the condition AA = A is called an idempotent matrix.


Note that if A is idempotent, then (I − A) is also idempotent.
62 CHAPTER 3. MULTIPLE REGRESSION

We now prove some results about the residual vector


e = Y − Yb
= Y − HY
= (I − H)Y .

As in Theorem 2.8, here we have


Lemma 3.1. E(e) = 0.
Proof.
E(e) = (I − H) E(Y )
= (I − X(X T X)−1 X T )Xβ
= Xβ − Xβ
= 0


Lemma 3.2. Var(e) = σ 2 (I − H).
Proof.
Var(e) = (I − H) var(Y )(I − H)T
= (I − H)σ 2 I(I − H)
= σ 2 (I − H)


Lemma 3.3. The sum of squares of the residuals is Y T (I − H)Y .
Proof.
n
X
e2i = eT e = Y T (I − H)T (I − H)Y
i=1
= Y T (I − H)Y


Lemma 3.4. The elements of the residual vector e sum to zero, i.e
n
X
ei = 0.
i=1
3.3. ANALYSIS OF VARIANCE 63

Proof. We will prove this by contradiction.


P
Assume that ei = nc where c 6= 0. Then
X X
e2i = {(ei − c) + c}2
X X
= (ei − c)2 + 2c (ei − c) + nc2
X X
= (ei − c)2 + 2c( ei −nc) + nc2
| {z }
=nc
X
2 2
= (ei − c) + nc
X
> (ei − c)2 .

P
But we know that e2i is the minimum value of S(β) so there cannot exist values
with a smaller sum of squares and this gives the required contradiction. So c = 
0.

Corollary 3.1.
n
1Xb
Yi = Ȳ .
n i=1

P P P
Proof. The residuals ei = Yi − Ybi , so ei = (Yi − Ybi ) but ei = 0. Hence
P Pb
Yi = Yi and so the result follows. 

3.3 Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity.


Theorem 3.3. The total sum of squares splits into the regression sum of squares
and the residual sum of squares, that is

SST = SSR + SSE .

Proof.
X
SST = (Yi − Ȳ )2
X
= Yi2 − nȲ 2
= Y T Y − nȲ 2 .
64 CHAPTER 3. MULTIPLE REGRESSION

X
SSR = (Ybi − Ȳ )2
X X
= Ybi2 − 2Ȳ Ybi +nȲ 2
| {z }
=nȲ
X
= Ybi2 − nȲ 2

T
= Yb Yb − nȲ 2
T
= βb X T X βb − nȲ 2
= Y T X(X T X)−1 X T X(X T X)−1 X T Y − nȲ 2
| {z }
=I
= Y T HY − nȲ 2 .

We have seen (Lemma 3.3) that

SSE = Y T (I − H)Y

and so

SSR + SSE = Y T HY − nȲ 2 + Y T (I − H)Y


= Y T Y − nȲ 2
= SST

F-test for the Overall Significance of Regression

Suppose we wish to test the hypothesis

H0 : β1 = β2 = . . . = βp−1 = 0,

i.e. all coefficients except β0 are zero, versus

H1 : ¬H0 ,

which means that at least one of the coefficients is non-zero. Under H0 , the model
reduces to the null model
Y = 1β0 + ε,
3.3. ANALYSIS OF VARIANCE 65

where 1 is a vector of ones.

In testing H0 we are asking if there is sufficient evidence to reject the null model.

The Analysis of variance table is given by

Source d.f. SS MS VR
T SSR M SR
Overall regression p−1 Y HY − nȲ 2 p−1 M SE

Residual n−p Y T (I − H)Y SSE


n−p

Total n−1 Y T Y − nȲ 2

As in SLM we have n − 1 total degrees of freedom. Fitting a linear model with


p parameters (β0 , β1 , . . . , βp−1 ) leaves n − p residual d.f. Then the regression d.f.
are n − 1 − (n − p) = p − 1.

It can be shown that E(SSE ) = (n − p)σ 2 , that is MSE is an unbiased estimator


of σ 2 . Also,
SSE
∼ χ2n−p
σ2
and if β1 = . . . βp−1 = 0, then

SSR
∼ χ2p−1 .
σ2
The two statistics are independent, hence
MSR
∼ Fp−1,n−p .
MSE H0
This is a test function for the null hypothesis

H0 : β1 = β2 = . . . = βp−1 = 0,

versus
H1 : ¬H0 .
We reject H0 at the 100α% level of significance if

Fobs > Fα;p−1,n−p ,

where Fα;p−1,n−p is such that P (F > Fα;p−1,n−p ) = α.

You might also like