3.1 Multiple Linear Regression Model

Chapter 3
Multiple Regression
3.1 Multiple Linear Regression Model
A fitted linear regression model always leaves some residual variation. There
might be another systematic cause for the variability in the observations yi . If we
have data on other explanatory variables we can ask whether they can be used to
explain some of the residual variation in Y . If this is a case, we should take it into
account in the model, so that the errors are purely random. We could write
Yi = β0 + β1 xi + β2 zi + ε⋆i .
| {z }
previously εi
Z is another explanatory variable. Usually, we denote all explanatory variables

(there may be more than two of them) using letter X with an index to distinguish
between them, i.e., X1 , X2 , . . . , Xp−1.
Example 3.1. (Neter at al, 1996) Dwine Studios, Inc.

The company operates portrait studios in 21 cities of medium size. These studios
specialize in portraits of children. The company is considering an expansion into
other cities of medium size and wishes to investigate whether sales (Y ) in a com-
munity can be predicted from the number of persons aged 16 or younger in the
community (X1 ) and the per capita disposable personal income in the community
(X2 ).
If we use just X2 (per capita disposable personal income in the community) to
model Y (sales in the community) we obtain the following model fit.
55
56 CHAPTER 3. MULTIPLE REGRESSION
The regression equation is

Y = - 352.5 + 31.17 X2
S = 20.3863 R-Sq = 69.9% R-Sq(adj) = 68.3%
Analysis of Variance
Source DF SS MS F P
Regression 1 18299.8 18299.8 44.03 0.000
Error 19 7896.4 415.6
Total 20 26196.2
(a) (b)
Figure 3.1: (a) Fitted line plot for Dwine Studios versus per capita disposable personal income
in the community. (b) Residual plots.
The regression is highly significant, but R2 is rather small. It suggests that there
could be some other factors, which are also important for the sales. We have
data on the number of persons aged 16 or younger in the community, so we will
examine whether the residuals of the above fit are related to this variable. If yes,
then including it in the model may improve the fit.
Indeed, the residuals show a possible relationship with the number of persons aged
16 or younger in the community. We will fit the model with both variables, X1
and X2 included, that is
Yi = β0 + β1 x1i + β1 x2i + εi , i = 1, . . . , n.
The model fit is following
The regression equation is

Y = - 68.9 + 1.45 X1 + 9.37 X2
Predictor Coef SE Coef T P

Constant -68.86 60.02 -1.15 0.266
3.1. MULTIPLE LINEAR REGRESSION MODEL 57
Figure 3.2: The dependence of the residuals on X1 .
X1 1.4546 0.2118 6.87 0.000

X2 9.366 4.064 2.30 0.033
S = 11.0074 R-Sq = 91.7% R-Sq(adj) = 90.7%
Source DF SS MS F P
Regression 2 24015 12008 99.10 0.000
Residual Error 18 2181 121
Total 20 26196
Here we see that the intercept parameter is not significantly different from zero
(p = 0.226) and so the model without the intercept was fitted. R2 is now close to
100% and both parameters are highly significant.
Regression Equation
Y = 1.62 X1 + 4.75 X2
Coefficients
Term Coef SE Coef T P
X1 1.62175 0.154948 10.4664 0.000
X2 4.75042 0.583246 8.1448 0.000
S = 11.0986 R-Sq = 99.68% R-Sq(adj) = 99.64%
Source DF Seq SS Adj SS Adj MS F P
Regression 2 718732 718732 359366 2917.42 0.000
Error 19 2340 2340 123
Total 21 721072
Figure 3.3: Fitted surface plot and the Dwine Studios observations.
A Multiple Linear Regression (MLR) model for a response variable Y and ex-
planatory variables X1 , X2 , . . . , Xp−1 is
E(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = β0 + β1 x1i + . . . + βp−1xp−1i
var(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = σ 2 , i = 1, . . . , n
cov(Y |X1 = x1i , .., Xp−1 = xp−1i , Y |X1 = x1j , .., Xp−1 = xp−1j ) = 0, i 6= j
As in the SLR model we denote
Yi = (Y |X1 = x1i , . . . , Xp−1 = xp−1i )
and we usually omit the condition on Xs and write
µi = E(Yi ) = β0 + β1 x1i + . . . + βp−1 xp−1i
var(Yi ) = σ 2 , i = 1, . . . , n
cov(Yi , Yj ) = 0, i 6= j
or
Yi = β0 + β1 x1i + . . . + βp−1 xp−1i + εi
E(εi ) = 0
var(εi ) = σ 2 , i = 1, . . . , n
cov(εi , εj ) = 0, i 6= j
For testing we need the assumption of Normality, i.e., we assume that
Yi ∼ N (µi , σ 2 )
ind
3.2. LEAST SQUARES ESTIMATION 59
or
εi ∼ N (0, σ 2 )
ind
To simplify the notation we write the MLR model in a matrix form
Y = Xβ + ε, (3.1)
that is,
      
Y1 1 x1,1 ··· xp−1,1 β0 ε1
 Y2   1 x1,2 ··· xp−1,2  β1   ε2 
      
 .. = .. .. ..  .. + .. 
 .   . . ··· .  .   . 
Yn 1 x1,n · · · xp−1,n βp−1 εn
| {z } | {z }| {z } | {z }
:= Y := X := β := ε
Here Y is the vector of responses, X is often called the design matrix, β is the
vector of unknown, constant parameters and ε is the vector of random errors.
εi are independent and identically distributed, that is

ε ∼ N n (0n , σ 2 I n ).
Note that the properties of the errors give
Y ∼ N n (Xβ, σ 2 I n ).
3.2 Least squares estimation
To derive the least squares estimator (LSE) for the parameter vector β we min-
imise the sum of squares of the errors, that is
n
X
S(β) = [Yi − {β0 + β1 x1,i + · · · + βp−1 xp−1,i }]2
i=1
X
= ε2i
= εT ε
= (Y − Xβ)T (Y − Xβ)
= (Y T − β T X T )(Y − Xβ)
= Y T Y − Y T Xβ − β T X T Y + β T X T Xβ
= Y T Y − 2βT X T Y + β T X T Xβ.
Theorem 3.1. The LSE βb of β is given by
βb = (X T X)−1 X T Y
if X T X is non-singular. If X T X is singular there is no unique LSE of β.
Proof. Let β 0 be any solution of X T Xβ = X T Y . Then, X T Xβ 0 = X T Y and
S(β) − S(β 0 )
= Y T Y − 2β T X T Y + β T X T Xβ − Y T Y + 2β T0 X T Y − β T0 X T Xβ 0
= −2β T X T Xβ 0 + β T X T Xβ + 2β T0 X T Xβ 0 − β T0 X T Xβ 0
= β T X T Xβ − 2β T X T Xβ 0 + β T0 X T Xβ 0
= β T X T Xβ − β T X T Xβ 0 − β T X T Xβ 0 + β T0 X T Xβ 0
= β T X T Xβ − β T X T Xβ 0 − β T0 X T Xβ + β T0 X T Xβ 0
= β T (X T Xβ − X T Xβ 0 ) − β T0 (X T Xβ − X T Xβ 0 )
= (β T − β T0 )(X T Xβ − X T Xβ 0 )
= (β T − β T0 )X T X(β − β 0 )
= {X(β − β 0 )}T {X(β − β 0 )} ≥ 0
since it is a sum of squares of elements of the vector X(β − β 0 ).
We have shown that S(β) − S(β 0 ) ≥ 0.
Hence, β 0 minimises S(β), i.e. any solution of X T Xβ = X T Y minimises

S(β).
If X T X is nonsingular the unique solution is βb = (X T X)−1 X T Y .
If X T X is singular there is no unique solution.

3.2. LEAST SQUARES ESTIMATION 61
Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result
by differentiating S(β) with respect to β and setting it equal to 0.
3.2.1 Properties of the least squares estimator
Theorem 3.2. If
Y = Xβ + ε, ε ∼ Nn (0, σ 2 I),
then
βb ∼ N p (β, σ 2 (X T X)−1 ).
Proof. Each element of βb is a linear function of Y1 , . . . , Yn . We assume that

Yi , i = 1, . . . , n are normally distributed. Hence βb is also normally distributed.
The expectation and variance-covariance matrix can be shown in the same way as
in Theorem 2.7.
Remark 3.1. The vector of fitted values is given by
b = Yb = X βb
µ
= X(X T X)−1 X T Y
= HY .
The matrix H = X(X T X)−1 X T is called the hat matrix.
Note that
HT = H
and also
HH = X(X T X)−1 X T X(X T X)−1 X T

| {z }
=I
T T
= X(X X) X−1
= H.
A matrix, which satisfies the condition AA = A is called an idempotent matrix.

Note that if A is idempotent, then (I − A) is also idempotent.
We now prove some results about the residual vector

e = Y − Yb
= Y − HY
= (I − H)Y .
As in Theorem 2.8, here we have

Lemma 3.1. E(e) = 0.
Proof.
E(e) = (I − H) E(Y )
= (I − X(X T X)−1 X T )Xβ
= Xβ − Xβ
= 0

Lemma 3.2. Var(e) = σ 2 (I − H).
Proof.
Var(e) = (I − H) var(Y )(I − H)T
= (I − H)σ 2 I(I − H)
= σ 2 (I − H)

Lemma 3.3. The sum of squares of the residuals is Y T (I − H)Y .
Proof.
n
X
e2i = eT e = Y T (I − H)T (I − H)Y
i=1
= Y T (I − H)Y

Lemma 3.4. The elements of the residual vector e sum to zero, i.e
n
X
ei = 0.
i=1
3.3. ANALYSIS OF VARIANCE 63
Proof. We will prove this by contradiction.

P
Assume that ei = nc where c 6= 0. Then
X X
e2i = {(ei − c) + c}2
X X
= (ei − c)2 + 2c (ei − c) + nc2
X X
= (ei − c)2 + 2c( ei −nc) + nc2
| {z }
=nc
X
2 2
= (ei − c) + nc
X
> (ei − c)2 .
P
But we know that e2i is the minimum value of S(β) so there cannot exist values
with a smaller sum of squares and this gives the required contradiction. So c =
0.
Corollary 3.1.
n
1Xb
Yi = Ȳ .
n i=1
P P P
Proof. The residuals ei = Yi − Ybi , so ei = (Yi − Ybi ) but ei = 0. Hence
P Pb
Yi = Yi and so the result follows.
3.3 Analysis of Variance
We begin this section by proving the basic Analysis of Variance identity.

Theorem 3.3. The total sum of squares splits into the regression sum of squares
and the residual sum of squares, that is
SST = SSR + SSE .
Proof.
X
SST = (Yi − Ȳ )2
X
= Yi2 − nȲ 2
= Y T Y − nȲ 2 .
X
SSR = (Ybi − Ȳ )2
X X
= Ybi2 − 2Ȳ Ybi +nȲ 2
| {z }
=nȲ
X
= Ybi2 − nȲ 2
T
= Yb Yb − nȲ 2
T
= βb X T X βb − nȲ 2
= Y T X(X T X)−1 X T X(X T X)−1 X T Y − nȲ 2
| {z }
=I
= Y T HY − nȲ 2 .
We have seen (Lemma 3.3) that
SSE = Y T (I − H)Y
and so
SSR + SSE = Y T HY − nȲ 2 + Y T (I − H)Y

= Y T Y − nȲ 2
= SST
F-test for the Overall Significance of Regression
Suppose we wish to test the hypothesis
H0 : β1 = β2 = . . . = βp−1 = 0,
i.e. all coefficients except β0 are zero, versus
H1 : ¬H0 ,
which means that at least one of the coefficients is non-zero. Under H0 , the model
reduces to the null model
Y = 1β0 + ε,
3.3. ANALYSIS OF VARIANCE 65
where 1 is a vector of ones.
In testing H0 we are asking if there is sufficient evidence to reject the null model.
The Analysis of variance table is given by
Source d.f. SS MS VR
T SSR M SR
Overall regression p−1 Y HY − nȲ 2 p−1 M SE
Residual n−p Y T (I − H)Y SSE

n−p
Total n−1 Y T Y − nȲ 2
As in SLM we have n − 1 total degrees of freedom. Fitting a linear model with

p parameters (β0 , β1 , . . . , βp−1 ) leaves n − p residual d.f. Then the regression d.f.
are n − 1 − (n − p) = p − 1.
It can be shown that E(SSE ) = (n − p)σ 2 , that is MSE is an unbiased estimator

of σ 2 . Also,
SSE
∼ χ2n−p
σ2
and if β1 = . . . βp−1 = 0, then
SSR
∼ χ2p−1 .
σ2
The two statistics are independent, hence
MSR
∼ Fp−1,n−p .
MSE H0
This is a test function for the null hypothesis
H0 : β1 = β2 = . . . = βp−1 = 0,
versus
H1 : ¬H0 .
We reject H0 at the 100α% level of significance if
Fobs > Fα;p−1,n−p ,
where Fα;p−1,n−p is such that P (F > Fα;p−1,n−p ) = α.

3.1 Multiple Linear Regression Model

Uploaded by

Copyright:

Available Formats

3.1 Multiple Linear Regression Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3.1 Multiple Linear Regression Model

Uploaded by

Copyright:

Available Formats

Chapter 3

3.1 Multiple Linear Regression Model

Z is another explanatory variable. Usually, we denote all explanatory variables

Example 3.1. (Neter at al, 1996) Dwine Studios, Inc.

The regression equation is

S = 20.3863 R-Sq = 69.9% R-Sq(adj) = 68.3%

The regression equation is

Predictor Coef SE Coef T P

Figure 3.2: The dependence of the residuals on X1 .

X1 1.4546 0.2118 6.87 0.000

S = 11.0074 R-Sq = 91.7% R-Sq(adj) = 90.7%

S = 11.0986 R-Sq = 99.68% R-Sq(adj) = 99.64%

εi are independent and identically distributed, that is

3.2 Least squares estimation

Theorem 3.1. The LSE βb of β is given by

if X T X is non-singular. If X T X is singular there is no unique LSE of β.

Proof. Let β 0 be any solution of X T Xβ = X T Y . Then, X T Xβ 0 = X T Y and

= {X(β − β 0 )}T {X(β − β 0 )} ≥ 0

since it is a sum of squares of elements of the vector X(β − β 0 ).

We have shown that S(β) − S(β 0 ) ≥ 0.

Hence, β 0 minimises S(β), i.e. any solution of X T Xβ = X T Y minimises

If X T X is nonsingular the unique solution is βb = (X T X)−1 X T Y .

If X T X is singular there is no unique solution.

3.2.1 Properties of the least squares estimator

Proof. Each element of βb is a linear function of Y1 , . . . , Yn . We assume that

The matrix H = X(X T X)−1 X T is called the hat matrix.

HH = X(X T X)−1 X T X(X T X)−1 X T

A matrix, which satisfies the condition AA = A is called an idempotent matrix.

We now prove some results about the residual vector

As in Theorem 2.8, here we have

Proof. We will prove this by contradiction.

3.3 Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity.

SST = SSR + SSE .

We have seen (Lemma 3.3) that

SSR + SSE = Y T HY − nȲ 2 + Y T (I − H)Y

F-test for the Overall Significance of Regression

Suppose we wish to test the hypothesis

i.e. all coefficients except β0 are zero, versus

where 1 is a vector of ones.

The Analysis of variance table is given by

Residual n−p Y T (I − H)Y SSE

Total n−1 Y T Y − nȲ 2

As in SLM we have n − 1 total degrees of freedom. Fitting a linear model with

It can be shown that E(SSE ) = (n − p)σ 2 , that is MSE is an unbiased estimator

Fobs > Fα;p−1,n−p ,

where Fα;p−1,n−p is such that P (F > Fα;p−1,n−p ) = α.

You might also like