Decomposing Variance: Kerby Shedden

Decomposing Variance
Kerby Shedden
Department of Statistics, University of Michigan
October 9, 2019
1 / 35
Law of total variation
For any regression model involving a response Y ∈ R and a covariate

vector X ∈ Rp , we can decompose the marginal variance of Y as follows:
var(Y ) = varX E (Y |X ) + EX var(Y |X ).
I If the population is homoscedastic, var(Y |X ) does not depend on

X , so we can simply write var(Y |X ) = σ 2 , and we get
var(Y ) = varX E (Y |X ) + σ 2 .
I If the population is heteroscedastic, var(Y |X ) is a function σ 2 (X )
with expected value σ 2 = EX σ 2 (X ), and again we get
var(Y ) = varX E (Y |X ) + σ 2 .
If we write Y = f (X ) + with E (|X ) = 0, then E (Y |X ) = f (X ), and

varX E (Y |X ) summarizes the variation of f (X ) over the marginal
distribution of X .
2 / 35
Law of total variation
4
2
E(Y|X)
10 1 2 3 4
X
Orange curves: conditional distributions of Y given X
Purple curve: marginal distribution of Y
Black dots: conditional means of Y given X
3 / 35
Pearson correlation
The population Pearson correlation coefficient of two jointly distributed

scalar-valued random variables X and Y is
cov(X , Y )
ρXY ≡ .
σX σY
Given data y = (y1 , . . . , yn )0 and x = (x1 , . . . , xn )0 , the Pearson

correlation coefficient is estimated by
(x − x̄)0 (y − ȳ )
P
cd
ov(x, y ) (xi − x̄)(yi − ȳ )
ρ̂xy = = pP i = .
σ̂x σ̂y kx − x̄k · ky − ȳ k
2
P 2
i (xi − x̄) · i (yi − ȳ )
When we write y − ȳ here, this means y − ȳ · 1, where 1 is a vector of

1’s, and ȳ is a scalar.
4 / 35
Pearson correlation
By the Cauchy-Schwartz inequality,
−1 ≤ ρxy ≤ 1
−1 ≤ ρ̂xy ≤ 1.
The sample correlation coefficient is slightly biased, but the bias is so

small that it is usually ignored.
5 / 35
Pearson correlation and simple linear regression slopes
For the simple linear regression model
Y = α + βX + ,
if we view X as a random variable that is uncorrelated with , then
cov(X , Y ) = βσX2
and the correlation is
β
ρXY ≡ cor(X , Y ) = p .
β2 + σ 2 /σX2
The sample correlation coefficient for data x = (x1 , . . . , xn ) and
y = (y1 , . . . , yn ) is related to the least squares slope estimate:
cd
ov(x, y ) σ̂y
β̂ = 2
= ρ̂xy .
σ̂x σ̂x
6 / 35
Orthogonality between fitted values and residuals
Recall that the fitted values are
ŷ = x β̂ = Py
where y ∈ Rn is the vector of observed responses, and P ∈ Rn×n is the

projection matrix onto col(X).
The residuals are
r = y − ŷ = (I − P)y ∈ Rn .
Since P(I − P) = 0n×n it follows that ŷ 0 r = 0.

since r̄ = 0, it is equivalent to state that the sample correlation between
r and ŷ is zero, i.e.
cor(r
c , ŷ ) = 0.
7 / 35
Coefficient of determination
A descriptive summary of the explanatory power of x for y is given by the
coefficient of determination, also known as the proportion of explained
variance, or multiple R 2 . This is the quantity
ky − ŷ k2 kŷ − ȳ k2 var(ŷ
c )
R2 ≡ 1 − 2
= = .
ky − ȳ k ky − ȳ k2 var(y
c )
The equivalence between the two expressions follows from the identity
ky − ȳ k2 = ky − ŷ + ŷ − ȳ k2
= ky − ŷ k2 + kŷ − ȳ k2 + 2(y − ŷ )0 (ŷ − ȳ )
= ky − ŷ k2 + kŷ − ȳ k2 ,
It should be clear that R 2 = 0 iff ŷ = ȳ and R 2 = 1 iff ŷ = y .
8 / 35
Coefficient of determination
The coefficient of determination is equal to
c , y )2 .
cor(ŷ
To see this, note that
(ŷ − ȳ )0 (y − ȳ )
cor(ŷ
c , y) =
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ + ŷ − ȳ )
=
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ ) + (ŷ − ȳ )0 (ŷ − ȳ )
=
kŷ − ȳ k · ky − ȳ k
kŷ − ȳ k
= .
ky − ȳ k
9 / 35
Coefficient of determination in simple linear regression
In general,
ov(y , ŷ )2
cd
R 2 = cor(y
c , ŷ )2 = .
c ) · var(ŷ
var(y c )
In the case of simple linear regression,
cd
ov(y , ŷ ) = ov(y , α̂ + β̂x)
cd
= β̂ cd
ov(y , x),
and
var(ŷ
c ) = var(α̂
c + β̂x)
= β̂ 2 var(x)
c
Thus for simple linear regression, R 2 = cor(y

c , x)2 = cor(y
c , ŷ )2 .
10 / 35
Relationship to the F statistic
The F-statistic for the null hypothesis
β1 = . . . = βp = 0
is
kŷ − ȳ k2 n − p − 1 R2 n−p−1
2
· = 2
· ,
ky − ŷ k p 1−R p
which is an increasing function of R 2 .
11 / 35
Adjusted R 2
The sample R 2 is an estimate of the population R 2 :
EX var(Y |X )
1− .
var(Y )
Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is

not large unless the sample size is small or the number of covariates is
large. The adjusted R 2 is an approximately unbiased estimate of the
population R 2 :
n−1
1 − (1 − R 2 ) .
n−p−1
The adjusted R 2 is always less than the unadjusted R 2 . The adjusted R 2

is always less than or equal to one, but can be negative.
12 / 35
The unique variation in one covariate
How much “information” about y is present in a covariate xk ? This

question is not straightforward when the covariates are non-orthogonal,
since several covariates may contain overlapping information about y .
Let xk⊥ ∈ Rn be the residual of the k th covariates, xk ∈ Rn , after
regressing it against all other covariates (including the intercept). If P−k
is the projection onto span({xj , j 6= k}), then
xk⊥ = (I − P−k )xk .
c k⊥ )/var(x
We could use var(x c k ) to assess how much of the variation in xk
is “unique” in that it is not also captured by other predictors.
But this measure doesn’t involve y , so it can’t tell us whether the unique
variation in xk is useful in the regression analysis.
13 / 35
The unique regression information in one covariate
To learn how xk contributes “uniquely” to the regression, we can consider

how introducing xk to a working regression model affects the R 2 .
Let ŷ−k = P−k y be the fitted values in the model omitting covariate k.
Let R 2 denote the multiple R 2 for the full model, and let R−k
2
be the
2
multiple R for the regression omitting covariate xk . The value of
R 2 − R−k
2
is a way to quantify how much unique information about y in xk is not

captured by the other covariates. This is called the semi-partial R 2 .
14 / 35
Identity involving norms of fitted values and residuals
Before we continue, we will need a simple identity that is often useful.

In general, if a and b are orthogonal, then ka + bk2 = kak2 + kbk2 .
If a and b − a are orthogonal, then
kbk2 = kb − a + ak2 = kb − ak2 + kak2 .
Thus in this setting we have kbk2 − kak2 = kb − ak2 .

Applying this fact to regression, we know that the fitted values and
residuals are orthogonal. Thus for the regression omitting variable k, ŷ−k
and y − ŷ−k are orthogonal, so ky − ŷ−k k2 = ky k2 − kŷ−k k2 .
By the same argument, ky − ŷ k2 = ky k2 − kŷ k2 .
15 / 35
Improvement in R 2 due to one covariate
Now we can obtain a simple, direct expression for the semi-partial R 2 .

Since xk⊥ is orthogonal to the other covariates,
hy , xk⊥ i ⊥
ŷ = ŷ−k + x ,
hxk⊥ , xk⊥ i k
and
kŷ k2 = kŷ−k k2 + hy , xk⊥ i2 /kxk⊥ k2 .
16 / 35
Improvement in R 2 due to one covariate
Thus we have
ky − ŷ k2
R2 = 1−
ky − ȳ k2
ky k2 − kŷ k2
= 1−
ky − ȳ k2
ky k2 − kŷ−k k2 − hy , xk⊥ i2 /kxk⊥ k2
= 1−
ky − ȳ k2
ky − ŷ−k k2 hy , xk⊥ i2 /kxk⊥ k2
= 1− 2
+
ky − ȳ k ky − ȳ k2
2 hy , xk⊥ i2 /kxk⊥ k2
= R−k + .
ky − ȳ k2
17 / 35
Semi-partial R 2
Thus the semi-partial R 2 is
hy , xk⊥ i2 /kxk⊥ k2 hy , xk⊥ /kxk⊥ ki2

R 2 − R−k
2
= = .
ky − ȳ k2 ky − ȳ k2
Since xk⊥ /kxk⊥ k is centered and has length 1, it follows that
R 2 − R−k
2
c , xk⊥ )2 .
= cor(y
Thus the semi-partial R 2 for covariate k has two interpretations:
I It is the improvement in R 2 resulting from including covariate k in a

working regression model that already contains the other covariates.
I It is the R 2 for a simple linear regression of y on xk⊥ = (I − P−k )xk .
18 / 35
Partial R 2
The partial R 2 is
R 2 − R−k
2
hy , xk⊥ i2 /kxk⊥ k2
2 = .
1 − R−k ky − ŷ−k k2
The partial R 2 for covariate k is the fraction of the maximum possible

improvement in R 2 that is contributed by covariate k.
Let ŷ−k be the fitted values for regressing y on all covariates except xk .
0
Since ŷ−k xk⊥ = 0,
hy , xk⊥ i2 hy − ŷ−k , xk⊥ i2

=
ky − ŷ−k k2 · kxk⊥ k2 ky − ŷ−k k2 · kxk⊥ k2
The expression on the left is the usual R 2 that would be obtained when
regressing y − ŷ−k on xk⊥ . Thus the partial R 2 is the same as the usual
R 2 for (I − P−k )y regressed on (I − P−k )xk .
19 / 35
Decomposition of projection matrices
Suppose P ∈ Rn×n is a rank-d projection matrix, and U is a n × d
orthogonal matrix whose columns span col(P). If we partition U by
columns
 
| | ··· |
U =  U1 U2 ··· Ud  ,
| | ··· |
then P = UU 0 , so we can write
d
X
P= Uj Uj0 .
j=1
Note that this representation is not unique, since there are different
orthogonal bases for col(P).
Each summand Uj Uj0 ∈ Rn×n is a rank-1 projection matrix onto hUj i.
20 / 35
Decomposition of R 2
Question: In a multiple regression model, how much of the variance in y

is explained by a particular covariate?
Orthogonal case: If the design matrix X is orthogonal (X0 X = I ), the
projection P onto col(X) can be decomposed as
p p
X 110 X 0
P= Pj = + xj xj ,
n
j=0 j=1
where xj is the j th column of the design matrix (assuming here that the
first column of X is an intercept).
21 / 35
Decomposition of R 2 (orthogonal case)
The n × n rank-1 matrix
Pj = xj xj0
is the projection onto span(xj ) (and P0 is the projection onto the span of
the vector of 1’s). Furthermore, by orthogonality, Pj Pk = 0 unless j = k.
Since
p
X
ŷ − ȳ = Pj y ,
j=1
by orthogonality
p
X
kŷ − ȳ k2 = kPj y k2 .
j=1
Here we are using the fact that if U1 , . . . , Um are orthogonal, then
kU1 + · · · + Um k2 = kU1 k2 + · · · + kUm k2 .
22 / 35
Decomposition of R 2 (orthogonal case)
The R 2 for simple linear regression of y on xj is
Rj2 ≡ kŷ − ȳ k2 /ky − ȳ k2 = kPj y k2 /ky − ȳ k2 ,
so we see that for orthogonal design matrices,
p
X
2
R = Rj2 .
j=1
That is, the overall coefficient of determination is the sum of univariate

coefficients of determination for all the explanatory variables.
23 / 35
Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be

the sum of single covariate R 2 ’s.
If we let Rj2 be as above (the R 2 values for regressing Y on each Xj ),
P 2 2
P 2 2
then there are two different situations: j Rj > R , and j Rj < R .
24 / 35
Rj2 > R 2
P
Case 1:
It’s not surprising that j Rj2 can be bigger than R 2 . For example,
P
suppose that the population model is
Y = X1 +
is the data generating model, and X2 is highly correlated with X1 (but is
not part of the data generating model).
For the regression of Y on both X1 and X2 , the multiple R 2 will be
1 − σ 2 /var(Y ) (since E (Y |X1 , X2 ) = E (Y |X1 ) = X1 ).
The R 2 values for Y regressed on either X1 or X2 separately will also be
approximately 1 − σ 2 /var(Y ).
Thus R12 + R22 ≈ 2R 2 .
25 / 35
Rj2 < R 2
P
Case 2: j
This is more surprising, and is sometimes called enhancement.

As an example, suppose the data generating model is
Y = Z + ,
but we don’t observe Z (for simplicity assume EZ = 0). Instead, we

observe a value X1 that satisfies
X1 = Z + X2 ,
where X2 has mean 0 and is independent of Z and .

Since X2 is independent of Z and , it is also independent of Y , thus
R22 ≈ 0 for large n.
26 / 35
Decomposition of R 2 (enhancement example)
The multiple R 2 of Y on X1 and X2 is approximately σZ2 /(σZ2 + σ 2 ) for

large n, since the fitted values will converge to Ŷ = X1 − X2 = Z .
To calculate R12 , first note that for the regression of y on x1 , where
y , x1 ∈ Rn are data vectors
cd
ov(y , x1 ) σ2
β̂ = → 2 Z 2
var(x
c 1) σZ + σX2
and
α̂ → 0.
27 / 35
Therefore for large n,
n−1 ky − ŷ k2 ≈ n−1 kz + − σZ2 X1 /(σZ2 + σx22 )k2

= n−1 kσX2 2 z/(σZ2 + σx22 ) + − σZ2 x2 /(σZ2 + σX2 2 )k2
= σX4 2 σZ2 /(σZ2 + σX2 2 )2 + σ 2 + σZ4 σX2 2 /(σZ2 + σX2 2 )2
= σX2 2 σZ2 /(σZ2 + σX2 2 ) + σ 2 .
Therefore
n−1 ky − ŷ k2
R12 = 1−
n−1 ky − ȳ k2
σ 2 σ 2 /(σ 2 + σX2 2 ) + σ 2
≈ 1 − X2 Z 2Z
σZ + σ 2
σZ2
=
(σZ2 + 2
σ )(1 + σX2 2 /σZ2 )
.
28 / 35
Thus
R12 /R 2 ≈ 1/(1 + σX2 2 /σZ2 ),
which is strictly less than one if σX2 2 > 0.

Since R22 = 0, it follows that R 2 > R12 + R22 .
The reason for this is that while X2 contains no directly useful
information about Y (hence R22 = 0), it can remove the “measurement
error” in X1 , making X1 a better predictor of Z .
29 / 35
We can now calculate the limiting partial R 2 for adding X2 to a model

that already contains X1 :
σX2 2
.
σX2 2 + σ 2 (1 + σX2 2 /σZ2 )
30 / 35
Partial R 2 example 2
Suppose the design matrix satisfies
 
1 0 0
X0 X/n =  0 1 r 
0 r 1
and the data generating model is
Y = X1 + X2 +
with var = σ 2 .
31 / 35
We will calculate the partial R 2 for X1 , using the fact that the partial R 2
is the regular R 2 for regressing
(I − P−1 )y
on
(I − P−1 )x1
where y , x1 , x2 ∈ Rn are data vectors distributed like Y , X1 , and X2 , and

P−1 is the projection onto span ({1, x2 }).
Since this is a simple linear regression, the partial R 2 can be expressed
cor((I
c − P−1 )y , (I − P−1 )x1 )2 .
32 / 35
We will calculate the partial R 2 in a setting where all conditional means

are linear. This would hold if the data are jointly Gaussian (but this is
not a necessary condition for conditional means to be linear).
The numerator of the partial R 2 is the square of
ov((I − P−1 )y , (I − P−1 )x1 )

cd = y 0 (I − P−1 )x1 /n
= (x1 + x2 + )0 (x1 − rx2 )/n
→ 1 − r 2.
33 / 35
The denominator contains two factors. The first is
k(I − P−1 )x1 k2 /n = x10 (I − P−1 )x1 /n

= x10 (x1 − rx2 )/n
→ 1 − r 2.
34 / 35
The other factor in the denominator is y 0 (I − P−1 )y /n:
y 0 (I − P−1 )y /n = (x1 + x2 )0 (I − P−1 )(x1 + x2 )/n + 0 (I − P−1 )/n +

20 (I − P−1 )(x1 + x2 )/n
≈ (x1 + x2 )0 (x1 − rx2 )/n + σ 2
→ 1 − r 2 + σ2 .
Thus we get that the partial R 2 is approximately equal to
1 − r2
.
1 − r 2 + σ2
If r = 1 then the result is zero (X1 has no unique explanatory power),

and if r = 0, the result is 1/(1 + σ 2 ), indicating that after controlling for
X2 , around 1/(1 + σ 2 ) fraction of the remaining variance is explained by
X1 (the rest is due to ).
35 / 35
Summary
Each of the three R 2 values can be expressed either in terms of variance

ratios, or as a squared correlation coefficient:
Multiple R 2 Semi-partial R 2 Partial R 2

VR kŶ − Ȳ k2 /kY − Ȳ k2 R 2 − R−k
2 2
(R − R−k2 2
)/(1 − R−k )
Correlation c Ŷ , Y )2
cor( c , Xk⊥ )2
cor(Y cor((I
c ⊥ 2
− P−k )Y , Xk )
36 / 35

Decomposing Variance: Kerby Shedden

Uploaded by

Copyright:

Available Formats

Decomposing Variance: Kerby Shedden

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decomposing Variance: Kerby Shedden

Uploaded by

Copyright:

Available Formats

Decomposing Variance

Department of Statistics, University of Michigan

For any regression model involving a response Y ∈ R and a covariate

var(Y ) = varX E (Y |X ) + EX var(Y |X ).

I If the population is homoscedastic, var(Y |X ) does not depend on

If we write Y = f (X ) +  with E (|X ) = 0, then E (Y |X ) = f (X ), and

The population Pearson correlation coefficient of two jointly distributed

Given data y = (y1 , . . . , yn )0 and x = (x1 , . . . , xn )0 , the Pearson

When we write y − ȳ here, this means y − ȳ · 1, where 1 is a vector of

By the Cauchy-Schwartz inequality,

The sample correlation coefficient is slightly biased, but the bias is so

For the simple linear regression model

where y ∈ Rn is the vector of observed responses, and P ∈ Rn×n is the

Since P(I − P) = 0n×n it follows that ŷ 0 r = 0.

It should be clear that R 2 = 0 iff ŷ = ȳ and R 2 = 1 iff ŷ = y .

To see this, note that

In the case of simple linear regression,

Thus for simple linear regression, R 2 = cor(y

The F-statistic for the null hypothesis

which is an increasing function of R 2 .

The sample R 2 is an estimate of the population R 2 :

Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is

The adjusted R 2 is always less than the unadjusted R 2 . The adjusted R 2

How much “information” about y is present in a covariate xk ? This

xk⊥ = (I − P−k )xk .

To learn how xk contributes “uniquely” to the regression, we can consider

is a way to quantify how much unique information about y in xk is not

Before we continue, we will need a simple identity that is often useful.

kbk2 = kb − a + ak2 = kb − ak2 + kak2 .

Thus in this setting we have kbk2 − kak2 = kb − ak2 .

Now we can obtain a simple, direct expression for the semi-partial R 2 .

kŷ k2 = kŷ−k k2 + hy , xk⊥ i2 /kxk⊥ k2 .

Thus the semi-partial R 2 is

hy , xk⊥ i2 /kxk⊥ k2 hy , xk⊥ /kxk⊥ ki2

Since xk⊥ /kxk⊥ k is centered and has length 1, it follows that

Thus the semi-partial R 2 for covariate k has two interpretations:

I It is the improvement in R 2 resulting from including covariate k in a

The partial R 2 for covariate k is the fraction of the maximum possible

hy , xk⊥ i2 hy − ŷ−k , xk⊥ i2

then P = UU 0 , so we can write

Question: In a multiple regression model, how much of the variance in y

Here we are using the fact that if U1 , . . . , Um are orthogonal, then

kU1 + · · · + Um k2 = kU1 k2 + · · · + kUm k2 .

The R 2 for simple linear regression of y on xj is

Rj2 ≡ kŷ − ȳ k2 /ky − ȳ k2 = kPj y k2 /ky − ȳ k2 ,

so we see that for orthogonal design matrices,

That is, the overall coefficient of determination is the sum of univariate

Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be

This is more surprising, and is sometimes called enhancement.

but we don’t observe Z (for simplicity assume EZ = 0). Instead, we

where X2 has mean 0 and is independent of Z and .

The multiple R 2 of Y on X1 and X2 is approximately σZ2 /(σZ2 + σ 2 ) for

n−1 ky − ŷ k2 ≈ n−1 kz +  − σZ2 X1 /(σZ2 + σx22 )k2

R12 /R 2 ≈ 1/(1 + σX2 2 /σZ2 ),

which is strictly less than one if σX2 2 > 0.

We can now calculate the limiting partial R 2 for adding X2 to a model

Suppose the design matrix satisfies

and the data generating model is

where y , x1 , x2 ∈ Rn are data vectors distributed like Y , X1 , and X2 , and

If we write Y = f (X ) + with E (|X ) = 0, then E (Y |X ) = f (X ), and

where X2 has mean 0 and is independent of Z and .

n−1 ky − ŷ k2 ≈ n−1 kz + − σZ2 X1 /(σZ2 + σx22 )k2

y 0 (I − P−1 )y /n = (x1 + x2 )0 (I − P−1 )(x1 + x2 )/n + 0 (I − P−1 )/n +