Decomposing Variance: Kerby Shedden
Decomposing Variance: Kerby Shedden
Decomposing Variance: Kerby Shedden
Kerby Shedden
October 9, 2019
1 / 35
Law of total variation
2 / 35
Law of total variation
4
2
E(Y|X)
10 1 2 3 4
X
Orange curves: conditional distributions of Y given X
Purple curve: marginal distribution of Y
Black dots: conditional means of Y given X
3 / 35
Pearson correlation
cov(X , Y )
ρXY ≡ .
σX σY
(x − x̄)0 (y − ȳ )
P
cd
ov(x, y ) (xi − x̄)(yi − ȳ )
ρ̂xy = = pP i = .
σ̂x σ̂y kx − x̄k · ky − ȳ k
2
P 2
i (xi − x̄) · i (yi − ȳ )
4 / 35
Pearson correlation
−1 ≤ ρxy ≤ 1
−1 ≤ ρ̂xy ≤ 1.
5 / 35
Pearson correlation and simple linear regression slopes
Y = α + βX + ,
if we view X as a random variable that is uncorrelated with , then
cov(X , Y ) = βσX2
and the correlation is
β
ρXY ≡ cor(X , Y ) = p .
β2 + σ 2 /σX2
The sample correlation coefficient for data x = (x1 , . . . , xn ) and
y = (y1 , . . . , yn ) is related to the least squares slope estimate:
cd
ov(x, y ) σ̂y
β̂ = 2
= ρ̂xy .
σ̂x σ̂x
6 / 35
Orthogonality between fitted values and residuals
Recall that the fitted values are
ŷ = x β̂ = Py
r = y − ŷ = (I − P)y ∈ Rn .
cor(r
c , ŷ ) = 0.
7 / 35
Coefficient of determination
A descriptive summary of the explanatory power of x for y is given by the
coefficient of determination, also known as the proportion of explained
variance, or multiple R 2 . This is the quantity
ky − ŷ k2 kŷ − ȳ k2 var(ŷ
c )
R2 ≡ 1 − 2
= = .
ky − ȳ k ky − ȳ k2 var(y
c )
The equivalence between the two expressions follows from the identity
ky − ȳ k2 = ky − ŷ + ŷ − ȳ k2
= ky − ŷ k2 + kŷ − ȳ k2 + 2(y − ŷ )0 (ŷ − ȳ )
= ky − ŷ k2 + kŷ − ȳ k2 ,
8 / 35
Coefficient of determination
The coefficient of determination is equal to
c , y )2 .
cor(ŷ
(ŷ − ȳ )0 (y − ȳ )
cor(ŷ
c , y) =
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ + ŷ − ȳ )
=
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ ) + (ŷ − ȳ )0 (ŷ − ȳ )
=
kŷ − ȳ k · ky − ȳ k
kŷ − ȳ k
= .
ky − ȳ k
9 / 35
Coefficient of determination in simple linear regression
In general,
ov(y , ŷ )2
cd
R 2 = cor(y
c , ŷ )2 = .
c ) · var(ŷ
var(y c )
cd
ov(y , ŷ ) = ov(y , α̂ + β̂x)
cd
= β̂ cd
ov(y , x),
and
var(ŷ
c ) = var(α̂
c + β̂x)
= β̂ 2 var(x)
c
β1 = . . . = βp = 0
is
kŷ − ȳ k2 n − p − 1 R2 n−p−1
2
· = 2
· ,
ky − ŷ k p 1−R p
11 / 35
Adjusted R 2
EX var(Y |X )
1− .
var(Y )
n−1
1 − (1 − R 2 ) .
n−p−1
12 / 35
The unique variation in one covariate
c k⊥ )/var(x
We could use var(x c k ) to assess how much of the variation in xk
is “unique” in that it is not also captured by other predictors.
But this measure doesn’t involve y , so it can’t tell us whether the unique
variation in xk is useful in the regression analysis.
13 / 35
The unique regression information in one covariate
R 2 − R−k
2
14 / 35
Identity involving norms of fitted values and residuals
15 / 35
Improvement in R 2 due to one covariate
hy , xk⊥ i ⊥
ŷ = ŷ−k + x ,
hxk⊥ , xk⊥ i k
and
16 / 35
Improvement in R 2 due to one covariate
Thus we have
ky − ŷ k2
R2 = 1−
ky − ȳ k2
ky k2 − kŷ k2
= 1−
ky − ȳ k2
ky k2 − kŷ−k k2 − hy , xk⊥ i2 /kxk⊥ k2
= 1−
ky − ȳ k2
ky − ŷ−k k2 hy , xk⊥ i2 /kxk⊥ k2
= 1− 2
+
ky − ȳ k ky − ȳ k2
2 hy , xk⊥ i2 /kxk⊥ k2
= R−k + .
ky − ȳ k2
17 / 35
Semi-partial R 2
R 2 − R−k
2
c , xk⊥ )2 .
= cor(y
18 / 35
Partial R 2
The partial R 2 is
R 2 − R−k
2
hy , xk⊥ i2 /kxk⊥ k2
2 = .
1 − R−k ky − ŷ−k k2
The expression on the left is the usual R 2 that would be obtained when
regressing y − ŷ−k on xk⊥ . Thus the partial R 2 is the same as the usual
R 2 for (I − P−k )y regressed on (I − P−k )xk .
19 / 35
Decomposition of projection matrices
Suppose P ∈ Rn×n is a rank-d projection matrix, and U is a n × d
orthogonal matrix whose columns span col(P). If we partition U by
columns
| | ··· |
U = U1 U2 ··· Ud ,
| | ··· |
d
X
P= Uj Uj0 .
j=1
Note that this representation is not unique, since there are different
orthogonal bases for col(P).
Each summand Uj Uj0 ∈ Rn×n is a rank-1 projection matrix onto hUj i.
20 / 35
Decomposition of R 2
p p
X 110 X 0
P= Pj = + xj xj ,
n
j=0 j=1
where xj is the j th column of the design matrix (assuming here that the
first column of X is an intercept).
21 / 35
Decomposition of R 2 (orthogonal case)
The n × n rank-1 matrix
Pj = xj xj0
is the projection onto span(xj ) (and P0 is the projection onto the span of
the vector of 1’s). Furthermore, by orthogonality, Pj Pk = 0 unless j = k.
Since
p
X
ŷ − ȳ = Pj y ,
j=1
by orthogonality
p
X
kŷ − ȳ k2 = kPj y k2 .
j=1
22 / 35
Decomposition of R 2 (orthogonal case)
p
X
2
R = Rj2 .
j=1
23 / 35
Decomposition of R 2
24 / 35
Decomposition of R 2
Rj2 > R 2
P
Case 1:
It’s not surprising that j Rj2 can be bigger than R 2 . For example,
P
suppose that the population model is
Y = X1 +
is the data generating model, and X2 is highly correlated with X1 (but is
not part of the data generating model).
For the regression of Y on both X1 and X2 , the multiple R 2 will be
1 − σ 2 /var(Y ) (since E (Y |X1 , X2 ) = E (Y |X1 ) = X1 ).
The R 2 values for Y regressed on either X1 or X2 separately will also be
approximately 1 − σ 2 /var(Y ).
Thus R12 + R22 ≈ 2R 2 .
25 / 35
Decomposition of R 2
Rj2 < R 2
P
Case 2: j
Y = Z + ,
X1 = Z + X2 ,
26 / 35
Decomposition of R 2 (enhancement example)
cd
ov(y , x1 ) σ2
β̂ = → 2 Z 2
var(x
c 1) σZ + σX2
and
α̂ → 0.
27 / 35
Decomposition of R 2 (enhancement example)
Therefore for large n,
Therefore
n−1 ky − ŷ k2
R12 = 1−
n−1 ky − ȳ k2
σ 2 σ 2 /(σ 2 + σX2 2 ) + σ 2
≈ 1 − X2 Z 2Z
σZ + σ 2
σZ2
=
(σZ2 + 2
σ )(1 + σX2 2 /σZ2 )
.
28 / 35
Decomposition of R 2 (enhancement example)
Thus
29 / 35
Decomposition of R 2 (enhancement example)
σX2 2
.
σX2 2 + σ 2 (1 + σX2 2 /σZ2 )
30 / 35
Partial R 2 example 2
1 0 0
X0 X/n = 0 1 r
0 r 1
Y = X1 + X2 +
with var = σ 2 .
31 / 35
Partial R 2 example 2
We will calculate the partial R 2 for X1 , using the fact that the partial R 2
is the regular R 2 for regressing
(I − P−1 )y
on
(I − P−1 )x1
cor((I
c − P−1 )y , (I − P−1 )x1 )2 .
32 / 35
Partial R 2 example 2
33 / 35
Partial R 2 example 2
34 / 35
Partial R 2 example 2
The other factor in the denominator is y 0 (I − P−1 )y /n:
1 − r2
.
1 − r 2 + σ2
35 / 35
Summary
36 / 35