Dimensionality Reduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 274

Supervised Unsupervised

Learning Learning

Machine
Learning

Reinforcement
Learning
Supervised Unsupervised
Learning Learning
Machine
Learning

Reinforcement
Learning
Supervised Unsupervised
Learning Learning
Machine
Learning

Reinforcement
Learning
Classi cation

Supervised Unsupervised
Learning Learning
Machine
Learning
Regression

Reinforcement
Learning
fi
Classi

Machine
Learning

Reinforcement
Learning
ImageNet cat

196608D

What are the examples of high-dimensional


objects?
A taxi ride from
LiDAR recording
NYC data
3.5$

Object from 7D few million D

MNIST digit ImageNet cat


3x
28px

256px

28px 256px

Object from Object from


784D 196608D

What are the examples of high-dimensional


objects?
A taxi ride from
LiDAR recording
NYC data
3.5$

Object from 7D few million D

MNIST digit ImageNet cat Human DNA


3x

A ACA T
CG GGT ATAACA
C T
A T…
28px

T
256px

C
TA
Gene
TG T…

A
T

G
C
CC

A
G ATA T TG T
T TG T A

28px 256px

Object from Object from


784D 196608D

What are the examples of high-dimensional


objects?
A taxi ride from
LiDAR recording
NYC data
3.5$

Object from 7D few million D

MNIST digit ImageNet cat Human DNA


3x

A ACA T
CG GGT ATAACA
C T
A T…
28px

T
256px

C
TA
Gene
TG T…

A
T

G
C
CC

A
G ATA T TG T
T TG T A

28px 256px

Object from Object from Object from


784D 196608D 3.6 billion D

What is the problem with high-dimensional


things?
What is the problem with high-dimensional
things?

Hard to visualise
What is the problem with high-dimensional
things?

Hard to visualise
What is the problem with high-dimensional
things?

Algorithms tend
Hard to visualise
to get slow

What is the problem with high-dimensional


things?
Nearest Neighbour Classi er
y Algorithms tend
Hard to visualise
Euclidean distance to get slow
d= (x2 − x1)2 + (y2 − y1)2

nearest neighbour is found by calculating X


distances to all existing examples

fi
What is the problem with high-dimensional
things?
Nearest Neighbour Classi er
y Algorithms tend
Hard to visualise
Euclidean distance to get slow
d= (x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2

nearest neighbour is found by calculating X


distances to all existing examples

fi
What is the problem with high-dimensional
things?
Nearest Neighbour Classi er
y Algorithms tend
Hard to visualise
Euclidean distance to get slow
d= (x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2 + . . . + (n2 − n1)2

Nearest Neighbour is O(n2),


but with number of dimensions
approaching the number of
samples it is O(n3)

nearest neighbour is found by calculating X


distances to all existing examples

fi
What is the problem with high-dimensional
things?

Algorithms tend
Hard to visualise
to get slow

What is the problem with high-dimensional


things?

Algorithms tend
Hard to visualise
to get slow

Methods trained on high-dimensional data


suffer from the curse of dimensionality

The Curse of
Dimensionality
What is the curse of dimensionality?
What is the curse of dimensionality?

1 2 3 4 5 X
What is the curse of dimensionality?

X>2

False True

Blue (75%) X>4

4 points False True

1 2 3 4 5 X Blue (50%) Red (75%)

6 points 4 points
What is the curse of dimensionality?

X>2
2 4
False True

Blue (75%) X>4

4 points False True

1 2 3 4 5 X Blue (50%) Red (75%)

6 points 4 points
What is the curse of dimensionality?

75% 25% 50% 50% 25% 75%


X>2
2 4
False True

Blue (75%) X>4

4 points False True

1 2 3 4 5 X Blue (50%) Red (75%)

6 points 4 points
What is the curse of dimensionality?

y 2 4

4
4
3

2 2
1

1 2 3 4 5 X
What is the curse of dimensionality?

y 2 4
100% 0% 0% 100% 0% 0%
5

4
4
100% 0% 50% 50% 0% 100%
3

2 2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
What is the curse of dimensionality?

y 2 4
100% 0% 0% 100% 0% 0%
Highly unbalanced
5
regions
4
4
100% 0% 50% 50% 0% 100%
3

2 2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
What is the curse of dimensionality?

y 2 4
100% 0% 0% 100% 0% 0%
Nothing is going on
5
here!
4
4
100% 0% 50% 50% 0% 100%
3

2 2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
What is the curse of dimensionality?

y 2 4
100% 0% 0% 100% 0% 0%
5

4
4
100% 0% 50% 50% 0% 100%
3

2 2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
On average 55.5% of cells will be
either empty or singletons
What is the curse of dimensionality?

z
y 2 4
100% 0% 0% 100% 0% 0%
5

4
4
100% 0% 50% 50% 0% 100%
3

2
X
2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
y
On average 55.5% of cells will be On average 92.5% of cells will be
either empty or singletons either empty or singletons
What is the curse of dimensionality?
In order to keep high-dimensional space reasonably
covered you need a lot more data

z
y 2 4
100% 0% 0% 100% 0% 0%
5

4
4
100% 0% 50% 50% 0% 100%
3

2
X
2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
y
On average 55.5% of cells will be On average 92.5% of cells will be
either empty or singletons either empty or singletons
What is the curse of dimensionality?
(part II)
What is the curse of dimensionality?
(part II)

Distances become similar in high-dimensional space


What is the problem with high-dimensional
things?

Algorithms tend
Hard to visualise
to get slow

Methods trained on high-dimensional data


suffer from the curse of dimensionality

What is the problem with high-dimensional


things?

Algorithms tend
Hard to visualise
to get slow

You need more data and objects become


closer in high-dimensional space

A lot of high-
dimensional things
A ACA T
CG GGT ATAACA
C T
T A T…

C
TA

Gene
TG T…
A

A
T

G
C

CC

A
G ATA T TG T
T TG T A
A lot of high- High-dimensional things
VS
dimensional things are hard to work with
A ACA T GT ATAACA
CG
T ACG T
T…
O(n3)

C
TA

Gene
TG T…
O(2n)
A

A
T

G
C

CC
O(n2)
A
G ATA T TG T
T TG T A

O(n!)
Is there a way to break the curse?
Feature extraction vs feature elimination
Feature extraction vs feature elimination

Keeping only few


original features

Feature extraction vs feature elimination

Remove Keeping only few


all the rest original features

Feature extraction vs feature elimination


Feature extraction vs feature elimination
Circumference

Diameter
Feature extraction vs feature elimination
Circumference

Diameter
ce
en
erf
um
rc
Ci

Diameter
Principle Component Analysis

)
C1
(P
#1
nt
ne
po
om
eC
ipl
inc
Pr

Principle Component #2 (PC2)


Principle Component Analysis

)
C1
(P
#1
nt
ne
po
om
eC
ipl
inc
Pr

Principle Component #2 (PC2)


1-Dimensional data

X
1 2 3 4 5
2-Dimensional data
y
x y 6

1 2
5
2 4 4

3 5 3

4 4
2
5 5 1

X
1 2 3 4 5
3-Dimensional data
z
6

x y z 5
4
1 2 2 3
2
2 4 0.5
1
3 5 1
X
1 2 3
1 4 5 6
2
4 4 1 3
4
5 5 0.5 5
6

y 7
200-Dimensional data?
200-Dimensional data?
Are all of these dimensions equally useful?
2-D example revisited
y
6

X
1 2 3 4 5
2-D example revisited
y
Main variation is from left to right
6

X
1 2 3 4 5
2-D example revisited
y
Main variation is from left to right
6

X
1 2 3 4 5
2-D example revisited
y
Main variation is from left to right
6

2
Not so much from top to bottom
1

X
1 2 3 4 5
2-D example revisited
y
Main variation is from left to right
6 We can keep only one dimension
5

4
X
1 2 3 4 5
3

2
Not so much from top to bottom
1

X
1 2 3 4 5
2-D example revisited
y
Main variation is from left to right
6 We can keep only one dimension
5

4
X
1 2 3 4 5
3

2 Projected data does not seem


Not so much from top to bottom to loose much information
1

X
1 2 3 4 5
2-D example revisited
y
6

X
1 2 3 4 5
2-D example revisited
y
6

X
1 2 3 4 5
2-D example revisited
y
6

X
1 2 3 4 5
2-D example revisited
Data seem to be spread more equally
along X and y axes
y
6

X
1 2 3 4 5

2-D example revisited


y
6

X
1 2 3 4 5
2-D example revisited
y
Data is mostly spread along this
6 line
5

X
1 2 3 4 5
2-D example revisited
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5
2-D example revisited
How about we make new axes from these lines?
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5
2-D example revisited
How about we make new axes from these lines?
y y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1
X
X
1 2 3 4 5
2-D example revisited
How about we make new axes from these lines?
y y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1
X
X
1 2 3 4 5
2-D example revisited
X
How about we make new axes from these lines?
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5

y
2-D example revisited
X
How about we make new axes from these lines?
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5

y
2-D example revisited
X
These new axes are called principle components
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5

y
2-D example revisited
These new axes are called principle components X
y
Data is mostly spread along this
6 line
5
PC #1
4

3 #1
And PC
a little#1
bit is a new vector which
PC
along
2 this linemost of the variation in
spans along data
1

X
1 2 3 4 5

2-D example revisited


These new axes are called principle components X
y
Data is mostly spread along this

PC #2
6 line
5
PC
#2

3 #1
And PC
a little#1
bit is a new vector which
PC
along
2 this line
1

X
1 2 3 4 5

PC #2 is another new vector which spans along y


the direction of the second most variation
Principle components are not additional
axes/dimensions

They are old


dimensions
rearranged

rst axis now spans along most


variation, the second the second most variation etc.
fi
Principle components are not additional
axes/dimensions
y

How many PCs will be


in 3D space?
z

X
Principle components are not additional
axes/dimensions
y

PC #2
How many PCs will be
in 3D space? PC #1

#3
PC
X
As many as there were
original dimensions,
hence 3 PCs
Principle components are not additional
axes/dimensions

How many PCs will be


formed in 200D space?
Principle components are not additional
axes/dimensions

How many PCs will be


formed in 200D space?

No exceptions, 200 PCs


Principle components are not additional
axes/dimensions

How many PCs will be


formed in 200D space?

No exceptions, 200 PCs


But what is the bene t of
having PCs?
fi
X
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5

y
X
y
Data is mostly spread along this
6 line
5

3
And a little bit along
2 this line
1

X
1 2 3 4 5

y
X
y
Data is mostly spread along this
6 line
5

4
1 2 3 4 5
3
And a little bit along From 2D to 1D without
2 this line loosing much information
1

X
1 2 3 4 5

Principle components are not additional


axes/dimensions

PC #2
How many PCs will be
formed in 200D space? PC #1

No exceptions, 200 PCs


But what is the bene t of
having PCs?
fi
Principle components are not additional
axes/dimensions

PC #2
How many PCs will be
formed in 200D space? PC #1

First few PCs would be


enough to capture
important information
Computational example
of PCA
y
6 x y

5 1 2
4 2 4

3 3 5

2 4 4
1 5 5

X
1 2 3 4 5
y
6 x y

5 1 2
4 2 4 x̄ = 3
3 3 5 ȳ=4
2 4 4
1 5 5

X
1 2 3 4 5
y
6 x y x - x̄ y-ȳ

5 1 2 -2 -2
4 2 4 x̄ = 3 -1 0

3 3 5 ȳ=4 0 1

2 4 4 1 0
1 5 5 2 1

X
1 2 3 4 5
y
2 x - x̄ y-ȳ

1 -2 -2
0 -1 0

-1 0 1

-2 1 0
-3 2 1

X
-2 -1 0 1 2
y Z
2 x - x̄ y-ȳ

1 -2 -2
0 -1 0

-1 0 1

-2 1 0
-3 2 1

X
-2 -1 0 1 2
Transpose the matrix of coordinates

Z
-2 -2

-1 0

0 1

1 0

2 1
Transpose the matrix of coordinates

Z
-2 -2

-1 0

0 1
What are the dimensions
of the transposed matrix?
1 0

2 1

Transpose the matrix of coordinates

Z
-2 -2 ⊤
Z
-1 0
-2 -1 0 1 2
0 1
-2 0 1 0 1
1 0

2 1
Transpose the matrix of coordinates

Z
-2 -2 ⊤
Z
-1 0
-2 -1 0 1 2
0 1
-2 0 1 0 1
1 0

2 1

Z ×Z=S

Z ×Z
=S
n−1

Z ×Z
=S
n−1
Because we compute
empirical covariance matrix
(i.e. from data)

Z
⊤ -2 -2
Z
-1 0
-2 -1 0 1 2
× 0 1 = S
-2 0 1 0 1
1 0

2 1

Matrix multiplication beautifully animated http://matrixmultiplication.xyz/


Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 ? ?
× 0 1 =
-2 0 1 0 1 ? ?
1 0

2 1
Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 10 6
× 0 1 =
-2 0 1 0 1 6 6
1 0

2 1
Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 10 6
× 0 1 = /n − 1
-2 0 1 0 1 6 6
1 0

2 1
Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 10 6
× 0 1 = /4
-2 0 1 0 1 6 6
1 0

2 1
Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 2.5 1.5
× 0 1 =
-2 0 1 0 1 1.5 1.5
1 0

2 1
Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 2.5 1.5
× 0 1 =
-2 0 1 0 1 1.5 1.5
1 0

2 1 Covariance matrix
How to interpret values in covariance matrix?

y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2]
y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2] mean([-2, -1, 0, 1, 2]) = ?


y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2] mean([-2, -1, 0, 1, 2]) = 0


y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2] x̄ = 0
y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?
∑ (xi − x̄)2
[-2, -1, 0, 1, 2] x̄ = 0 σ=
n−1
y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?
∑ (xi − x̄)2
[-2, -1, 0, 1, 2] x̄ = 0 σ=
n−1
y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?
∑ (xi − x̄)2
[-2, -1, 0, 1,Variance
2] x̄ = 0 σ=
n−1
y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values
How to interpret values in covariance matrix?
∑ (xi − x̄)2
[-2, -1, 0, 1,Variance
2] x̄ = 0 σ=
n−1
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

How to interpret values in covariance matrix?


∑ (xi − x̄)2
[-2, -1, 0, 1,Variance
2] x̄ = 0 σ=
5−1
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

How to interpret values in covariance matrix?


∑ (xi − x̄)2
[-2, -1, 0, 1,Variance
2] x̄ = 0 σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

mean of all
How to interpret values in covariance matrix?points
∑ (xi − x̄)2 x̄ = 0
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

mean of all
How to interpret values in covariance matrix?points
∑ (xi − 0)2 x̄ = 0
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

mean of all
How to interpret values in covariancex̄matrix?
= 0 points
∑ (xi)2
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret each
values
point
in covariance matrix?
x̄ = 0 points
∑ (xi)2
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret values
[-2, -1, 0, 1, 2] each point in covariance matrix?
x̄ = 0 points
∑ (xi)2
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret values
[-2, -1, 0, 1, 2] each point in covariance matrix?
x̄ = 0 points
(−2)2 + (−1)2 + (0)2 + (1)2 + (2)2
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret values
[-2, -1, 0, 1, 2] each point in covariance matrix?
x̄ = 0 points
10
Variance σ=
4
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret values
[-2, -1, 0, 1, 2] each point in covariance matrix?
x̄ = 0 points

Variance σ = 2.5
y
2
number of
points
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

value of mean of all


How to interpret values
[-2, -1, 0, 1, 2] each point in covariance matrix?
x̄ = 0 points

Variance σ = 2.5
y
∑ (xi − x̄)2
2 σ=
number of
points
S n−1
1
Variance is an expected value
2.5squared
of the 1.5 deviation from
0
the mean
-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Collect all projected


onto X axis values

How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2] x̄ = 0 σ = 2.5


y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

[-2, -1, 0, 1, 2] x̄ = 0 σ = 2.5


y Variance along
2
rst axis
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?

y Variance along
2
rst axis
S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?

y Variance along
2
rst axis
S
1
2.5 1.5
0

-1 1.5 1.5
?
-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?

y Variance along
2
rst axis
S
1
2.5 1.5
0
Variance along
-1 1.5 1.5 second axis

-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?
∑ (yi − ȳ)2
mean of all
points points
σ=
n−1
y Variance along
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

How to interpret values in covariance matrix?


∑ (yi − ȳ)2
mean of all
points points
σ=
5 ȳ=0 n−1
y Variance along
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

How to interpret values in covariance matrix?


mean of all
6
points points
σ=
5 ȳ=0 4
y Variance along
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

How to interpret values in covariance matrix?

y Variance along
2
rst axis
S
1
2.5 1.5
0
Variance along
-1 1.5 1.5 second axis

-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?

y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Covariance indicates how two variables are related. A positive covariance means
the variables are positively related, while a negative covariance means the
variables are inversely related.
y

ce
n
aria
o v
e c
itiv
o s
Second variable

X
First variable
y
Second variable

Inv
ers
ec
ova
rian
ce

X
First variable
How to interpret values in covariance matrix?

y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [-2, 0, 0, 1, 1]
∑ (xi − x̄)(yi − ȳ)
x̄ = 0 ȳ=0 cov(x, y) =
n−1
y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [-2, 0, 0, 1, 1]
6
x̄ = 0 ȳ=0 cov(x, y) =
4
y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [-2, 0, 0, 1, 1]

x̄ = 0 ȳ=0 cov(x, y) = 1.5


y
Covariances
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1]
∑ (xi − x̄)(yi − ȳ)
x̄ = ? ȳ=? cov(x, y) =
n−1
y
2

S
1
2.5 ?
0

-1 ? 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1]
∑ (xi − x̄)(yi − ȳ)
x̄ = 0 ȳ=0 cov(x, y) =
n−1
y
2

S
1
2.5 ?
0

-1 ? 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1]
∑ (xi)(yi)
x̄ = 0 ȳ=0 cov(x, y) =
4
y
2

S
1
2.5 ?
0

-1 ? 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1] (−2)(0) + (−1)(0) + (0)(1) + (1)(−2) + (2)(1)
cov(x, y) =
4
x̄ = 0 ȳ=0
y
2

S
1
2.5 ?
0

-1 ? 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1] 0
cov(x, y) = = 0
x̄ = 0 ȳ=0 4
y
2

S
1
2.5 ?
0

-1 ? 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1] 0
cov(x, y) = = 0
x̄ = 0 ȳ=0 4
y
2

S
1
2.5 0
0

-1 0 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?
[-2, -1, 0, 1, 2] [0, 0, 1,-2, 1] 0
cov(x, y) = = 0
x̄ = 0 ȳ=0 4
y
2

S
1
2.5 0
0

-1 0 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Covariance 0 means that there is no relationship between two variables.


Knowing something about the value of one does not say anything about the value
of the other.
How to interpret values in covariance matrix?

y
2

S
1
2.5 1.5
0

-1 1.5 1.5
-2
Covariance matrix
-3

X
-2 -1 0 1 2
How to interpret values in covariance matrix?

Variance along
y rst axis
Covariances
2

S
1
2.5 1.5
0
Variance along
-1 1.5 1.5 second axis

-2
Covariance matrix
-3

X
-2 -1 0 1 2
fi
How to interpret values in covariance matrix?

Variance along
y
Covariances
2

S
1
2.5 1.5
0
Aren’t they supposed to be
-1 1.5 1.5
between [-1,1]?
-2
Covariance matrix
-3

X
-2 -1 0 1 2

How to interpret values in covariance matrix?

Variance along
y
Covariances
2

S
1
2.5 1.5
0
Aren’t they supposed to be
-1 1.5 1.5
between [-1,1]?
-2
Covariance matrix
-3

X
-2 -1 0 1 2

Covariance vs Correlation: https://en.wikipedia.org/wiki/Covariance_and_correlation


Z
Z ⊤ -2 -2
S
-1 0
-2 -1 0 1 2 2.5 1.5

-2 0 1 0 1
× 0 1 = × 4
1.5 1.5
1 0

2 1
S


PDP
2.5 1.5
=
1.5 1.5
S PDP ⊤
2.5 1.5 2.5 1.5
=
1.5 1.5 1.5 1.5

For an example: https://www.scss.tcd.ie/Rozenn.Dahyot/CS1BA1/SolutionEigen.pdf


S PD P⊤
2.5 1.5 -2.9 0.24 -0.81 -0.58
= ×
1.5 1.5 -2.1 -0.33 0.58 -0.81

For an example: https://www.scss.tcd.ie/Rozenn.Dahyot/CS1BA1/SolutionEigen.pdf


S P D P⊤
2.5 1.5 -0.81 0.58 3.58 0 -0.81 -0.58
= × ×
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81

For an example: https://www.scss.tcd.ie/Rozenn.Dahyot/CS1BA1/SolutionEigen.pdf


Eigendecomposition

S P D P⊤
2.5 1.5 -0.81 0.58 3.58 0 -0.81 -0.58
= × ×
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81

For an example: https://www.scss.tcd.ie/Rozenn.Dahyot/CS1BA1/SolutionEigen.pdf


Eigendecomposition
Eigenvectors

S P D P⊤
2.5 1.5 -0.81 0.58 3.58 0 -0.81 -0.58
= × ×
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81

Eigenvalues

For an example: https://www.scss.tcd.ie/Rozenn.Dahyot/CS1BA1/SolutionEigen.pdf


Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

-1

-2

-3

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

1
(0,0)
0

-1

-2

-3

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

1
(0,0)
0
-0.58
-1

-2

-3
-0.81

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

1
(0,0)
0

-1

-2

-3

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

1
(0,0)
0
-0.81
-1

-2

-3
0.58

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
2

-1

-2

-3

X
-2 -1 0 1 2
Eigenvectors

-0.81 0.58

-0.58 -0.81

y
ei
ge
nv

2
ec
to
r#
2

#1
-1 e c tor
e nv
eig

-2

-3

X
-2 -1 0 1 2

Old coordinate system


Eigenvectors

-0.81 0.58

-0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
Eigenvectors EigenvectorsT

-0.81 0.58 transpose -0.81 -0.58


=
-0.58 -0.81 0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT

-0.81 -0.58

0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤
-0.81 -0.58 -2 -1 0 1 2

-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤
-0.81 -0.58 -2 -1 0 1 2
×
-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ -2 -1

-2 0
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ -2 -1

-2 0
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-2*(-0.81) + (-2)*(-0.58) = 2.78 X
-2*(0.58) + (-2)*(-0.81) = 0.46
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ -2 -1

-2 0
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-2*(-0.81) + (-2)*(-0.58) = 2.78 X
-2*(0.58) + (-2)*(-0.81) = 0.46
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1 0.46
0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

2.78
X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-2*(-0.81) + (-2)*(-0.58) = 2.78 X
-2*(0.58) + (-2)*(-0.81) = 0.46
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1 0.46
0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

2.78
X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-1*(-0.81) + (0)*(-0.58) = 0.81 X
-1*(0.58) + (0)*(-0.81) = -0.58
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-1*(-0.81) + (0)*(-0.58) = 0.81 X
-1*(0.58) + (0)*(-0.81) = -0.58
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

1
-1 nv e c tor
#
-1 -0.58
e
eig

-2 -2

-3 -3

0.81
X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
-1*(-0.81) + (0)*(-0.58) = 0.81 X
-1*(0.58) + (0)*(-0.81) = -0.58
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

1
-1 nv e c tor
#
-1 -0.58
e
eig

-2 -2

-3 -3

0.81
X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× =
-2 0 1 0 1
0.58 -0.81
0*(-0.81) + (1)*(-0.58) = -0.58 X
0*(0.58) + (1)*(-0.81) = -0.81
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× = -0.58 -0.81
-2 0 1 0 1
0.58 -0.81
0*(-0.81) + (1)*(-0.58) = -0.58 X
0*(0.58) + (1)*(-0.81) = -0.81
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1 -0.81
e nv
eig

-2 -2

-3 -3

-0.58
X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× = -0.58 -0.81
-2 0 1 0 1
0.58 -0.81

X
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
× = -0.58 -0.81
-2 0 1 0 1
0.58 -0.81
-0.81 0.58
X -2.2 0.35
y

eigenvector #2
m
yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#
2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y
EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
×
These eigenvectors are called Principle = -0.58 -0.81
-2 0 1 0 1
0.58 -0.81 Components
-0.81 0.58

-2.2 0.35
y

eigenvector #2
m

eigenvector #1 is called PC1


yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#

eigenvector #2 is called PC2


2

1 1

0 0
eigenvector #1

#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


EigenvectorsT Z⊤ 2.78 0.46

0.81 -0.58
-0.81 -0.58 -2 -1 0 1 2
×
These eigenvectors are called Principle = -0.58 -0.81
-2 0 1 0 1
0.58 -0.81 Components
-0.81 0.58

-2.2 0.35
y

PC2
m

eigenvector #1 is called PC1


yste
ei

s
ge

te a
rdin
nv

2 2 coo
ec

Old
to
r#

eigenvector #2 is called PC2


2

1 1

0 0
PC1
#1
-1 e c tor -1
e nv
eig

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5
=
-0.81 0.58
×
3.58 0
×
-0.81 -0.58
Perform
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
eigendecomposition

P⊤ Z⊤
-0.81 -0.58
×
-2 -1 0 1 2
=
2.78 0.81 -0.58 -0.81 -2.2 New
-2 0 1 0 1 0.46 0.58 0.81 -0.58 -0.35
0.58 -0.81
coordinates
Do you still remember what was it all about?

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


Do you still remember what was it all about?
We want to reduce the dimensionality!

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


Do you still remember what was it all about?
We want to reduce the dimensionality!

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


We can ignore the second eigenvector
because it does not contain much information

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


We can ignore the second eigenvector
because it does not contain much information
How much information the second eigenvector
contains?
y

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


y

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


D
3.58 0

0 0.42

Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S D
2.5 1.5 3.58 0

1.5 1.5 0 0.42

Covariance matrix Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S D
Variance along
2.5 1.5 X axis 3.58 0

1.5 1.5 Variance along 0 0.42


Y axis
Covariance matrix Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S Variance
along ?
D
Variance along
2.5 1.5 X axis 3.58 0

Variance along Variance


1.5 1.5 0 0.42
Y axis along ?
Covariance matrix Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S Variance
along PC1
D
Variance along
2.5 1.5 X axis 3.58 0

Variance along Variance


1.5 1.5 0 0.42
Y axis along PC2
Covariance matrix Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S D
2.5 1.5 3.58 0
Covariances Covariances
1.5 1.5 0 0.42

Covariance matrix Eigenvalues

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S D
2.5 1.5 3.58 0

1.5 1.5 0 0.42

Covariance matrix New covariance matrix

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


S D
2.5 1.5 3.58 0

1.5 1.5 0 0.42

Covariance matrix New covariance matrix

PC2
2
Both old and new axes2 explain 4 units of
variance
1 1

0 2.5 + 1.5 = 4 0 3.58 + 0.42 = 4


-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2 PC1
Old coordinate system New coordinate system
S D
2.5 1.5 3.58 0

1.5 1.5 0 0.42

Covariance matrix New covariance matrix

PC2
2
Both old and new axes2 explain 4 units of
variance
1 1

0 2.5 + 1.5 = 4 0 3.58 + 0.42 = 4


-1 Out of these 4, X axis explains: -1

-2 2.5/4=62.5% -2
And Y axis explains:
-3 -3
1.5/4=37.5%

X
-2 -1 0 1 2 -2 -1 0 1 2 PC1
Old coordinate system New coordinate system
S D
2.5 1.5 3.58 0

1.5 1.5 0 0.42

Covariance matrix New covariance matrix

PC2
2
Both old and new axes2 explain 4 units of
variance
1 1

0 2.5 + 1.5 = 4 0 3.58 + 0.42 = 4


-1 Out of these 4, X axis explains: -1 Out of these 4, PC1 axis explains:
-2 2.5/4=62.5% -2 3.58/4=89.5%
And Y axis explains: And PC2 axis explains:
-3 -3
1.5/4=37.5% 0.42/4=10.5%

X
-2 -1 0 1 2 -2 -1 0 1 2 PC1
Old coordinate system New coordinate system
We can ignore the second PC because it
does not contain much information
How much information the second PC contains?
y

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


We can ignore the second PC because it
explains only 10.5% of variation

PC2
2 2

1 1

0 0
PC1
-1 -1

-2 -2

-3 -3

X
-2 -1 0 1 2 -2 -1 0 1 2

Old coordinate system New coordinate system


How many PCs will be
formed in 200D space?
How many PCs will be
formed in 200D space?

No exceptions, 200 PCs


How many PCs will be
formed in 200D space?

No exceptions, 200 PCs


How many PCs will be
formed in 200D space?

No exceptions, 200 PCs


How many PCs should we keep?
Variance explained is a good criteria for
choosing the total number of PCs to keep

How many PCs will be


formed in 200D space?

No exceptions, 200 PCs


How many PCs should we keep?
Variance explained is a good criteria for
choosing the total number of PCs to keep

You should keep as many


PCs as it takes to explain
90% of total variance

How many PCs should we keep?


Variance explained is a good criteria for
choosing the total number of PCs to keep

You should keep as many


PCs as it takes to explain
90% of total variance

How many PCs should we keep?


Principle Component
Analysis (PCA)

PCA

Can be used as part of supervised learning pipeline


Supervised Learning pipeline
1 Acquire Data 2 Preprocessing

Train/test split
Find the
4 best model 3
using CV

Safe place

Evaluate nal
5 model on
the test set

Pro t
fi
fi

2 Preprocessing

Train/test split
3

Evaluate
2
(subtract mean)

2
(subtract mean)

Train/test split Train/test split


3 3

Evaluate

2
(subtract mean)

Train/test split Train/test split


3 3
test

Safe place

Evaluate

2
(subtract mean)

PCA 200 PCs


Train/test split
4 3
test

Safe place

Evaluate

2
(subtract mean)

PCA 200 PCs


Train/test split
4 3
test
Keep few PCs
(90% variance) Safe place

Evaluate

2
(subtract mean)

PCA 200 PCs


Train/test split
Find the
5 best model 4 3
using CV test
Keep few PCs
(90% variance) Safe place

Evaluate

Supervised Learning pipeline


1 2
Normalisation
(subtract mean)
200D raw data
PCA 200 PCs
Train/test split
Find the
5 best model 4 3
using CV test
Keep few PCs
(90% variance) Safe place

Evaluate nal
6 model on
the test set

Pro t
fi
fi

Supervised Learning pipeline


1
Normalisation

200D raw data

Safe place
PCA

PCA has an “undo" button

PCA

Reverse

You can recover the original features back!


 (t-SNE) 
&
Uniform Manifold Approximation and Projection
(UMAP)

y
2D 1D

X X
y
2D 1D

X X
y
2D 1D

X X
y
2D 1D

X X
y
2D 1D

X X
y
2D 1D

X X
y
2D 1D

X X
t-SNE iteratively tries to make distances in low-
dimensional space to be similar to distances in high-
dimensional space
y
2D 1D

X X

A bit more about t-SNE: https://distill.pub/2016/misread-tsne/


t-SNE iteratively tries to make distances in low-
dimensional space to be similar to distances in high-
dimensional space
y
2D 1D

X X
UMAP ultimately tries to achieve similar things,
using slightly different mechanisms
t-SNE iteratively tries to make distances in low-
dimensional space to be similar to distances in high-
dimensional space
y
2D 1D

Both t-SNE and UMAP cannot “undo”


transformations

X X
UMAP ultimately tries to achieve similar things,
using slightly different mechanisms
t-SNE iteratively tries to make distances in low-
dimensional space to be similar to distances in high-
dimensional space
y
2D 1D

Both t-SNE and UMAP cannot “undo”


transformations

Both t-SNE and UMAP are slower than PCA

X X
UMAP ultimately tries to achieve similar things,
using slightly different mechanisms
UMAP explained: https://pair-code.github.io/understanding-umap/
UMAP explained and compared to t-SNE: https://pair-code.github.io/understanding-umap/
Projector Tensor ow
https://projector.tensor ow.org/
fl
fl
Recap
0 ?1 Classi

Machine
?$ Learning

Reinforcement
Learning
to get slow

Methods trained on high-dimensional data


suffer from the curse of dimensionality

What is the curse of dimensionality?


What is the curse of dimensionality?
In order to keep high-dimensional space reasonably
covered you need a lot more data

z
y 2 4
100% 0% 0% 100% 0% 0%
5

4
4
100% 0% 50% 50% 0% 100%
3

2
X
2
50% 50% 50% 50%
1
75% 25%

1 2 3 4 5 X
y
On average 55.5% of cells will be On average 92.5% of cells will be
either empty or singletons either empty or singletons
200-Dimensional data?
Are all of these dimensions equally useful?
Principle components are not additional
axes/dimensions

PC #2
How many PCs will be
formed in 200D space? PC #1

First few PCs would be


enough to capture
important information
What are the main steps
to compute PCA?
Z

Z⊤ -2 -2
-1 0
S
-2 -1 0 1 2
× 0 1 =
-2 0 1 0 1
1 0
2 1
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5 -0.81 0.58 3.58 0 -0.81 -0.58
= × ×
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5
=
-0.81 0.58
×
3.58 0
×
-0.81 -0.58
Perform
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
eigendecomposition
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5
=
-0.81 0.58
×
3.58 0
×
-0.81 -0.58
Perform
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
eigendecomposition
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5
=
-0.81 0.58
×
3.58 0
×
-0.81 -0.58
Perform
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
eigendecomposition

P⊤ Z⊤
-0.81 -0.58 -2 -1 0 1 2 2.78 0.81 -0.58 -0.81 -2.2
× =
-2 0 1 0 1 0.46 0.58 0.81 -0.58 -0.35
0.58 -0.81
Z

Z⊤ -2 -2
-1 0
Compute
S
-2 -1 0 1 2 covariance
× 0 1 =
-2 0 1 0 1
1 0 matrix
2 1

S P D P⊤
2.5 1.5
=
-0.81 0.58
×
3.58 0
×
-0.81 -0.58
Perform
1.5 1.5 -0.58 -0.81 0 0.41 0.58 -0.81
eigendecomposition

P⊤ Z⊤
-0.81 -0.58
×
-2 -1 0 1 2
=
2.78 0.81 -0.58 -0.81 -2.2 New
-2 0 1 0 1
0.58 -0.81 0.46 0.58 0.81 -0.58 -0.35
coordinates

You might also like