Math Review For ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

18-661 Introduction to Machine Learning

Review of Mathematics for ML

Fall 2018
ECE – Carnegie Mellon University
Outline

1. Linear Algebra

2. Calculus and Optimization

3. Probability

4. Review on Statistics

1
Linear Algebra
Linear Algebra

Calculus and Optimization

Probability

Review on Statistics

2
Vector spaces – definition
Vector Space (V , +, ·) over a field F
Set of elements (vectors) with two operations:

• sum of elements: u + v, where u, v ∈ V


• and multiplication by a scalar: α · u, α ∈ F (F = R, C, . . .).

3
Vector spaces – definition
Vector Space (V , +, ·) over a field F
Set of elements (vectors) with two operations:

• sum of elements: u + v, where u, v ∈ V


• and multiplication by a scalar: α · u, α ∈ F (F = R, C, . . .).

Satisfying:

1. ∃0∈V : x + 0 = x,
2. ∀x∈V : ∃−x : x + (−x)0,
3. ∃ζ ∈ F : ζx = x we denote ζ = 1,
4. Commutativity: x + y = y + x.
5. Associativity: (x + y) + z = x + (y + z) and α(βx) = (αβ)x,
6. Distributivity: α(x + y) = αx + αy and (α + β)x = αx + αx,

for all x, y, z ∈ V and α, β ∈ F.


3
Vector spaces

Linear Independence
x1 , x2 , . . . , xn ∈ V are linearly independent if
n
X
αi xi = 0 =⇒ α1 , . . . , αn = 0.
i=1

4
Vector spaces

Linear Independence
x1 , x2 , . . . , xn ∈ V are linearly independent if
n
X
αi xi = 0 =⇒ α1 , . . . , αn = 0.
i=1

Span
The span of x1 , x2 , . . . , xn inV is
n
X
L{x1 , x2 , . . . , xn } = {x ∈ V : ∃α1 ,...,αn ∈F : αi xi = x}.
i=1

4
Vector spaces

Basis
B = {x1 , . . . , xn } is a basis of a vector space V if
n
X
∀x∈V ∃α1 ,...,αn ∈F : αi xi = x,
i=1

and {x1 , . . . , xn } are linearly independent.

5
Normed spaces

Norm
Let V be a real vector space. A Norm is a function, denoted by
k · k : V → R, that satisfies:

1. kxk ≥ 0, and kxk = 0 if and only if x = 0,


2. kαxk = |α|kxk,
3. kx + yk ≤ kxk + kyk (triangular inequality).

6
Normed spaces

Norm
Let V be a real vector space. A Norm is a function, denoted by
k · k : V → R, that satisfies:

1. kxk ≥ 0, and kxk = 0 if and only if x = 0,


2. kαxk = |α|kxk,
3. kx + yk ≤ kxk + kyk (triangular inequality).

Examples (Norms in Rn ):
Pn
• kxk1 = i=1 |xi |,
Pn 1
p p
• kxkp = i=1 |xi | , p ≥ 1,
• kxk∞ = max1≤i≤n |xi |.

6
Inner product spaces

Inner product
An inner product on a real vector space V is a function
h·i : V × V → R satisfying:

1. hx, xi ≥ 0 and hx, xi = 0 iff x = 0,


2. hx + y, zi = hx, zi + hy, zi and hαx, yi = αhx, yi,
3. hx, yi = hy, xi,

∀x,y,z∈V and ∀α∈R .

Example
Inner product in Rn
n
X
hx, yi = xi yi = x> y.
i=1

7
Inner product spaces

Remark
p
Any inner product in V induces a norm on V : kxk = hx, xi.

8
Inner product spaces

Remark
p
Any inner product in V induces a norm on V : kxk = hx, xi.

Orthogonality
Two vectors x, y ∈ V are orthogonal, x ⊥ y if hx, yi = 0.

8
Inner product spaces

Remark
p
Any inner product in V induces a norm on V : kxk = hx, xi.

Orthogonality
Two vectors x, y ∈ V are orthogonal, x ⊥ y if hx, yi = 0.

Pythagorean Theorem
If x ⊥ y, then
kx + yk2 = kxk2 + kyk2 .

8
Inner product spaces

Remark
p
Any inner product in V induces a norm on V : kxk = hx, xi.

Orthogonality
Two vectors x, y ∈ V are orthogonal, x ⊥ y if hx, yi = 0.

Pythagorean Theorem
If x ⊥ y, then
kx + yk2 = kxk2 + kyk2 .

Cauchy-Schwarz Inequality

khx, yik ≤ kxkkykk, ∀x,y∈V .

8
Singular value decomposition (SVD) I

Every matrix has the following decomposition


SVD
Let A ∈ Rm×n then
A = UΣV > ,
where U ∈ Rm×m , V ∈ Rn×n are orthogonal matrices (i.e.
UU > = U > U = I ) and Σ ∈ Rm×n is a diagonal matrix with singular
values of A denoted by σi appearing by non-increasing order:
σ1 ≥ σ2 ≥ . . . ≥ σr > σr +1 = . . . = σmin(m,n) = 0.

9
Calculus and Optimization
Linear Algebra

Calculus and Optimization

Probability

Review on Statistics

10
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is:
 
∂f
 ∂x. 1  ∂f
∇f =  .  [∇f ]i = ∀i ∈ {1, 2, . . . , d}
 .  ∂xi
∂f
∂xn

∇f (x) points in the direction of the steepest ascent from x.

11
Jacobian

Jacobian
The Jacobian of a vector field f : Rn → Rm is:
 
∂f1 ∂f1
. . .
 ∂x. 1 . ∂xn
..  ∂fi
Jf =  .. .. .  [Jf ]ij = ∂x

 j
∂fn
∂x1 . . . ∂f
∂xn
m

12
Hessian

Hessian
The Hessian of a vector field f : Rn → Rm is:
 
∂f1 ∂f1
∂x1 . . . ∂x n
 . .. ..  ∂2f
Hf =  . . .  [Hf ]ij = ∂x ∂x
 .

i j
∂fn
∂x1 . . . ∂f
∂xn
m

Note that: Hf (x) = J∇f > (x).

13
Hessian

Clairaut’s Theorem
If the second order partial derivatives of f : Rd → R are continuous, at
a point x, then

∂2f ∂2f
(x) = (x), ∀i,j∈{1,...,d} ,
∂xi ∂xj ∂xj ∂xi

in this case the Hessian is symmetric [Hf ]ij (x) = [Hf ]ji (x).

14
Matrix Calculus

A lot of the computations in Optimization amounts to finding stationary


points (gradient vanishes) and optimal points (stationary plus condition
on the Hessian).
Differentiation rules for vectors and matrices
The most important rules for ML are

∇x (a> x) = a
(
> > (A + A> )x,
∇x (x A x) =
2Ax, if A is symmetric.

15
Chain rule

For single-variable function

(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).

16
Chain rule

For single-variable function

(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).

Chain rule for multivariate functions


Let f : Rm → Rk and g : Rn → Rm , then

Jf ◦g (x) = Jf (g (x))Jg (x).

If k = 1, we have f ◦ g : Rn → R and

∇(f ◦ g )(x) = Jg (x)> ∇f (g (x)).

16
Convexity

• Convex set: A set X ⊆ Rd is convex if


tx + (1 − t)y ∈ X , for all x, y ∈ X , and t ∈ [0, 1].

• Convex function: A function f : Rd → R is convex if


f (tx+(1−t)y) ≤ tf (x)+(1−t)f (y) for all x, y ∈ dom f , and t ∈ [0, 1].

17
Probability
Linear Algebra

Calculus and Optimization

Probability

Review on Statistics

18
Setup

Sample Space: a set of all possible outcomes or realizations of some


random trial.
Example: Toss a coin twice; the sample space is
Ω = {HH, HT , TH, TT }.

Event: A subset of sample space


Example: the event that at least one toss is a head is
A = {HH, HT , TH}.

Probability: We assign a real number P(A) to each event A, called the


probability of A.

Probability Axioms: The probability P must satisfy three axioms:

1. P(A) ≥ 0 for every A;


2. P(Ω) = 1;
P∞
3. If A1 , A2 , . . . are disjoint, then P(∪∞
i=1 Ai ) = i=1 P(Ai ) 19
Random variables

Definition: A random variable is a function that maps from the sample


space to the reals (X : Ω → R), i.e., it assigns a real number X (ω) to
each outcome ω.

Example: X returns 1 if a coin is heads and 0 if a coin is tails. Y returns


the number of heads after 3 flips of a fair coin.

Random variables can take on many values, and we are often interested
in the distribution over the values of a random variable, e.g., P(Y = 0)

20
Distribution function

Definition: Suppose X is a random variable, x is a specific value that it


can take,
Cumulative distribution function (CDF) is the function F : R → [0, 1],
where F (x) = P(X ≤ x).

If X is discrete ⇒ probability mass function: f (x) = P(X = x).


If X is continuous ⇒ probability density function for X if there exists a
R∞
function f such that f (x) ≥ 0 for all x, −∞ f (x)dx = 1 and for every
a ≤ b,
Z b
P(a ≤ X ≤ b) = f (x)dx.
a

If F (x) is differentiable everywhere, f (x) = F 0 (x).

21
Example of distributions

Discrete variable Probability function Mean Variance


N+1
Uniform X ∼ U[1, . . . , N] 1/N 2
n x (n−x)
Binomial X ∼ Bin(n, p) x p (1 − p) np
Geometric X ∼ Geom(p) (1 − p)x−1 p 1/p
e −λ λx
Poisson X ∼ Poisson(λ) x! λ
Continuous variable Probability density function Mean Variance
Uniform X ∼ U(a, b) 1/ (b-a) (a + b)/2
Gaussian X ∼ N(µ, σ 2 ) √1
2πσ
exp(− 1 2 (x − µ)2 ) µ

Gamma X ∼ Γ(α, β) (x ≥ 0) 1
Γ(α)β a
x a−1 e −x/β αβ
x
1 −β
Exponential X ∼ exponen(β) βe β

22
Expectation

Expected Values
P
• Discrete random variable X, E [g (X )] = x∈X g (x)f (x);
R∞
• Continuous random variable X, E [g (X )] = −∞ g (x)f (x)

Mean and Variance µ = E [X ] is the mean; var [X ] = E [(X − µ)2 ] is the


variance.
We also have var [X ] = E [X 2 ] − µ2 .

23
Multivariate Distributions

Definition:
FX ,Y (x, y ) := P(X ≤ x, Y ≤ y ),
and
∂ 2 FX ,Y (x, y )
fX ,Y (x, y ) := ,
∂x∂y

Marginal Distribution of X (Discrete case):


X X
fX (x) = P(X = x) = P(X = x, Y = y ) = fX ,Y (x, y )
y y
R
or fX (x) = f (x, y )dy
y X ,Y
for continuous variable.

24
Conditional Probability and Bayes Rule

Conditional probability of X given Y = y is

P(X = x, Y = y ) fX ,Y (x, y )
fX |Y (x|y ) = P(X = x|Y = y ) = =
P(Y = y ) fY (y )

Bayes Rule:

P(Y |X )P(X )
P(X |Y ) =
P(Y )

25
Independence

Independent Variables X and Y are independent if and only if:

P(X = x, Y = y ) = P(X = x)P(Y = y )

or fX ,Y (x, y ) = fX (x)fY (y ) for all values x and y .

IID variables: Independent and identically distributed (IID) random


variables are drawn from the same distribution and are all mutually
independent.

Linearity of Expectation: Even if X1 , . . . , Xn are not independent,


n
X n
X
E[ Xi ] = E [Xi ].
i=1 i=1

26
Review on Statistics
Statistics

Suppose X1 , . . . , Xn are random variables:


Sample Mean:
N
1 X
X̄ = Xi
N
i=1

Sample Variance:
N
2 1 X
SN−1 = (Xi − X̄ )2 .
N −1
i=1

If Xi are iid:

E [X̄ ] = E [Xi ] = µ,
Var (X̄ ) = σ 2 /N,
2
E [SN−1 ] = σ2

27
Point Estimation

Definition The point estimator θ̂N is a function of samples X1 , . . . , XN


that approximates a parameter θ of the distribution of Xi .

Sample Bias: The bias of an estimator is

bias(θ̂N ) = Eθ [θ̂N ] − θ

An estimator is unbiased estimator if Eθ [θ̂N ] = θ

28
Example

Suppose we have observed N realizations of the random variable X :

x1 , x2 , · · · , xN

Then,
1
P
• Sample mean X̄ = N n xn is an unbiased estimator of X ’s mean.
2 1 2
P
• Sample variance SN−1 = N−1 n (xn − X̄ ) is an unbiased estimator
of X ’s variance
1
• Sample variance SN2 = − X̄ )2 is not an unbiased estimator
P
N n (xn
of X ’s variance

29

You might also like