Math Review For ML

18-661 Introduction to Machine Learning
Review of Mathematics for ML
Fall 2018
ECE – Carnegie Mellon University
Outline
1. Linear Algebra
2. Calculus and Optimization
3. Probability
4. Review on Statistics
1
Linear Algebra
Linear Algebra
Calculus and Optimization
Probability
Review on Statistics
2
Vector spaces – definition
Vector Space (V , +, ·) over a field F
Set of elements (vectors) with two operations:
• sum of elements: u + v, where u, v ∈ V

• and multiplication by a scalar: α · u, α ∈ F (F = R, C, . . .).
3
Vector spaces – definition
Vector Space (V , +, ·) over a field F
Set of elements (vectors) with two operations:
• sum of elements: u + v, where u, v ∈ V

• and multiplication by a scalar: α · u, α ∈ F (F = R, C, . . .).
Satisfying:
1. ∃0∈V : x + 0 = x,
2. ∀x∈V : ∃−x : x + (−x)0,
3. ∃ζ ∈ F : ζx = x we denote ζ = 1,
4. Commutativity: x + y = y + x.
5. Associativity: (x + y) + z = x + (y + z) and α(βx) = (αβ)x,
6. Distributivity: α(x + y) = αx + αy and (α + β)x = αx + αx,
for all x, y, z ∈ V and α, β ∈ F.

3
Vector spaces
Linear Independence
x1 , x2 , . . . , xn ∈ V are linearly independent if
n
X
αi xi = 0 =⇒ α1 , . . . , αn = 0.
i=1
4
Vector spaces
Linear Independence
x1 , x2 , . . . , xn ∈ V are linearly independent if
n
X
αi xi = 0 =⇒ α1 , . . . , αn = 0.
i=1
Span
The span of x1 , x2 , . . . , xn inV is
n
X
L{x1 , x2 , . . . , xn } = {x ∈ V : ∃α1 ,...,αn ∈F : αi xi = x}.
i=1
4
Vector spaces
Basis
B = {x1 , . . . , xn } is a basis of a vector space V if
n
X
∀x∈V ∃α1 ,...,αn ∈F : αi xi = x,
i=1
and {x1 , . . . , xn } are linearly independent.
5
Normed spaces
Norm
Let V be a real vector space. A Norm is a function, denoted by
k · k : V → R, that satisfies:
1. kxk ≥ 0, and kxk = 0 if and only if x = 0,

2. kαxk = |α|kxk,
3. kx + yk ≤ kxk + kyk (triangular inequality).
6
Normed spaces
Norm
Let V be a real vector space. A Norm is a function, denoted by
k · k : V → R, that satisfies:
1. kxk ≥ 0, and kxk = 0 if and only if x = 0,

2. kαxk = |α|kxk,
3. kx + yk ≤ kxk + kyk (triangular inequality).
Examples (Norms in Rn ):
Pn
• kxk1 = i=1 |xi |,
Pn 1
p p
• kxkp = i=1 |xi | , p ≥ 1,
• kxk∞ = max1≤i≤n |xi |.
6
Inner product spaces
Inner product
An inner product on a real vector space V is a function
h·i : V × V → R satisfying:
1. hx, xi ≥ 0 and hx, xi = 0 iff x = 0,

2. hx + y, zi = hx, zi + hy, zi and hαx, yi = αhx, yi,
3. hx, yi = hy, xi,
∀x,y,z∈V and ∀α∈R .
Example
Inner product in Rn
n
X
hx, yi = xi yi = x> y.
i=1
7
Remark
p
Any inner product in V induces a norm on V : kxk = hx, xi.
8
Remark
p
Orthogonality
Two vectors x, y ∈ V are orthogonal, x ⊥ y if hx, yi = 0.
8
Remark
p
Orthogonality
Pythagorean Theorem
If x ⊥ y, then
kx + yk2 = kxk2 + kyk2 .
8
Remark
p
Orthogonality
Pythagorean Theorem
If x ⊥ y, then
kx + yk2 = kxk2 + kyk2 .
Cauchy-Schwarz Inequality
khx, yik ≤ kxkkykk, ∀x,y∈V .
8
Singular value decomposition (SVD) I
Every matrix has the following decomposition

SVD
Let A ∈ Rm×n then
A = UΣV > ,
where U ∈ Rm×m , V ∈ Rn×n are orthogonal matrices (i.e.
UU > = U > U = I ) and Σ ∈ Rm×n is a diagonal matrix with singular
values of A denoted by σi appearing by non-increasing order:
σ1 ≥ σ2 ≥ . . . ≥ σr > σr +1 = . . . = σmin(m,n) = 0.
9
Linear Algebra
Probability
10
Gradient
Gradient
Consider a multivariate function f : Rd → R, the gradient of f is:
 
∂f
 ∂x. 1  ∂f
∇f =  .  [∇f ]i = ∀i ∈ {1, 2, . . . , d}
 .  ∂xi
∂f
∂xn
∇f (x) points in the direction of the steepest ascent from x.
11
Jacobian
Jacobian
The Jacobian of a vector field f : Rn → Rm is:
 
∂f1 ∂f1
. . .
 ∂x. 1 . ∂xn
..  ∂fi
Jf =  .. .. .  [Jf ]ij = ∂x

 j
∂fn
∂x1 . . . ∂f
∂xn
m
12
Hessian
Hessian
The Hessian of a vector field f : Rn → Rm is:
 
∂f1 ∂f1
∂x1 . . . ∂x n
 . .. ..  ∂2f
Hf =  . . .  [Hf ]ij = ∂x ∂x
 .

i j
∂fn
∂x1 . . . ∂f
∂xn
m
Note that: Hf (x) = J∇f > (x).
13
Hessian
Clairaut’s Theorem
If the second order partial derivatives of f : Rd → R are continuous, at
a point x, then
∂2f ∂2f
(x) = (x), ∀i,j∈{1,...,d} ,
∂xi ∂xj ∂xj ∂xi
in this case the Hessian is symmetric [Hf ]ij (x) = [Hf ]ji (x).
14
Matrix Calculus
A lot of the computations in Optimization amounts to finding stationary

points (gradient vanishes) and optimal points (stationary plus condition
on the Hessian).
Differentiation rules for vectors and matrices
The most important rules for ML are
∇x (a> x) = a
(
> > (A + A> )x,
∇x (x A x) =
2Ax, if A is symmetric.
15
Chain rule
For single-variable function
(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).
16
Chain rule
For single-variable function
(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).
Chain rule for multivariate functions

Let f : Rm → Rk and g : Rn → Rm , then
Jf ◦g (x) = Jf (g (x))Jg (x).
If k = 1, we have f ◦ g : Rn → R and
∇(f ◦ g )(x) = Jg (x)> ∇f (g (x)).
16
Convexity
• Convex set: A set X ⊆ Rd is convex if

tx + (1 − t)y ∈ X , for all x, y ∈ X , and t ∈ [0, 1].
• Convex function: A function f : Rd → R is convex if

f (tx+(1−t)y) ≤ tf (x)+(1−t)f (y) for all x, y ∈ dom f , and t ∈ [0, 1].
17
Probability
Linear Algebra
Probability
18
Setup
Sample Space: a set of all possible outcomes or realizations of some

random trial.
Example: Toss a coin twice; the sample space is
Ω = {HH, HT , TH, TT }.
Event: A subset of sample space

Example: the event that at least one toss is a head is
A = {HH, HT , TH}.
Probability: We assign a real number P(A) to each event A, called the

probability of A.
Probability Axioms: The probability P must satisfy three axioms:
1. P(A) ≥ 0 for every A;

2. P(Ω) = 1;
P∞
3. If A1 , A2 , . . . are disjoint, then P(∪∞
i=1 Ai ) = i=1 P(Ai ) 19
Random variables
Definition: A random variable is a function that maps from the sample

space to the reals (X : Ω → R), i.e., it assigns a real number X (ω) to
each outcome ω.
Example: X returns 1 if a coin is heads and 0 if a coin is tails. Y returns

the number of heads after 3 flips of a fair coin.
Random variables can take on many values, and we are often interested
in the distribution over the values of a random variable, e.g., P(Y = 0)
20
Distribution function
Definition: Suppose X is a random variable, x is a specific value that it

can take,
Cumulative distribution function (CDF) is the function F : R → [0, 1],
where F (x) = P(X ≤ x).
If X is discrete ⇒ probability mass function: f (x) = P(X = x).

If X is continuous ⇒ probability density function for X if there exists a
R∞
function f such that f (x) ≥ 0 for all x, −∞ f (x)dx = 1 and for every
a ≤ b,
Z b
P(a ≤ X ≤ b) = f (x)dx.
a
If F (x) is differentiable everywhere, f (x) = F 0 (x).
21
Example of distributions
Discrete variable Probability function Mean Variance

N+1
Uniform X ∼ U[1, . . . , N] 1/N 2
n x (n−x)
Binomial X ∼ Bin(n, p) x p (1 − p) np
Geometric X ∼ Geom(p) (1 − p)x−1 p 1/p
e −λ λx
Poisson X ∼ Poisson(λ) x! λ
Continuous variable Probability density function Mean Variance
Uniform X ∼ U(a, b) 1/ (b-a) (a + b)/2
Gaussian X ∼ N(µ, σ 2 ) √1
2πσ
exp(− 1 2 (x − µ)2 ) µ
2σ
Gamma X ∼ Γ(α, β) (x ≥ 0) 1
Γ(α)β a
x a−1 e −x/β αβ
x
1 −β
Exponential X ∼ exponen(β) βe β
22
Expectation
Expected Values
P
• Discrete random variable X, E [g (X )] = x∈X g (x)f (x);
R∞
• Continuous random variable X, E [g (X )] = −∞ g (x)f (x)
Mean and Variance µ = E [X ] is the mean; var [X ] = E [(X − µ)2 ] is the

variance.
We also have var [X ] = E [X 2 ] − µ2 .
23
Multivariate Distributions
Definition:
FX ,Y (x, y ) := P(X ≤ x, Y ≤ y ),
and
∂ 2 FX ,Y (x, y )
fX ,Y (x, y ) := ,
∂x∂y
Marginal Distribution of X (Discrete case):

X X
fX (x) = P(X = x) = P(X = x, Y = y ) = fX ,Y (x, y )
y y
R
or fX (x) = f (x, y )dy
y X ,Y
for continuous variable.
24
Conditional Probability and Bayes Rule
Conditional probability of X given Y = y is
P(X = x, Y = y ) fX ,Y (x, y )
fX |Y (x|y ) = P(X = x|Y = y ) = =
P(Y = y ) fY (y )
Bayes Rule:
P(Y |X )P(X )
P(X |Y ) =
P(Y )
25
Independence
Independent Variables X and Y are independent if and only if:
P(X = x, Y = y ) = P(X = x)P(Y = y )
or fX ,Y (x, y ) = fX (x)fY (y ) for all values x and y .
IID variables: Independent and identically distributed (IID) random

variables are drawn from the same distribution and are all mutually
independent.
Linearity of Expectation: Even if X1 , . . . , Xn are not independent,

n
X n
X
E[ Xi ] = E [Xi ].
i=1 i=1
26
Statistics
Suppose X1 , . . . , Xn are random variables:

Sample Mean:
N
1 X
X̄ = Xi
N
i=1
Sample Variance:
N
2 1 X
SN−1 = (Xi − X̄ )2 .
N −1
i=1
If Xi are iid:
E [X̄ ] = E [Xi ] = µ,
Var (X̄ ) = σ 2 /N,
2
E [SN−1 ] = σ2
27
Point Estimation
Definition The point estimator θ̂N is a function of samples X1 , . . . , XN

that approximates a parameter θ of the distribution of Xi .
Sample Bias: The bias of an estimator is
bias(θ̂N ) = Eθ [θ̂N ] − θ
An estimator is unbiased estimator if Eθ [θ̂N ] = θ
28
Example
Suppose we have observed N realizations of the random variable X :
x1 , x2 , · · · , xN
Then,
1
P
• Sample mean X̄ = N n xn is an unbiased estimator of X ’s mean.
2 1 2
P
• Sample variance SN−1 = N−1 n (xn − X̄ ) is an unbiased estimator
of X ’s variance
1
• Sample variance SN2 = − X̄ )2 is not an unbiased estimator
P
N n (xn
of X ’s variance
29

Math Review For ML

Uploaded by

Copyright:

Available Formats

Math Review For ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math Review For ML

Uploaded by

Copyright:

Available Formats

18-661 Introduction to Machine Learning

Review of Mathematics for ML

2. Calculus and Optimization

Calculus and Optimization

• sum of elements: u + v, where u, v ∈ V

• sum of elements: u + v, where u, v ∈ V

for all x, y, z ∈ V and α, β ∈ F.

and {x1 , . . . , xn } are linearly independent.

1. kxk ≥ 0, and kxk = 0 if and only if x = 0,

1. kxk ≥ 0, and kxk = 0 if and only if x = 0,

1. hx, xi ≥ 0 and hx, xi = 0 iff x = 0,

∀x,y,z∈V and ∀α∈R .

khx, yik ≤ kxkkykk, ∀x,y∈V .

Every matrix has the following decomposition

Calculus and Optimization

∇f (x) points in the direction of the steepest ascent from x.

Note that: Hf (x) = J∇f > (x).

A lot of the computations in Optimization amounts to finding stationary

For single-variable function

(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).

For single-variable function

(f ◦ g )0 (x) = f 0 (g (x))g 0 (x).

Chain rule for multivariate functions

Jf ◦g (x) = Jf (g (x))Jg (x).

∇(f ◦ g )(x) = Jg (x)> ∇f (g (x)).

• Convex set: A set X ⊆ Rd is convex if

• Convex function: A function f : Rd → R is convex if

Calculus and Optimization

Sample Space: a set of all possible outcomes or realizations of some

Event: A subset of sample space

Probability: We assign a real number P(A) to each event A, called the

Probability Axioms: The probability P must satisfy three axioms:

1. P(A) ≥ 0 for every A;

Definition: A random variable is a function that maps from the sample

Example: X returns 1 if a coin is heads and 0 if a coin is tails. Y returns

Definition: Suppose X is a random variable, x is a specific value that it

If X is discrete ⇒ probability mass function: f (x) = P(X = x).

If F (x) is differentiable everywhere, f (x) = F 0 (x).

Discrete variable Probability function Mean Variance

Mean and Variance µ = E [X ] is the mean; var [X ] = E [(X − µ)2 ] is the

Marginal Distribution of X (Discrete case):

Conditional probability of X given Y = y is

Independent Variables X and Y are independent if and only if:

P(X = x, Y = y ) = P(X = x)P(Y = y )

or fX ,Y (x, y ) = fX (x)fY (y ) for all values x and y .

IID variables: Independent and identically distributed (IID) random

Linearity of Expectation: Even if X1 , . . . , Xn are not independent,

Suppose X1 , . . . , Xn are random variables:

Definition The point estimator θ̂N is a function of samples X1 , . . . , XN

Sample Bias: The bias of an estimator is

An estimator is unbiased estimator if Eθ [θ̂N ] = θ

Suppose we have observed N realizations of the random variable X :

You might also like