FIT5197 2021 S1 Formula Sheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

FIT5197, 2021 Semester 1, Formula Sheet

Daniel Schmidt, Wray Buntine, Levin Kuhlmann


May 31, 2021

1 Sample Statistics
• quartiles, percentiles, etc.: given n data points, rank them in increasing
value to get x1 , ..., xn
– median, if n is odd, given by x(n+1)/2 , if n is even given by 12 (xn/2 +
xn/2+1 )
– quartiles, Q1 or Q3 is given by Qk = xp + 4q (xp+1 − xp ) where
p = f loor((k(n + 1))/4) and q = (k(n + 1)) mod 4
q
– percentiles, Pk = xp + 100 (xp+1 − xp ) where p = f loor((k(n +
1))/100) and q = (k(n + 1)) mod 100
• measures of spread for n data points, x = (x1 , ..., xn )
1
Pn
– Sample variance, var(x) = s2x = n−1 i=1 (xi − x̄)
2

– Sample standard deviation, sx


– Range = maxni=1 xi − minni=1 xi
– (inter-quartile range) IQR = Q3 − Q1
• Sample covariance
n
1 X
qxy = (xi − x̄)(yi − ȳ)
n − 1 i=1

• Sample correlation coefficient


Pn
qxy (xi − x̄)(yi − ȳ)
rxy = = pPn i=1 Pn
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2

1
Cheatsheet

• boxplots
– represents numerical data through quar-
tiles

– the lower hinge is Q1 , upper hinge is Q3

– a lower whisker is drawn at minimum


data value greater than the lower inner
fence is Q1 − 1.5 × IQR (which itself is
usually not draw)

– upper whisker is drawn at maximum


data value less than the upper inner
fence is Q3 + 1.5 × IQR (which itself is
usually not draw)

– outliers are highlighted outside these


two fences

2 Probability
• probability axioms of Kolmogorov:

1. for any event A, 0 ≤ p(A) ≤ 1


2. p(Ω) = 1, where Ω is the universal set, the set of everything
for mutually exclusive events A1 , ... An p(A1 ∪ A2 ... ∪ An ) =
3. P
n
i=1 p(Ai )

• other probability identities for the domain X × Y where A, B are any


events:

– complement rule, p(A) = 1 − p(A)


– product rule, p(B ∩ A) = p(B|A)p(A)
– addition rule for 2 sets, p(A ∪ B) = p(A) + p(B) − p(A ∩ B)
P
– sum rule, p(A) = x∈X p(x ∩ A)
p(B∩A)
– conditional, p(B|A) = p(A)
when p(A) > 0
– Bayes theorem, p(x|A) = P p(A|x)p(x)
x∈X p(A|x)p(x)

Page 2 of 20
Cheatsheet

• for continuous random variables a probability density function (PDF)


p(x) on domain X satisfies
p(x) ≥ 0 for all x ∈ X
and Z
p(x)dx = 1
X

• for continuous random variables, the probability X ∈ A, where A ⊂ X


is Z
p(X ∈ A) = p(x)dx.
A

• for X a single dimension, then define the cumulative density func-


tion (CDF), P (x), in terms of the the PDF p(x) as
Z
P (x) = p(y)d y
y<x

and the quantile function Q(x) as


Q(x) = P −1 (x)
this is well defined when p(x) > 0.
• Let the random variable pair (X, Y ) be from domain X × Y. We say
X and Y are independent if any of the following three (equivalent)
conditions hold for all x ∈ X and y ∈ Y
(I) p(X=x|Y =y) = p(X=x) when p(Y =y) > 0
(II) p(Y =y|X=x) = p(Y =y) when p(X=x) > 0
(III) p(Y =y ∩ X=x) = p(X=x)p(Y =y)

3 Expected Values
• if X has domain X , expectation and variance of f (X):
X
E [f (X)] = p(x)f (x)
x∈X

V [f (X)] = E (f (X) − E [f (X)])2 = E f (X)2 − E [f (X)]2


   

with integral replacing sum for continuous RVs

Page 3 of 20
Cheatsheet

• some useful rules for RVs X, Y and constant c


– E [f (X) + g(Y )] = E [f (X)] + E [g(Y )]
– E [cf (X)] = cE [f (X)]
– V [cf (X)] = c2 V [f (X)]
• if X, Y are independent RVs
– E [f (X)g(Y )] = E [f (X)] E [g(Y )]
– V [f (X) + g(Y )] = V [f (X)] + V [g(Y )]
• Chebyshev’s inequality: if X is a RV with mean µ and variance σ 2 ,
then for any k > 0  
|X − µ| 1
p ≥k ≤ 2
σ k
• Weak law of large numbers: let X1 , . . . , Xn be RVs with E [Xi ] = µ;
then for any ε > 0
 
X 1 + · · · + Xn
p − µ > ε → 0 as n → ∞.
n

4 Distributions
• for the Gaussian or normal distribution, denoted N (µ, σ 2 )
 21
(x − µ)2
  
2 1
p(x | µ, σ ) = exp −
2πσ 2 2σ 2
and has the properties
– E [x] = µ and V [x] = σ 2
– the mode and the median are the same as the mean
– if the curve for p(x | 0, 1) is shifted to the right by µ and scaled by
σ, one gets the curve for p(x | µ, σ 2 )
• the discrete uniform distribution models discrete RVs denoted U (a, b)
and follows
1
P(X = k | a, b) =
b−a+1
where X ∈ {a, . . . , b} with b ≥ a, and has properties

Page 4 of 20
Cheatsheet

a+b (b−a+1)2 −1
– E[X] = 2
and V[X] = 12

• the continuous uniform distribution models continuous RVs de-


noted U (a, b) with pdf

 0 for x<a
1
p(x | a, b) = for a≤x≤b
 b−a
0 for x>b

where a > b and

– The quantity a determines the start of the distribution


– The quantity w = b − a is the width of the distribution
a+b w (b−a)2 w2
– E[X] = 2
=a+ 2
and V[X] = 12
= 12

• the Bernoulli distribution models discrete, binary RVs, i.e., X =


{0, 1}, denoted Be(θ),

p(X = 1 | θ) = θ, θ ∈ [0, 1]

so that the parametric probability distribution is

p(x | θ) = θx (1 − θ)(1−x)

and has properties

– E [x] = θ and V [x] = θ(1 − θ)

• the binomial distribution describes the probability of getting x suc-


cessful outcomes in n Bernoulli trials with probability of success θ,
denoted bin(θ, n), and x ∈ {0, 1, ..., n},
 
n x
p(x | n, θ) = θ (1 − θ)(n−x)
x

and has properties

– E [x] = nθ and V [x] = nθ(1 − θ)

Page 5 of 20
Cheatsheet

• the Poisson distribution with rate parameter λ is the number of


events x occurring, for X = {0} ∪ N , denoted Pois(λ),
λx exp (−λ)
p(x | λ) =
x!
and has properties
– E [x] = λ and V [x] = λ
– if X ∼ Pois(λX ) and Y ∼ Pois(λY ) then (X +Y ) ∼ Pois(λX +λY )
– bin(θ, n) ≈ Pois(nθ) for n  1 and nθ small
• Note the Central Limit Theorem (CLT) has been moved to section 6
of this document.

5 Estimation
• have a sample x; let θ̂(x) beh a point
i estimate for model parameter θ;
then θ̂(x) is unbiased if Ex θ̂(x) = θ, where the expectation is taken
over samples x
• the bias of the estimator is
h i
bθ (θ̂) = Ex θ̂(x) − θ

• the variance of the estimator is


h i  h i2 
Vθ θ̂ = Ex θ̂(x) − Ex θ̂(x)

• the mean square error (MSE) of the estimator is


h i  2  h i
MSEθ θ̂ = Ex θ̂(x) − θ = bθ (θ̂)2 + Vθ θ̂

• for sample x of size n distributed as N (µ, σ) the sum of squared


errors (SSE) of mean estimate µ is given by
n
X
SSE(µ) = (xi − µ)2
i=1

Page 6 of 20
Cheatsheet

and the point estimate µ̂ minimising SSE is the mean


n
1X
µ̂ = xi
n i=1

• the method of maximum likelihood says we should use the model that
assigns the greatest probability to the data we have observed; formally,
the maximum likelihood estimator (MLE) is found by solving

Θ̂ = arg max{p(x | Θ)}


Θ

where p(x | Θ) is called the likelihood function

• use L(x | Θ) to denote the negative log-likelihood, log 1/p(x | Θ)

• for sample x of size n distributed as N (µ, σ)


n 1
L(x | µ, σ 2 ) = log(2πσ 2 ) + 2 SSE(µ)
2 2σ
from this we get

– µ̂M L is the mean, same as when using the SSE


– the MLE for the variance is
n
2 1X
σ̂M L = (xi − µ̂)2
n i=1

this is however biased, an unbiased estimate is


n
1 X
σ̂u2 = (xi − µ̂)2
n − 1 i=1

• the MLE estimates for λ of the Poisson and θ of the Bernoulli is also
the mean

• the MLE estimates for θ of the binomial, bin(θ, m), using sample x of
size n is n
1 X
θ̂M L = xi
nm i=1

Page 7 of 20
Cheatsheet

• let x be a sample of size n from a Gaussian population with mean µ


and variance σ 2 , and let m be the mean and s2 be the sample variance:
 
σ2
– m is Gaussian with mean, variance µ, n

– n(m − µ)/s is Student’s t with n − 1 degrees of freedom
– these can be used to develop confidence bounds or hypothesis tests
for µ and σ 2 respectively
• the Student’s t distribution with n degrees of freedom, denoted
Student−t(n), has the following properties:
– it looks like a standard normal as n → ∞
– is symmetric about 0
– has mean EStun [X] = 0 for n > 1
∗ mean undefined for n = 1
n
– has variance VStun [X] = n−2
for n > 2
∗ variance undefined for n ≤ 2

6 CLT and Confidence Intervals


• Central Limit Theorem (CLT): have distribution with mean µ and
variance σ 2 , andPsample n identical RVs X1 , ..., Xn from it; then the
1 n
sample mean n i=1 Xi is approximately
Pn distributed as N µ, n1 σ 2 for
large n. Likewise the sample sum i=1 Xi is approximately distributed
as N (nµ, nσ 2 ) for large n.
• examples of the CLT
– it is exact in the case of the Gaussian
– for the binomial, bin(θ, n) ≈ N (nθ, nθ(1 − θ)) for n  1 and θ
not near 0 or 1
– for the Poisson, Pois(λ) ≈ N (λ, λ) for λ  1
• let X have the CDF P (X), and Q(p) = P −1 (p) is the corresponding
quantile function; then the (1 − α) two-sided confidence interval
for X is given by
[Q(α/2), Q(1 − α/2)]

Page 8 of 20
Cheatsheet

• consider the case for Z ∼ N (0, 1):

– let Z1−α/2 denotes the upper α/2 quantile for N (0, 1)


– we are 1 − α confident Z ∼ N (0, 1) falls inside (−Z1−α/2 , Z1−α/2 )
– [−Z1−α/2 , Z1−α/2 ] is called a (two-sided) confidence interval for
N (0, 1)

this is depicted in the unshaded part of the curve:

• let X have the CDF P (X), and Q(p) = P −1 (p) is the correspond-
ing quantile function; then the (1 − α) one-sided lower confidence
interval for X is given by

[−∞, Q(1 − α)]

and the (1 − α) one-sided upper confidence interval for X is given


by
[Q(α), ∞]

• assume dataset of count n with mean X̄ and sample variance S 2 :


assumptions parameter interval

Gaussian, σ2 µ X̄ ± Zα/2 √σn


known
Gaussian, σ2 µ X̄ ± tα/2,n−1 √Sn
unknown

• assume dataset of count n with mean X̄ and sample variance S 2 and a


second dataset of count m with mean Ȳ and sample variance T 2 :

Page 9 of 20
Cheatsheet

assumptions parameter interval


p
Gaussian, σ12 , σ22 µ1 − µ2 X̄ − Ȳ ± Zα/2 σ12 /n + σ22 /m
known q
1 1
Gaussian, σ12= µ1 − µ2 X̄ − Ȳ ± tα/2,n+m−2 SP n
+ m
for
σ22 unknown but (n−1)S 2 +(m−1)T 2
SP2 = n+m−2
equal
Gaussian, σ12 6= µ1 − µ2 use 1st case for σ12 = S 2 , σ22 = T 2 ,
σ22 unknown, us- assuming n, m are large
ing CLT

• for Poisson, assume dataset of count n with mean X̂; for Bernoulli,
assume dataset of count n with mean p̂; also a 2nd dataset of count m
with mean q̂;
assumptions parameter interval
q
Poisson, λ un- λ X̂ ± Zα/2 X̂/n
known, using
CLT p
Bernoulli, θ θ p̂ ± Zα/2 p̂(1 − p̂)/n
unknown, using
CLT
Bernoulli, θ1 , θ2 θ1 − θ2 p̂ −p q̂ ±
unknown, using Zα/2 p̂(1 − p̂)/n + q̂(1 − q̂)/m
CLT

7 Hypothesis Tests
• given an arbitrary test statistic x with CDF P (X) (i.e. x could be z
or t), then the p-value is given by

 2P (−|x|) if null hypothesis is equality
p= 1 − P (x) if null hypothesis involves ≤
P (x) if null hypothesis involves ≥

• assume dataset of count n with mean X̄ and sample variance S 2 :

Page 10 of 20
Cheatsheet

assumptions null-hypo. test statistic


X̄−µ
Gaussian, σ2 µ0 Z= √0
σ/ n
known
X̄−µ
Gaussian, σ2 µ0 tn−1 = √0
S/ n
unknown

• assume dataset of count n with mean X̄ and sample variance S 2 and a


second dataset of count m with mean Ȳ and sample variance T 2 :
assumptions null-hypo. test statistic

Gaussian, σ12 , σ22 ∆µ0 Z = √X̄−2 Ȳ −∆µ2 0


σ1 /n+σ2 /m
known
Ȳ −∆µ0
X̄−√
Gaussian, σ12 = ∆µ0 tn+m−2 = 1 1
for SP2 =
SP +m
σ22 unknown but (n−1)S 2 +(m−1)T 2
n

equal n+m−2
Gaussian, σ12 6= ∆µ0 use 1st case for σ12 = S 2 , σ22 = T 2,
σ22 unknown, us- assuming n, m are large
ing CLT

• for Poisson, assume dataset of count n with mean X̂; for Bernoulli,
assume dataset of count n with mean p̂; also a 2nd dataset of count m
with mean q̂; all using the CLT so require large samples (n, m)
assumptions null-hypo. test statistic
X̂−λ0
Poisson, λ un- λ0 Z=√
λ0 /n
known, using
CLT
p̂−θ0
Bernoulli, θ θ0 Z=√
θ0 (1−θ0 )/n
unknown, using
CLT
p̂−q̂−∆θ0
Bernoulli, θ1 , θ2 ∆θ0 Z=√
p̂(1−p̂)/n+q̂(1−q̂)/m
unknown, using If ∆θ0 = 0 this reduces to
p̂−q̂
CLT Z=√
r̂(1−r̂)(1/n+1/m)
where r̂ = np̂+mq̂
n+m

Page 11 of 20
Cheatsheet

8 Linear Regression
• simple least squares model has E [yi | xi ] = β0 + β1 xi and has a residual
sum of squares
n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1

• various intermediate formula are used to calculate quantities including


the sums
n  
X 2
SSXX = (xi − X)2 = n X 2 − X
i=1
n
X 
SSXY = (xi − X)(yi − Y ) = n XY − X Y
i=1
n  
X 2
SSY Y = (yi − Y )2 = n Y 2 − Y
i=1

• with this the RSS can be minimised using the solution for β1 of

SSXY XY − X Y
β̂1 = = 2
SSXX X2 − X
and the solution for β0 of

Y X 2 − XY X
β̂0 = Y − β̂1 X = 2
X2 − X

• giving an RSS at the minimum of


2
XX − SSXY
  SS SS 2
YY
RSS βˆ0 , βˆ1 = = SSY Y − SSXX βˆ1
SSXX

• if we use the probability model yi ∼ N (β0 + β1 xi , σ 2 ) then the log-


likelihood becomes
n RSS(β0 , β1 )
L(x, y | β0 , β1 , σ 2 ) = log(2πσ 2 ) +
2 2σ 2

Page 12 of 20
Cheatsheet

• minimising this gives the same solution to β0 , β1 as before and an


estimator for σ 2
2 1
σ̂M L = RSS(β̂0 , β̂1 )
n
plus an unbiased estimator of σ 2 is given by
1
σ̂u2 = RSS(β̂0 , β̂1 )
n−2

• moreover, the following statistics can be used to develop confidence


intervals or hypothesis tests
1
2
RSS(β̂0 , β̂1 ) ∼ χ2n−2
σ
1
q (β̂0 − β0 ) ∼ Student−t(n − 2)
RSS X2
n(n−2) X 2 −X 2
1
q (β̂1 − β1 ) ∼ Student−t(n − 2)
RSS 1
n(n−2) X 2 −X 2

• a measure of quality for the linear regression is the R2 value computed


as
RSS SS2XY
R2 = 1 − = 2
= rXY
SSY Y SSXX SSY Y
which is in [0, 1], 1 for a perfect zero error fit, 0 for pure noise, and
higher for better quality fit

• for multi-linear regression, the prediction instead becomes


p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1

and residual sum of squares (RSS) of


n p
!2
X X
RSS(β0 , β1 , . . . , βp ) = y i − β0 − βj xi,j
i=1 j=1

Page 13 of 20
Cheatsheet

• the design matrix given by X of predictors:


 
1 x1,1 x1,2 · · · x1,p
 1 x2,1 x2,2 · · · x2,p 
X = (1, x01 , x02 , . . . , x0p ) =  .. ,
 
.. .. ..
 . . . . 
1 xn,1 xn,2 · · · xn,p

has corresponding parameters β T = (β0 , β1 , . . . , βp ) yielding the pre-


diction
E [yi | xi ] = Xβ

• minimising RSS(β) has solution

β̂ = (XT X)−1 (XT Y)


RSS(β̂) = YT Y − β̂ T (XT Y)

• if we use the probability model yi ∼ N (xi β, σ 2 ) then the log-likelihood


becomes
n RSS(β)
L(X, y | β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
• minimising this gives the same solution to β as before and an estimator
for σ 2
2 1
σ̂M L = RSS(β̂)
n
plus an unbiased estimator of σ 2 is given by
1
σ̂u2 = RSS(β̂)
n−p−1

9 Classification and Clustering


• probability prediction formula for naı̈ve Bayes classifier is

P (y) pj=1 P (xj | y)


Q
P (y | x1 , . . . , xp ) =
P (x1 , . . . , xp )

where the denominator is a constant so can be found by normalising


the renumerator.

Page 14 of 20
Cheatsheet

• point estimation is done by estimating the probabilities P (Y = y) and


P (Xj = xj | Y = y) for all entries of the tables

• probability prediction formula for logistic regression classifier is


expressed using the logistic function
1
p(Yi = 1 | xi,1 , ..., xi,p ) =
1 + exp(−ηi )

where p
X
ηi = β0 + βj xi,j
j=1

so that the log-odds given by

p(Yi = 1 | xi,1 , ..., xi,p )


log = ηi
p(Yi = 0 | xi,1 , ..., xi,p )

• the parameters (β0 , β1 , ...βp ) are fit using optimising routines on the log
likelihood

10 More Classification
• log2 (x) = logc (x)/logc (2) where c is any constant

• define entropy (to base 2 by default)

H(X) = E [log2 1/p(X)]

• define condtional entropy (to base 2 by default)


X
H(X|Y ) = p(Y =y)H(X|Y =y)
y

where H(X|Y =y) is the entropy of the conditional distribution p(X|Y =


y).

• some properties of entropy where X has discrete domain X :

– if X of finite dimension K, then 0 ≤ H(X) ≤ K

Page 15 of 20
Cheatsheet

– if H(X) = 0 then p(X=x) = 1 for some x ∈ X

• some useful rules for RVs X, Y

– H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )

• if X, Y are independent RVs

– H(X, Y ) = H(X) + H(Y )

• Information gain for predictor RV X and target RV Y is defined as

– I.G.(Y, X) = H(Y ) − H(Y |X)

Page 16 of 20
Cheatsheet

11 Tables for Standard Normal


Tables from http://www.z-table.com/ on the next 2 pages. One table for
z-values less than 0 and one table for z-values greater than 0 to help you find
p = F (z).

Table for z-values less than 0.

Page 17 of 20
Cheatsheet

Table for z-values greater than 0.

Page 18 of 20
Cheatsheet

12 Table for Student t


Table from http://www.ttable.org/. Provides critical t-values for specific sig-
nificance values for one- and two-sided t-tests.

Page 19 of 20
Cheatsheet

13 Calculus

Page 20 of 20

You might also like