FIT5197 2021 S1 Formula Sheet

FIT5197, 2021 Semester 1, Formula Sheet
Daniel Schmidt, Wray Buntine, Levin Kuhlmann

May 31, 2021
1 Sample Statistics
• quartiles, percentiles, etc.: given n data points, rank them in increasing
value to get x1 , ..., xn
– median, if n is odd, given by x(n+1)/2 , if n is even given by 12 (xn/2 +
xn/2+1 )
– quartiles, Q1 or Q3 is given by Qk = xp + 4q (xp+1 − xp ) where
p = f loor((k(n + 1))/4) and q = (k(n + 1)) mod 4
q
– percentiles, Pk = xp + 100 (xp+1 − xp ) where p = f loor((k(n +
1))/100) and q = (k(n + 1)) mod 100
• measures of spread for n data points, x = (x1 , ..., xn )
1
Pn
– Sample variance, var(x) = s2x = n−1 i=1 (xi − x̄)
2
– Sample standard deviation, sx

– Range = maxni=1 xi − minni=1 xi
– (inter-quartile range) IQR = Q3 − Q1
• Sample covariance
n
1 X
qxy = (xi − x̄)(yi − ȳ)
n − 1 i=1
• Sample correlation coefficient

Pn
qxy (xi − x̄)(yi − ȳ)
rxy = = pPn i=1 Pn
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2
1
Cheatsheet
• boxplots
– represents numerical data through quar-
tiles
– the lower hinge is Q1 , upper hinge is Q3
– a lower whisker is drawn at minimum

data value greater than the lower inner
fence is Q1 − 1.5 × IQR (which itself is
usually not draw)
– upper whisker is drawn at maximum

data value less than the upper inner
fence is Q3 + 1.5 × IQR (which itself is
usually not draw)
– outliers are highlighted outside these

two fences
2 Probability
• probability axioms of Kolmogorov:
1. for any event A, 0 ≤ p(A) ≤ 1

2. p(Ω) = 1, where Ω is the universal set, the set of everything
for mutually exclusive events A1 , ... An p(A1 ∪ A2 ... ∪ An ) =
3. P
n
i=1 p(Ai )
• other probability identities for the domain X × Y where A, B are any

events:
– complement rule, p(A) = 1 − p(A)

– product rule, p(B ∩ A) = p(B|A)p(A)
– addition rule for 2 sets, p(A ∪ B) = p(A) + p(B) − p(A ∩ B)
P
– sum rule, p(A) = x∈X p(x ∩ A)
p(B∩A)
– conditional, p(B|A) = p(A)
when p(A) > 0
– Bayes theorem, p(x|A) = P p(A|x)p(x)
x∈X p(A|x)p(x)
Page 2 of 20
Cheatsheet
• for continuous random variables a probability density function (PDF)

p(x) on domain X satisfies
p(x) ≥ 0 for all x ∈ X
and Z
p(x)dx = 1
X
• for continuous random variables, the probability X ∈ A, where A ⊂ X

is Z
p(X ∈ A) = p(x)dx.
A
• for X a single dimension, then define the cumulative density func-

tion (CDF), P (x), in terms of the the PDF p(x) as
Z
P (x) = p(y)d y
y<x
and the quantile function Q(x) as

Q(x) = P −1 (x)
this is well defined when p(x) > 0.
• Let the random variable pair (X, Y ) be from domain X × Y. We say
X and Y are independent if any of the following three (equivalent)
conditions hold for all x ∈ X and y ∈ Y
(I) p(X=x|Y =y) = p(X=x) when p(Y =y) > 0
(II) p(Y =y|X=x) = p(Y =y) when p(X=x) > 0
(III) p(Y =y ∩ X=x) = p(X=x)p(Y =y)
3 Expected Values
• if X has domain X , expectation and variance of f (X):
X
E [f (X)] = p(x)f (x)
x∈X
V [f (X)] = E (f (X) − E [f (X)])2 = E f (X)2 − E [f (X)]2

with integral replacing sum for continuous RVs
Page 3 of 20
Cheatsheet
• some useful rules for RVs X, Y and constant c

– E [f (X) + g(Y )] = E [f (X)] + E [g(Y )]
– E [cf (X)] = cE [f (X)]
– V [cf (X)] = c2 V [f (X)]
• if X, Y are independent RVs
– E [f (X)g(Y )] = E [f (X)] E [g(Y )]
– V [f (X) + g(Y )] = V [f (X)] + V [g(Y )]
• Chebyshev’s inequality: if X is a RV with mean µ and variance σ 2 ,
then for any k > 0
|X − µ| 1
p ≥k ≤ 2
σ k
• Weak law of large numbers: let X1 , . . . , Xn be RVs with E [Xi ] = µ;
then for any ε > 0

X 1 + · · · + Xn
p − µ > ε → 0 as n → ∞.
n
4 Distributions
• for the Gaussian or normal distribution, denoted N (µ, σ 2 )
21
(x − µ)2

2 1
p(x | µ, σ ) = exp −
2πσ 2 2σ 2
and has the properties
– E [x] = µ and V [x] = σ 2
– the mode and the median are the same as the mean
– if the curve for p(x | 0, 1) is shifted to the right by µ and scaled by
σ, one gets the curve for p(x | µ, σ 2 )
• the discrete uniform distribution models discrete RVs denoted U (a, b)
and follows
1
P(X = k | a, b) =
b−a+1
where X ∈ {a, . . . , b} with b ≥ a, and has properties
Page 4 of 20
Cheatsheet
a+b (b−a+1)2 −1
– E[X] = 2
and V[X] = 12
• the continuous uniform distribution models continuous RVs de-

noted U (a, b) with pdf

 0 for x<a
1
p(x | a, b) = for a≤x≤b
 b−a
0 for x>b
where a > b and
– The quantity a determines the start of the distribution

– The quantity w = b − a is the width of the distribution
a+b w (b−a)2 w2
– E[X] = 2
=a+ 2
and V[X] = 12
= 12
• the Bernoulli distribution models discrete, binary RVs, i.e., X =

{0, 1}, denoted Be(θ),
p(X = 1 | θ) = θ, θ ∈ [0, 1]
so that the parametric probability distribution is
p(x | θ) = θx (1 − θ)(1−x)
and has properties
– E [x] = θ and V [x] = θ(1 − θ)
• the binomial distribution describes the probability of getting x suc-

cessful outcomes in n Bernoulli trials with probability of success θ,
denoted bin(θ, n), and x ∈ {0, 1, ..., n},

n x
p(x | n, θ) = θ (1 − θ)(n−x)
x
and has properties
– E [x] = nθ and V [x] = nθ(1 − θ)
Page 5 of 20
Cheatsheet
• the Poisson distribution with rate parameter λ is the number of

events x occurring, for X = {0} ∪ N , denoted Pois(λ),
λx exp (−λ)
p(x | λ) =
x!
and has properties
– E [x] = λ and V [x] = λ
– if X ∼ Pois(λX ) and Y ∼ Pois(λY ) then (X +Y ) ∼ Pois(λX +λY )
– bin(θ, n) ≈ Pois(nθ) for n 1 and nθ small
• Note the Central Limit Theorem (CLT) has been moved to section 6
of this document.
5 Estimation
• have a sample x; let θ̂(x) beh a point
i estimate for model parameter θ;
then θ̂(x) is unbiased if Ex θ̂(x) = θ, where the expectation is taken
over samples x
• the bias of the estimator is
h i
bθ (θ̂) = Ex θ̂(x) − θ
• the variance of the estimator is

h i h i2
Vθ θ̂ = Ex θ̂(x) − Ex θ̂(x)
• the mean square error (MSE) of the estimator is

h i 2 h i
MSEθ θ̂ = Ex θ̂(x) − θ = bθ (θ̂)2 + Vθ θ̂
• for sample x of size n distributed as N (µ, σ) the sum of squared

errors (SSE) of mean estimate µ is given by
n
X
SSE(µ) = (xi − µ)2
i=1
Page 6 of 20
Cheatsheet
and the point estimate µ̂ minimising SSE is the mean

n
1X
µ̂ = xi
n i=1
• the method of maximum likelihood says we should use the model that
assigns the greatest probability to the data we have observed; formally,
the maximum likelihood estimator (MLE) is found by solving
Θ̂ = arg max{p(x | Θ)}

Θ
where p(x | Θ) is called the likelihood function
• use L(x | Θ) to denote the negative log-likelihood, log 1/p(x | Θ)
• for sample x of size n distributed as N (µ, σ)

n 1
L(x | µ, σ 2 ) = log(2πσ 2 ) + 2 SSE(µ)
2 2σ
from this we get
– µ̂M L is the mean, same as when using the SSE

– the MLE for the variance is
n
2 1X
σ̂M L = (xi − µ̂)2
n i=1
this is however biased, an unbiased estimate is

n
1 X
σ̂u2 = (xi − µ̂)2
n − 1 i=1
• the MLE estimates for λ of the Poisson and θ of the Bernoulli is also
the mean
• the MLE estimates for θ of the binomial, bin(θ, m), using sample x of
size n is n
1 X
θ̂M L = xi
nm i=1
Page 7 of 20
Cheatsheet
• let x be a sample of size n from a Gaussian population with mean µ

and variance σ 2 , and let m be the mean and s2 be the sample variance:

σ2
– m is Gaussian with mean, variance µ, n
√
– n(m − µ)/s is Student’s t with n − 1 degrees of freedom
– these can be used to develop confidence bounds or hypothesis tests
for µ and σ 2 respectively
• the Student’s t distribution with n degrees of freedom, denoted
Student−t(n), has the following properties:
– it looks like a standard normal as n → ∞
– is symmetric about 0
– has mean EStun [X] = 0 for n > 1
∗ mean undefined for n = 1
n
– has variance VStun [X] = n−2
for n > 2
∗ variance undefined for n ≤ 2
6 CLT and Confidence Intervals

• Central Limit Theorem (CLT): have distribution with mean µ and
variance σ 2 , andPsample n identical RVs X1 , ..., Xn from it; then the
1 n
sample mean n i=1 Xi is approximately
Pn distributed as N µ, n1 σ 2 for
large n. Likewise the sample sum i=1 Xi is approximately distributed
as N (nµ, nσ 2 ) for large n.
• examples of the CLT
– it is exact in the case of the Gaussian
– for the binomial, bin(θ, n) ≈ N (nθ, nθ(1 − θ)) for n 1 and θ
not near 0 or 1
– for the Poisson, Pois(λ) ≈ N (λ, λ) for λ 1
• let X have the CDF P (X), and Q(p) = P −1 (p) is the corresponding
quantile function; then the (1 − α) two-sided confidence interval
for X is given by
[Q(α/2), Q(1 − α/2)]
Page 8 of 20
Cheatsheet
• consider the case for Z ∼ N (0, 1):
– let Z1−α/2 denotes the upper α/2 quantile for N (0, 1)

– we are 1 − α confident Z ∼ N (0, 1) falls inside (−Z1−α/2 , Z1−α/2 )
– [−Z1−α/2 , Z1−α/2 ] is called a (two-sided) confidence interval for
N (0, 1)
this is depicted in the unshaded part of the curve:
• let X have the CDF P (X), and Q(p) = P −1 (p) is the correspond-
ing quantile function; then the (1 − α) one-sided lower confidence
interval for X is given by
[−∞, Q(1 − α)]
and the (1 − α) one-sided upper confidence interval for X is given

by
[Q(α), ∞]
• assume dataset of count n with mean X̄ and sample variance S 2 :

assumptions parameter interval
Gaussian, σ2 µ X̄ ± Zα/2 √σn

known
Gaussian, σ2 µ X̄ ± tα/2,n−1 √Sn
unknown
• assume dataset of count n with mean X̄ and sample variance S 2 and a

second dataset of count m with mean Ȳ and sample variance T 2 :
Page 9 of 20
Cheatsheet

p
Gaussian, σ12 , σ22 µ1 − µ2 X̄ − Ȳ ± Zα/2 σ12 /n + σ22 /m
known q
1 1
Gaussian, σ12= µ1 − µ2 X̄ − Ȳ ± tα/2,n+m−2 SP n
+ m
for
σ22 unknown but (n−1)S 2 +(m−1)T 2
SP2 = n+m−2
equal
Gaussian, σ12 6= µ1 − µ2 use 1st case for σ12 = S 2 , σ22 = T 2 ,
σ22 unknown, us- assuming n, m are large
ing CLT
• for Poisson, assume dataset of count n with mean X̂; for Bernoulli,
assume dataset of count n with mean p̂; also a 2nd dataset of count m
with mean q̂;
q
Poisson, λ un- λ X̂ ± Zα/2 X̂/n
known, using
CLT p
Bernoulli, θ θ p̂ ± Zα/2 p̂(1 − p̂)/n
unknown, using
CLT
Bernoulli, θ1 , θ2 θ1 − θ2 p̂ −p q̂ ±
unknown, using Zα/2 p̂(1 − p̂)/n + q̂(1 − q̂)/m
CLT
7 Hypothesis Tests
• given an arbitrary test statistic x with CDF P (X) (i.e. x could be z
or t), then the p-value is given by

 2P (−|x|) if null hypothesis is equality
p= 1 − P (x) if null hypothesis involves ≤
P (x) if null hypothesis involves ≥

• assume dataset of count n with mean X̄ and sample variance S 2 :
Page 10 of 20
Cheatsheet
assumptions null-hypo. test statistic

X̄−µ
Gaussian, σ2 µ0 Z= √0
σ/ n
known
X̄−µ
Gaussian, σ2 µ0 tn−1 = √0
S/ n
unknown
• assume dataset of count n with mean X̄ and sample variance S 2 and a

second dataset of count m with mean Ȳ and sample variance T 2 :
Gaussian, σ12 , σ22 ∆µ0 Z = √X̄−2 Ȳ −∆µ2 0

σ1 /n+σ2 /m
known
Ȳ −∆µ0
X̄−√
Gaussian, σ12 = ∆µ0 tn+m−2 = 1 1
for SP2 =
SP +m
σ22 unknown but (n−1)S 2 +(m−1)T 2
n
equal n+m−2
Gaussian, σ12 6= ∆µ0 use 1st case for σ12 = S 2 , σ22 = T 2,
σ22 unknown, us- assuming n, m are large
ing CLT
• for Poisson, assume dataset of count n with mean X̂; for Bernoulli,
assume dataset of count n with mean p̂; also a 2nd dataset of count m
with mean q̂; all using the CLT so require large samples (n, m)
X̂−λ0
Poisson, λ un- λ0 Z=√
λ0 /n
known, using
CLT
p̂−θ0
Bernoulli, θ θ0 Z=√
θ0 (1−θ0 )/n
unknown, using
CLT
p̂−q̂−∆θ0
Bernoulli, θ1 , θ2 ∆θ0 Z=√
p̂(1−p̂)/n+q̂(1−q̂)/m
unknown, using If ∆θ0 = 0 this reduces to
p̂−q̂
CLT Z=√
r̂(1−r̂)(1/n+1/m)
where r̂ = np̂+mq̂
n+m
Page 11 of 20
Cheatsheet
8 Linear Regression
• simple least squares model has E [yi | xi ] = β0 + β1 xi and has a residual
sum of squares
n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1
• various intermediate formula are used to calculate quantities including

the sums
n
X 2
SSXX = (xi − X)2 = n X 2 − X
i=1
n
X
SSXY = (xi − X)(yi − Y ) = n XY − X Y
i=1
n
X 2
SSY Y = (yi − Y )2 = n Y 2 − Y
i=1
• with this the RSS can be minimised using the solution for β1 of
SSXY XY − X Y
β̂1 = = 2
SSXX X2 − X
and the solution for β0 of
Y X 2 − XY X
β̂0 = Y − β̂1 X = 2
X2 − X
• giving an RSS at the minimum of

2
XX − SSXY
SS SS 2
YY
RSS βˆ0 , βˆ1 = = SSY Y − SSXX βˆ1
SSXX
• if we use the probability model yi ∼ N (β0 + β1 xi , σ 2 ) then the log-

likelihood becomes
n RSS(β0 , β1 )
L(x, y | β0 , β1 , σ 2 ) = log(2πσ 2 ) +
2 2σ 2
Page 12 of 20
Cheatsheet
• minimising this gives the same solution to β0 , β1 as before and an

estimator for σ 2
2 1
σ̂M L = RSS(β̂0 , β̂1 )
n
plus an unbiased estimator of σ 2 is given by
1
σ̂u2 = RSS(β̂0 , β̂1 )
n−2
• moreover, the following statistics can be used to develop confidence

intervals or hypothesis tests
1
2
RSS(β̂0 , β̂1 ) ∼ χ2n−2
σ
1
q (β̂0 − β0 ) ∼ Student−t(n − 2)
RSS X2
n(n−2) X 2 −X 2
1
q (β̂1 − β1 ) ∼ Student−t(n − 2)
RSS 1
n(n−2) X 2 −X 2
• a measure of quality for the linear regression is the R2 value computed

as
RSS SS2XY
R2 = 1 − = 2
= rXY
SSY Y SSXX SSY Y
which is in [0, 1], 1 for a perfect zero error fit, 0 for pure noise, and
higher for better quality fit
• for multi-linear regression, the prediction instead becomes

p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1
and residual sum of squares (RSS) of

n p
!2
X X
RSS(β0 , β1 , . . . , βp ) = y i − β0 − βj xi,j
i=1 j=1
Page 13 of 20
Cheatsheet
• the design matrix given by X of predictors:

 
1 x1,1 x1,2 · · · x1,p
 1 x2,1 x2,2 · · · x2,p 
X = (1, x01 , x02 , . . . , x0p ) =  .. ,
 
.. .. ..
 . . . . 
1 xn,1 xn,2 · · · xn,p
has corresponding parameters β T = (β0 , β1 , . . . , βp ) yielding the pre-

diction
E [yi | xi ] = Xβ
• minimising RSS(β) has solution
β̂ = (XT X)−1 (XT Y)

RSS(β̂) = YT Y − β̂ T (XT Y)
• if we use the probability model yi ∼ N (xi β, σ 2 ) then the log-likelihood

becomes
n RSS(β)
L(X, y | β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
• minimising this gives the same solution to β as before and an estimator
for σ 2
2 1
σ̂M L = RSS(β̂)
n
plus an unbiased estimator of σ 2 is given by
1
σ̂u2 = RSS(β̂)
n−p−1
9 Classification and Clustering

• probability prediction formula for naı̈ve Bayes classifier is
P (y) pj=1 P (xj | y)

Q
P (y | x1 , . . . , xp ) =
P (x1 , . . . , xp )
where the denominator is a constant so can be found by normalising

the renumerator.
Page 14 of 20
Cheatsheet
• point estimation is done by estimating the probabilities P (Y = y) and

P (Xj = xj | Y = y) for all entries of the tables
• probability prediction formula for logistic regression classifier is

expressed using the logistic function
1
p(Yi = 1 | xi,1 , ..., xi,p ) =
1 + exp(−ηi )
where p
X
ηi = β0 + βj xi,j
j=1
so that the log-odds given by
p(Yi = 1 | xi,1 , ..., xi,p )

log = ηi
p(Yi = 0 | xi,1 , ..., xi,p )
• the parameters (β0 , β1 , ...βp ) are fit using optimising routines on the log
likelihood
10 More Classification
• log2 (x) = logc (x)/logc (2) where c is any constant
• define entropy (to base 2 by default)
H(X) = E [log2 1/p(X)]
• define condtional entropy (to base 2 by default)

X
H(X|Y ) = p(Y =y)H(X|Y =y)
y
where H(X|Y =y) is the entropy of the conditional distribution p(X|Y =

y).
• some properties of entropy where X has discrete domain X :
– if X of finite dimension K, then 0 ≤ H(X) ≤ K
Page 15 of 20
Cheatsheet
– if H(X) = 0 then p(X=x) = 1 for some x ∈ X
• some useful rules for RVs X, Y
– H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
• if X, Y are independent RVs
– H(X, Y ) = H(X) + H(Y )
• Information gain for predictor RV X and target RV Y is defined as
– I.G.(Y, X) = H(Y ) − H(Y |X)
Page 16 of 20
Cheatsheet
11 Tables for Standard Normal

Tables from http://www.z-table.com/ on the next 2 pages. One table for
z-values less than 0 and one table for z-values greater than 0 to help you find
p = F (z).
Table for z-values less than 0.
Page 17 of 20
Cheatsheet
Table for z-values greater than 0.
Page 18 of 20
Cheatsheet
12 Table for Student t

Table from http://www.ttable.org/. Provides critical t-values for specific sig-
nificance values for one- and two-sided t-tests.
Page 19 of 20
Cheatsheet
13 Calculus
Page 20 of 20

FIT5197 2021 S1 Formula Sheet

Uploaded by

Copyright:

Available Formats

FIT5197 2021 S1 Formula Sheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FIT5197 2021 S1 Formula Sheet

Uploaded by

Copyright:

Available Formats

FIT5197, 2021 Semester 1, Formula Sheet

Daniel Schmidt, Wray Buntine, Levin Kuhlmann

– Sample standard deviation, sx

• Sample correlation coefficient

– the lower hinge is Q1 , upper hinge is Q3

– a lower whisker is drawn at minimum

– upper whisker is drawn at maximum

– outliers are highlighted outside these

1. for any event A, 0 ≤ p(A) ≤ 1

• other probability identities for the domain X × Y where A, B are any

– complement rule, p(A) = 1 − p(A)

• for continuous random variables a probability density function (PDF)

• for continuous random variables, the probability X ∈ A, where A ⊂ X

• for X a single dimension, then define the cumulative density func-

and the quantile function Q(x) as

V [f (X)] = E (f (X) − E [f (X)])2 = E f (X)2 − E [f (X)]2

with integral replacing sum for continuous RVs

• some useful rules for RVs X, Y and constant c

• the continuous uniform distribution models continuous RVs de-

where a > b and

– The quantity a determines the start of the distribution

• the Bernoulli distribution models discrete, binary RVs, i.e., X =

so that the parametric probability distribution is

and has properties

– E [x] = θ and V [x] = θ(1 − θ)

• the binomial distribution describes the probability of getting x suc-

and has properties

– E [x] = nθ and V [x] = nθ(1 − θ)

• the Poisson distribution with rate parameter λ is the number of

• the variance of the estimator is

• the mean square error (MSE) of the estimator is

• for sample x of size n distributed as N (µ, σ) the sum of squared

and the point estimate µ̂ minimising SSE is the mean

Θ̂ = arg max{p(x | Θ)}

where p(x | Θ) is called the likelihood function

• use L(x | Θ) to denote the negative log-likelihood, log 1/p(x | Θ)

• for sample x of size n distributed as N (µ, σ)

– µ̂M L is the mean, same as when using the SSE

this is however biased, an unbiased estimate is

• let x be a sample of size n from a Gaussian population with mean µ

6 CLT and Confidence Intervals

• consider the case for Z ∼ N (0, 1):

– let Z1−α/2 denotes the upper α/2 quantile for N (0, 1)

this is depicted in the unshaded part of the curve:

[−∞, Q(1 − α)]

and the (1 − α) one-sided upper confidence interval for X is given

• assume dataset of count n with mean X̄ and sample variance S 2 :

Gaussian, σ2 µ X̄ ± Zα/2 √σn

• assume dataset of count n with mean X̄ and sample variance S 2 and a

assumptions parameter interval

• assume dataset of count n with mean X̄ and sample variance S 2 :

assumptions null-hypo. test statistic

• assume dataset of count n with mean X̄ and sample variance S 2 and a

Gaussian, σ12 , σ22 ∆µ0 Z = √X̄−2 Ȳ −∆µ2 0

• various intermediate formula are used to calculate quantities including

• giving an RSS at the minimum of

• if we use the probability model yi ∼ N (β0 + β1 xi , σ 2 ) then the log-

• minimising this gives the same solution to β0 , β1 as before and an

• moreover, the following statistics can be used to develop confidence

• a measure of quality for the linear regression is the R2 value computed