A Probability and Statistics Cheatsheet

Probability and Statistics
Cheat Sheet
Copyright
c Matthias Vallentin, 2011
[email protected]
6th March, 2011

This cheat sheet integrates a variety of topics in probability the- 12 Parametric Inference 11 20 Stochastic Processes 22
ory and statistics. It is based on literature [1, 6, 3] and in-class 12.1 Method of Moments . . . . . . . . . . . 11 20.1 Markov Chains . . . . . . . . . . . . . . 22
material from courses of the statistics department at the Univer- 12.2 Maximum Likelihood . . . . . . . . . . . 12 20.2 Poisson Processes . . . . . . . . . . . . . 22
sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . . 12
[4, 5]. If you find errors or have suggestions for further topics, I 21 Time Series 23
12.3 Multiparameter Models . . . . . . . . . 12
would appreciate if you send me an email. The most recent ver- 21.1 Stationary Time Series . . . . . . . . . . 23
12.3.1 Multiparameter Delta Method . 13
sion of this document is available at http://bit.ly/probstat. 21.2 Estimation of Correlation . . . . . . . . 24
12.4 Parametric Bootstrap . . . . . . . . . . 13 21.3 Non-Stationary Time Series . . . . . . . 24
To reproduce, please contact me.
21.3.1 Detrending . . . . . . . . . . . . 24
13 Hypothesis Testing 13 21.4 ARIMA models . . . . . . . . . . . . . . 24
Contents 14 Bayesian Inference 14
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
1 Distribution Overview 3 14.1 Credible Intervals . . . . . . . . . . . . . 14
1.1 Discrete Distributions . . . . . . . . . . 3 14.2 Function of Parameters . . . . . . . . . 14 22 Math 26
1.2 Continuous Distributions . . . . . . . . 4 14.3 Priors . . . . . . . . . . . . . . . . . . . 15 22.1 Gamma Function . . . . . . . . . . . . . 26
14.3.1 Conjugate Priors . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . . . . . 26
2 Probability Theory 6 14.4 Bayesian Testing . . . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
3 Random Variables 6 15 Exponential Family 16
3.1 Transformations . . . . . . . . . . . . . 7
16 Sampling Methods 16
4 Expectation 7 16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
5 Variance 7
16.2 Rejection Sampling . . . . . . . . . . . . 17
6 Inequalities 8 16.3 Importance Sampling . . . . . . . . . . . 17
7 Distribution Relationships 8 17 Decision Theory 17

17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 17
Functions 9 17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9 18 Linear Regression 18
9.2 Bivariate Normal . . . . . . . . . . . . . 9 18.1 Simple Linear Regression . . . . . . . . 18
9.3 Multivariate Normal . . . . . . . . . . . 9
18.2 Prediction . . . . . . . . . . . . . . . . . 19
10 Convergence 9 18.3 Multiple Regression . . . . . . . . . . . 19
10.1 Law of Large Numbers (LLN) . . . . . . 10 18.4 Model Selection . . . . . . . . . . . . . . 19
10.2 Central Limit Theorem (CLT) . . . . . 10
19 Non-parametric Function Estimation 20
11 Statistical Inference 10 19.1 Density Estimation . . . . . . . . . . . . 20
11.1 Point Estimation . . . . . . . . . . . . . 10 19.1.1 Histograms . . . . . . . . . . . . 20
11.2 Normal-based Confidence Interval . . . . 11 19.1.2 Kernel Density Estimator (KDE) 21
11.3 Empirical Distribution Function . . . . . 11 19.2 Non-parametric Regression . . . . . . . 21
11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions 21
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a + 1)2 1 eas e(b+1)s

bxca+1 I(a < x < b) a+b
Uniform Unif {a, . . . , b} axb
ba ba+1 2 12 s(b a)
1 x>b

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p + pes
!
n x
Binomial Bin (n, p) I1p (n x, x + 1) p (1 p)nx np np(1 p) (1 p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 pkk xi = n npi npi (1 pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m mx

x np x nx nm nm(N n)(N m)
Hypergeometric Hyp (N, m, n) N
N/A
N 2 (N 1)
p
np(1 p) x
N
! r
x+r1 r 1p 1p p
Negative Binomial NBin (n, p) Ip (r, x + 1) p (1 p)x r r
r1 p p2 1 (1 p)es
1 1p p
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+
p p2 1 (1 p)es
x
X i x e s
Poisson Po () e e(e 1)
i=0
i! x!
Uniform (discrete) Binomial Geometric Poisson

n = 40, p = 0.3 p = 0.2 =1
0.8

n = 30, p = 0.6 p = 0.5 =4
0.25
n = 25, p = 0.9 p = 0.8 = 10
0.3
0.20
0.6
0.15
0.2
1
PMF
PMF
PMF
PMF

0.4

0.10
0.1
0.2

0.05

0.00

0.0
0.0

a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20
x x x x
1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a)2 esb esa

xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b

(x )2
Z x
2 s2

1
N , 2 2

Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2

1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+

Log-Normal + erf exp
2 2 2 2 x 2 2 2 2

1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2
+1

2 x2
Students t Student() Ix ,
1+ 0 0
2 2 2

1 k x 1
Chi-square 2k , xk/2 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)

d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 (s < 1/)
1 s

(, x/) 1 1
Gamma Gamma (, ) x1 ex/ 2 (s < 1/)
() () 1 s
, x

1 /x 2 2(s)/2 p
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2)2 ()
P
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1

sn n

k k x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
x
m x xm x
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 m
>2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)
4
Uniform (continuous) Normal Lognormal Student's t
=1
0.4
= 0, 2 = 0.2 = 0, 2 = 3
1.0
= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=
0.8
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
0.8
0.3
0.6
0.6
PDF
PDF
PDF
(x)
0.2
0.4
1

0.4
ba
0.1
0.2
0.2
0.0
0.0
0.0

a b 4 2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 4 2 0 2 4

x x x x
2
F Exponential Gamma
k=1 d1 = 1, d2 = 1 =2 = 1, = 2
0.5
2.0
0.5
k=2 d1 = 2, d2 = 1 =1 = 2, = 2
3.0
k=3 d1 = 5, d2 = 2 = 0.4 = 3, = 2
k=4 d1 = 100, d2 = 1 = 5, = 1
k=5 d1 = 100, d2 = 100 = 9, = 0.5
0.4
0.4
2.5
1.5
2.0
0.3
0.3
PDF
PDF
PDF
PDF
1.0
1.5
0.2
0.2
1.0
0.5
0.1
0.1
0.5
0.0
0.0
0.0
0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto

= 1, = 1 = 0.5, = 0.5 = 1, k = 0.5 xm = 1, = 1
3.0
2.5
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, = 4
= 3, = 0.5 = 2, = 2 = 1, k = 5
= 2, = 5
4
2.5
2.0
3
2.0
3
1.5
2
PDF
PDF
PDF
PDF
1.5
2
1.0
1.0
1
1
0.5
0.5
0.0
0.0
0
0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5
x x x x
5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1
Outcome (point or element) Bayes Theorem

Event A
n
-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn = Ai
1. A j=1 P [B | Aj ] P [Aj ] i=1
S
2. A1 , A2 , . . . , A = i=1 Ai A Inclusion-Exclusion Principle
3. A A = A A
n n
r
[ X \
Probability distribution P
X
(1)r1

Ai = A ij

1. P [A] 0 for every A i=1 r=1 ii1 <<ir n j=1
2. P [] = 1
" #
G X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable
Probability space (, A, P) X:R
Properties Probability Mass Function (PMF)
P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]

B = B = (A A) B = (A B) (A B)
Probability Density Function (PDF)
P [A] = 1 P [A]
b
P [B] = P [A B] + P [A B]
Z
P [a X b] = f (x) dx
P [] = 1 P [] = 0 a
S T T S
( n An ) = n An ( n An ) = n An DeMorgan
S T Cumulative Distribution Function (CDF):
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B] FX : R [0, 1] FX (x) = P [X x]
= P [A B] P [A] + P [B] 1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )
P [A B] = P [A B] + P [A B] + P [A B] 2. Normalized: limx = 0 and limx = 1
P [A B] = P [A] P [A B] 3. Right-continuous: limyx F (y) = F (x)
Continuity of Probabilities
S Z b
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai P [a Y b | X = x] = fY |X (y | x)dy ab
T
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X x, Y y] = P [X x] P [Y y]
P [A B]
P [A | B] = if P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
Discrete X
X E [X] = P [X x]
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

f (x) x=1
x1 (z) Sample mean
n
Continuous 1X
Xn = Xi
Z n i=1
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z} Conditional Expectation
Az Z
Special case if strictly monotone E [Y | X = x] = yf (y | x) dy

d

dx 1 E [X] = E [E [X | Y ]]
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
Z
dz dz |J| E[(X, Y ) | X = x] = (x, y)fY |X (y | x) dx
Z

The Rule of the Lazy Statistician
E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz

Z
E [Z] = (x) dFX (x) E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z E [(X)Y | X] = (X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E[Y | X] = c = Cov [X, Y ] = 0
A
Convolution
Z Z z
5 Variance
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx Variance
0
Z 2
2
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx V [X] = X = E (X E [X])2 = E X 2 E [X]
" n # n
Z 0 Z X X X
X V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx i=1 i=1
Y " n #
i6=j
X n
X
V Xi = V [Xi ] iff Xi
Xj
4 Expectation i=1 i=1
Expectation Standard deviation p

X sd[X] = V [X] = X

xfX (x) X discrete Covariance
Z x

E [X] = X = x dFX (x) = Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0

Z
xfX (x) X continuous

Cov [X, X] = V [X]
P [X = c] = 1 = E [c] = c Cov [X, Y ] = Cov [Y, X]
E [cX] = c E [X] Cov [aX, bY ] = abCov [X, Y ]
E [X + Y ] = E [X] + E [Y ] Cov [X + a, Y + b] = Cov [X, Y ]
7

Xn m
X n X
X m limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Cov Xi , Yj = Cov [Xi , Yj ]
Negative Binomial
i=1 j=1 i=1 j=1
X NBin (1, p) = Geo (p)
Correlation Pr
Cov [X, Y ] X NBin (r, p) = i=1 Geo (p)
[X, Y ] = p Xi NBin (ri , p) =
P P
Xi NBin ( ri , p)
V [X] V [Y ]
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Independence
Poisson
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ] n
X n
X
!
Xi Po (i ) Xi Xj = Xi Po i
Sample variance i=1 i=1
n
1 X

S2 = (Xi Xn )2 n n
X X i
n 1 i=1 Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
j=1 j=1 j=1 j
Conditional Variance
Exponential
2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] n
X
V [Y ] = E [V [Y | X]] + V [E [Y | X]] Xi Exp () Xi
Xj = Xi Gamma (n, )
i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities Normal

X

Cauchy-Schwarz
2
X N , 2 = N (0, 1)

E [XY ] E X 2 E Y 2

X N , Z = aX + b = Z N a + b, a2 2
2
Markov

X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22

E [(X)]
P [(X) t]

Xi N i , i2 =
P
X N
P P 2

t i i i i , i i
P [a < X b] = b a

Chebyshev
V [X]
P [|X E [X]| t] (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
t2
1
Chernoff Upper quantile of N (0, 1): z = (1 )
e

P [X (1 + )] > 1 Gamma
(1 + )1+
X Gamma (, ) X/ Gamma (, 1)
Jensen P
Gamma (, ) i=1 Exp ()
E [(X)] (E [X]) convex P P
Xi Gamma (i , ) Xi
Xj = i Xi Gamma ( i i , )
Z
() 1 x
= x e dx
7 Distribution Relationships 0
Beta
Binomial
1 ( + ) 1
n x1 (1 x)1 = x (1 x)1
Xi Bern (p) =
X
Xi Bin (n, p) B(, ) ()()
B( + k, ) +k1
E X k1

i=1 E Xk = =
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p) B(, ) ++k1
limn Bin (n, p) = Po (np) (n large, p small) Beta (1, 1) Unif (0, 1)
8
8 Probability and Moment Generating Functions Conditional mean and variance
X
E [X | Y ] = E [X] + (Y E [Y ])

GX (t) = E tX |t| < 1
" Y

#
X (Xt)i X E Xi
ti
t
Xt

MX (t) = GX (e ) = E e =E =
p
i! i! V [X | Y ] = X 1 2
i=0 i=0
P [X = 0] = GX (0)
P [X = 1] = G0X (0) 9.3 Multivariate Normal
(i)
GX (0) Covariance Matrix (Precision Matrix 1 )
P [X = i] =
i!
V [X1 ] Cov [X1 , Xk ]
E [X] = G0X (1 )
=
.. .. ..
(k)
E X k = MX (0) . . .

X!
Cov [Xk , X1 ] V [Xk ]
(k)
E = GX (1 )
(X k)! If X N (, ),
2
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))
1/2 1
GX (t) = GY (t) = X = Y
d
fX (x) = (2)n/2 || exp (x )T 1 (x )
2
Properties
9 Multivariate Distributions
Z N (0, 1) X = + 1/2 Z = X N (, )
9.1 Standard Bivariate Normal X N (, ) = 1/2 (X ) N (0, 1)

p X N (, ) = AX N A, AAT
Let X, Y N (0, 1) X
Z with Y = X + 1 2 Z
X N (, ) a is vector of length k = aT X N aT , aT a
Joint density 2
x + y 2 2xy

1
f (x, y) = exp 10 Convergence
2(1 2 )
p
2 1 2
Conditionals Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2

and
Types of Convergence
Independence D
X
Y = 0 1. In distribution (weakly, in law): Xn X
lim Fn (t) = F (t) t where F continuous

n
9.2 Bivariate Normal
P
Let X N x , x2 and Y N y , y2 . 2. In probability: Xn X
1

z
( > 0) lim P [|Xn X| > ] = 0
n
f (x, y) = exp
2(1 2 )
p
2x y 1 2
as
3. Almost surely (strongly): Xn X
" 2 2 #
x x y y x x y y h i h i
z= + 2 P lim Xn = X = P : lim Xn () = X() = 1
x y x y n n
9
qm
4. In quadratic mean (L2 ): Xn X CLT Notations
lim E (Xn X)2 = 0

Zn N (0, 1)
n
2

Xn N ,
Relationships n
2

qm
Xn X = Xn X = Xn X
P D Xn N 0,
n
as
Xn X = Xn X
P 2

D P
n(Xn ) N 0,
Xn X (c R) P [X = c] = 1 = Xn X
n(Xn )
Xn
P
X Yn
P
Y = Xn + Yn X + Y
P
N (0, 1)
qm qm qm
n
Xn X Yn Y = Xn + Yn X + Y
P P P
Xn X Yn Y = Xn Yn XY
Xn
P
X =
P
(Xn ) (X) Continuity Correction
x + 12
D D
Xn X = (Xn ) (X)
qm P Xn x
Xn b limn E [Xn ] = b limn V [Xn ] = 0 / n
qm
X1 , . . . , Xn iid E [X] = V [X] < Xn
x 12

P Xn x 1
Slutzkys Theorem / n
Delta Method
D P D
Xn X and Yn c = Xn + Yn X + c
2 2

D P D 0 2
Xn X and Yn c = Xn Yn cX Yn N , = (Yn ) N (), ( ())
D D D n n
In general: Xn X and Yn Y =
6 Xn + Yn X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN) iid
Let X1 , , Xn F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .
11.1 Point Estimation
Weak (WLLN)
P Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )
Xn as n h i
bias(bn ) = E bn
Strong (SLLN) P
as Consistency: bn
Xn as n
Sampling distribution: F (bn )
r h i
Standard error: se(n ) = V bn
b
10.2 Central Limit Theorem (CLT)
h i h i
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 . Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

Xn n(Xn ) D
Zn := q = Z where Z N (0, 1) bn D
V Xn Asymptotic normality: N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consis-
lim P [Zn z] = (z) zR tent estimator
bn .
n 10
11.2 Normal-based Confidence Interval 11.4 Statistical Functionals

b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2 Statistical functional: T (F )

Suppose bn N , se

and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then Plug-in estimator of = T (F ) : bn = T (Fn )
R
Linear functional: T (F ) = (x) dFX (x)
Cn = bn z/2 se
b Plug-in estimator for linear functional:
Z n
1X
T (Fn ) =
(x) dFbn (x) = (Xi )
11.3 Empirical Distribution Function n i=1

Empirical Distribution Function (ECDF) b 2 = T (Fn ) z/2 se
Often: T (Fn ) N T (F ), se b
Pn
I(Xi x) pth quantile: F 1 (p) = inf{x : F (x) p}
i=1
Fbn (x) = = Xn
n
n
1 X
b2 =
(Xi Xn )2
n 1 i=1
(
1 Xi x
I(Xi x) = 1
Pn 3
0 Xi > x n i=1 (Xi )
=
b3 j

Pn
Properties (for any fixed x) (Xi Xn )(Yi Yn )
= qP i=1 qP
n 2 n
h i
i=1 (X i Xn ) i=1 (Yi Yn )
E Fn = F (x)
h i F (x)(1 F (x))
V Fn =
n
12 Parametric Inference
F (x)(1 F (x)) D
mse = 0 Let F = f (x; : be a parametric model with parameter space Rk
n
P and parameter = (1 , . . . , k ).
Fn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn F ) 12.1 Method of Moments

j th moment Z
2
P sup F (x) Fn (x) > = 2e2n

j () = E X j = xj dFX (x)

x
Nonparametric 1 confidence band for F j th sample moment

n
1X j
j = X
L(x) = max{Fn n , 0} n i=1 i
U (x) = min{Fn + n , 1} Method of Moments Estimator (MoM)
s
1 2
= log 1 () = 1
2n
2 () = 2
.. ..
.=.
P [L(x) F (x) U (x) x] 1 k () = k
11
Properties of the MoM estimator Equivariance: bn is the mle = (bn ) is the mle of ()
bn exists with probability tending to 1 Asymptotic normality:
P
p
Consistency: bn 1. se 1/In ()
Asymptotic normality: (bn ) D
N (0, 1)
D
se
n(b ) N (0, ) q
b 1/In (bn )
2. se
where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
(bn ) D
g = (g1 , . . . , gk ) and gj = j () N (0, 1)
se
b
Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
12.2 Maximum Likelihood ples. If en is any other estimator, the asymptotic relative efficiency is
Likelihood: Ln : [0, ) h i
V bn
n
Y are(en , bn ) = h i 1
Ln () = f (Xi ; ) V en
i=1
Approximately the Bayes estimator
Log-likelihood
n
`n () = log Ln () =
X
log f (Xi ; ) 12.2.1 Delta Method
i=1 b where is differentiable and 0 () 6= 0:
If = ()
Maximum Likelihood Estimator (mle)
n ) D
(b
N (0, 1)
Ln (bn ) = sup Ln () se(b
b )

Score Function where b = ()

b is the mle of and

s(X; ) = log f (X; )

b = 0 ()
se se(
b n )
b b

Fisher Information
I() = V [s(X; )] 12.3 Multiparameter Models
In () = nI() Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.
Fisher Information (exponential family)
2 `n 2 `n
Hjj = Hjk =
2 j k
I() = E s(X; )
Fisher Information Matrix
Observed Fisher Information
E [H11 ] E [H1k ]

n
In () = .. .. ..
2 X

. . .
Inobs () =

log f (Xi ; )
2 i=1 E [Hk1 ] E [Hkk ]
Properties of the mle Under appropriate regularity conditions

P
Consistency: bn (b ) N (0, Jn )
12
with Jn () = In1 . Further, if bj is the j th component of , then 13 Hypothesis Testing
H0 : 0 versus H1 : 1
(bj j ) D
N (0, 1) Definitions
se
bj
Null hypothesis H0
h i Alternative hypothesis H1
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
12.3.1 Multiparameter Delta Method Critical value c
Test statistic T
Let = (1 , . . . , k ) be a function and let the gradient of be Rejection Region R = {x : T (x) > c}
Power function () = P [X R]

Power of a test: 1 P [Type II error] = 1 = inf ()
1
1 Test size: = P [Type I error] = sup ()
.
..
= 0

Retain H0 Reject H0

k H0 true Type
I error ()
H1 true Type II error () (power)
p-value
Suppose =b 6= 0 and b = ().
b Then,

p-value = sup0 P [T (X) T (x)] = inf : T (x) R
P [T (X ? ) T (X)]

p-value = sup0 = inf : T (X) R
) D
(b
N (0, 1)
| {z }
se(b
b ) 1F (T (X)) since T (X ? )F
p-value evidence
where < 0.01 very strong evidence against H0
0.01 0.05 strong evidence against H0
r
T 0.05 0.1 weak evidence against H0
se(b
b ) =
Jn
> 0.1 little or no evidence against H0
Wald Test
Two-sided test
and Jn = Jn () = b.

b and
= b 0
Reject H0 when |W | > z/2 where W =
se
b
P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
12.4 Parametric Bootstrap
Likelihood Ratio Test (LRT)
Sample from f (x; bn ) instead of from Fn , where bn could be the mle or method sup Ln () Ln (bn )
T (X) = =
of moments estimator. sup0 Ln () Ln (bn,0 ) 13
k
D
X iid xn = (x1 , . . . , xn )
(X) = 2 log T (X) 2rq where Zi2 2k with Z1 , . . . , Zk N (0, 1)
Prior density f ()
i=1 Likelihood f (xn | ): joint density of the data
p-value = P0 [(X) > (x)] P 2rq > (x) n
Y
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
Multinomial LRT
i=1
Posterior density f ( | xn )

X1 Xk
Let pn = ,..., be the mle
Normalizing constant cn = f (xn ) = f (x | )f () d
R
n n
k Xj Kernel: part of a density that depends Ron
Ln (pn ) Y pj
T (X) = = Ln ()f ()
Posterior Mean n = f ( | xn ) d = R Ln ()f
R
Ln (p0 ) j=1
p0j () d
k
X pj D
(X) = 2 Xj log 2k1 14.1 Credible Intervals
j=1
p 0j
The approximate size LRT rejects H0 when (X) 2k1, 1 Posterior Interval
2
Pearson Test Z b
n
P [ (a, b) | x ] = f ( | xn ) d = 1
k
X (Xj E [Xj ])2 a
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] 1 Equal-tail Credible Interval
D
T 2k1 Z a Z
f ( | xn ) d = f ( | xn ) d = /2

p-value = P 2k1 > T (x)
D
b
2
Faster Xk1 than LRT, hence preferable for small n
1 Highest Posterior Density (HPD) region Rn
Independence Testing
1. P [ Rn ] = 1
I rows, J columns, X multinomial sample of size n = I J
X 2. Rn = { : f ( | xn ) > k} for some k
mles unconstrained: pij = nij
X
mles under H0 : p0ij = pi pj = Xni nj Rn is unimodal = Rn is an interval

PI PJ nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
Pearson 2 : T = i=1 j=1 ijE[Xij ]ij
14.2 Function of Parameters
D
LRT and Pearson 2k , where = (I 1)(J 1) Let = () and A = { : () }.
Posterior CDF for
Z
14 Bayesian Inference H(r | xn ) = P [() | xn ] = f ( | xn ) d
A
Bayes Theorem
Posterior Density
f (x | )f () f (x | )f () h( | xn ) = H 0 ( | xn )
f ( | x) = n
=R Ln ()f ()
f (x ) f (x | )f () d
Bayesian Delta Method
Definitions
| X n N (),
b seb 0 ()
b
n
X = (X1 , . . . , Xn )

14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Choice
Uniform(0, ) Pareto(xm , k) max x(n) , xm , k + n
n
Subjective Bayesianism: prior should incorporate as much detail as possible Exponential() Gamma(, ) + n, +
X
xi
the researchs a priori knowledge via prior elicitation. i=1
Objective Bayesianism: prior should incorporate as little detail as possible Pn
0 i=1 xi 1 n
(non-informative prior). Normal(, c2 ) Normal(0 , 02 ) + / + 2 ,
2 2 02 c
Robust Bayesianism: consider various priors and determine sensitivity of 0 c1
1 n
our inferences to changes in the prior. + 2
02 c
Pn
02 + i=1 (xi )2
Types Normal(c , 2 ) Scaled Inverse Chi- + n,
+n
square(, 02 )
Flat: f () constant + nx n
R Normal(, 2 ) Normal- , + n, + ,
Proper: f () d = 1 +n 2
scaled Inverse n
(x )2
R
Improper: f () d = 1X 2
Gamma(, , , ) + (xi x) +
Jeffreys prior (transformation-invariant): 2 i=1 2(n + )
1
1 1
1 1

p p MVN(, c ) MVN(0 , 0 ) 0 + nc 0 0 + n x ,
f () I() f () det(I())
1 1
1

0 + nc
n
Conjugate: f () and f ( | xn ) belong to the same parametric family X
MVN(c , ) Inverse- n + , + (xi c )(xi c )T
Wishart(, ) i=1
n
X xi
14.3.1 Conjugate Priors Pareto(xmc , k) Gamma(, ) + n, + log
i=1
xm c
Discrete likelihood Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 kn where k0 > kn
Xn
Likelihood Conjugate Prior Posterior hyperparameters Gamma(c , ) Gamma(0 , 0 ) 0 + nc , 0 + xi
n n i=1
X X
Bernoulli(p) Beta(, ) + xi , + n xi
i=1
Xn n
X
i=1
n
X
14.4 Bayesian Testing
Binomial(p) Beta(, ) + xi , + Ni xi If H0 : 0 :
i=1 i=1 i=1
n
X
Z
Negative Binomial(p) Beta(, ) + rn, + xi Prior probability P [H0 ] = f () d
n
i=1 Z0
Posterior probability P [H0 | xn ] = f ( | xn ) d
X
Poisson() Gamma(, ) + xi , + n
0
i=1
n
X
Multinomial(p) Dirichlet() + x(i)
i=1 Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),
n
f (xn | Hk )P [Hk ]
X
Geometric(p) Beta(, ) + n, + xi
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ]
i=1
15
Marginal Likelihood 1. Estimate VF [Tn ] with VFn [Tn ].
Z 2. Approximate VFn [Tn ] using simulation:
f (xn | Hi ) = f (xn | , Hi )f ( | Hi ) d

(a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Posterior Odds (of Hi relative to Hj ) the sampling distribution implied by Fn
P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1 , . . . , Xn Fn .

= ii. Compute Tn = g(X1 , . . . , Xn ).
P [Hj | xn ] f (xn | Hj ) P [Hj ]
(b) Then
| {z } | {z }
Bayes Factor BFij prior odds
B B
!2
Bayes Factor 1 X 1 X
log10 BF10 BF10 evidence vboot = VFn = Tn,b T
B B r=1 n,r
b=1
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
12 10 100 Strong
16.1.1 Bootstrap Confidence Intervals
>2 > 100 Decisive
p
1p BF 10 Normal-based Interval
p = p where p = P [H1 ] and p = P [H1 | xn ]
1 + 1p BF10
Tn z/2 se
boot
15 Exponential Family Pivotal Interval

Scalar parameter
1. Location parameter = T (F )
fX (x | ) = h(x) exp {()T (x) A()} 2. Pivot Rn = bn
= h(x)g() exp {()T (x)} 3. Let H(r) = P [Rn r] be the cdf of Rn

4. Let Rn,b = bn,b bn . Approximate H using bootstrap:
Vector parameter
( s
)
B
1 X
X
fX (x | ) = h(x) exp i ()Ti (x) A() H(r) =
I(Rn,b r)
i=1 B
b=1
= h(x) exp {() T (x) A()}
= h(x)g() exp {() T (x)} 5. Let denote the sample quantile of (bn,1

, . . . , bn,B )
Natural form 6. Let r denote the sample quantile of (Rn,1

, . . . , Rn,B ), i.e., r = bn

fX (x | ) = h(x) exp { T(x) A()} 7. Then, an approximate 1 confidence interval is Cn = a, b with
= h(x)g() exp { T(x)}
= h(x)g() exp T T(x) a = bn H 1 1 =
bn r1/2 =
2bn 1/2

2

b = bn H 1 =
bn r/2 =
2bn /2
2
16 Sampling Methods
Percentile Interval
16.1 The Bootstrap

Cn = /2 , 1/2
Let Tn = g(X1 , . . . , Xn ) be a statistic.
16
16.2 Rejection Sampling Decision rule: synonymous for an estimator b
Action a A: possible value of the decision rule. In the estimation
Setup
context, the action is just an estimate of , (x).
b
We can easily sample from g() Loss function L: consequences of taking action a when true state is or
We want to sample from h(), but it is difficult discrepancy between and , b L : A [k, ).
k()
We know h() up to proportional constant: h() = R Loss functions
k() d
Envelope condition: we can find M > 0 such that k() M g() Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Algorithm Linear loss: L(, a) =
K2 (a ) a 0
1. Draw cand g() Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
2. Generate u Unif (0, 1) Lp loss: L(, a) = | a|p
k(cand ) (
3. Accept cand if u 0 a=
M g(cand ) Zero-one loss: L(, a) =
1 a 6=
4. Repeat until B values of cand have been accepted
Example 17.1 Risk

We can easily sample from the prior g() = f () Posterior Risk
Target is the posterior with h() k() = f (xn | )f () Z h i
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M r(b | x) = L(, (x))f
b ( | x) d = E|X L(, (x))
b
Algorithm
(Frequentist) Risk
1. Draw cand f ()
Z
2. Generate u Unif (0, 1)
h i
R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
Ln (cand )
3. Accept cand if u
Ln (bn ) Bayes Risk
ZZ
16.3 Importance Sampling
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | xn ]:
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )b
iid
1. Sample from the prior 1 , . . . , n f ()
h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Ln (i )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (i )
n
PB 17.2 Admissibility
3. E [q() | x ] i=1 q(i )wi
b0 dominates b if
: R(, b0 ) R(, )
b
17 Decision Theory
: R(, b0 ) < R(, )
b
Definitions
b is inadmissible if there is at least one other estimator b0 that dominates
Unknown quantity affecting our decision: it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) n
X
rss(b0 , b1 ) = 2i
r(f, )
b = inf e r(f, )

e
i=1
R
(x) = inf r( | x) x = r(f, )
b b b = r(b | x)f (x) dx
Least Square Estimates
Theorems
bT = (b0 , b1 )T : min rss

b0 ,
b1
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode b0 = Yn b1 Xn
Pn Pn
(Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
17.4 Minimax Rules b1 = i=1 Pn 2
= n
(X X ) 2
P 2
i=1 i n i=1 Xi nX
Maximum Risk

0
h i
R()
b = sup R(, ) R(a) = sup R(, a) E b | X n =
b 1

2 n1 ni=1 Xi2 X n
h i P
Minimax Rule n
V |X =
b
e = inf sup R(, )
b = inf R()
sup R(, ) e nsX X n 1
e e
r Pn
2
i=1 Xi

b
se(
b b0 ) =
b = Bayes rule c : R(, )
b =c sX n n

b
Least Favorable Prior se(
b b1 ) =
sX n
bf = Bayes rule R(, bf ) r(f, bf ) Pn Pn
where s2X = n1 i=1 (Xi X n )2 and
b2 = 1
n2 2i
i=1 an (unbiased) estimate
of . Further properties:
18 Linear Regression
P P
Consistency: b0 0 and b1 1
Definitions
Asymptotic normality:
Response variable Y
Covariate X (aka predictor variable or feature) b0 0 D b1 1 D
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
18.1 Simple Linear Regression
Approximate 1 confidence intervals for 0 and 1 are
Model
Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2 b0 z/2 se(
b b0 ) and b1 z/2 se(
b b1 )
Fitted Line
rb(x) = b0 + b1 x The Wald test for testing H0 : 1 = 0 vs. H1 : 1 6= 0 is: reject H0 if
|W | > z/2 where W = b1 /se(
b b1 ).
Predicted (Fitted) Values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi Y ) rss
2
= 1 Pn i=1 i 2 = 1

i = Yi Ybi = Yi b0 + b1 Xi R = Pn 2
i=1 (Yi Y ) i=1 (Yi Y )
tss
18
Likelihood If the (k k) matrix X T X is invertible,
n n n
Y Y Y b = (X T X)1 X T Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2 h i
i=1 i=1 i=1 V b | X n = 2 (X T X)1
n
b N , 2 (X T X)1
Y
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
n 1 X
L2 = fY |X (Yi | Xi ) exp 2 Yi (0 1 Xi ) k
2 i X
i=1 rb(x) = bj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
2
Unbiased estimate for
n
1X 2 n
b2 =
1 X 2
n i=1 i b2 =
= X b Y
n k i=1 i
18.2 Prediction mle

nk 2

b = X b2 =

Observe X = x of the covarite and want to predict their outcome Y . n
1 Confidence Interval
Yb = b0 + b1 x bj z/2 se(
b bj )
h i h i h i h i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
18.4 Model Selection
Prediction Interval Pn Consider predicting a new observation Y for covariates X and let S J
2

2 2 i=1 (Xi X ) denote a subset of the covariates in the model, where |S| = k and |J| = n.
n =
b P +1
n i (Xi X)2 j
b
Issues
Underfitting: too few covariates yields high bias
Yb z/2 bn
Overfitting: too many covariates yields high variance
18.3 Multiple Regression Procedure

1. Assign a score to each model
Y = X +
2. Search through all models to find the one with the highest score
where Hypothesis Testing
X11 X1k 1 1
.. .. = ...
.. .. H0 : j = 0 vs. H1 : j 6= 0 j J
X= . =.

. .
Xn1 Xnk k n Mean Squared Prediction Error (mspe)
Likelihood h i

1
mspe = E (Yb (S) Y )2
2 n/2
L(, ) = (2 ) exp 2 rss
2
Prediction Risk
N
X n
X n
X h i
rss = (y X)T (y X) = ||Y X||2 = (Yi xTi )2 R(S) = mspei = E (Ybi (S) Yi )2
i=1 i=1 i=1 19
Training Error 19 Non-parametric Function Estimation
n
X
R
btr (S) = (Ybi (S) Yi )2 19.1 Density Estimation
i=1 R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
2
R Integrated Square Error (ise)
Pn b 2
R i=1 (Yi (S) Y )
rss(S) btr (S) Z 2 Z
R2 (S) = 1 =1 =1 P n 2 L(f, fbn ) = f (x) fbn (x) dx = J(h) + f 2 (x) dx
i=1 (Yi Y )
tss tss
The training error is a downward-biased estimate of the prediction risk. Frequentist Risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) f (x)
bias(R btr (S) R(S) = 2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)
Adjusted R2
19.1.1 Histograms
2 n 1 rss
R (S) = 1
n k tss Definitions
Mallows Cp statistic Number of bins m
1
Binwidth h = m
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram Estimator
m
AIC(S) = bS2 )
`n (bS , k X pbj
fbn (x) = I(x Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) log n
BIC(S) = `n (bS , h i p (1 p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and Training 2 1
R(fn , f )
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi (S) Yi )2 m = |{validation data}|, often or 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
2/3 Z 1/3
Leave-one-out Cross-validation C 3 2
R (fbn , f ) 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi Ybi (S)
R
bCV (S) = (Yi Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx
b b f(i) (Xi ) = pb
U (S) = XS (XST XS )1 XS (hat matrix) n i=1 (n 1)h (n 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi Nk (x)
K(x) 0
Nadaraya-Watson Kernel Estimator
R
K(x) dx = 1
R
xK(x) dx = 0 n
X
rb(x) = wi (x)Yi
R 2 2
x K(x) dx K >0
i=1
xxi

KDE K
wi (x) = h [0, 1]
n
Pn xxj
j=1 K

1X1 x Xi h
fbn (x) = K
n i=1 h h h4
Z 4 Z
f 0 (x)
2
Z Z rn , r)
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
1 4 00 2 1 4 f (x)
R(f, fn ) (hK )
b (f (x)) dx + K 2 (x) dx
4 nh 2 K 2 (x) dx
Z R
2/5 1/5 1/5 + dx
nhf (x)
Z Z
c c2 c3
h = 1 c1 = 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h
Z 4/5 Z 1/5 n1/5
c4 5 2 2/5 c2
R (f, fbn ) = 4/5 c4 = (K ) K 2 (x) dx (f 00 )2 dx R (b
rn , r) 4/5
n 4 n
| {z }
C(K)
Cross-validation estimate of E [J(h)]

Epanechnikov Kernel
n n
X X (Yi rb(xi ))2
(Yi rb(i) (xi ))2 =
(
3
|x| < 5 JbCV (h) = !2
K(x) = 4 5(1x2 /5)
i=1 i=1 K(0)
0 otherwise 1 Pn xx
j
j=1 K h
Cross-validation estimate of E [J(h)]

19.3 Smoothing Using Orthogonal Functions
n n n
1 X X Xi Xj
Z
2 2Xb 2 Approximation
JbCV (h) = fn (x) dx
b f(i) (Xi ) 2
K + K(0)
n i=1 hn i=1 j=1 h nh
X J
X
r(x) = j j (x) j j (x)
Z j=1 i=1
K (x) = K (2) (x) 2K(x) K (2) (x) = K(x y)K(y) dy Multivariate Regression
Y = +

19.2 Non-parametric Regression 0 (x1 ) J (x1 )
.. .. ..
where i = i and = . . .
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by 0 (xn ) J (xn )
Least Squares Estimator
Yi = r(xi ) + i
b = (T )1 T Y
E [i ] = 0
1
V [i ] = 2 T Y (for equallly spaced observations only)
n
21
Cross-validation estimate of E [J(h)] 20.2 Poisson Processes
2
Xn J
X Poisson Process
R
bCV (J) = Yi j (xi )bj,(i)
i=1 j=1
{Xt : t [0, )} number of events up to and including time t
X0 = 0
20 Stochastic Processes Independent increments:
Stochastic Process
( t0 < < tn : Xt1 Xt0

Xtn Xtn1
{0, 1, . . . } = Z discrete
{Xt : t T } T =
[0, ) continuous
Intensity function (t)
Notations: Xt , X(t)
State space X P [Xt+h Xt = 1] = (t)h + o(h)
Index set T P [Xt+h Xt = 2] = o(h)
Rt
Xs+t Xs Po (m(s + t) m(s)) where m(t) = 0
(s) ds
20.1 Markov Chains
Markov Chain Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X
(t) = Xt Po (t) >0
Transition probabilities
pij P [Xn+1 = j | Xn = i] Waiting Times

pij (n) P [Xm+n = j | Xm = i] n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
(i, j) element is pij
1
pij > 0 Wt Gamma t,
P
i pij = 1
Chapman-Kolmogorov Interarrival Times

X
pij (m + n) = pij (m)pkj (n) St = Wt+1 Wt
k
Pm+n = Pm Pn
1
Pn = P P = Pn St Exp

Marginal probability
n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]

St
0 , initial distribution
n = 0 Pn Wt1 Wt t
22
21 Time Series 21.1 Stationary Time Series
Mean function Z
Strictly stationary
xt = E [xt ] = xft (x) dx
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]
Autocovariance function
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k N, tk , ck , h Z
x (t, t) = E (xt t )2 = V [xt ]

Weakly stationary
Autocorrelation function (ACF)
E x2t < t Z
2
Cov [xs , xt ] (s, t) E xt = m t Z
(s, t) = p =p
V [xs ] V [xt ] (s, s)(t, t) x (s, t) = x (s + r, t + r) r, s, t Z
Cross-covariance function (CCV) Autocovariance function

xy (s, t) = E [(xs xs )(yt yt )]
(h) = E [(xt+h )(xt )] h Z

Cross-correlation function (CCF) (0) = E (xt )2
xy (s, t) (0) 0
xy (s, t) = p (0) |(h)|
x (s, s)y (t, t)
(h) = (h)
Backshift operator
B k (xt ) = xtk Autocorrelation function (ACF)
Difference operator
d = (1 B)d Cov [xt+h , xt ] (t + h, t) (h)
x (h) = p =p =
V [xt+h ] V [xt ] (t + h, t + h)(t, t) (0)
White Noise
2
wt wn(0, w ) Jointly stationary time series
iid 2

Gaussian: wt N 0, w
xy (h) = E [(xt+h x )(yt y )]
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T xy (h)
xy (h) = p
x (0)y (h)
Random Walk
Drift Linear Process
Pt
xt = t + j=1 wj
X
X
E [xt ] = t xt = + j wtj where |j | <
j= j=
Symmetric Moving Average
k
X k
X
X
2
mt = aj xtj where aj = aj 0 and aj = 1 (h) = w j+h j
j=k j=k j=
23
21.2 Estimation of Correlation 21.3.1 Detrending
Sample mean Least Squares
n
1X
x = xt 1. Choose trend model, e.g., t = 0 + 1 t + 2 t2
n t=1
2. Minimize rss to obtain trend estimate bt = b0 + b1 t + b2 t2
Sample variance 3. Residuals , noise wt
n
1 X |h|
V [x] = 1 x (h) Moving average
n n
h=n
1
The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
Sample autocovariance function
k
nh 1 X
1 X vt = xt1

b(h) = (xt+h x)(xt x) 2k + 1
n t=1 i=k
1
Pk
Sample autocorrelation function If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion

b(h)
b(h) = Differencing

b(0)
t = 0 + 1 t = xt = 1
Sample cross-variance function
nh
1 X 21.4 ARIMA models

bxy (h) = (xt+h x)(yt y)
n t=1 Autoregressive polynomial
Sample cross-correlation function (z) = 1 1 z p zp z C p 6= 0

bxy (h) Autoregressive operator
bxy (h) = p
bx (0)b
y (0)
(B) = 1 1 B p B p
Properties
Autoregressive model order p, AR (p)
1
bx (h) = if xt is white noise
n xt = 1 xt1 + + p xtp + wt (B)xt = wt
1
bxy (h) = if xt or yt is white noise AR (1)
n
k1
X k,||<1 X
21.3 Non-Stationary Time Series xt = k (xtk ) + j (wtj ) = j (wtj )
j=0 j=0
Classical decomposition model | {z }
linear process
P
xt = t + st + wt E [xt ] = j=0 j (E [wtj ]) = 0
2 h
w
t = trend (h) = Cov [xt+h , xt ] = 12
(h)
st = seasonal component (h) = (0) = h
wt = random noise term (h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
(z) = 1 + 1 z + + q zq z C q 6= 0 Denoted by ARIMA (p, d, q) (P, D, Q)s
Moving average operator P (B s )(B)D d s
s xt = + Q (B )(B)wt
(B) = 1 + 1 B + + p B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that
xt = wt + 1 wt1 + + q wtq xt = (B)wt
q
X
xt = wtj = (B)wt
X
E [xt ] = j E [wtj ] = 0
j=0
j=0
( Pqh P
2
w j=0 j j+h 0hq ARMA (p, q) is invertible {j } : j=0 j < such that
(h) = Cov [xt+h , xt ] =
0 h>q

X
MA (1) (B)xt = Xtj = wt
xt = wt + wt1 j=0

2 2
(1 + )w h = 0

Properties
(h) = w 2
h=1

0 h>1 ARMA (p, q) causal roots of (z) lie outside the unit circle
(

2 h=1
(z)
(h) = (1+ )
X
0 h>1 (z) = j z j = |z| 1
j=0
(z)
ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq ARMA (p, q) invertible roots of (z) lie outside the unit circle

(B)xt = (B)wt X (z)
(z) = j z j = |z| 1
Partial autocorrelation function (PACF) j=0
(z)
xh1
i , regression of xi on {xh1 , xh2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
hh = corr(xh xh1
h , x0 xh1
0 ) h2
E.g., 11 = corr(x1 , x0 ) = (1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
d xt = (1 B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
(B)(1 B)d xt = (B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt1 + wt wt1 Periodic process

X xt = A cos(2t + )
xt = (1 )j1 xtj + wt when || < 1
j=1 = U1 cos(2t) + U2 sin(2t)
xn+1 = (1 )xn + xn
Frequency index (cycles per unit time), period 1/
25
Amplitude A Discrete Fourier Transform (DFT)
Phase n
X
U1 = A cos and U2 = A sin often normally distributed rvs d(j ) = n1/2 xt e2ij t
i=1
Periodic mixture
q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
j = j/n
k=1
Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2 Inverse DFT
n1
Pq X
(h) = k=1 k2 cos(2k h) xt = n1/2 d(j )e2ij t
Pq
(0) = E x2t = k=1 k2 j=0
Spectral representation of a periodic process Periodogram

I(j/n) = |d(j/n)|2
(h) = 2 cos(20 h)
Scaled Periodogram
2 2i0 h 2 2i0 h
= e + e 4
2 2 P (j/n) = I(j/n)
Z 1/2 n
e2ih dF ()
!2 !2
= 2X
n
2X
n
1/2 = xt cos(2tj/n + xt sin(2tj/n
n t=1 n t=1
Spectral distribution function

0 < 0

22 Math
F () = 2 /2 < 0

2
0 22.1 Gamma Function
Z
F () = F (1/2) = 0 Ordinary: (s) = ts1 et dt
F () = F (1/2) = (0) 0 Z
Spectral density Upper incomplete: (s, x) = ts1 et dt
x
Z x

X 1 1 Lower incomplete: (s, x) = ts1 et dt
f () = (h)e2ih
2 2 0
h=
( + 1) = () >1
P R 1/2
Needs |(h)| < = (h) = e2ih f () d h = 0, 1, . . . (n) = (n 1)! nN
h= 1/2
f () 0 (1/2) =
f () = f ()
f () = f (1 ) 22.2 Beta Function
R 1/2 Z 1
(0) = V [xt ] = 1/2 f () d (x)(y)
Ordinary: B(x, y) = B(y, x) = tx1 (1 t)y1 dt =
2
White noise: fw () = w 0 (x + y)
Z x
ARMA (p, q) , (B)xt = (B)wt : Incomplete: B(x; a, b) = ta1 (1 t)b1 dt
0
|(e2i )|2
2 Regularized incomplete:
fx () = w a+b1
|(e2i )|2 B(x; a, b) a,bN X (a + b 1)!
Pp Pq Ix (a, b) = = xj (1 x)a+b1j
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k B(a, b) j=a
j!(a + b 1 j)!
26
I0 (a, b) = 0 I1 (a, b) = 1 Stirling numbers, 2nd kind
Ix (a, b) = 1 I1x (b, a) (
n n1 n1 n 1 n=0
=k + 1kn =
k k k1 0 0 else
22.3 Series
Finite Binomial Partitions
n n n
n(n + 1) n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
X X
n
k= =2
2 k i=1
k=1 k=0
n n
f :BU D = distinguishable, D = indistinguishable.

X X r+k r+n+1 Balls and Urns
(2k 1) = n2 =
k n
k=1 k=0
n n
X n(n + 1)(2n + 1) X k n+1 |B| = n, |U | = m f arbitrary f injective f surjective f bijective
k2 = =
6 m m+1
k=1 k=0 ( (
mn m n

n 2 Vandermondes Identity: n n n! m = n
X n(n + 1) B : D, U : D m m!
k3 = r
m n

m+n

0 else m 0 else
2
X
k=1 =
n k rk r (
cn+1 1 k=0

X n+n1 m n1 1 m=n
ck = c 6= 1 Binomial Theorem: B : D, U : D
c1 n
n nk k n n m1 0 else
k=0
X
a b = (a + b)n
k m
( (
k=0 X n 1 mn n 1 m=n
B : D, U : D
k 0 else m 0 else
Infinite k=1
m
( (
X 1 X p 1 mn 1 m=n
pk = pk =
X
, |p| < 1 B : D, U : D Pn,k Pn,m
1p 1p 0 else 0 else
k=0 k=1 k=1

!
X d X d 1 1
kpk1 = pk
= = |p| < 1
dp dp 1 p 1 p2 References
k=0 k=0

X r+k1 k
x = (1 x)r r N+ [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
k
k=0 Brooks Cole, 1972.

X k
p = (1 + p) |p| < 1 , C [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
k
k=0 The American Statistician, 62(1):4553, 2008.
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
k out of n w/o replacement w/ replacement Algebra. Springer, 2001.
k1
Y n! [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
ordered nk = (n i) = nk Statistik. Springer, 2002.
i=0
(n k)!

n nk n!

n1+r

n1+r
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
unordered = = = Springer, 2003.
k k! k!(n k)! r n1
27
Univariate distribution relationships, courtesy of Leemis and McQueston [2].
28

A Probability and Statistics Cheatsheet

Uploaded by

A Probability and Statistics Cheatsheet

Uploaded by

Probability and Statistics

6th March, 2011

7 Distribution Relationships 8 17 Decision Theory 17

Uniform (discrete) Binomial Geometric Poisson

n = 25, p = 0.9 p = 0.8 = 10

a b 4 2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 4 2 0 2 4

Inverse Gamma Beta Weibull Pareto

Outcome (point or element) Bayes Theorem

Properties Probability Mass Function (PMF)

P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]

Expectation Standard deviation p

lim Fn (t) = F (t) t where F continuous

lim E (Xn X)2 = 0

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn F ) 12.1 Method of Moments

Nonparametric 1 confidence band for F j th sample moment

Score Function where b = ()

Properties of the mle Under appropriate regularity conditions

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1 , . . . , Xn Fn .

15 Exponential Family Pivotal Interval

Example 17.1 Risk

18.2 Prediction mle

18.3 Multiple Regression Procedure

Cross-validation estimate of E [J(h)]

Cross-validation estimate of E [J(h)]

pij P [Xn+1 = j | Xn = i] Waiting Times

Chapman-Kolmogorov Interarrival Times

n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]

x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k N, tk , ck , h Z

x (t, t) = E (xt t )2 = V [xt ]

Cross-covariance function (CCV) Autocovariance function

Sample cross-correlation function (z) = 1 1 z p zp z C p 6= 0

Spectral representation of a periodic process Periodogram

You might also like