0% found this document useful (0 votes)
774 views28 pages

A Probability and Statistics Cheatsheet

This document is a cheat sheet that covers many topics in probability theory and statistics. It includes overviews of common probability distributions and their properties, concepts in probability, statistical inference, hypothesis testing, linear regression, and other statistical techniques. The cheat sheet provides formulas, notation, and brief explanations of the key concepts.

Uploaded by

farah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
774 views28 pages

A Probability and Statistics Cheatsheet

This document is a cheat sheet that covers many topics in probability theory and statistics. It includes overviews of common probability distributions and their properties, concepts in probability, statistical inference, hypothesis testing, linear regression, and other statistical techniques. The cheat sheet provides formulas, notation, and brief explanations of the key concepts.

Uploaded by

farah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 28

Probability and Statistics

Cheat Sheet

Copyright
c Matthias Vallentin, 2011
[email protected]

6th March, 2011


This cheat sheet integrates a variety of topics in probability the- 12 Parametric Inference 11 20 Stochastic Processes 22
ory and statistics. It is based on literature [1, 6, 3] and in-class 12.1 Method of Moments . . . . . . . . . . . 11 20.1 Markov Chains . . . . . . . . . . . . . . 22
material from courses of the statistics department at the Univer- 12.2 Maximum Likelihood . . . . . . . . . . . 12 20.2 Poisson Processes . . . . . . . . . . . . . 22
sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . . 12
[4, 5]. If you find errors or have suggestions for further topics, I 21 Time Series 23
12.3 Multiparameter Models . . . . . . . . . 12
would appreciate if you send me an email. The most recent ver- 21.1 Stationary Time Series . . . . . . . . . . 23
12.3.1 Multiparameter Delta Method . 13
sion of this document is available at http://bit.ly/probstat. 21.2 Estimation of Correlation . . . . . . . . 24
12.4 Parametric Bootstrap . . . . . . . . . . 13 21.3 Non-Stationary Time Series . . . . . . . 24
To reproduce, please contact me.
21.3.1 Detrending . . . . . . . . . . . . 24
13 Hypothesis Testing 13 21.4 ARIMA models . . . . . . . . . . . . . . 24
Contents 14 Bayesian Inference 14
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
1 Distribution Overview 3 14.1 Credible Intervals . . . . . . . . . . . . . 14
1.1 Discrete Distributions . . . . . . . . . . 3 14.2 Function of Parameters . . . . . . . . . 14 22 Math 26
1.2 Continuous Distributions . . . . . . . . 4 14.3 Priors . . . . . . . . . . . . . . . . . . . 15 22.1 Gamma Function . . . . . . . . . . . . . 26
14.3.1 Conjugate Priors . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . . . . . 26
2 Probability Theory 6 14.4 Bayesian Testing . . . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
3 Random Variables 6 15 Exponential Family 16
3.1 Transformations . . . . . . . . . . . . . 7
16 Sampling Methods 16
4 Expectation 7 16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
5 Variance 7
16.2 Rejection Sampling . . . . . . . . . . . . 17
6 Inequalities 8 16.3 Importance Sampling . . . . . . . . . . . 17

7 Distribution Relationships 8 17 Decision Theory 17


17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 17
Functions 9 17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9 18 Linear Regression 18
9.2 Bivariate Normal . . . . . . . . . . . . . 9 18.1 Simple Linear Regression . . . . . . . . 18
9.3 Multivariate Normal . . . . . . . . . . . 9
18.2 Prediction . . . . . . . . . . . . . . . . . 19
10 Convergence 9 18.3 Multiple Regression . . . . . . . . . . . 19
10.1 Law of Large Numbers (LLN) . . . . . . 10 18.4 Model Selection . . . . . . . . . . . . . . 19
10.2 Central Limit Theorem (CLT) . . . . . 10
19 Non-parametric Function Estimation 20
11 Statistical Inference 10 19.1 Density Estimation . . . . . . . . . . . . 20
11.1 Point Estimation . . . . . . . . . . . . . 10 19.1.1 Histograms . . . . . . . . . . . . 20
11.2 Normal-based Confidence Interval . . . . 11 19.1.2 Kernel Density Estimator (KDE) 21
11.3 Empirical Distribution Function . . . . . 11 19.2 Non-parametric Regression . . . . . . . 21
11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions 21
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a + 1)2 1 eas e(b+1)s

bxca+1 I(a < x < b) a+b
Uniform Unif {a, . . . , b} axb
ba ba+1 2 12 s(b a)
1 x>b

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p + pes
!
n x
Binomial Bin (n, p) I1p (n x, x + 1) p (1 p)nx np np(1 p) (1 p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 pkk xi = n npi npi (1 pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m mx
 
x np x nx nm nm(N n)(N m)
Hypergeometric Hyp (N, m, n) N
N/A
N 2 (N 1)
p 
np(1 p) x
N
!  r
x+r1 r 1p 1p p
Negative Binomial NBin (n, p) Ip (r, x + 1) p (1 p)x r r
r1 p p2 1 (1 p)es
1 1p p
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+
p p2 1 (1 p)es
x
X i x e s
Poisson Po () e e(e 1)

i=0
i! x!

Uniform (discrete) Binomial Geometric Poisson


n = 40, p = 0.3 p = 0.2 =1

0.8

n = 30, p = 0.6 p = 0.5 =4
0.25

n = 25, p = 0.9 p = 0.8 = 10

0.3
0.20

0.6
0.15

0.2
1
PMF

PMF

PMF

PMF

0.4


0.10

0.1
0.2


0.05









0.00


0.0

0.0

a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20
x x x x

1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a)2 esb esa

xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b

(x )2
Z x
2 s2
   
1
N , 2 2

Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2
   
1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+

Log-Normal + erf exp
2 2 2 2 x 2 2 2 2
 
1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2
+1
 
 
2 x2
Students t Student() Ix ,
 1+ 0 0
2 2 2
 
1 k x 1
Chi-square 2k , xk/2 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)
 
d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)

d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 (s < 1/)
1 s
 
(, x/) 1 1
Gamma Gamma (, ) x1 ex/ 2 (s < 1/)
() () 1 s
, x

1 /x 2 2(s)/2 p 
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2)2 ()
P 
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1

sn n 
   
k k  x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
 x 
m x xm x
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 m
>2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)

4
Uniform (continuous) Normal Lognormal Student's t
=1

0.4
= 0, 2 = 0.2 = 0, 2 = 3

1.0
= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=

0.8
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

0.8

0.3
0.6

0.6
PDF

PDF

PDF
(x)

0.2
0.4
1

0.4
ba

0.1
0.2

0.2
0.0

0.0

0.0

a b 4 2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 4 2 0 2 4


x x x x

2
F Exponential Gamma

k=1 d1 = 1, d2 = 1 =2 = 1, = 2
0.5

2.0

0.5
k=2 d1 = 2, d2 = 1 =1 = 2, = 2
3.0

k=3 d1 = 5, d2 = 2 = 0.4 = 3, = 2
k=4 d1 = 100, d2 = 1 = 5, = 1
k=5 d1 = 100, d2 = 100 = 9, = 0.5
0.4

0.4
2.5

1.5
2.0
0.3

0.3
PDF

PDF

PDF

PDF
1.0
1.5
0.2

0.2
1.0

0.5
0.1

0.1
0.5
0.0

0.0

0.0

0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x

Inverse Gamma Beta Weibull Pareto


= 1, = 1 = 0.5, = 0.5 = 1, k = 0.5 xm = 1, = 1
3.0

2.5
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, = 4
= 3, = 0.5 = 2, = 2 = 1, k = 5
= 2, = 5
4

2.5

2.0

3
2.0
3

1.5

2
PDF

PDF

PDF

PDF
1.5
2

1.0
1.0

1
1

0.5
0.5
0.0

0.0
0

0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5
x x x x

5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1

Outcome (point or element) Bayes Theorem


Event A
n
-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn = Ai
1. A j=1 P [B | Aj ] P [Aj ] i=1
S
2. A1 , A2 , . . . , A = i=1 Ai A Inclusion-Exclusion Principle
3. A A = A A
n n
r
[ X \
Probability distribution P
X
(1)r1

Ai = A ij


1. P [A] 0 for every A i=1 r=1 ii1 <<ir n j=1

2. P [] = 1
" #
G X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable
Probability space (, A, P) X:R

Properties Probability Mass Function (PMF)

P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]


B = B = (A A) B = (A B) (A B)
Probability Density Function (PDF)
P [A] = 1 P [A]
b
P [B] = P [A B] + P [A B]
Z
P [a X b] = f (x) dx
P [] = 1 P [] = 0 a
S T T S
( n An ) = n An ( n An ) = n An DeMorgan
S T Cumulative Distribution Function (CDF):
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B] FX : R [0, 1] FX (x) = P [X x]
= P [A B] P [A] + P [B] 1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )
P [A B] = P [A B] + P [A B] + P [A B] 2. Normalized: limx = 0 and limx = 1
P [A B] = P [A] P [A B] 3. Right-continuous: limyx F (y) = F (x)
Continuity of Probabilities
S Z b
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai P [a Y b | X = x] = fY |X (y | x)dy ab
T
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X x, Y y] = P [X x] P [Y y]
P [A B]
P [A | B] = if P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
Discrete X
X E [X] = P [X x]
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
 
f (x) x=1
x1 (z) Sample mean
n
Continuous 1X
Xn = Xi
Z n i=1
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z} Conditional Expectation
Az Z
Special case if strictly monotone E [Y | X = x] = yf (y | x) dy

d

dx 1 E [X] = E [E [X | Y ]]
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
Z
dz dz |J| E[(X, Y ) | X = x] = (x, y)fY |X (y | x) dx
Z

The Rule of the Lazy Statistician
E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz

Z
E [Z] = (x) dFX (x) E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z E [(X)Y | X] = (X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E[Y | X] = c = Cov [X, Y ] = 0
A
Convolution
Z Z z
5 Variance
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx Variance
0
Z 2
    2
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx V [X] = X = E (X E [X])2 = E X 2 E [X]
" n # n
Z 0 Z X X X
X V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx i=1 i=1
Y " n #
i6=j
X n
X
V Xi = V [Xi ] iff Xi
Xj
4 Expectation i=1 i=1

Expectation Standard deviation p


X sd[X] = V [X] = X


xfX (x) X discrete Covariance
Z x

E [X] = X = x dFX (x) = Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0

Z
xfX (x) X continuous


Cov [X, X] = V [X]
P [X = c] = 1 = E [c] = c Cov [X, Y ] = Cov [Y, X]
E [cX] = c E [X] Cov [aX, bY ] = abCov [X, Y ]
E [X + Y ] = E [X] + E [Y ] Cov [X + a, Y + b] = Cov [X, Y ]
7

Xn m
X n X
X m limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Cov Xi , Yj = Cov [Xi , Yj ]
Negative Binomial
i=1 j=1 i=1 j=1
X NBin (1, p) = Geo (p)
Correlation Pr
Cov [X, Y ] X NBin (r, p) = i=1 Geo (p)
[X, Y ] = p Xi NBin (ri , p) =
P P
Xi NBin ( ri , p)
V [X] V [Y ]
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Independence
Poisson
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ] n
X n
X
!
Xi Po (i ) Xi Xj = Xi Po i
Sample variance i=1 i=1
n
1 X

S2 = (Xi Xn )2 n n
X X i
n 1 i=1 Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
j=1 j=1 j=1 j
Conditional Variance
Exponential
    2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] n
X
V [Y ] = E [V [Y | X]] + V [E [Y | X]] Xi Exp () Xi
Xj = Xi Gamma (n, )
i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities Normal
 
X

Cauchy-Schwarz
2
X N , 2 = N (0, 1)

E [XY ] E X 2 E Y 2
   
 
X N , Z = aX + b = Z N a + b, a2 2
2
Markov
 
X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22

E [(X)]
P [(X) t]

Xi N i , i2 =
P
X N
P P 2

t  i i i i , i i

P [a < X b] = b a

Chebyshev
V [X]
P [|X E [X]| t] (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
t2
1
Chernoff Upper quantile of N (0, 1): z = (1 )
e
 
P [X (1 + )] > 1 Gamma
(1 + )1+
X Gamma (, ) X/ Gamma (, 1)
Jensen P
Gamma (, ) i=1 Exp ()
E [(X)] (E [X]) convex P P
Xi Gamma (i , ) Xi
Xj = i Xi Gamma ( i i , )
Z
() 1 x
= x e dx
7 Distribution Relationships 0
Beta
Binomial
1 ( + ) 1
n x1 (1 x)1 = x (1 x)1
Xi Bern (p) =
X
Xi Bin (n, p) B(, ) ()()
  B( + k, ) +k1
E X k1
 
i=1 E Xk = =
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p) B(, ) ++k1
limn Bin (n, p) = Po (np) (n large, p small) Beta (1, 1) Unif (0, 1)
8
8 Probability and Moment Generating Functions Conditional mean and variance
X
E [X | Y ] = E [X] + (Y E [Y ])
 
GX (t) = E tX |t| < 1
" Y

#  
X (Xt)i X E Xi
ti
t
 Xt

MX (t) = GX (e ) = E e =E =
p
i! i! V [X | Y ] = X 1 2
i=0 i=0
P [X = 0] = GX (0)
P [X = 1] = G0X (0) 9.3 Multivariate Normal
(i)
GX (0) Covariance Matrix (Precision Matrix 1 )
P [X = i] =
i!
V [X1 ] Cov [X1 , Xk ]
E [X] = G0X (1 )
=
.. .. ..
  (k)
E X k = MX (0) . . .

X!
 Cov [Xk , X1 ] V [Xk ]
(k)
E = GX (1 )
(X k)! If X N (, ),
2
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))  
1/2 1
GX (t) = GY (t) = X = Y
d
fX (x) = (2)n/2 || exp (x )T 1 (x )
2
Properties
9 Multivariate Distributions
Z N (0, 1) X = + 1/2 Z = X N (, )
9.1 Standard Bivariate Normal X N (, ) = 1/2 (X ) N (0, 1)

p X N (, ) = AX N A, AAT
Let X, Y N (0, 1) X
Z with Y = X + 1 2 Z 
X N (, ) a is vector of length k = aT X N aT , aT a
Joint density  2
x + y 2 2xy

1
f (x, y) = exp 10 Convergence
2(1 2 )
p
2 1 2
Conditionals Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2
 
and
Types of Convergence
Independence D
X
Y = 0 1. In distribution (weakly, in law): Xn X

lim Fn (t) = F (t) t where F continuous


n
9.2 Bivariate Normal
  P
Let X N x , x2 and Y N y , y2 . 2. In probability: Xn X

1

z
 ( > 0) lim P [|Xn X| > ] = 0
n
f (x, y) = exp
2(1 2 )
p
2x y 1 2
as
3. Almost surely (strongly): Xn X
" 2  2   #
x x y y x x y y h i h i
z= + 2 P lim Xn = X = P : lim Xn () = X() = 1
x y x y n n
9
qm
4. In quadratic mean (L2 ): Xn X CLT Notations

lim E (Xn X)2 = 0


  Zn N (0, 1)
n
2
 
Xn N ,
Relationships n
2
 
qm
Xn X = Xn X = Xn X
P D Xn N 0,
n
as
Xn X = Xn X
P 2

D P
n(Xn ) N 0,
Xn X (c R) P [X = c] = 1 = Xn X
n(Xn )
Xn
P
X Yn
P
Y = Xn + Yn X + Y
P
N (0, 1)
qm qm qm
n
Xn X Yn Y = Xn + Yn X + Y
P P P
Xn X Yn Y = Xn Yn XY
Xn
P
X =
P
(Xn ) (X) Continuity Correction
x + 12
D D  
Xn X = (Xn ) (X)  
qm P Xn x
Xn b limn E [Xn ] = b limn V [Xn ] = 0 / n
qm
X1 , . . . , Xn iid E [X] = V [X] < Xn
x 12
 
 
P Xn x 1
Slutzkys Theorem / n
Delta Method
D P D
Xn X and Yn c = Xn + Yn X + c
2 2
   
D P D 0 2
Xn X and Yn c = Xn Yn cX Yn N , = (Yn ) N (), ( ())
D D D n n
In general: Xn X and Yn Y =
6 Xn + Yn X + Y

11 Statistical Inference
10.1 Law of Large Numbers (LLN) iid
Let X1 , , Xn F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .
11.1 Point Estimation
Weak (WLLN)
P Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )
Xn as n h i
bias(bn ) = E bn
Strong (SLLN) P
as Consistency: bn
Xn as n
Sampling distribution: F (bn )
r h i
Standard error: se(n ) = V bn
b
10.2 Central Limit Theorem (CLT)
h i h i
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 . Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent


Xn n(Xn ) D
Zn := q   = Z where Z N (0, 1) bn D
V Xn Asymptotic normality: N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consis-
lim P [Zn z] = (z) zR tent estimator
bn .
n 10
11.2 Normal-based Confidence Interval 11.4 Statistical Functionals
 
b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2 Statistical functional: T (F )
 
Suppose bn N , se
 
and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then Plug-in estimator of = T (F ) : bn = T (Fn )
R
Linear functional: T (F ) = (x) dFX (x)
Cn = bn z/2 se
b Plug-in estimator for linear functional:
Z n
1X
T (Fn ) =
(x) dFbn (x) = (Xi )
11.3 Empirical Distribution Function n i=1
 
Empirical Distribution Function (ECDF) b 2 = T (Fn ) z/2 se
Often: T (Fn ) N T (F ), se b
Pn
I(Xi x) pth quantile: F 1 (p) = inf{x : F (x) p}
i=1
Fbn (x) = = Xn
n
n
1 X
b2 =
(Xi Xn )2
n 1 i=1
(
1 Xi x
I(Xi x) = 1
Pn 3
0 Xi > x n i=1 (Xi )
=
b3 j

Pn
Properties (for any fixed x) (Xi Xn )(Yi Yn )
= qP i=1 qP
n 2 n
h i
i=1 (X i Xn ) i=1 (Yi Yn )
E Fn = F (x)
h i F (x)(1 F (x))
V Fn =
n
12 Parametric Inference
F (x)(1 F (x)) D 
mse = 0 Let F = f (x; : be a parametric model with parameter space Rk
n
P and parameter = (1 , . . . , k ).
Fn F (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn F ) 12.1 Method of Moments


  j th moment Z
2
P sup F (x) Fn (x) > = 2e2n

j () = E X j = xj dFX (x)
 
x

Nonparametric 1 confidence band for F j th sample moment


n
1X j
j = X
L(x) = max{Fn n , 0} n i=1 i
U (x) = min{Fn + n , 1} Method of Moments Estimator (MoM)
s  
1 2
= log 1 () = 1
2n
2 () = 2
.. ..
.=.
P [L(x) F (x) U (x) x] 1 k () = k
11
Properties of the MoM estimator Equivariance: bn is the mle = (bn ) is the mle of ()
bn exists with probability tending to 1 Asymptotic normality:
P
p
Consistency: bn 1. se 1/In ()
Asymptotic normality: (bn ) D
N (0, 1)
D
se
n(b ) N (0, ) q
  b 1/In (bn )
2. se
where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
(bn ) D
g = (g1 , . . . , gk ) and gj = j () N (0, 1)
se
b
Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
12.2 Maximum Likelihood ples. If en is any other estimator, the asymptotic relative efficiency is
Likelihood: Ln : [0, ) h i
V bn
n
Y are(en , bn ) = h i 1
Ln () = f (Xi ; ) V en
i=1
Approximately the Bayes estimator
Log-likelihood
n
`n () = log Ln () =
X
log f (Xi ; ) 12.2.1 Delta Method
i=1 b where is differentiable and 0 () 6= 0:
If = ()
Maximum Likelihood Estimator (mle)
n ) D
(b
N (0, 1)
Ln (bn ) = sup Ln () se(b
b )

Score Function where b = ()


b is the mle of and

s(X; ) = log f (X; )

b = 0 ()
se se(
b n )
b b

Fisher Information
I() = V [s(X; )] 12.3 Multiparameter Models
In () = nI() Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.
Fisher Information (exponential family)
2 `n 2 `n
  Hjj = Hjk =
2 j k
I() = E s(X; )
Fisher Information Matrix
Observed Fisher Information
E [H11 ] E [H1k ]

n
In () = .. .. ..
2 X

. . .
Inobs () =

log f (Xi ; )
2 i=1 E [Hk1 ] E [Hkk ]

Properties of the mle Under appropriate regularity conditions


P
Consistency: bn (b ) N (0, Jn )
12
with Jn () = In1 . Further, if bj is the j th component of , then 13 Hypothesis Testing
H0 : 0 versus H1 : 1
(bj j ) D
N (0, 1) Definitions
se
bj
Null hypothesis H0
h i Alternative hypothesis H1
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
12.3.1 Multiparameter Delta Method Critical value c
Test statistic T
Let = (1 , . . . , k ) be a function and let the gradient of be Rejection Region R = {x : T (x) > c}
Power function () = P [X R]

Power of a test: 1 P [Type II error] = 1 = inf ()
1
1 Test size: = P [Type I error] = sup ()
.
..
= 0

Retain H0 Reject H0

k H0 true Type
I error ()
H1 true Type II error () (power)
p-value
Suppose =b 6= 0 and b = ().
b Then,

p-value = sup0 P [T (X) T (x)] = inf : T (x) R
P [T (X ? ) T (X)]

p-value = sup0 = inf : T (X) R
) D
(b
N (0, 1)
| {z }
se(b
b ) 1F (T (X)) since T (X ? )F

p-value evidence
where < 0.01 very strong evidence against H0
0.01 0.05 strong evidence against H0
r
 T   0.05 0.1 weak evidence against H0
se(b
b ) =
Jn
> 0.1 little or no evidence against H0
Wald Test
Two-sided test
and Jn = Jn () = b.

b and
= b 0
Reject H0 when |W | > z/2 where W =
  se
b
P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
12.4 Parametric Bootstrap
Likelihood Ratio Test (LRT)

Sample from f (x; bn ) instead of from Fn , where bn could be the mle or method sup Ln () Ln (bn )
T (X) = =
of moments estimator. sup0 Ln () Ln (bn,0 ) 13
k
D
X iid xn = (x1 , . . . , xn )
(X) = 2 log T (X) 2rq where Zi2 2k with Z1 , . . . , Zk N (0, 1)
Prior density f ()
 i=1  Likelihood f (xn | ): joint density of the data
p-value = P0 [(X) > (x)] P 2rq > (x) n
Y
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
Multinomial LRT
i=1
Posterior density f ( | xn )
 
X1 Xk
Let pn = ,..., be the mle
Normalizing constant cn = f (xn ) = f (x | )f () d
R
n n
k  Xj Kernel: part of a density that depends Ron
Ln (pn ) Y pj
T (X) = = Ln ()f ()
Posterior Mean n = f ( | xn ) d = R Ln ()f
R
Ln (p0 ) j=1
p0j () d
k  
X pj D
(X) = 2 Xj log 2k1 14.1 Credible Intervals
j=1
p 0j

The approximate size LRT rejects H0 when (X) 2k1, 1 Posterior Interval
2
Pearson Test Z b
n
P [ (a, b) | x ] = f ( | xn ) d = 1
k
X (Xj E [Xj ])2 a
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] 1 Equal-tail Credible Interval
D
T 2k1 Z a Z
f ( | xn ) d = f ( | xn ) d = /2
 
p-value = P 2k1 > T (x)
D
b
2
Faster Xk1 than LRT, hence preferable for small n
1 Highest Posterior Density (HPD) region Rn
Independence Testing
1. P [ Rn ] = 1
I rows, J columns, X multinomial sample of size n = I J
X 2. Rn = { : f ( | xn ) > k} for some k
mles unconstrained: pij = nij
X
mles under H0 : p0ij = pi pj = Xni nj Rn is unimodal = Rn is an interval
 
PI PJ nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
Pearson 2 : T = i=1 j=1 ijE[Xij ]ij
14.2 Function of Parameters
D
LRT and Pearson 2k , where = (I 1)(J 1) Let = () and A = { : () }.
Posterior CDF for
Z
14 Bayesian Inference H(r | xn ) = P [() | xn ] = f ( | xn ) d
A
Bayes Theorem
Posterior Density
f (x | )f () f (x | )f () h( | xn ) = H 0 ( | xn )
f ( | x) = n
=R Ln ()f ()
f (x ) f (x | )f () d
Bayesian Delta Method
Definitions  
| X n N (),
b seb 0 ()
b
n
X = (X1 , . . . , Xn )

14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Choice 
Uniform(0, ) Pareto(xm , k) max x(n) , xm , k + n
n
Subjective Bayesianism: prior should incorporate as much detail as possible Exponential() Gamma(, ) + n, +
X
xi
the researchs a priori knowledge via prior elicitation. i=1
Objective Bayesianism: prior should incorporate as little detail as possible  Pn   
0 i=1 xi 1 n
(non-informative prior). Normal(, c2 ) Normal(0 , 02 ) + / + 2 ,
2 2 02 c
Robust Bayesianism: consider various priors and determine sensitivity of  0 c1
1 n
our inferences to changes in the prior. + 2
02 c
Pn
02 + i=1 (xi )2
Types Normal(c , 2 ) Scaled Inverse Chi- + n,
+n
square(, 02 )
Flat: f () constant + nx n
R Normal(, 2 ) Normal- , + n, + ,
Proper: f () d = 1 +n 2
scaled Inverse n
(x )2
R
Improper: f () d = 1X 2
Gamma(, , , ) + (xi x) +
Jeffreys prior (transformation-invariant): 2 i=1 2(n + )
1
1 1
1 1

p p MVN(, c ) MVN(0 , 0 ) 0 + nc 0 0 + n x ,
f () I() f () det(I())
1 1
1

0 + nc
n
Conjugate: f () and f ( | xn ) belong to the same parametric family X
MVN(c , ) Inverse- n + , + (xi c )(xi c )T
Wishart(, ) i=1
n
X xi
14.3.1 Conjugate Priors Pareto(xmc , k) Gamma(, ) + n, + log
i=1
xm c
Discrete likelihood Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 kn where k0 > kn
Xn
Likelihood Conjugate Prior Posterior hyperparameters Gamma(c , ) Gamma(0 , 0 ) 0 + nc , 0 + xi
n n i=1
X X
Bernoulli(p) Beta(, ) + xi , + n xi
i=1
Xn n
X
i=1
n
X
14.4 Bayesian Testing
Binomial(p) Beta(, ) + xi , + Ni xi If H0 : 0 :
i=1 i=1 i=1
n
X
Z
Negative Binomial(p) Beta(, ) + rn, + xi Prior probability P [H0 ] = f () d
n
i=1 Z0
Posterior probability P [H0 | xn ] = f ( | xn ) d
X
Poisson() Gamma(, ) + xi , + n
0
i=1
n
X
Multinomial(p) Dirichlet() + x(i)
i=1 Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),
n
f (xn | Hk )P [Hk ]
X
Geometric(p) Beta(, ) + n, + xi
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ]
i=1
15
Marginal Likelihood 1. Estimate VF [Tn ] with VFn [Tn ].
Z 2. Approximate VFn [Tn ] using simulation:
f (xn | Hi ) = f (xn | , Hi )f ( | Hi ) d

(a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Posterior Odds (of Hi relative to Hj ) the sampling distribution implied by Fn

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1 , . . . , Xn Fn .


= ii. Compute Tn = g(X1 , . . . , Xn ).
P [Hj | xn ] f (xn | Hj ) P [Hj ]
(b) Then
| {z } | {z }
Bayes Factor BFij prior odds

B B
!2
Bayes Factor 1 X 1 X
log10 BF10 BF10 evidence vboot = VFn = Tn,b T
B B r=1 n,r
b=1
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
12 10 100 Strong
16.1.1 Bootstrap Confidence Intervals
>2 > 100 Decisive
p
1p BF 10 Normal-based Interval
p = p where p = P [H1 ] and p = P [H1 | xn ]
1 + 1p BF10
Tn z/2 se
boot

15 Exponential Family Pivotal Interval


Scalar parameter
1. Location parameter = T (F )
fX (x | ) = h(x) exp {()T (x) A()} 2. Pivot Rn = bn
= h(x)g() exp {()T (x)} 3. Let H(r) = P [Rn r] be the cdf of Rn

4. Let Rn,b = bn,b bn . Approximate H using bootstrap:
Vector parameter
( s
)
B
1 X
X
fX (x | ) = h(x) exp i ()Ti (x) A() H(r) =
I(Rn,b r)
i=1 B
b=1
= h(x) exp {() T (x) A()}
= h(x)g() exp {() T (x)} 5. Let denote the sample quantile of (bn,1

, . . . , bn,B )
Natural form 6. Let r denote the sample quantile of (Rn,1

, . . . , Rn,B ), i.e., r = bn
 
fX (x | ) = h(x) exp { T(x) A()} 7. Then, an approximate 1 confidence interval is Cn = a, b with
= h(x)g() exp { T(x)}  
= h(x)g() exp T T(x) a = bn H 1 1 =
bn r1/2 =
2bn 1/2

2

b = bn H 1 =
bn r/2 =
2bn /2
2
16 Sampling Methods
Percentile Interval
16.1 The Bootstrap  

Cn = /2 , 1/2
Let Tn = g(X1 , . . . , Xn ) be a statistic.
16
16.2 Rejection Sampling Decision rule: synonymous for an estimator b
Action a A: possible value of the decision rule. In the estimation
Setup
context, the action is just an estimate of , (x).
b
We can easily sample from g() Loss function L: consequences of taking action a when true state is or
We want to sample from h(), but it is difficult discrepancy between and , b L : A [k, ).
k()
We know h() up to proportional constant: h() = R Loss functions
k() d
Envelope condition: we can find M > 0 such that k() M g() Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Algorithm Linear loss: L(, a) =
K2 (a ) a 0
1. Draw cand g() Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
2. Generate u Unif (0, 1) Lp loss: L(, a) = | a|p
k(cand ) (
3. Accept cand if u 0 a=
M g(cand ) Zero-one loss: L(, a) =
1 a 6=
4. Repeat until B values of cand have been accepted

Example 17.1 Risk


We can easily sample from the prior g() = f () Posterior Risk
Target is the posterior with h() k() = f (xn | )f () Z h i
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M r(b | x) = L(, (x))f
b ( | x) d = E|X L(, (x))
b

Algorithm
(Frequentist) Risk
1. Draw cand f ()
Z
2. Generate u Unif (0, 1)
h i
R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
Ln (cand )
3. Accept cand if u
Ln (bn ) Bayes Risk
ZZ
16.3 Importance Sampling
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | xn ]:
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )b
iid
1. Sample from the prior 1 , . . . , n f ()
h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Ln (i )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (i )
n
PB 17.2 Admissibility
3. E [q() | x ] i=1 q(i )wi
b0 dominates b if
: R(, b0 ) R(, )
b
17 Decision Theory
: R(, b0 ) < R(, )
b
Definitions
b is inadmissible if there is at least one other estimator b0 that dominates
Unknown quantity affecting our decision: it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) n
X
rss(b0 , b1 ) = 2i
r(f, )
b = inf e r(f, )

e
i=1
R
(x) = inf r( | x) x = r(f, )
b b b = r(b | x)f (x) dx
Least Square Estimates
Theorems
bT = (b0 , b1 )T : min rss

b0 ,
b1
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode b0 = Yn b1 Xn
Pn Pn
(Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
17.4 Minimax Rules b1 = i=1 Pn 2
= n
(X X ) 2
P 2
i=1 i n i=1 Xi nX
Maximum Risk
 
0
h i
R()
b = sup R(, ) R(a) = sup R(, a) E b | X n =
b 1

2 n1 ni=1 Xi2 X n
h i  P 
Minimax Rule n
V |X =
b
e = inf sup R(, )
b = inf R()
sup R(, ) e nsX X n 1
e e
r Pn
2
i=1 Xi

b
se(
b b0 ) =
b = Bayes rule c : R(, )
b =c sX n n


b
Least Favorable Prior se(
b b1 ) =
sX n
bf = Bayes rule R(, bf ) r(f, bf ) Pn Pn
where s2X = n1 i=1 (Xi X n )2 and
b2 = 1
n2 2i
i=1  an (unbiased) estimate
of . Further properties:
18 Linear Regression
P P
Consistency: b0 0 and b1 1
Definitions
Asymptotic normality:
Response variable Y
Covariate X (aka predictor variable or feature) b0 0 D b1 1 D
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
18.1 Simple Linear Regression
Approximate 1 confidence intervals for 0 and 1 are
Model
Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2 b0 z/2 se(
b b0 ) and b1 z/2 se(
b b1 )
Fitted Line
rb(x) = b0 + b1 x The Wald test for testing H0 : 1 = 0 vs. H1 : 1 6= 0 is: reject H0 if
|W | > z/2 where W = b1 /se(
b b1 ).
Predicted (Fitted) Values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi Y )  rss
2
= 1 Pn i=1 i 2 = 1
 
i = Yi Ybi = Yi b0 + b1 Xi R = Pn 2
i=1 (Yi Y ) i=1 (Yi Y )
tss
18
Likelihood If the (k k) matrix X T X is invertible,
n n n
Y Y Y b = (X T X)1 X T Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2 h i
i=1 i=1 i=1 V b | X n = 2 (X T X)1
n
b N , 2 (X T X)1
Y 
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
n 1 X
L2 = fY |X (Yi | Xi ) exp 2 Yi (0 1 Xi ) k
2 i X
i=1 rb(x) = bj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
2
Unbiased estimate for
n
1X 2 n
b2 =
 1 X 2
n i=1 i b2 =
  = X b Y
n k i=1 i

18.2 Prediction mle


nk 2

b = X b2 =

Observe X = x of the covarite and want to predict their outcome Y . n
1 Confidence Interval
Yb = b0 + b1 x bj z/2 se(
b bj )
h i h i h i h i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
18.4 Model Selection
Prediction Interval  Pn Consider predicting a new observation Y for covariates X and let S J
2

2 2 i=1 (Xi X ) denote a subset of the covariates in the model, where |S| = k and |J| = n.
n =
b P +1
n i (Xi X)2 j
b
Issues
Underfitting: too few covariates yields high bias
Yb z/2 bn
Overfitting: too many covariates yields high variance

18.3 Multiple Regression Procedure


1. Assign a score to each model
Y = X + 
2. Search through all models to find the one with the highest score
where Hypothesis Testing
X11 X1k 1 1
.. .. = ...
.. .. H0 : j = 0 vs. H1 : j 6= 0 j J
X= . =.

. .
Xn1 Xnk k n Mean Squared Prediction Error (mspe)
Likelihood h i

1
 mspe = E (Yb (S) Y )2
2 n/2
L(, ) = (2 ) exp 2 rss
2
Prediction Risk
N
X n
X n
X h i
rss = (y X)T (y X) = ||Y X||2 = (Yi xTi )2 R(S) = mspei = E (Ybi (S) Yi )2
i=1 i=1 i=1 19
Training Error 19 Non-parametric Function Estimation
n
X
R
btr (S) = (Ybi (S) Yi )2 19.1 Density Estimation
i=1 R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
2
R Integrated Square Error (ise)
Pn b 2
R i=1 (Yi (S) Y )
rss(S) btr (S) Z  2 Z
R2 (S) = 1 =1 =1 P n 2 L(f, fbn ) = f (x) fbn (x) dx = J(h) + f 2 (x) dx
i=1 (Yi Y )
tss tss

The training error is a downward-biased estimate of the prediction risk. Frequentist Risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) f (x)
bias(R btr (S) R(S) = 2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)

Adjusted R2
19.1.1 Histograms
2 n 1 rss
R (S) = 1
n k tss Definitions
Mallows Cp statistic Number of bins m
1
Binwidth h = m
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram Estimator
m
AIC(S) = bS2 )
`n (bS , k X pbj
fbn (x) = I(x Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) log n
BIC(S) = `n (bS , h i p (1 p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and Training 2 1
R(fn , f )
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi (S) Yi )2 m = |{validation data}|, often or 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
 2/3 Z 1/3
Leave-one-out Cross-validation C 3 2
R (fbn , f ) 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi Ybi (S)
R
bCV (S) = (Yi Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx
b b f(i) (Xi ) = pb
U (S) = XS (XST XS )1 XS (hat matrix) n i=1 (n 1)h (n 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi Nk (x)
K(x) 0
Nadaraya-Watson Kernel Estimator
R
K(x) dx = 1
R
xK(x) dx = 0 n
X
rb(x) = wi (x)Yi
R 2 2
x K(x) dx K >0
i=1
xxi

KDE K
wi (x) = h  [0, 1]
n
Pn xxj
j=1 K
 
1X1 x Xi h
fbn (x) = K
n i=1 h h h4
Z 4 Z 
f 0 (x)
2
Z Z rn , r)
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
1 4 00 2 1 4 f (x)
R(f, fn ) (hK )
b (f (x)) dx + K 2 (x) dx
4 nh 2 K 2 (x) dx
Z R
2/5 1/5 1/5 + dx
nhf (x)
Z Z
c c2 c3
h = 1 c1 = 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h
Z 4/5 Z 1/5 n1/5
c4 5 2 2/5 c2
R (f, fbn ) = 4/5 c4 = (K ) K 2 (x) dx (f 00 )2 dx R (b
rn , r) 4/5
n 4 n
| {z }
C(K)

Cross-validation estimate of E [J(h)]


Epanechnikov Kernel
n n
X X (Yi rb(xi ))2
(Yi rb(i) (xi ))2 =
(
3
|x| < 5 JbCV (h) = !2
K(x) = 4 5(1x2 /5)
i=1 i=1 K(0)
0 otherwise 1 Pn  xx 
j
j=1 K h

Cross-validation estimate of E [J(h)]


19.3 Smoothing Using Orthogonal Functions
n n n  
1 X X Xi Xj
Z
2 2Xb 2 Approximation
JbCV (h) = fn (x) dx
b f(i) (Xi ) 2
K + K(0)
n i=1 hn i=1 j=1 h nh
X J
X
r(x) = j j (x) j j (x)
Z j=1 i=1
K (x) = K (2) (x) 2K(x) K (2) (x) = K(x y)K(y) dy Multivariate Regression
Y = +

19.2 Non-parametric Regression 0 (x1 ) J (x1 )
.. .. ..
where i = i and = . . .
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by 0 (xn ) J (xn )
Least Squares Estimator
Yi = r(xi ) + i
b = (T )1 T Y
E [i ] = 0
1
V [i ] = 2 T Y (for equallly spaced observations only)
n
21
Cross-validation estimate of E [J(h)] 20.2 Poisson Processes
2
Xn J
X Poisson Process
R
bCV (J) = Yi j (xi )bj,(i)
i=1 j=1
{Xt : t [0, )} number of events up to and including time t
X0 = 0
20 Stochastic Processes Independent increments:
Stochastic Process
( t0 < < tn : Xt1 Xt0

Xtn Xtn1
{0, 1, . . . } = Z discrete
{Xt : t T } T =
[0, ) continuous
Intensity function (t)
Notations: Xt , X(t)
State space X P [Xt+h Xt = 1] = (t)h + o(h)
Index set T P [Xt+h Xt = 2] = o(h)
Rt
Xs+t Xs Po (m(s + t) m(s)) where m(t) = 0
(s) ds
20.1 Markov Chains
Markov Chain Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X
(t) = Xt Po (t) >0
Transition probabilities

pij P [Xn+1 = j | Xn = i] Waiting Times


pij (n) P [Xm+n = j | Xm = i] n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
(i, j) element is pij  
1
pij > 0 Wt Gamma t,
P
i pij = 1

Chapman-Kolmogorov Interarrival Times


X
pij (m + n) = pij (m)pkj (n) St = Wt+1 Wt
k

Pm+n = Pm Pn  
1
Pn = P P = Pn St Exp

Marginal probability

n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]


St
0 , initial distribution
n = 0 Pn Wt1 Wt t
22
21 Time Series 21.1 Stationary Time Series
Mean function Z
Strictly stationary
xt = E [xt ] = xft (x) dx
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]
Autocovariance function

x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k N, tk , ck , h Z

x (t, t) = E (xt t )2 = V [xt ]


 
Weakly stationary
Autocorrelation function (ACF)  
E x2t < t Z
 2
Cov [xs , xt ] (s, t) E xt = m t Z
(s, t) = p =p
V [xs ] V [xt ] (s, s)(t, t) x (s, t) = x (s + r, t + r) r, s, t Z

Cross-covariance function (CCV) Autocovariance function


xy (s, t) = E [(xs xs )(yt yt )]
(h) = E [(xt+h )(xt )] h Z
 
Cross-correlation function (CCF) (0) = E (xt )2
xy (s, t) (0) 0
xy (s, t) = p (0) |(h)|
x (s, s)y (t, t)
(h) = (h)
Backshift operator
B k (xt ) = xtk Autocorrelation function (ACF)
Difference operator
d = (1 B)d Cov [xt+h , xt ] (t + h, t) (h)
x (h) = p =p =
V [xt+h ] V [xt ] (t + h, t + h)(t, t) (0)
White Noise
2
wt wn(0, w ) Jointly stationary time series
iid 2

Gaussian: wt N 0, w
xy (h) = E [(xt+h x )(yt y )]
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T xy (h)
xy (h) = p
x (0)y (h)
Random Walk
Drift Linear Process
Pt
xt = t + j=1 wj
X
X
E [xt ] = t xt = + j wtj where |j | <
j= j=
Symmetric Moving Average
k
X k
X
X
2
mt = aj xtj where aj = aj 0 and aj = 1 (h) = w j+h j
j=k j=k j=
23
21.2 Estimation of Correlation 21.3.1 Detrending
Sample mean Least Squares
n
1X
x = xt 1. Choose trend model, e.g., t = 0 + 1 t + 2 t2
n t=1
2. Minimize rss to obtain trend estimate bt = b0 + b1 t + b2 t2
Sample variance 3. Residuals , noise wt
n  
1 X |h|
V [x] = 1 x (h) Moving average
n n
h=n
1
The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
Sample autocovariance function
k
nh 1 X
1 X vt = xt1

b(h) = (xt+h x)(xt x) 2k + 1
n t=1 i=k

1
Pk
Sample autocorrelation function If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion

b(h)
b(h) = Differencing

b(0)
t = 0 + 1 t = xt = 1
Sample cross-variance function
nh
1 X 21.4 ARIMA models

bxy (h) = (xt+h x)(yt y)
n t=1 Autoregressive polynomial

Sample cross-correlation function (z) = 1 1 z p zp z C p 6= 0


bxy (h) Autoregressive operator
bxy (h) = p
bx (0)b
y (0)
(B) = 1 1 B p B p
Properties
Autoregressive model order p, AR (p)
1
bx (h) = if xt is white noise
n xt = 1 xt1 + + p xtp + wt (B)xt = wt
1
bxy (h) = if xt or yt is white noise AR (1)
n
k1
X k,||<1 X
21.3 Non-Stationary Time Series xt = k (xtk ) + j (wtj ) = j (wtj )
j=0 j=0
Classical decomposition model | {z }
linear process
P
xt = t + st + wt E [xt ] = j=0 j (E [wtj ]) = 0
2 h
w
t = trend (h) = Cov [xt+h , xt ] = 12
(h)
st = seasonal component (h) = (0) = h
wt = random noise term (h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
(z) = 1 + 1 z + + q zq z C q 6= 0 Denoted by ARIMA (p, d, q) (P, D, Q)s
Moving average operator P (B s )(B)D d s
s xt = + Q (B )(B)wt

(B) = 1 + 1 B + + p B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that
xt = wt + 1 wt1 + + q wtq xt = (B)wt
q
X
xt = wtj = (B)wt
X
E [xt ] = j E [wtj ] = 0
j=0
j=0
( Pqh P
2
w j=0 j j+h 0hq ARMA (p, q) is invertible {j } : j=0 j < such that
(h) = Cov [xt+h , xt ] =
0 h>q

X
MA (1) (B)xt = Xtj = wt
xt = wt + wt1 j=0

2 2
(1 + )w h = 0

Properties
(h) = w 2
h=1


0 h>1 ARMA (p, q) causal roots of (z) lie outside the unit circle
(

2 h=1
(z)
(h) = (1+ )
X
0 h>1 (z) = j z j = |z| 1
j=0
(z)
ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq ARMA (p, q) invertible roots of (z) lie outside the unit circle

(B)xt = (B)wt X (z)
(z) = j z j = |z| 1
Partial autocorrelation function (PACF) j=0
(z)
xh1
i , regression of xi on {xh1 , xh2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
hh = corr(xh xh1
h , x0 xh1
0 ) h2
E.g., 11 = corr(x1 , x0 ) = (1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
d xt = (1 B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
(B)(1 B)d xt = (B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt1 + wt wt1 Periodic process

X xt = A cos(2t + )
xt = (1 )j1 xtj + wt when || < 1
j=1 = U1 cos(2t) + U2 sin(2t)
xn+1 = (1 )xn + xn
Frequency index (cycles per unit time), period 1/
25
Amplitude A Discrete Fourier Transform (DFT)
Phase n
X
U1 = A cos and U2 = A sin often normally distributed rvs d(j ) = n1/2 xt e2ij t
i=1
Periodic mixture
q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
j = j/n
k=1

Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2 Inverse DFT
n1
Pq X
(h) = k=1 k2 cos(2k h) xt = n1/2 d(j )e2ij t
  Pq
(0) = E x2t = k=1 k2 j=0

Spectral representation of a periodic process Periodogram


I(j/n) = |d(j/n)|2
(h) = 2 cos(20 h)
Scaled Periodogram
2 2i0 h 2 2i0 h
= e + e 4
2 2 P (j/n) = I(j/n)
Z 1/2 n
e2ih dF ()
!2 !2
= 2X
n
2X
n
1/2 = xt cos(2tj/n + xt sin(2tj/n
n t=1 n t=1
Spectral distribution function

0 < 0

22 Math
F () = 2 /2 < 0

2
0 22.1 Gamma Function
Z
F () = F (1/2) = 0 Ordinary: (s) = ts1 et dt
F () = F (1/2) = (0) 0 Z
Spectral density Upper incomplete: (s, x) = ts1 et dt
x
Z x

X 1 1 Lower incomplete: (s, x) = ts1 et dt
f () = (h)e2ih
2 2 0
h=
( + 1) = () >1
P R 1/2
Needs |(h)| < = (h) = e2ih f () d h = 0, 1, . . . (n) = (n 1)! nN
h= 1/2
f () 0 (1/2) =
f () = f ()
f () = f (1 ) 22.2 Beta Function
R 1/2 Z 1
(0) = V [xt ] = 1/2 f () d (x)(y)
Ordinary: B(x, y) = B(y, x) = tx1 (1 t)y1 dt =
2
White noise: fw () = w 0 (x + y)
Z x
ARMA (p, q) , (B)xt = (B)wt : Incomplete: B(x; a, b) = ta1 (1 t)b1 dt
0
|(e2i )|2
2 Regularized incomplete:
fx () = w a+b1
|(e2i )|2 B(x; a, b) a,bN X (a + b 1)!
Pp Pq Ix (a, b) = = xj (1 x)a+b1j
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k B(a, b) j=a
j!(a + b 1 j)!
26
I0 (a, b) = 0 I1 (a, b) = 1 Stirling numbers, 2nd kind
Ix (a, b) = 1 I1x (b, a)         (
n n1 n1 n 1 n=0
=k + 1kn =
k k k1 0 0 else
22.3 Series
Finite Binomial Partitions
n n   n
n(n + 1) n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
X X
n
k= =2
2 k i=1
k=1 k=0
n n 
f :BU D = distinguishable, D = indistinguishable.
  
X X r+k r+n+1 Balls and Urns
(2k 1) = n2 =
k n
k=1 k=0
n n    
X n(n + 1)(2n + 1) X k n+1 |B| = n, |U | = m f arbitrary f injective f surjective f bijective
k2 = =
6 m m+1
k=1 k=0 ( (
mn m n
 
n  2 Vandermondes Identity: n n n! m = n
X n(n + 1) B : D, U : D m m!
k3 = r  
m n
 
m+n

0 else m 0 else
2
X
k=1 =
n k rk r (
cn+1 1 k=0
     
X n+n1 m n1 1 m=n
ck = c 6= 1 Binomial Theorem: B : D, U : D
c1 n  
n nk k n n m1 0 else
k=0
X
a b = (a + b)n
k m  
(   (
k=0 X n 1 mn n 1 m=n
B : D, U : D
k 0 else m 0 else
Infinite k=1
m
( (
X 1 X p 1 mn 1 m=n
pk = pk =
X
, |p| < 1 B : D, U : D Pn,k Pn,m
1p 1p 0 else 0 else
k=0 k=1 k=1

!  
X d X d 1 1
kpk1 = pk
= = |p| < 1
dp dp 1 p 1 p2 References
k=0 k=0
 
X r+k1 k
x = (1 x)r r N+ [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
k
k=0 Brooks Cole, 1972.
 
X k
p = (1 + p) |p| < 1 , C [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
k
k=0 The American Statistician, 62(1):4553, 2008.

22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
k out of n w/o replacement w/ replacement Algebra. Springer, 2001.
k1
Y n! [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
ordered nk = (n i) = nk Statistik. Springer, 2002.
i=0
(n k)!
 
n nk n!

n1+r
 
n1+r
 [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
unordered = = = Springer, 2003.
k k! k!(n k)! r n1
27
Univariate distribution relationships, courtesy of Leemis and McQueston [2].
28

You might also like