Lecture Note Lectures All Lectures Prof D Wolfson
Lecture Note Lectures All Lectures Prof D Wolfson
Lecture Note Lectures All Lectures Prof D Wolfson
Ryan Ordille
These course notes are for MATH 323: Probability, offered at McGill University
in the Winter 2012 semester, taught by Professor David Wolfson. These notes are
simply what the instructor wrote on the board, and may contain errors. The notes
are written by me, Ryan Ordille, and no copyright infringement is intended. Please
don’t sell these notes, or pass them off as your own.
The notes up until 28 February (first class after the break and midterm) are concise
– i.e. without review or obvious proofs. Most pre-midterm examples will be left out
since, when these notes were typed up towards the end of the course, the solutions
are obvious.
These course notes can be found at https://github.com/ryanordille/m323w12.
Thank you to everyone who submitted corrections, and good luck!
1 17 January
1. P (E) ≥ 0
2. P (S) = 1
3. ∀Ei ∩ EJ = ∅:
∞
X
P (∪∞
i=1 Ei ) = P (Ei )
i=1
4. Theorem 1: P (∅) = 0
5. Theorem 2: P (E c ) = 1 − P (E)
6. Theorem 3: P (E ∩ F c ) = P (E) − P (E ∩ F )
7. Theorem 4: If E ⊂ F , then P (E) ≤ P (F ).
8. Theorem 5: If E and F are any two events, then P (E ∪ F ) = P (E) + P (F ) −
P (E ∩ F ).
Word problem hints:
• “either/or” corresponds to a union of two events
• “at least one” – union
• “not” – complement
• “and” – intersection
• “proportion” – “a probability statement about an individual”
2 19 January
Theorem 6: Let S be a sample space (set of all possible outcomes) with finitely
many outcomes N . If all these outcomes are equally likely, then, if E is any
event,
number of ways that E can occur
P (E) =
total number of possible outcomes = N
Counting the number of ways that E can occur (and N ) can be difficult without the
following counting rules.
2.1.1 Factorial
By definition,
n! = n × (n − 1) × · · · × 2 × 1 where 0! = 1
2.1.3 n choose k
Suppose that a set contains n distinct objects. Then, the number of ways to select
k objects from these n objects, if we sample without replacement, is denoted by nk
(“n choose k”). It turns out that:
n n!
=
k k!(n − k)!
Note that here the order of selections is not important, e.g. the selection (A, B) is
equivalent to the selection (B, A).
2.1.4 Permutations
If the order is important in the previous example, this is denoted by Pkn , and
n!
Pkn =
(n − k)!
3 24 January
This lecture just contained examples of the previous counting rules.
4 26 January
Very often, if we know that some event A has occurred, then this will affect the
probability that the event B will occur.
Definition: for two events A and B, with P (A) 6= 0, we define the probability of
“B given A”, denoted by P (B | A), as
P (A ∩ B)
P (B|A) =
P (A)
4.2 Notes
(1) Conditional probabilities satisfies the three axioms, just like unconditional prob-
abilities.
(2) (the multiplication rule for conditional probabilities) Let A and B be
two events. Then,
(3) When solving conditional probability problems, use the conditioning technique
to break a long intersection into a series of conditional and unconditional probabili-
ties.
5 31 January
Note that it’s not true in general that P (A | (B1 ∪B2 )) = P (A | B1 )+P (A | B2 ) even
if B1 ∩ B2 = ∅. However, it is true that P (B1 ∪ B2 |A) = P (B1 |A) + P (B2 |A).
2. ∪m
i=1 Bi = S (i.e. B1 , B2 , . . . Bm form a partition of S)
Then,
m
X
P (A) = P (A | Bi )P (Bi )
i=1
Notice here that the left-hand side (P (A)) may be difficult to find directly, while the
components of the right-hand side might be known or easy to find.
6 02 February
6.1 Independence
Sometimes, knowing that an event A has occurred will not affect the probability
that B will occur. In such a situation, we say that A and B are independent. More
formally, two events A, B are said to be independent (A ⊥ B) if and only if
P (B | A) = P (B) or P (A | B) = P (A)
Theorem: if A and B are disjoint, then A and B can only be independent if either
P (A) = 0 or P (B) = 0.
There’s another important (albeit less intuitive) definition of independence: A ⊥ B
if and only if P (A ∩ B) = P (A)P (B).
More generally, events A1 , A2 , . . . , An are mutually independent if and only if, for
every subset Ai1 , Ai2 , . . . , Aik of A1 , A2 , . . . , An ,
k
Y
P (∩kj=1 Aij ) = P (Aij )
j=1
7 07 February
8 09 February
9 14 February
The total probability (mass) is equally or uniformly spread out on each of the num-
bers ai .
10 28 February
Firstly, we have a sequence of n independent trials – that is, the outcomes of these
trials are mutually independent.
Secondly, each trial can result in exactly one of two possible outcomes: a “success”
(S) or a “failure” (F). We call such trials Bernoulli trials.
Thirdly, the probability of success at trial i is constant and equal to p for every
i = 1, 2, . . . , n. For example, in a coin toss, we cannot change the probability of
heads halfway through, so the probability of success is constant.
Theorem: Let X be the number of successes observed in these n trials. Then,
n x
P (X = x) = p (1 − p)n−x for x = 0, 1, 2, . . . , n.
x
Proof: (same idea for the genetic mutation example) First, note that the probability
of any particular configuration in which there are x successes (and n − x failures) is
just px (1 − p)n−x .
E.g:
P (S1 ∩S2 ∩. . .∩Sx ∩Fx+1 ∩Fx+2 ∩. . .∩Fn ) = P (S1 )P (S2 ) . . . P (Sx )P (Fx+1 ) . . . P (Fn )
p × p × . . . × p × (1 − p) × . . . × (1 − p) = px (1 − p)n−x
But,
X X
P (X = x) = P (configuration i) = px (1 − p)n−x
all configurations
Since each configuration is defined by a choice of x objects from n total objects, then
for every x = 0, 1, . . . , n:
n x
P (X = x) = p (1 − p)n−x
x
Suppose that the five year survival probability for lung cancer is .10. If thirty people
with lung cancer are sampled, what is the probability that at least three will survive
five years or longer?
Solution: Let Y = the number out of thirty who will survive five or more years.
We shall reasonably assume the binomial setup, with n = 30 and P (Si ) = .10, where
Si is the event where the ith subject survives five or more years. Therefore,
30 2
X 30 30−y
X 30
P (Y ≥ 3) = y
(.10) (1 − .10) =1− (.10)y (1 − .10)30−y
y=3
y y=0
y
Important note: do not immediately have the “burning desire” to use the binomial
distribution as soon as you see a bunch of trials, each of which can result in exactly
one of two outcomes. You must check if the trials are independent – they will not
be if you are sampling without replacement!
Observe that the Bernoulli distribution is a special case of the binomial distribution,
where n = 1. So, PX (x) = px (1 − p)1−x for x = 0, 1.
The random variable X is said to have a Poisson distribution with parameter λ > 0
if
λx e−λ
P (X = x) = for x = 0, 1, 2, . . .
x!
(Note the infinite range of x)
Check that:
1. PX (x) ≥ 0 – obvious
P∞ λ
2. 0 PX (x) = 1 – Taylor series expansion for e
The Poisson distribution arises as an approximation to the binomial for “large n and
small p”.
λx e−λ
for x = 0, 1, 2, . . .
x!
6
(e.g. p = n
: np = n( n6 ) = 6)
Proof: We shall use the following result in proving our theorem.
a n
lim n → ∞(1 + ) = ea
n
n x
P (X = x) = p (1 − p)n−x for x = 0, 1, . . . , n
x
n!
(∗) = px (1 − p)n (1 − p)−x
x!(n − x)!
Now since λ = np, we have p = nλ . Then, (∗) becomes:
n! λx λ λ
(1 − )n (1 − )−x
x!(n − x)! n x n n
After some algebraic manipulation, this becomes:
1 n(n − 1)(n − 2) . . . (n − x + 1) λ λ
(1 − )n (1 − )−x λx
x! nn . . . n n n
Notice that the second term has x terms on both the top and the bottom. Now, let
n go to infinity to get the limit:
λx −λ
e
x!
11 01 March
λx e−λ
P (X = x) = where x = 0, 1, 2, . . .
x!
Suppose that in a book of 1000 pages, on any particular page, there can be either
zero errors or one error. Suppose further that the probability of an error on any
particular page is .002. what is the approximate probability that there will be at
most three errors in the book?
Exact solution: There is a binomial setup – errors are likely to occur independently
amongst the n = 1000 trials. Each trial can result in either a success (an error is
found) or a failure (an error is not found). There is also a constant probability of
success (.002) for every trial. Let X be the number of successes in 1000 pages. Then
X ∼ Bin(1000, .002).
3
X 1000
P (X ≤ 3) = (.002)x (1 − .002)1000−x
x=0
x
Approximate solution: Since n is large and p is small, we can use the Poisson
2
approximation with λ = n × p = 1000 × 1000 = 2. Therefore:
3 3
X X 2x e−2
P (X ≤ 3) = P (X = x) =
x=0 x=0
x!
The geometric random variable is used to describe or model the trial number at
which the first success occurs in a sequence of independent Bernoulli trials, each
with the probability of success p.
The probability distribution of a random variable provides the complete story about
the random variable. There is no information about a random variable once we
know its probability distribution. However, we often wish to summarize a probability
distribution. The two most common summaries are:
1. a parameter that discusses the “centre of distribution” and
2. a parameter that discusses how spread out the values are from the centre.
To this end, we give the following definition:
Expected value: let Y be a discrete random variable with the probability function
PY (y). Then, we define the expected value of y denoted by E(Y ) to be
X X
E(Y ) = yPY (y) = yP (Y = y)
all y all y
12 06 March
12.1 Review
12.2 Theory
Notes:
1. E(Y ) is often denoted by µY , and also is termed the mean of Y (also called
the “population mean”).
2. Interpretation: E(Y ) is a weighted average or mean of the possible values of
the random variable Y , where the weights are the probabilities of these values.
For example, in the special case where Y has a discrete uniform distribution
a1 , a2 , . . . , aN , then
N N
X 1 X
E(Y ) = ai P (Y = ai ) = ai
i=1
N i=1
P
The point of this: by definition, E(g(Y )) = E(X) = all x xP (X = x). Thus,
in order to find E(g(Y )), it would appear that we first need to find the prob-
ability distribution of X = g(Y ). We’ll see such transformations later – they
can be difficult. You do not have to first find the distribution of X – you can
use the distribution of the original Y and sum g(y)P (Y = y).
In particular, if g(Y ) = Y k , we can call E(g(Y )) = E(Y k ) is called the kth moment
of Y . µ is called the first moment of Y .
X
E(Y k ) = y k P (Y = y)
all y
12.3 Examples
12.3.1 Example 1
computer will be stolen. What premium should the insurance company charge so
that its expected gain is 0?
Solution: Let c be the required premium. Let Y be the gain of the company in a
given year. We need to find the value of c such that E(Y ) = 0.
First, we need PY (y) for all y. We have P (Y = c) = .95 (where the computer
was not stolen) and P (Y = (c − 1000)) = .05 (where the computer was stolen).
Therefore,
E(Y ) = c × .95 + (c − 100) × .05
Setting E(Y ) = 0 and solving for c, we get c = 50. Thus, if the company charges
$50 for the policy, on average, over a large number of clients, they would neither lose
nor gain money.
12.3.2 Example 2
“Nuts and Bolts” Example: suppose that the random variable x has the proba-
bility distribution:
P (X = −1.2) = .32
P (X = 2.6) = .40
P (X = 0) = .28
13 08 March
13.1 Summary
Centre – E(X) = µ
Spread – V ar(x) = σ 2
√
Standard Deviation – σ2 = σ
X X
(x − µ)2 P (X = x) = x2 P (X = x) − µ2
13.2.1 Binomial
n
X n x
E(X) = µ = x p (1 − p)n−x
x=1
x
X n(n − 1)!
= ppx−1 (1 − p)n−1−(x−1) as (n − x)! = (n − 1 − (x − 1))!
(x − 1)!(n − 1 − (x − 1))!
X n − 1
= np px−1 (1 − p)n−1−(x−1)
x−1
n−1
X n − 1
= np py (1 − p)n−1−y
y=0
y
= np
Note that x2 will not cancel with the leading terms of x! as before, so we have to use
a trick. We first calculate E(X(X − 1)), which is easy. Then, notice that
n
X n x
E(X(X − 1)) = x(x − 1) p (1 − p)n−x
x=2
x
X (n − 2)!
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
(x − 2)!(n − 2 − (x − 2))!
X n − 2
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
x−2
2
= n(n − 1)p
13.2.2 Bernoulli
In particular, the mean and variance of a Bernoulli random variable are p and p(1−p)
respectively (because n = 1).
13.2.3 Poisson
∞
X λx e−λ
E(X) = x
x=1
x!
∞
−λ λx−1
X
=e λ
x−1=0
(x − 1)!
=λ
For the variance, we use the trick we used for the variance of a binomial distribution.
First, we solve for E(X(X − 1)):
n
X λx e−λ
E(X(X − 1)) = x(x − 1)
x=2
x!
X λx−2 e−λ
= λ2 x(x − 1)
x(x − 1)(x − 2)!
X λx−2 e−λ
= λ2
(x − 2)!
2
= λ as the sum represents a Poisson distribution where x = x − 2
13.2.4 Geometric
∞
X
E(X) = xp(1 − p)x−1
x=1
X∞
using our trick... = p x(1 − p)x−1
x=1
∞
X d
=p − (1 − p)x
x=1
dp
∞
d X
= −p (1 − p)x we can interchange the derivative and the sum
dp x=1
d 1−p
= −p where (1 − p) = r
dp 1 − (1 − p)
d 1
= −p ( − 1)
dp p
1
=
p
For the variance, first find E(X(X − 1)) (2 derivatives), and then you can find the
variance.
1−p
V ar(X) =
p2
P (X = x) = FX (x) − P (X < x)
14 13 March
Rx
In short, FX (x) = −∞
fX (y)dy.
(2) Conversely, we can recover the p.d.f. from the c.d.f. by the Fundamental Theo-
rem of Calculus, since:
d
FX (x) = FX′ (x) = fX (x)∀x
dx
FX (x + ∆x) − FX (x)
≈ fX (x)
∆x
So it follows that:
FX (x + ∆x) − FX (x) ≈ ∆xfX (x)
But the left hand side is just P (x < X ≤ x + ∆x). Finally, we have that fX (x)∆x
is approximately the probability that X lies in (x, x + ∆x].
Note that, because the p.d.f. does not represent a probability on its own, it can be
greater than 1 or less than 0. When multiplied by a small ∆x, it gives an approximate
probability.
Any function whose total area equals 1 can qualify as a p.d.f., even if some points
are larger than 1.
(4) For continuous random variables, we define:
X Z ∞
E(g(X)) = g(x)PX (x) = g(x)fX (x)dx
allx −∞
As before, when g(X) = X k , then we refer to the E(g(X)) = E(X k ) as the kth
moment. Of particular importance are the first moment (k = 1) and the second
moment (k = 2). Again, as before, we call the first moment the mean or the expected
value of X.
So, by definition: Z ∞
E(X) = µX = xfX (x)dx
−∞
and Z ∞
2
E(X ) = x2 fx (x)dx)
−∞
Finally, as before:
Z ∞
2
V ar(X) = E((X − µX ) ) = (x − µX )2 fX (x)dx = E(X 2 ) − µ2X
∞
and Z ∞ Z ∞
2
V ar(X) = x fX (x)dx − ( xfX (x)dx)2
−∞ −∞
Note: watch out for a p.d.f. that may change its form in different ranges of (−∞, ∞)
when carrying out an integration.
14.3 Examples
14.3.1 Example 1
Let fX (x) = c(x2 + 1) for 0 < x < 1 and fX (x) = 0 elsewhere (c is a constant).
1. Find c.
2. Find P (.25 < X ≤ .50).
3. Find P (.25 < X < .50).
4. Find FX .
5. Find E(X) and σX .
R∞
(1) Since −∞ fX (x)dx = 1, we must have:
Z 0 Z 1 Z ∞
2
0dx + c(x + 1)dx + 0dx = 1
−∞ 0 1
(3) For a continuous p.d.f., these two values are the same (by rules of integra-
tion).
55
P (.25 < X < .50) = P (.25 < X ≤ .50) =
256
(4)
FX (x) = 0 ∀x ≤ 0
Z x
FX (x) = fX (y)dy ∀0 < x < 1
−∞
Z 0 Z x
= 0dy + (.75)(y 2 + 1)dy
−∞ 0
x3
= (.75)( + x)
3
FX (x) = 1 ∀x ≥ 1
(5)
Z ∞
E(X) = xfX (x)dx
−∞
Z 1
= x(.75)(x2 + 1)dx
0
x4 x2 1
= (.75)( + )
4 2 0
= (.75)(.25 + .50)
9
=
16
15 15 March
1
1. The p.d.f. is constant on [a, b]. On the graph, the height is constantly b−a , so
the area under the curve is 1. The probability is uniformly spread out in the
interval.
2. The uniform distribution is often used to model situations in which we believe
outcomes occur completely at random.
3. An important special case is the uniform distribution [0, 1].
4. Notation – we write X ∼ U (a, b) to mean that X has a uniform distribution
on the interval [a, b].
5. The c.d.f. of X is:
FX (x) = 0 for x < a
x−a
= for a ≤ x ≤ b
b−a
= 1 for b < x
The graph of the c.d.f. is 0 up to a, then grows linearly up to b, then is
constantly 1 after.
6.
Z ∞
µ = E(X) = xfX (x)dx
−∞
b
1
Z
=0+ x dx + 0
a b−a
b 2 − a2
=
2(b − a)
a+b
=
2
It is then easy to see that the variance of a uniform distribution is V ar(X) =
(b−a)2
12
.
FX (x) = 0 ∀x < 0
Z x
a −y
=0+ e β dy for x ≥ 0
0 β
−x
=1−e β
R∞ R∞ x
5. µ = E(X) = −∞ xfX (x)dx = 0 + 0 x β1 e− β dx = β You can do this using
integration by parts, or we’ll see a trick for this later. Also, σ 2 = V ar(X) = β 2 .
6. Sometimes, the exponential distribution will be “parameterized” in a different
way, i.e. the parameter will be written in a different form. The alternative
form for the p.d.f. is:
Watch out how the writer is writing in the parameter for the distribution! In
this case, E(X) = λ1 and V ar(X) = λ12 .
7. The memoryless property:
Theorem: Let X ∼ Exp(β). Then, P (x ≤ X < x + h | X ≥ x) = P (0 ≤ X <
h).
In other words, the information that X ≥ x is “forgotten”.
Important note: the memoryless property does not assert that P (0 ≤ X <
h) = P (x ≤ X < x + h)!
Proof: Let x ≤ X < x + h be B, and X ≥ x be A. Then,
= P (0 < X ≤ h)
16 20 March
Before defining the gamma distribution, we need to define the gamma function.
√
1. Γ( 21 ) = π
2. Γ(α + 1) = αΓ(α)
x
Set y = β
and dx = β dy to get the same answer.
This just proves that this given formula is a density, as it integrates out to 1.
(2) The gamma distribution is said to “flexible”, meaning that many different shapes
for the p.d.f. can be induced by changing the two parameters α and β.
For α > 1, the p.d.f. grows rapidly immediately after x > 0, then drops off with a
tail as x grows larger. fX is said to be skewed to the right.
For α = 1, fX grows exponentially downwards, similar to the exponential distribu-
tion.
For α < 1, the p.d.f. also grows exponentially downwards, but more steeply.
(3) The gamma density can be used to model the waiting time for the nth event if
the times between events are independent exponential random variables.
(4) There are two important special cases of the gamma distribution:
For x ≥ 0 (0 otherwise).
This particular p.d.f. plays an important role in statistics, and is called a Chi-
square p.d.f. “with ν degrees of freedom”. ν is just a parameter with this
peculiar name.
We write X ∼ χ2ν to mean “X has a Chi-square distribution with ν degrees of
freedom”.
(5) It is not too difficult to derive E(X) and V ar(X) from the definition. It will be
easier, however, once we know about moment-generating functions.
x
In the end, E(X) = αβ. Write xα as xα+1−1 and let y = β
, and carry out the
integration.
We get, similarly, V ar(X) = αβ 2 .
ν
In particular, if X ∼ χ2ν , then E(X) = ν (set α = 2
and β = 2) and V ar(X) =
2ν).
(6) The c.d.f. FX is not known in closed form: F (x) = 0 for x < 0 and:
Z x
1 y α−1 −y
FX (x) = α
e β dy
0 Γ(α) β
For x > 0.
(7) Notation:
We write X ∼ Gamma(α, β) to mean “X has a gamma distribution with parameters
α, β”.
16.2.1 Definition
The random variable X has a normal distribution with parameters µ, σ 2 if its p.d.f.
is given by
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ
for −∞ < x < ∞.
16.2.2 Notes
Background: if I give you any random variable with mean µ and standard deviation
σ, then
X −µ
Y =
σ
has E(Y ) = 0 and V ar(Y ) = 1.
We are said to have standardized X.
However, if X ∼ N (µ, σ 2 ), we have the following:
X −µ
Z= ∼ N (0, 1)
σ
This is called a standard normal random variable or distribution.
17 22 March
X −µ
Z= ∼ N (0, 1)
σ
Note: For any random variable with mean µ and variance σ 2 , X−µ σ
will have mean 0
and variance 1. The proof of this is simple - just plug in the fraction for X in E(X)
and V ar(X).
The point of the main result is that, after standardizing, we still get a normal random
variable. If you standardize a random variable, you don’t always get a random
variable of the same type (unless you’re standardizing a random variable with a
normal distribution).
17.1.1 Example 1
Solution: The idea is to reduce the problem to a N (0, 1) problem, and then use
N (0, 1) tables.
Step 1: (do not draw a sketch now)
Step 2: draw a sketch: (draw a standard normal distribution with mean = 0, shade
in area A between −.35 and 1.7)
Tables will give you areas to the right of a value z. Areas to the right of z = 3
is essentially 0, so tables will usually not give values of z > 3. Recall that the
areas will be the probabilities – e.g. the area to the right of z = 1.78 will equal
P (Z ≥ 1.78).
We’ll call A1 the area between −.35 and 0, and A2 the area between 0 and 1.7. We’ll
get these values by using the symmetry of N (0, 1) about µ = 0. Tables only give
positive values of z, so to get A1 , subtract P (Z ≥ .35) from .5.
From the table values, A2 = .5 − .0446 and A1 = .5 − .3632, so A = A1 + A2 =
.5922.
17.1.2 Example 2
Problem: Use the N (0, 1) tables inversely here. Suppose that a car battery is
known to have a lifetime that is approximately normally distributed with a mean of
36 months and a standard deviation of 6 months. What should the warranty period
be set at so that only 5% of batteries will need to be replaced?
Notice: Batteries cannot have a negative lifetime, so our true normal distribution
will not work. However, our lifetime is so skewed to the right that 2σ is still way
to the right of 0, so we can shift our model. Strictly speaking, modelling anything
that cannot hold negative values is not correctly, but for almost all cases, the normal
distribution will work just fine.
Solution: We have that X ∼ N (36, 62 ). Let x0 be the required warranty period.
We want that x0 such that P (X < x0 ) = .05, i.e. such that P (X ≥ x0 ) = .95.
17.2.1 Definition
17.2.2 Notes
(3) For some distributions, the m.g.f. does not exist because the integral (or sum)
does not converge. We say that the m.g.f. exists if it exists in some interval containing
0.
(4) If the m.g.f. exists, then it is possible to recover the p.d.f. or probability
function, i.e. there is a one-to-one correspondence between a p.d.f. (pf) and a m.g.f..
Recovering it, although possible, is a bit complicated, and we will not be expected
to do so.
(5) Uses of the m.g.f. – the m.g.f. can be used to find the moments of a ran-
dom variable, and is often easier than finding the moments (E(X k )) by using the
definition.
Theorem: E(X k ) = M (k) (0).
Proof (continuous case): (discrete case – replace integral by sum) We have, by
definition:
Z ∞
tx
MX (t) = E(e ) = etx fX (x) dx
−∞
d
Z
(1)
MX (t) = ...
dt
d
Z
= ...
dt
Z
= xetx fX (x) dx
now set t = 0
Z ∞
(1)
MX (0) = xfX (x) dx
−∞
= E(X)
In general, we get Z ∞
(k)
M (t) = xk etx fX (x) dx
−∞
18 27 March
18.1 Recall
MX (t) = E(etx )
(k)
MX (0) = E(X k )
X ∼ Gamma(α, β)
∞
1 xα−1 −x
Z
MX (t) = etx α
e β dx
0 Γ(α) β
Z ∞
1 1 1
= α
xα−1 e−x( β −t) dx
Γ(α) β 0
1
MX (t) = for |βt| < 1
(1 − βt)α
MX′′ (t) = αβ 2 + α2 β 2
t=0
X ∼ Bin(n, p)
n
X n x tx
MX (t) = e p (1 − p)n−x
x=0
x
= (1 − p + pet )n ∀t ∈ (−∞, ∞)
n
X n x
tx
MX (t) = e p (1 − p)n−x
x=0
x
n
X n
= (pet )x (1 − p)n−x
x=0
x
= (pet + (1 − p))n by the Binomial theorem
= (1 − p + pet )n ∀t ∈ (−∞, ∞)
Pn n
Note: the Binomial theorem is x=0 x
ax bn−x = (a + b)n .
We get MX′ (0) = np and MX′′ (0) = np(1 − p) + n2 p2 . Therefore, V ar(X) = np(1 −
p).
X ∼ P o(λ)
MX (t) = E(etx )
∞
X etx λx e−λ
=
x=0
x!
∞ x
−λ
X (et λ)
=e
x=0
x!
−λ et λ
=e e by Taylor series of ex
t
= eλ(e −1)
To get the m.g.f. of a N (µ, σ 2 ) random variable for arbitrary µ and σ 2 , we use the
following property of an m.g.f.: let a and b be constants. Then,
and
1 2 t2
MX (t) = eµt+ 2 σ ∀t ∈ (−∞, ∞)
We find MX′ (0) = µ, MX′′ (0) = σ 2 + µ2 , and V ar(X) = σ 2 .
Often, we’re given the distribution of a random variable X, but we’re more inter-
ested in some function Y = g(X) of this random variable, e.g. maybe we have the
distribution of the velocity V , and we are interested in the distribution of the kinetic
energy Y = 21 mV 2 . In general, we’re concerned with finding the distribution of g(X)
knowing the distribution of X.
First, consider the following two examples to illustrate the eventual formula for the
continuous case.
18.3.1 Example 1
X−µ
Let X ∼ N (µ, σ 2 ). Find the p.d.f. of Z = σ
. We know the answer to this, but we
don’t know the proof for it.
Step 1: write down the c.d.f. of Z.
FZ (z) = P (Z ≤ z)
X −µ
= P( ≤ z)
σ
Finally:
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ σz+µ
1 1 2
= √ e− 2 z ∀z ∈ (−∞, ∞)
2π
FY (y) = P (Y ≤ y)
= P (Z 2 ≤ y)
√
= P (|Z| ≤ y)
√ √
= P (− y ≤ Z ≤ y)
√ √
= FZ ( y) − FZ (− y) for y ≥ 0
Finally,
d
fY (y) = FY (y)
dy
1 1 √ 1 1 √
= y − 2 fZ ( y) + y − 2 fZ (− y) ∀y ≥ 0
2 2
− 12 √
= y fZ ( y) (N (0, 1) density symmetric about 0)
19 29 March
ν 1
Finally, we use the following facts: (with α = 2
= 2
and β = 2)
1 1 2
fZ (u) = √ e− 2 u for − ∞ < u < ∞
2π
If W ∼ χ21 , then
1 1 1 w
fW (w) = √ 1 w− 2 e− 2 for w ≥ 0
π 22
with fW (w) = 0 elsewhere.
We have
1 1 1
fY (z) = z − 2 √ √ e− 2 z for z > 0
2 π
= 0 for z ≤ 0
We’re done!
We can now give a general formula that allows us to go from the p.d.f. of a given
random variable X to the p.d.f. of a transformed random variable Y = g(X).
Theorem: Let X have p.d.f. fX . Let y = g(x) be either strictly increasing or
strictly decreasing as a function of x, and X is continuous. Define Y = g(X). Then,
the p.d.f. of Y is
dx
fY (y) = fX (g −1 (y))| | for the appropriate range of values of Y .
dy
dx 1
Note that dy
= dy .
| dx |
Proof: First,
FY (y) = P (g(X) ≤ y)
= P (X ≤ g −1 (y)) (if g increasing)
while = P (X ≥ g −1 (y) (if g decreasing)
Thus, we have,
FY (y) = FX (g −1 (y)) (if g increasing)
= 1 − FX (g −1 (y)) (if g decreasing)
The following is a very important result that allows one to simulate observations
from a given probability distribution by knowing only how to simulate observations
from a U (0, 1) distribution. This famous result is called the probability integral trans-
formation.
Theorem: Let X be a continuous random variable with a strictly increasing FX .
Let Y = FX (X). Then Y ∼ U (0, 1).
Note: here, our g is FX , i.e. Y = g(X) = FX (X). Also, you must use FX and not
some other F .
Proof: by our formula,
dx
fY (y) = fX (FX−1 (y))
dy
(| dx
dy
|= dx
dy
since FX is increasing)
1
fY (y) = fX (FX−1 (y)) dy
dx
dy
Recall y = FX (x), dx
= fX (x), and x = FX−1 (y). Therefore,
1
fY (y) = fX (FX−1 (y)) = 1 for 0 < y < 1
fX (FX−1 (y))
Very often, we’re interested in the simultaneous behaviour of several random vari-
ables, rather than one at a time, as we have considered up until now, e.g. if X =
number of kilometres traveled by a tire and Y = tread depth, we may wish to know
about the simultaneous or joint distribution of X and Y . This leads to so-called
multivariate distributions.
We shall consider bivariate (i.e. pairs of random variables) distributions, and indicate
the general extensions at the end.
19.2.1 Definition
FX,Y (x, y) = P (X ≤ x ∩ Y ≤ y)
19.2.2 Notes
(1) This definition holds for both continuous and discrete random variables.
(2) It is possible to show (in an advanced probability course) that the joint c.d.f.
uniquely determines the probability distribution in two-dimensional space.
20 03 April
(1) FX,Y (x, y) uniquely determines the joint probability distribution in two-dimensional
space, i.e. in theory, given any event B in R2 , once we know FX,Y (x, y) for all
−∞ < x < ∞, −∞ < x < ∞, then P ((X, Y ) ∈ B) is uniquely determined.
(2) We define the so-called marginal c.d.f. FX and FY of FX,Y as follows:
and, similarly,
FY (y) = FX,Y (+∞, y)
(3) FX,Y (x, y) is non-decreasing in x and y (e.g. FX,Y (x, y) ≤ FX,Y (x′ , y) for x <
x′ ).
(4) FX,Y (−∞, −∞) = 0 (c.f. FX (−∞) = 0) and FX,Y (∞, ∞) = 1.
(5) FX,Y (x, y) is jointly continuous from the right (c.f. FX (x) is continuous from the
right).
It follows that
and that
PX,Y (x, y) = P (X = x, Y = y).
Given fX,Y (x, y), to find the marginal p.d.f., integrate out the variable you wish to
get rid of: Z ∞
fX (x) = fX,Y (x, y) dy
−∞
and Z ∞
fY (y) = fX,Y (x, y) dx
−∞
(beware of times when the p.d.f. changes its form – see next example for details)
20.2.1 Example
Let
2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(1) Find the marginal p.d.f.s fX and fY .
Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z 1
2
=0+ (x + 2y) dy + 0
0 3
2
= (x + 1) for 0 < x < 1
3
= 0 elsewhere
and
1
2
Z
fY (y) = 0 + (x + 2y) dx
0 3
1
= (1 + 4y) for 0 < y < 1
3
= 0 elsewhere
For x ≥ 1 and y ≥ 1:
FX,Y (x, y) = 1
Note: identify the part of the range of FX,Y (x, y) where 0 < x < 1 and where you
can let y → +∞. In this case, the part is 0 < x < 1 and y ≥ 1. Finally, for such y,
FX,Y (x, y) does not change with y.
Thus, we get the same marginal c.d.f. for X by two different but equivalent meth-
ods.
21 05 April
PY | X=x (y) = P (Y = y | X = x)
Again, by the definition of conditional probability, the right hand side is equal
to
P (X = x, Y = y) PX,Y (x, y)
=
P (X = x) PX (x)
(provided that P (X = x) 6= 0).
fX,Y (x, y)
fY | X=x (y | x) =
fX (x)
FY | X=x (y | x) = P (Y ≤ y | X = x)
Z y
= fY | X=x (u | x) du
−∞
It follows that Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞
Note that this is the usual definition of expected value, except that we use the
conditional p.d.f..
In the discrete case:
X
P (Y ≤ y | X = x) = PY | X=x (u | x)
∀u:u≤y
It’s only in the continuous case where things get a bit more complicated.
The following theorems are very useful analogues of the Law of Total Probability for
events.
Recall: n
X
P (A) = P (A | Bi )P (Bi )
i=1
21.3 Example
By (1), we have:
For 0 < y < 1 and 0 < x < 1:
2
3
(x + 2y)
fY | X=x (y | x) = 2
3
(x + 1)
Otherwise, for y ∈
/ (0, 1),
fY | X=x (y | x) = 0
(5) Find E(Y | X = x) for 0 < x < 1 and, in particular, find E(Y | X = 12 ).
By definition,
Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞
Z 1
y(x + 2y)
= dy
0 x+1
3 x 2
= ( + ) for 0 < x < 1
2(x + 1) 3 3
In particular,
1 3 1 2 5
E(Y | X = ) = 1 ( + )=
2 2( 2 + 1) 6 3 6
21.4.1 Covariance
21.4.2 Notes
(1) Cov(X, Y ) is a measure of how X and Y vary about their means simultaneously.
If Cov(X, Y ) > 0, then this tells us that as X increases, so does Y , and also as X
decreases, so does Y . Conversely, if Cov(X, Y ) < 0, then as X increases, Y tends to
decrease, and vice versa.
(2) “little theorem”
Cov(X, Y ) = E(XY ) − E(X)E(Y )
Proof: by definition,
22 10 April
While the sign of Cov(X, Y ) tells you whether or not X and Y tend to vary in the
same direction together (positive if they do, negative if in opposite directions), the
magnitude of Cov(X, Y ) depends on the scale of measurement. Thus,
Note that the sign of ρ(X, Y ) is the same as the sign of Cov(X, Y ), and
Cov(aX, bY ) |ab|Cov(X, Y )
|ρ(aX, bY )| = p = p = ρ(X, Y )
V ar(aX)V ar(aY ) |ab| V ar(X)V ar(Y )
22.1.3 Example
2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(6) Find Cov(X, Y ).
We need µX and µY . We then need E(XY ).
Z ∞
µX = xfX (x) dx
−∞
Z 1
2
= x (x + 1) dx
0 3
5
=
Z9 ∞
µY = yfY (y) dy
−∞
Z 1
1
= y (1 + 4y) dy
0 3
∞ ∞
11
Z Z
= E(XY ) = xyfX,Y (x, y) dx dy
18 −∞ −∞
Z 1Z 1
2
= xy (x + 2y) dx dy
0 0 3
1
=
3
1 5 11
Cov(X, Y ) = −
3 9 18
−3
=
486
1
2
Z
2
E(X ) = x2 (x + 1) dx
0 3
7
=
18
7 5
V ar(X) = − ( )2
18 9
= .0803
√
σX = .0803
= .2833
... = ...
V ar(Y ) = .6821
√
σY = .6821
= .8259
22.2.1 Definition
Note: This definition is valid whether the random variables are continuous or dis-
crete.
The random variables X1 , X2 , . . . , Xn are said to be independent if and only if
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = FX1 (x1 )FX2 (x2 ) . . . FXn (xn )
for all −∞ < xi < ∞.
If X1 , X2 , . . . , Xn are jointly continuous, then it is easy to see that they are indepen-
dent if and only if
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
for all −∞ < xi < ∞.
If X1 , X2 , . . . , Xn are jointly discrete, then they are independent if and only if
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = PX1 (x1 )PX2 (x2 ) . . . PXn (xn )
for all −∞ < xi < ∞.
1 1 2 1 2
fX,Y ( , ) = ( + )
4 4 3 4 4
1 25
fX ( ) =
4 34
1 1 4
fY ( ) = (1 + )
4 3 4
So that
1 1 5 1
fX ( )fY ( ) = 6=
4 4 9 2
Therefore, X and Y are not independent.
23 12 April
First off, use the m.g.f. method to find the exact distribution of a sum of independent
random variables, under certain circumstances.
Secondly, use the Central Limit Theorem to use the approximate distribution of
a sum of a “large” number of independent random variables under general condi-
tions.
Thirdly, we’ll discuss the Weak Law of Large Numbers that enables to us to interpret
probability as a limiting relative frequency.
The moment generating function method for finding the distribution of a
sum of independent random variables: Recall from last class that if X ⊥ Y ,
then E(XY ) = E(X)E(Y ) (i.e. Cov(X, Y ) = 0). In general, if X1 , X2 , . . . , Xn are
independent, then !
Yn Yn
E Xi = E(Xi ).
i=1 i=1
The following extended result is true: if X ⊥ Y , then g1 (x) ⊥ g2 (y) for all functions
g 1 , g2 .
23.1.1 Setup
The last step above is valid because functions of independent random variables are
themselves independent, and E(g(X1 )g(X2 )) = E(g(X1 ))E(g(X2 )). Then,
... = ...
Yn
MSn (t) = MXi (t)
i=1
Remark: if the Xi ’s all have the same distribution (termed identically distributed ),
then
MSn (t) = (MXi (t))n .
Step 3: having found ni=1 MXi (t), we hope to recognize its form as the m.g.f. of
Q
a familiar distribution. If so, then by the Uniqueness Theorem of M.G.F.s, that
distribution must be the distribution of Sn .
23.1.2 In practice
t
(a) We have MXi (t) = eλi (e −1) . Therefore,
n
t
Pn
λi (et −1)
Y
MSn (t) = eλi (e −1) = e i=1
i=1
n
Y σi2 t2
MSn (t) = eµi t+ 2
i=1
Pn Pn 2
µi t+ σi2 t2
=e i=1 i=1
Pn Pn
which we recognize as the m.g.f. of a N ( i=1 µi , i=1 σi2 ) random variable.
(c) (try it yourself)
(d)
1
MXi (t) = νi
(1 − 2t) 2
Therefore,
1
MSn (t) = Pn νi
(1 − 2t) i=1 2
Roughly, the Central Limit Theorem says: sums of a large number of independent
random variables are approximately normally distributed.
Theorem: Let X1 , X2 , . . . be independent and identically distributed (i.i.d.) ran-
dom variables with mean µ and variance σ 2 . Then,
Sn − nµ
P √ ≤ x) → P (Z ≤ x) ∀x as n → ∞
nσ
23.2.1 Notes
(1) V ar(Sn ) = ni=1 V ar(Xi ) = nσ 2 , while E(Sn ) = nµ. Therefore, the l.h.s. of
P
the Central Limit Theorem is just Sn standardized to have mean 0 and standard
deviation 1.
(2) The C.L.T. can be written in the form
!
X̄ − µ
P σ ≤ x → P (Z ≤ x)
√
n
23.2.2 Application
Suppose that it is known that the survival time for patients with Alzheimer’s disease
from onset of symptoms has a mean of 8 years and a standard deviation of 4 years. If
a sample of 30 patients with the disease is taken, what is the approximate probability
that their average survival will be less than seven years?
P30
Xi
Solution: the C.L.T. generally works well with n ≥ 30. We’ll let X̄ = i=1
30
. We