Lecture Note Lectures All Lectures Prof D Wolfson

lOMoARcPSD|36552487
Lecture note, lectures all lectures - prof. d. wolfson
Probability (McGill University)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])
lOMoARcPSD|36552487
MATH 323: Lecture Notes
Ryan Ordille
April 16, 2012
These course notes are for MATH 323: Probability, offered at McGill University
in the Winter 2012 semester, taught by Professor David Wolfson. These notes are
simply what the instructor wrote on the board, and may contain errors. The notes
are written by me, Ryan Ordille, and no copyright infringement is intended. Please
don’t sell these notes, or pass them off as your own.
The notes up until 28 February (first class after the break and midterm) are concise
– i.e. without review or obvious proofs. Most pre-midterm examples will be left out
since, when these notes were typed up towards the end of the course, the solutions
are obvious.
These course notes can be found at https://github.com/ryanordille/m323w12.
Thank you to everyone who submitted corrections, and good luck!
1 17 January
1.1 Axioms and Theorems
1. P (E) ≥ 0
2. P (S) = 1
3. ∀Ei ∩ EJ = ∅:
∞
X
P (∪∞
i=1 Ei ) = P (Ei )
i=1

lOMoARcPSD|36552487
Ryan Ordille MATH 323: Probability 2
4. Theorem 1: P (∅) = 0
5. Theorem 2: P (E c ) = 1 − P (E)
6. Theorem 3: P (E ∩ F c ) = P (E) − P (E ∩ F )
7. Theorem 4: If E ⊂ F , then P (E) ≤ P (F ).
8. Theorem 5: If E and F are any two events, then P (E ∪ F ) = P (E) + P (F ) −
P (E ∩ F ).
Word problem hints:
• “either/or” corresponds to a union of two events
• “at least one” – union
• “not” – complement
• “and” – intersection
• “proportion” – “a probability statement about an individual”
2 19 January
2.1 Using counting methods to compute probabilities
Theorem 6: Let S be a sample space (set of all possible outcomes) with finitely
many outcomes N . If all these outcomes are equally likely, then, if E is any
event,
number of ways that E can occur
P (E) =
total number of possible outcomes = N
Counting the number of ways that E can occur (and N ) can be difficult without the
following counting rules.
2.1.1 Factorial
By definition,
n! = n × (n − 1) × · · · × 2 × 1 where 0! = 1

lOMoARcPSD|36552487
2.1.2 Multiplication rule for counting
Suppose you have k sets of distinct objects of size n1 , n2 , . . . , nk respectively. Then,

the number of ways to choose one object from each set is n1 , n2 , . . . , nk .
2.1.3 n choose k
Suppose that a set contains n distinct objects. Then, the number of ways to select
k objects from these n objects, if we sample without replacement, is denoted by nk
(“n choose k”). It turns out that:

n n!
=
k k!(n − k)!
Note that here the order of selections is not important, e.g. the selection (A, B) is
equivalent to the selection (B, A).
2.1.4 Permutations
If the order is important in the previous example, this is denoted by Pkn , and
n!
Pkn =
(n − k)!
3 24 January
This lecture just contained examples of the previous counting rules.
4 26 January
4.1 Conditional Probabilities
Very often, if we know that some event A has occurred, then this will affect the
probability that the event B will occur.

lOMoARcPSD|36552487
Definition: for two events A and B, with P (A) 6= 0, we define the probability of
“B given A”, denoted by P (B | A), as
P (A ∩ B)
P (B|A) =
P (A)
If P (A) = 0, then P (B | A) can be defined arbitrarily, so we consider it to be

undefined.
Notice that the right-hand side of the definition is given in terms of already-defined
quantities – straightforward, unconditional probabilities.
4.2 Notes
(1) Conditional probabilities satisfies the three axioms, just like unconditional prob-
abilities.
(2) (the multiplication rule for conditional probabilities) Let A and B be
two events. Then,
P (A ∩ B) = P (B | A)P (A) = P (A | B)P (B)
(3) When solving conditional probability problems, use the conditioning technique
to break a long intersection into a series of conditional and unconditional probabili-
ties.
5 31 January
Note that it’s not true in general that P (A | (B1 ∪B2 )) = P (A | B1 )+P (A | B2 ) even
if B1 ∩ B2 = ∅. However, it is true that P (B1 ∪ B2 |A) = P (B1 |A) + P (B2 |A).
5.1 The Law of Total Probability
Let A be any event and let B1 , B2 , . . . , Bm be a collection of m events satisfying:

1. Bi ∩ Bj = ∅ ∀i 6= j

lOMoARcPSD|36552487
2. ∪m
i=1 Bi = S (i.e. B1 , B2 , . . . Bm form a partition of S)
Then,
m
X
P (A) = P (A | Bi )P (Bi )
i=1
Notice here that the left-hand side (P (A)) may be difficult to find directly, while the
components of the right-hand side might be known or easy to find.
5.2 Bayes’ Theorem
Let A be any event. Let B1 , B2 , . . . , Bm form a partition of S. Then, for every

k = 1, 2, . . . , m,
P (A | Bk )P (Bk ) P (A | Bk )P (Bk )
P (Bk | A) = = Pm
P (A) i=1 P (A|Bi )P (Bi )
6 02 February
6.1 Independence
Sometimes, knowing that an event A has occurred will not affect the probability
that B will occur. In such a situation, we say that A and B are independent. More
formally, two events A, B are said to be independent (A ⊥ B) if and only if
P (B | A) = P (B) or P (A | B) = P (A)
Theorem: if A and B are disjoint, then A and B can only be independent if either
P (A) = 0 or P (B) = 0.
There’s another important (albeit less intuitive) definition of independence: A ⊥ B
if and only if P (A ∩ B) = P (A)P (B).
More generally, events A1 , A2 , . . . , An are mutually independent if and only if, for
every subset Ai1 , Ai2 , . . . , Aik of A1 , A2 , . . . , An ,
k
Y
P (∩kj=1 Aij ) = P (Aij )
j=1

lOMoARcPSD|36552487
It follows that, if A ⊥ B, then P (A ∪ B) = P (A) + P (B) − P (A)P (B).

In general, sampling without replacement assumes dependence, while sampling with
replacement assumes independence. Disjointness is entirely a set property, and
should not be confused with independence.
7 07 February
7.1 Random variables
Often, we are not so interested in the outcomes of an experiment themselves, but

rather in numerical values that can be associated with these outcomes.
Definition: a function X that maps the sample space S to the real line in such a
way that, for every ω ∈ S, X(ω) is a real number, is called a random variable (or
r.v. for short).
Note that two or more distinct ω’s can give the same value of X(ω). However, one
value of ω is not allowed to give two different values of X(ω), as this would not make
X a function.
The term “random variable” comes about because the outcomes of the experiment
are random or uncertain before we perform our experiment, and hence the value of
X will also be uncertain before the experiment.
We denote random variables with capitals, and the values of random variables after
experiments with lowercase letters.
8 09 February
8.1 Random variables continued
8.1.1 The cumulative distribution function
We define P (x ∈ B) to be P (w ∈ S : x ∈ B : P (X −1 (B)) (i.e. we refer back to the

events of S to find the probabilities of events on the real line).

lOMoARcPSD|36552487
Definition: P (X ∈ (−∞, x]) = P (X ≤ x) is a function of X called the cumulative

distribution function (or c.d.f.) of the random variable X, and
FX (x) = P (X ≤ x)∀x ∈ (−∞, ∞).
8.1.2 Properties of the c.d.f.
(1) The c.d.f. is a real-valued function of x.

(2) In order to specify FX , we need to specify FX (x) ∀x ∈ (−∞, ∞).
(3) All c.d.f.s are non-decreasing and right-continuous.
(4) FX (−∞) = limx→−∞ FX (x) = 0 and FX (∞) = limx→∞ FX (x) = 1.
8.1.3 Continuous vs discrete random variables
Definition: We call a random variable continuous if its c.d.f. is a real-valued

function of x (i.e. it has no jumps). If a random variable is not continuous, the
random variable is said to be discrete, and can assume at most a countable number
of distinct values.
Definition: For a discrete random variable X, the real-valued function of x specified
by PX (x) = P (X = x) ∀x that X can assume is called the probability function of
X.
Theorem: If we know PX (x) ∀x that X can assume, then we can find FX (x) ∀x ∈
(−∞, ∞). Conversely, if we’re given the c.d.f., we can find the probability func-
tion.
9 14 February
9.1 Named probability distributions
9.1.1 Discrete uniform distribution
Definition: X has a discrete uniform distribution on the set of N real numbers

a1 < a2 < . . . , aN if P (X = ai ) = N1 ∀i = 1, 2, . . . , N .

lOMoARcPSD|36552487
The total probability (mass) is equally or uniformly spread out on each of the num-
bers ai .
9.1.2 Bernoulli distribution
Definition: a random variable X has a Bernoulli distribution with parameter p if

P (X = 1) = p and P (X = 0) = 1 − p = q.
This is used as a “building block” for more complicated random variables.
Starting from after the midterm and break:
10 28 February
10.1 Random variable distributions
Cumulative distribution function (c.d.f.): FX (x) = P (X ≤ x)

For discrete random variables: P (X = x) = PX (x)
10.1.1 Binomial distribution
The random variable X has a binomial distribution with parameters n and p if

n x
P (X = x) = p (1 − p)n−x
x
for x = 0, 1, . . . , n, where n is a non-negative integer and 0 ≤ p ≤ 1.

Note that we should check if:
1. PX (x) ≥ 0 (obvious)
P
2. all x in range PX (x) = 1 (easy to check)
We write X ∼ Bin(n, p) to mean “X has the binomial distribution”.

How the binomial distribution arises: (the binomial setup)

lOMoARcPSD|36552487
Firstly, we have a sequence of n independent trials – that is, the outcomes of these
trials are mutually independent.
Secondly, each trial can result in exactly one of two possible outcomes: a “success”
(S) or a “failure” (F). We call such trials Bernoulli trials.
Thirdly, the probability of success at trial i is constant and equal to p for every
i = 1, 2, . . . , n. For example, in a coin toss, we cannot change the probability of
heads halfway through, so the probability of success is constant.
Theorem: Let X be the number of successes observed in these n trials. Then,

n x
P (X = x) = p (1 − p)n−x for x = 0, 1, 2, . . . , n.
x
Proof: (same idea for the genetic mutation example) First, note that the probability
of any particular configuration in which there are x successes (and n − x failures) is
just px (1 − p)n−x .
E.g:
P (S1 ∩S2 ∩. . .∩Sx ∩Fx+1 ∩Fx+2 ∩. . .∩Fn ) = P (S1 )P (S2 ) . . . P (Sx )P (Fx+1 ) . . . P (Fn )
Assuming these are all independent. This is equal to
p × p × . . . × p × (1 − p) × . . . × (1 − p) = px (1 − p)n−x
But,
{X = x} = ∪all possible configurations {configuration i with x successes}
Note that the configurations are disjoint, so therefore, by axiom 3:
X X
P (X = x) = P (configuration i) = px (1 − p)n−x
all configurations
Since each configuration is defined by a choice of x objects from n total objects, then
for every x = 0, 1, . . . , n:

n x
P (X = x) = p (1 − p)n−x
x

lOMoARcPSD|36552487
10.1.2 Binomial distribution example
Suppose that the five year survival probability for lung cancer is .10. If thirty people
with lung cancer are sampled, what is the probability that at least three will survive
five years or longer?
Solution: Let Y = the number out of thirty who will survive five or more years.
We shall reasonably assume the binomial setup, with n = 30 and P (Si ) = .10, where
Si is the event where the ith subject survives five or more years. Therefore,
30 2
X 30 30−y
X 30
P (Y ≥ 3) = y
(.10) (1 − .10) =1− (.10)y (1 − .10)30−y
y=3
y y=0
y
Important note: do not immediately have the “burning desire” to use the binomial
distribution as soon as you see a bunch of trials, each of which can result in exactly
one of two outcomes. You must check if the trials are independent – they will not
be if you are sampling without replacement!
Observe that the Bernoulli distribution is a special case of the binomial distribution,
where n = 1. So, PX (x) = px (1 − p)1−x for x = 0, 1.
10.1.3 Poisson distribution
The random variable X is said to have a Poisson distribution with parameter λ > 0
if
λx e−λ
P (X = x) = for x = 0, 1, 2, . . .
x!
(Note the infinite range of x)
Check that:
1. PX (x) ≥ 0 – obvious
P∞ λ
2. 0 PX (x) = 1 – Taylor series expansion for e
The Poisson distribution arises as an approximation to the binomial for “large n and
small p”.

lOMoARcPSD|36552487
Theorem: Let X ∼ Bin(n, p). The limit of P (X = x) as n tends to infinity and p

tends to 0, in such a way that n × p is constant (= λ), is
λx e−λ
for x = 0, 1, 2, . . .
x!
6
(e.g. p = n
: np = n( n6 ) = 6)
Proof: We shall use the following result in proving our theorem.
a n
lim n → ∞(1 + ) = ea
n

n x
P (X = x) = p (1 − p)n−x for x = 0, 1, . . . , n
x
n!
(∗) = px (1 − p)n (1 − p)−x
x!(n − x)!
Now since λ = np, we have p = nλ . Then, (∗) becomes:
n! λx λ λ
(1 − )n (1 − )−x
x!(n − x)! n x n n
After some algebraic manipulation, this becomes:
1 n(n − 1)(n − 2) . . . (n − x + 1) λ λ
(1 − )n (1 − )−x λx
x! nn . . . n n n
Notice that the second term has x terms on both the top and the bottom. Now, let
n go to infinity to get the limit:
λx −λ
e
x!
11 01 March
11.1 Poisson distribution continued
11.1.1 Poisson distribution example
λx e−λ
P (X = x) = where x = 0, 1, 2, . . .
x!

lOMoARcPSD|36552487
Suppose that in a book of 1000 pages, on any particular page, there can be either
zero errors or one error. Suppose further that the probability of an error on any
particular page is .002. what is the approximate probability that there will be at
most three errors in the book?
Exact solution: There is a binomial setup – errors are likely to occur independently
amongst the n = 1000 trials. Each trial can result in either a success (an error is
found) or a failure (an error is not found). There is also a constant probability of
success (.002) for every trial. Let X be the number of successes in 1000 pages. Then
X ∼ Bin(1000, .002).
3
X 1000
P (X ≤ 3) = (.002)x (1 − .002)1000−x
x=0
x
Approximate solution: Since n is large and p is small, we can use the Poisson
2
approximation with λ = n × p = 1000 × 1000 = 2. Therefore:
3 3
X X 2x e−2
P (X ≤ 3) = P (X = x) =
x=0 x=0
x!
If X has a Poisson distribution with parameter λ, we write X ∼ P o(λ).
11.2 The Hypergeometric Distribution
The random variable X has a hypergeometric distribution with parameters N , a, and

n if:
a N −a

x n−x
P (X = x) = N

n
(for x ≤ a and n − x ≤ N − a).

The hypergeometric distribution is applied to a situation where there are a objects
of some type in a set of N objects. The distribution discusses the probability of
drawing x objects of this type in a sample of size n.

lOMoARcPSD|36552487
11.3 The Geometric Distribution
X is said to have a geometric distribution with parameter p if
P (X = x) = Px (x) = (1 − p)x−1 p for x = 1, 2, . . . .

∞
X p
p (1 − p)x−1 = =1
x=1
1 − (p − 1)
The geometric random variable is used to describe or model the trial number at
which the first success occurs in a sequence of independent Bernoulli trials, each
with the probability of success p.
11.4 Expected values
The probability distribution of a random variable provides the complete story about
the random variable. There is no information about a random variable once we
know its probability distribution. However, we often wish to summarize a probability
distribution. The two most common summaries are:
1. a parameter that discusses the “centre of distribution” and
2. a parameter that discusses how spread out the values are from the centre.
To this end, we give the following definition:
Expected value: let Y be a discrete random variable with the probability function
PY (y). Then, we define the expected value of y denoted by E(Y ) to be
X X
E(Y ) = yPY (y) = yP (Y = y)
all y all y

lOMoARcPSD|36552487
12 06 March
12.1 Review
Recall: given a random variable Y , we define the expected value or expectation of

Y denoted by E(Y ) to be:
X
E(Y ) = yP (Y = y)
all y
provided that the sum is finite.
12.2 Theory
Notes:
1. E(Y ) is often denoted by µY , and also is termed the mean of Y (also called
the “population mean”).
2. Interpretation: E(Y ) is a weighted average or mean of the possible values of
the random variable Y , where the weights are the probabilities of these values.
For example, in the special case where Y has a discrete uniform distribution
a1 , a2 , . . . , aN , then
N N
X 1 X
E(Y ) = ai P (Y = ai ) = ai
i=1
N i=1
So think of µ as the “average value” of Y .

3. µ is a constant, a parameter specific to a given probability distribution. You
need the distribution to compute µ.
4. E(cY ) = cE(Y ) where c is a constant, and E( ni=1 Yi ) =
P Pn
i=1 E(Yi ) (the
proof will come later)
5. Note, however, that in general, E(XY ) 6= E(X)E(Y ).
6. Let g be some real-valued function of a random variable Y . Then, if X = g(Y ),
we have X
E(X) = E(g(Y )) = g(y)P (Y = y)
all y

lOMoARcPSD|36552487
P
The point of this: by definition, E(g(Y )) = E(X) = all x xP (X = x). Thus,
in order to find E(g(Y )), it would appear that we first need to find the prob-
ability distribution of X = g(Y ). We’ll see such transformations later – they
can be difficult. You do not have to first find the distribution of X – you can
use the distribution of the original Y and sum g(y)P (Y = y).
In particular, if g(Y ) = Y k , we can call E(g(Y )) = E(Y k ) is called the kth moment
of Y . µ is called the first moment of Y .
X
E(Y k ) = y k P (Y = y)
all y
Of particular importance is a special function of Y :

g(Y ) = (Y − µY )2
In this case, E(g(Y )) = E((Y − µY )2 ) is called the variance of Y , and is denoted by
V ar(Y ), also σY2 . This gives the average squared distance between the values of Y
and p its mean. It is a measure of spread or variation of Y (i.e. its distribution). We
call σY2 the standard deviation of Y . This is more convenient than V ar(Y ), since
it is in the same units as Y , unlike V ar(Y ).
Result:
V ar(Y ) = E((Y − µY )2 ) = E(Y 2 ) − µ2Y
Proof:
E((Y − µY )2 ) = E(Y 2 − 2µY Y + µ2Y )
= E(Y 2 ) − E(2µY Y ) + E(µ2Y )
= E(Y 2 ) − 2µY E(Y ) + µ2Y
= E(Y 2 ) − 2µ2Y + µ2Y
= E(Y 2 ) − µ2Y
Final note: V ar(cY ) 6= c × V ar(Y ), but V ar(cY ) = c2 × V ar(Y ).
12.3 Examples
12.3.1 Example 1
The calculation of insurance premiums: An insurance company will insure

your computer against theft for $1000. It is known with a probability .05 that your

lOMoARcPSD|36552487
computer will be stolen. What premium should the insurance company charge so
that its expected gain is 0?
Solution: Let c be the required premium. Let Y be the gain of the company in a
given year. We need to find the value of c such that E(Y ) = 0.
First, we need PY (y) for all y. We have P (Y = c) = .95 (where the computer
was not stolen) and P (Y = (c − 1000)) = .05 (where the computer was stolen).
Therefore,
E(Y ) = c × .95 + (c − 100) × .05
Setting E(Y ) = 0 and solving for c, we get c = 50. Thus, if the company charges
$50 for the policy, on average, over a large number of clients, they would neither lose
nor gain money.
12.3.2 Example 2
“Nuts and Bolts” Example: suppose that the random variable x has the proba-
bility distribution:
P (X = −1.2) = .32
P (X = 2.6) = .40
P (X = 0) = .28
Find E(X) and V ar(X).
E(X) = −1.2 × .32 + 0 × .28 + 2.6 × .40 = .66 = µX

X
V ar(X) = E(X 2 ) − µ2X = x2 P (X = x)
E(X 2 ) = (−1.2)2 × .32 + 02 × .28 + (2.6)2 × .40
= 3.1648
∴ V ar(X) = 3.1648 − (.66)2
= 2.7292
√
Also, σ = 2.7292 = 1.652

lOMoARcPSD|36552487
13 08 March
13.1 Summary
Centre – E(X) = µ
Spread – V ar(x) = σ 2
√
Standard Deviation – σ2 = σ
X X
(x − µ)2 P (X = x) = x2 P (X = x) − µ2
13.2 The Mean and Variance of Some Named Distributions
13.2.1 Binomial
n
X n x
E(X) = µ = x p (1 − p)n−x
x=1
x
X n(n − 1)!
= ppx−1 (1 − p)n−1−(x−1) as (n − x)! = (n − 1 − (x − 1))!
(x − 1)!(n − 1 − (x − 1))!
X n − 1
= np px−1 (1 − p)n−1−(x−1)
x−1
n−1
X n − 1
= np py (1 − p)n−1−y
y=0
y
= np
The sum above is equal to 1 since it is just a sum of n − 1 Bin(n − 1, p) probabili-

ties.
To find V ar(X), we first have to find E(X 2 ). By definition:
n
2
X n x
2
E(X ) = x p (1 − p)n−x
x=1
x

lOMoARcPSD|36552487
Note that x2 will not cancel with the leading terms of x! as before, so we have to use
a trick. We first calculate E(X(X − 1)), which is easy. Then, notice that
E(X(X − 1)) = E(X 2 ) − E(X) = E(X 2 ) − µ
So, E(X 2 ) = E(X(X − 1)) + µ. Finally, V ar(X) = E(X(X − 1)) + µ − µ2 .
n
X n x
E(X(X − 1)) = x(x − 1) p (1 − p)n−x
x=2
x
X (n − 2)!
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
(x − 2)!(n − 2 − (x − 2))!
X n − 2
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
x−2
2
= n(n − 1)p
V ar(X) = n(n − 1)p2 + np − n2 p2 = np(1 − p)
13.2.2 Bernoulli
In particular, the mean and variance of a Bernoulli random variable are p and p(1−p)
respectively (because n = 1).
13.2.3 Poisson
∞
X λx e−λ
E(X) = x
x=1
x!
∞
−λ λx−1
X
=e λ
x−1=0
(x − 1)!
=λ
So, the mean of a Poisson random variable is just its parameter λ.

lOMoARcPSD|36552487
For the variance, we use the trick we used for the variance of a binomial distribution.
First, we solve for E(X(X − 1)):
n
X λx e−λ
E(X(X − 1)) = x(x − 1)
x=2
x!
X λx−2 e−λ
= λ2 x(x − 1)
x(x − 1)(x − 2)!
X λx−2 e−λ
= λ2
(x − 2)!
2
= λ as the sum represents a Poisson distribution where x = x − 2
Since E(X 2 ) = E(X(X − 1)) + µ, V ar(X) = E(X(X − 1)) + µ − µ2 .

∴ V ar(X) = λ2 + λ − λ2 = λ
13.2.4 Geometric
∞
X
E(X) = xp(1 − p)x−1
x=1
X∞
using our trick... = p x(1 − p)x−1
x=1
∞
X d
=p − (1 − p)x
x=1
dp
∞
d X
= −p (1 − p)x we can interchange the derivative and the sum
dp x=1
d 1−p
= −p where (1 − p) = r
dp 1 − (1 − p)
d 1
= −p ( − 1)
dp p
1
=
p

lOMoARcPSD|36552487
For the variance, first find E(X(X − 1)) (2 derivatives), and then you can find the
variance.
1−p
V ar(X) =
p2
13.3 Continuous probability distributions
Definition: a random variable X with c.d.f. FX is said to be continuous if FX is

continuous for all −∞ < x < ∞.
The continuous c.d.f.s split into two types:
1. the so-called “absolutely continuous” c.d.f.s (essentially, they are differentiable)
and
2. the so-called “singular” c.d.f.s (without derivatives).
From now on in this course, we’ll assume for continuos c.d.f.s FX , that they are
differentiable for all −∞ < x < ∞.
It follows that if X is continuous, then P (X = x) = 0 for every x. Consider the
“area under a curve” analogy.
Remember (as FX (x) = P (X ≤ x)):
P (X = x) = FX (x) − P (X < x)
Hence, we cannot specify a continuous random variable by specifying the values

P (X = x) for all x that X can assume, as we did in the discrete case. Instead,
we introduce an analogue of the probability function known as the probability den-
sity function (p.d.f.). It will turn out that the p.d.f. also uniquely determines the
probability distribution.
Definition: A real-valued function fx is said to be the probability density function
of a random variable X if:
1. fx (x) ≥ 0 for all −∞ < x < ∞ and
R
2. P (X ∈ A) = A fx (x)dx (i.e. fx has the property that when you integrate it
over a set, you get the probability of the set).

lOMoARcPSD|36552487
14 13 March
14.1 The probability density function
Definition: the probability density function (p.d.f.) of a random variable X is any

function fX (x) such that:
1. fX (x) ≥ 0 ∀ − ∞ < x < ∞ and
R
2. P (X ∈ A) = A fX (s)dx ∀ events A ∈ [R]
A p.d.f. gives P (X ∈ A) by integrating over A.
14.2 Notes on the p.d.f.
(1) In particular, if A is of the form (−∞, x] then

Z x
P (X ∈ A) = P (X ≤ x) = fX (y)dy
−∞
Rx
In short, FX (x) = −∞
fX (y)dy.
(2) Conversely, we can recover the p.d.f. from the c.d.f. by the Fundamental Theo-
rem of Calculus, since:
d
FX (x) = FX′ (x) = fX (x)∀x
dx
Thus, in particular, if fX is a p.d.f., then:

Z ∞
fX (x)dx = 1
−∞
(3) Interpretation of a p.d.f: We have, for small ∆x, that:
FX (x + ∆x) − FX (x)
≈ fX (x)
∆x
So it follows that:
FX (x + ∆x) − FX (x) ≈ ∆xfX (x)

lOMoARcPSD|36552487
But the left hand side is just P (x < X ≤ x + ∆x). Finally, we have that fX (x)∆x
is approximately the probability that X lies in (x, x + ∆x].
Note that, because the p.d.f. does not represent a probability on its own, it can be
greater than 1 or less than 0. When multiplied by a small ∆x, it gives an approximate
probability.
Any function whose total area equals 1 can qualify as a p.d.f., even if some points
are larger than 1.
(4) For continuous random variables, we define:
X Z ∞
E(g(X)) = g(x)PX (x) = g(x)fX (x)dx
allx −∞
As before, when g(X) = X k , then we refer to the E(g(X)) = E(X k ) as the kth
moment. Of particular importance are the first moment (k = 1) and the second
moment (k = 2). Again, as before, we call the first moment the mean or the expected
value of X.
So, by definition: Z ∞
E(X) = µX = xfX (x)dx
−∞
and Z ∞
2
E(X ) = x2 fx (x)dx)
−∞
Finally, as before:
Z ∞
2
V ar(X) = E((X − µX ) ) = (x − µX )2 fX (x)dx = E(X 2 ) − µ2X
∞
and Z ∞ Z ∞
2
V ar(X) = x fX (x)dx − ( xfX (x)dx)2
−∞ −∞
Note: watch out for a p.d.f. that may change its form in different ranges of (−∞, ∞)
when carrying out an integration.

lOMoARcPSD|36552487
14.3 Examples
14.3.1 Example 1
Let fX (x) = c(x2 + 1) for 0 < x < 1 and fX (x) = 0 elsewhere (c is a constant).
1. Find c.
2. Find P (.25 < X ≤ .50).
3. Find P (.25 < X < .50).
4. Find FX .
5. Find E(X) and σX .
R∞
(1) Since −∞ fX (x)dx = 1, we must have:
Z 0 Z 1 Z ∞
2
0dx + c(x + 1)dx + 0dx = 1
−∞ 0 1
This gives c = .75.

(2)
.50
55
Z
P (.25 < X ≤ .50) = (.75)(x2 + 1)dx =
.25 256
(3) For a continuous p.d.f., these two values are the same (by rules of integra-
tion).
55
P (.25 < X < .50) = P (.25 < X ≤ .50) =
256
(4)
FX (x) = 0 ∀x ≤ 0
Z x
FX (x) = fX (y)dy ∀0 < x < 1
−∞
Z 0 Z x
= 0dy + (.75)(y 2 + 1)dy
−∞ 0
x3
= (.75)( + x)
3
FX (x) = 1 ∀x ≥ 1

lOMoARcPSD|36552487
(5)
Z ∞
E(X) = xfX (x)dx
−∞
Z 1
= x(.75)(x2 + 1)dx
0
x4 x2 1
= (.75)( + )
4 2 0
= (.75)(.25 + .50)
9
=
16
To find σX , first find:

Z 0 Z 1 Z ∞
2 2 2
E(X ) = 0dx + x (.75)(x + 1)dx + 0dx
−∞ 0 1
x5 x 1 3
= (.75)( + )
5 3 0
σ 2 = V ar(X)
9
= E(X 2 ) − ( )2
√ 16
σ= σ 2
15 15 March
15.1 Named continuous probability distributions
15.1.1 The uniform distribution
Definition: The random variable X is said to be uniformly distributed on the

interval [a, b] if Z x
1
fX (x) = dx for a ≤ x ≤ b
a b−a
and fX (x) = 0 elsewhere.

Notes:

lOMoARcPSD|36552487
1
1. The p.d.f. is constant on [a, b]. On the graph, the height is constantly b−a , so
the area under the curve is 1. The probability is uniformly spread out in the
interval.
2. The uniform distribution is often used to model situations in which we believe
outcomes occur completely at random.
3. An important special case is the uniform distribution [0, 1].
4. Notation – we write X ∼ U (a, b) to mean that X has a uniform distribution
on the interval [a, b].
5. The c.d.f. of X is:
FX (x) = 0 for x < a
x−a
= for a ≤ x ≤ b
b−a
= 1 for b < x
The graph of the c.d.f. is 0 up to a, then grows linearly up to b, then is
constantly 1 after.
6.
Z ∞
µ = E(X) = xfX (x)dx
−∞
b
1
Z
=0+ x dx + 0
a b−a
b 2 − a2
=
2(b − a)
a+b
=
2
It is then easy to see that the variance of a uniform distribution is V ar(X) =
(b−a)2
12
.
15.1.2 The exponential distribution
Definition: X has an exponential distribution with parameter β > 0 if:

1 − βx
fX (x) = e for x ≥ 0
β

lOMoARcPSD|36552487
The distribution is equal to 0 elsewhere.

Notes:
1. we write X ∼ Exp(β) to describe this distribution.
2. If X is exponential, then X is a non-negative random variable, meaning P (X ≥
0) = 1.
3. The p.d.f. of fX (x) is β1 at x = 0, and grows exponentially downwards as x
grows larger. The probability of an interval of length L decreases as L moves
down the graph (e.g. the probability between 2 and 4 is greater than the
probability between 4 and 6). The probability is concentrated towards the
origin.
4. The c.d.f. of X is:
FX (x) = 0 ∀x < 0
Z x
a −y
=0+ e β dy for x ≥ 0
0 β
−x
=1−e β
R∞ R∞ x
5. µ = E(X) = −∞ xfX (x)dx = 0 + 0 x β1 e− β dx = β You can do this using
integration by parts, or we’ll see a trick for this later. Also, σ 2 = V ar(X) = β 2 .
6. Sometimes, the exponential distribution will be “parameterized” in a different
way, i.e. the parameter will be written in a different form. The alternative
form for the p.d.f. is:
fX (x) = λeλx for x ≥ 0

= 0 elsewhere
Watch out how the writer is writing in the parameter for the distribution! In
this case, E(X) = λ1 and V ar(X) = λ12 .
7. The memoryless property:
Theorem: Let X ∼ Exp(β). Then, P (x ≤ X < x + h | X ≥ x) = P (0 ≤ X <
h).
In other words, the information that X ≥ x is “forgotten”.
Important note: the memoryless property does not assert that P (0 ≤ X <

lOMoARcPSD|36552487
h) = P (x ≤ X < x + h)!
Proof: Let x ≤ X < x + h be B, and X ≥ x be A. Then,
P (x ≤ X < x + h|X ≥ x) = P (B ∩ A)/P (A)

= P (B)/P (B)since B is a subset of A
FX (x + h) − FX (x)
=
1 − P (X < x)
−(x+h) −x
1−e β − (1 − e β )
= −x
1 − (1 − e β )
−h
=1−e β
= P (0 < X ≤ h)
There is an interesting converse – the only continuous distribution with the

memoryless property is the exponential. The geometric discrete distribution
also has this property.
8. The exponential distribution is used when you believe that X has a constant
“hazard” (i.e. P (x ≤ X < x + h | X ≥ x) is roughly constant in x for small
h). It is also used to model the times between events that occur according to
a Poisson Process (explained in later statistic courses).
16 20 March
16.1 The Gamma Distribution
Before defining the gamma distribution, we need to define the gamma function.
16.1.1 The Gamma Function
Let α > 0. We denote the gamma function by Γ(α) and define

Z ∞
Γ(α) = xα−1 e−x dx
0
The gamma function has the following two important properties:

lOMoARcPSD|36552487
√
1. Γ( 21 ) = π
2. Γ(α + 1) = αΓ(α)
16.1.2 The Gamma Distribution
Definition: A random variable X has a gamma density with parameters α, β if its

p.d.f. is given by:
−x
1 xα−1 e β
fX (x) = for x ≥ 0
Γ(α) β α
fX (x) = 0 when x < 0.
Notes:
(1)
∞ −x
1 xα−1 e
Z β
dx = 1
0 Γ(α) β α
x
This is as it should be – set y = β
to get
∞
1
Z
y α−1 e−y dy = 1.
Γ(α) 0
x
Set y = β
and dx = β dy to get the same answer.
This just proves that this given formula is a density, as it integrates out to 1.
(2) The gamma distribution is said to “flexible”, meaning that many different shapes
for the p.d.f. can be induced by changing the two parameters α and β.
For α > 1, the p.d.f. grows rapidly immediately after x > 0, then drops off with a
tail as x grows larger. fX is said to be skewed to the right.
For α = 1, fX grows exponentially downwards, similar to the exponential distribu-
tion.
For α < 1, the p.d.f. also grows exponentially downwards, but more steeply.
(3) The gamma density can be used to model the waiting time for the nth event if
the times between events are independent exponential random variables.
(4) There are two important special cases of the gamma distribution:

lOMoARcPSD|36552487
1. If we set α = 1, we get an exponential distribution with parameter β.

ν
2. If we set α = 2
and β = 2, then the density becomes:
ν
1 x 2 −1 −x
ν e2
Γ( ν2 ) 2 2
For x ≥ 0 (0 otherwise).
This particular p.d.f. plays an important role in statistics, and is called a Chi-
square p.d.f. “with ν degrees of freedom”. ν is just a parameter with this
peculiar name.
We write X ∼ χ2ν to mean “X has a Chi-square distribution with ν degrees of
freedom”.
(5) It is not too difficult to derive E(X) and V ar(X) from the definition. It will be
easier, however, once we know about moment-generating functions.
x
In the end, E(X) = αβ. Write xα as xα+1−1 and let y = β
, and carry out the
integration.
We get, similarly, V ar(X) = αβ 2 .
ν
In particular, if X ∼ χ2ν , then E(X) = ν (set α = 2
and β = 2) and V ar(X) =
2ν).
(6) The c.d.f. FX is not known in closed form: F (x) = 0 for x < 0 and:
Z x
1 y α−1 −y
FX (x) = α
e β dy
0 Γ(α) β
For x > 0.
(7) Notation:
We write X ∼ Gamma(α, β) to mean “X has a gamma distribution with parameters
α, β”.
16.2 The Normal (or Gaussian) Distribution
The normal or Gaussian distribution is easily the most important distribution in

probability and statistics! The distribution seems to occur naturally all over.

lOMoARcPSD|36552487
16.2.1 Definition
The random variable X has a normal distribution with parameters µ, σ 2 if its p.d.f.
is given by
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ
for −∞ < x < ∞.
16.2.2 Notes
(1) We write X ∼ N (µ, σ 2 ).

(2) It is possible to show directly that E(X) = the parameter µ and V ar(X) = the
parameter σ 2 .
Note that if X ∼ N (1.2, 7.8), we mean that µ = 1.2 and σ 2 = 7.8 not σ = 7.8.
We’ll derive µ and σ 2 by using the so-called moment generating function later, as
the current integration would be a bit tricky (but not impossible).
(3) The p.d.f. has the famous bell shape. The features are:
1. fX is symmetric about µ.
2. Changing µ changes the location of the p.d.f., i.e. where it is centred on the
x-axis.
3. Increasing σ 2 increases the spread of the p.d.f. and decreasing σ 2 decreases the
spread.
(4) The c.d.f. is not known in closed form, similar to the gamma distribution.
Probabilities of intervals need to be done using numerical integration.
However, unlike the gamma density, it is possible to use a single table to find any
normal probability. The idea is to reduce the general problem to what is called a
standard normal problem.

lOMoARcPSD|36552487
16.2.3 Standard Normal Problem
Background: if I give you any random variable with mean µ and standard deviation
σ, then
X −µ
Y =
σ
has E(Y ) = 0 and V ar(Y ) = 1.
We are said to have standardized X.
However, if X ∼ N (µ, σ 2 ), we have the following:
X −µ
Z= ∼ N (0, 1)
σ
This is called a standard normal random variable or distribution.
17 22 March
17.1 Standardizing continued
Our main result from Tuesday: if X ∼ N (µ, σ 2 ), then
X −µ
Z= ∼ N (0, 1)
σ
Note: For any random variable with mean µ and variance σ 2 , X−µ σ
will have mean 0
and variance 1. The proof of this is simple - just plug in the fraction for X in E(X)
and V ar(X).
The point of the main result is that, after standardizing, we still get a normal random
variable. If you standardize a random variable, you don’t always get a random
variable of the same type (unless you’re standardizing a random variable with a
normal distribution).
17.1.1 Example 1
Problem: If X ∼ N (−1.2, 4), find P (−1.9 ≤ X < 2.2).

lOMoARcPSD|36552487
Solution: The idea is to reduce the problem to a N (0, 1) problem, and then use
N (0, 1) tables.
Step 1: (do not draw a sketch now)
−1.9 − (−1.2) X − (−1.2) 2.2 − (−1.2)

P (−1.9 ≤ X < 2.2) = P ( ≤ < )
2 2 2
= P (−.35 ≤ Z < 1.7) our main result from before
Step 2: draw a sketch: (draw a standard normal distribution with mean = 0, shade
in area A between −.35 and 1.7)
Tables will give you areas to the right of a value z. Areas to the right of z = 3
is essentially 0, so tables will usually not give values of z > 3. Recall that the
areas will be the probabilities – e.g. the area to the right of z = 1.78 will equal
P (Z ≥ 1.78).
We’ll call A1 the area between −.35 and 0, and A2 the area between 0 and 1.7. We’ll
get these values by using the symmetry of N (0, 1) about µ = 0. Tables only give
positive values of z, so to get A1 , subtract P (Z ≥ .35) from .5.
From the table values, A2 = .5 − .0446 and A1 = .5 − .3632, so A = A1 + A2 =
.5922.
17.1.2 Example 2
Problem: Use the N (0, 1) tables inversely here. Suppose that a car battery is
known to have a lifetime that is approximately normally distributed with a mean of
36 months and a standard deviation of 6 months. What should the warranty period
be set at so that only 5% of batteries will need to be replaced?
Notice: Batteries cannot have a negative lifetime, so our true normal distribution
will not work. However, our lifetime is so skewed to the right that 2σ is still way
to the right of 0, so we can shift our model. Strictly speaking, modelling anything
that cannot hold negative values is not correctly, but for almost all cases, the normal
distribution will work just fine.
Solution: We have that X ∼ N (36, 62 ). Let x0 be the required warranty period.
We want that x0 such that P (X < x0 ) = .05, i.e. such that P (X ≥ x0 ) = .95.

lOMoARcPSD|36552487
Reduce this distribution to a standard normal distribution. So, we seek x0 such

that:
X − 36 x0 − 36
P( ≥ ) = .95
6 6
i.e. such that P (Z ≥ x0 −46
6
) = .95.
Draw a sketch – we’re looking for a z0 from standard normal tables such that the
area to the right is .95, then set that equal to our above probability, and we can solve
our problem.
This z0 must be to the left of the mean 0, since the area to the right of the mean is
.5. Find a z1 such that the area to the right of it is .05, and according to our tables,
z1 = 1.64. Take the negative, so the area to the right of z0 = −1.64 is .95 (using the
symmetry of the normal distribution).
From the N (0, 1) tables, we know that P (Z ≤ −1.64) = .95. Finally, we must have
that
x0 − 36
= −1.64
6
We get x0 = 26.16 months.
17.2 Moment Generating Functions
17.2.1 Definition
Let X be a random variable with p.d.f. fX (respectively, probability PX for the

discrete case). We define the moment generating function (denoted m.g.f.) to be
that function of t, such that
MX (t) = E(etx )
17.2.2 Notes
(1) The m.g.f. is a function of the real valued t.

(2) In the continuous case,
Z ∞
tx
E(e ) = etx fX (x) dx
−∞

lOMoARcPSD|36552487
In the discrete case, X

E(etx ) = etx P (X = x)
x
(3) For some distributions, the m.g.f. does not exist because the integral (or sum)
does not converge. We say that the m.g.f. exists if it exists in some interval containing
0.
(4) If the m.g.f. exists, then it is possible to recover the p.d.f. or probability
function, i.e. there is a one-to-one correspondence between a p.d.f. (pf) and a m.g.f..
Recovering it, although possible, is a bit complicated, and we will not be expected
to do so.
(5) Uses of the m.g.f. – the m.g.f. can be used to find the moments of a ran-
dom variable, and is often easier than finding the moments (E(X k )) by using the
definition.
Theorem: E(X k ) = M (k) (0).
Proof (continuous case): (discrete case – replace integral by sum) We have, by
definition:
Z ∞
tx
MX (t) = E(e ) = etx fX (x) dx
−∞
d
Z
(1)
MX (t) = ...
dt
d
Z
= ...
dt
Z
= xetx fX (x) dx
now set t = 0
Z ∞
(1)
MX (0) = xfX (x) dx
−∞
= E(X)
In general, we get Z ∞
(k)
M (t) = xk etx fX (x) dx
−∞
This gives M (k) (0) = E(X k ), as advertised.

lOMoARcPSD|36552487
18 27 March
18.1 Recall
MX (t) = E(etx )
(k)
MX (0) = E(X k )
18.2 The m.g.f.s of some important distributions
18.2.1 Gamma distribution
X ∼ Gamma(α, β)
∞
1 xα−1 −x
Z
MX (t) = etx α
e β dx
0 Γ(α) β
Z ∞
1 1 1
= α
xα−1 e−x( β −t) dx
Γ(α) β 0
Set y = x( β1 − t) with dy = ( β1 ) − t dx to get
1
MX (t) = for |βt| < 1
(1 − βt)α
From MX (t), we immediately get
MX′ (0) = αβ(1 − βt)−α−1

t=0
by the chain rule. Therefore, MX′ (0) = αβ. Similarly,
MX′′ (t) = αβ 2 + α2 β 2
t=0
as V ar(Y ) = E(Y 2 ) − (E(Y ))2 . Therefore, V ar(X) = αβ 2 .

So, in particular, for α = 1 (i.e. the exponential distribution), we have
1
MX (t) =
(1 − βt)

lOMoARcPSD|36552487
with E(X) = β and V ar(X) = β 2 .

ν
For α = 2
and β = 2 (i.e. a chi-square distribution with ν degrees of freedom), we
get
1
MX (t) = ν
(1 − 2t) 2
and E(X) = ν and V ar(X) = 2ν.
18.2.2 Binomial distribution
X ∼ Bin(n, p)
n
X n x tx
MX (t) = e p (1 − p)n−x
x=0
x
= (1 − p + pet )n ∀t ∈ (−∞, ∞)
(note: 1 − p = a and pet = b)
n
X n x
tx
MX (t) = e p (1 − p)n−x
x=0
x
n
X n
= (pet )x (1 − p)n−x
x=0
x
= (pet + (1 − p))n by the Binomial theorem
= (1 − p + pet )n ∀t ∈ (−∞, ∞)
Pn n

Note: the Binomial theorem is x=0 x
ax bn−x = (a + b)n .
We get MX′ (0) = np and MX′′ (0) = np(1 − p) + n2 p2 . Therefore, V ar(X) = np(1 −
p).

lOMoARcPSD|36552487
18.2.3 Poisson distribution
X ∼ P o(λ)
MX (t) = E(etx )
∞
X etx λx e−λ
=
x=0
x!
∞ x
−λ
X (et λ)
=e
x=0
x!
−λ et λ
=e e by Taylor series of ex
t
= eλ(e −1)
Therefore, E(X) = MX′ (0) = λ and E(X 2 ) = λ + λ2 , so therefore V ar(X) = λ.
18.2.4 Normal distribution
Let X ∼ N (0, 1) to start with.

1 1 2
fX (x) = √ e 2 x ∀x ∈ (−∞, ∞)
2π
Therefore,
∞
1
Z
1 2
MX (t) = etx √ e− 2 x dx
−∞ 2π
∞
1
Z
1 2 −2tx+t2 ) t2
= e− 2 (x e 2 √ dx
−∞ 2π
∞
1
Z
t2 1 2
=e 2 √ e− 2 (x−t) dx
−∞ 2π
t2
=e 2 since the integrand is just a N (t, 1) p.d.f..
To get the m.g.f. of a N (µ, σ 2 ) random variable for arbitrary µ and σ 2 , we use the
following property of an m.g.f.: let a and b be constants. Then,
MaX+b (t) = ebt MX (at)

lOMoARcPSD|36552487
Now, recall that if X ∼ N (µ, σ 2 ), then Z = X−µ

σ
∼ N (0, 1). Therefore, we can
2
always write a N (µ, σ ) random variable X as X = σZ + µ.
Putting these two results together, we get
σ 2 t2
MX (t) = MσZ+µ (t) = eµt e 2
and
1 2 t2
MX (t) = eµt+ 2 σ ∀t ∈ (−∞, ∞)
We find MX′ (0) = µ, MX′′ (0) = σ 2 + µ2 , and V ar(X) = σ 2 .
18.3 Transformations of random variables
Often, we’re given the distribution of a random variable X, but we’re more inter-
ested in some function Y = g(X) of this random variable, e.g. maybe we have the
distribution of the velocity V , and we are interested in the distribution of the kinetic
energy Y = 21 mV 2 . In general, we’re concerned with finding the distribution of g(X)
knowing the distribution of X.
First, consider the following two examples to illustrate the eventual formula for the
continuous case.
18.3.1 Example 1
X−µ
Let X ∼ N (µ, σ 2 ). Find the p.d.f. of Z = σ
. We know the answer to this, but we
don’t know the proof for it.
Step 1: write down the c.d.f. of Z.
FZ (z) = P (Z ≤ z)
X −µ
= P( ≤ z)
σ
Step 2: write the FZ in terms of FX .

X −µ
P( ≤ z) = P (X ≤ σz + µ)
σ
= FX (σz + µ)

lOMoARcPSD|36552487
Step 3: differentiate in terms of z.

d
fz (z) = FZ (z)
dz
= σfX (σz + µ) by chain rule
Finally:
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ σz+µ
1 1 2
= √ e− 2 z ∀z ∈ (−∞, ∞)
2π
18.3.2 Example 2 (Careful!)
Let Z ∼ N (0, 1). Find the p.d.f. of Y = Z 2 .
FY (y) = P (Y ≤ y)
= P (Z 2 ≤ y)
√
= P (|Z| ≤ y)
√ √
= P (− y ≤ Z ≤ y)
√ √
= FZ ( y) − FZ (− y) for y ≥ 0
Finally,
d
fY (y) = FY (y)
dy
1 1 √ 1 1 √
= y − 2 fZ ( y) + y − 2 fZ (− y) ∀y ≥ 0
2 2
− 12 √
= y fZ ( y) (N (0, 1) density symmetric about 0)

lOMoARcPSD|36552487
19 29 March
19.1 Transformations continued
If Z ∼ N (0, 1), then Y = Z 2 ∼ X 2 . We have

√ √
FY (z) = P (− z ≤ Z ≤ z)
√ √
FZ ( z) − FZ (− z)
d
fY (z) = FY (z)
dz
1 −1 √ √ 1 1
= z 2 fZ ( z) + fZ (− z) z − 2
2 2
√
But f√Z is a N (0, 1) p.d.f. which is symmetric about 0, and therefore fZ ( z) =
fZ (− z).
Therefore, we have
1 √
fY (z) = z − 2 fZ ( z) for 0 < z < ∞
= 0 for z ≤ 0
ν 1
Finally, we use the following facts: (with α = 2
= 2
and β = 2)
1 1 2
fZ (u) = √ e− 2 u for − ∞ < u < ∞
2π
If W ∼ χ21 , then
1 1 1 w
fW (w) = √ 1 w− 2 e− 2 for w ≥ 0
π 22
with fW (w) = 0 elsewhere.
We have
1 1 1
fY (z) = z − 2 √ √ e− 2 z for z > 0
2 π
= 0 for z ≤ 0
We’re done!

lOMoARcPSD|36552487
19.1.1 General formula and theorem
We can now give a general formula that allows us to go from the p.d.f. of a given
random variable X to the p.d.f. of a transformed random variable Y = g(X).
Theorem: Let X have p.d.f. fX . Let y = g(x) be either strictly increasing or
strictly decreasing as a function of x, and X is continuous. Define Y = g(X). Then,
the p.d.f. of Y is
dx
fY (y) = fX (g −1 (y))| | for the appropriate range of values of Y .
dy
dx 1
Note that dy
= dy .
| dx |
Proof: First,
FY (y) = P (g(X) ≤ y)
= P (X ≤ g −1 (y)) (if g increasing)
while = P (X ≥ g −1 (y) (if g decreasing)
Thus, we have,
FY (y) = FX (g −1 (y)) (if g increasing)
= 1 − FX (g −1 (y)) (if g decreasing)
Finally, as g(x) = y ⇒ x = g −1 (y),

d
fY (y) = FY (y)
dy
dx
= fX (g −1 (y)) (if g increasing)
dy
dx
= −fX (g −1 (y)) (if g decreasing)
dy
dy 1
But if g is decreasing, then dx
= dx < 0, so that the two situations (where g is
dy
increasing and g is decreasing) can be combined into a single formula.
dx
fY (y) = fX (g −1 (y))| |
dy
Note: do not apply this formula unless g is either strictly increasing or strictly
decreasing.

lOMoARcPSD|36552487
19.1.2 The probability integral transformation
The following is a very important result that allows one to simulate observations
from a given probability distribution by knowing only how to simulate observations
from a U (0, 1) distribution. This famous result is called the probability integral trans-
formation.
Theorem: Let X be a continuous random variable with a strictly increasing FX .
Let Y = FX (X). Then Y ∼ U (0, 1).
Note: here, our g is FX , i.e. Y = g(X) = FX (X). Also, you must use FX and not
some other F .
Proof: by our formula,
dx
fY (y) = fX (FX−1 (y))
dy
(| dx
dy
|= dx
dy
since FX is increasing)
1
fY (y) = fX (FX−1 (y)) dy
dx
dy
Recall y = FX (x), dx
= fX (x), and x = FX−1 (y). Therefore,
1
fY (y) = fX (FX−1 (y)) = 1 for 0 < y < 1
fX (FX−1 (y))
and fY (y) = 0 elsewhere.

We recognize the above as a U (0, 1) p.d.f.
19.2 Joint probability distributions
Very often, we’re interested in the simultaneous behaviour of several random vari-
ables, rather than one at a time, as we have considered up until now, e.g. if X =
number of kilometres traveled by a tire and Y = tread depth, we may wish to know
about the simultaneous or joint distribution of X and Y . This leads to so-called
multivariate distributions.
We shall consider bivariate (i.e. pairs of random variables) distributions, and indicate
the general extensions at the end.

lOMoARcPSD|36552487
19.2.1 Definition
The random variables X and Y have joint c.d.f., denoted FX,Y , if
FX,Y (x, y) = P (X ≤ x ∩ Y ≤ y)
Which we denote as P (X ≤ x, Y ≤ y).
19.2.2 Notes
(1) This definition holds for both continuous and discrete random variables.
(2) It is possible to show (in an advanced probability course) that the joint c.d.f.
uniquely determines the probability distribution in two-dimensional space.
20 03 April
20.1 Properties of the joint c.d.f.
(1) FX,Y (x, y) uniquely determines the joint probability distribution in two-dimensional
space, i.e. in theory, given any event B in R2 , once we know FX,Y (x, y) for all
−∞ < x < ∞, −∞ < x < ∞, then P ((X, Y ) ∈ B) is uniquely determined.
(2) We define the so-called marginal c.d.f. FX and FY of FX,Y as follows:
FX (x) = P (X ≤ x) = FX,Y (x, +∞) = lim FX,Y (x, y)

y→∞
and, similarly,
FY (y) = FX,Y (+∞, y)
(3) FX,Y (x, y) is non-decreasing in x and y (e.g. FX,Y (x, y) ≤ FX,Y (x′ , y) for x <
x′ ).
(4) FX,Y (−∞, −∞) = 0 (c.f. FX (−∞) = 0) and FX,Y (∞, ∞) = 1.
(5) FX,Y (x, y) is jointly continuous from the right (c.f. FX (x) is continuous from the
right).

lOMoARcPSD|36552487
20.2 The role of the p.d.f. and probability functions in joint

distributions
Definition: We call fX,Y (x, y) the joint p.d.f. of (X, Y ) if

Z Z
P ((X, Y ) ∈ A) = fX,Y (x, y) dx dy
A
and fX,Y (x, y) ≥ 0.

We call PX,Y (x, y) the joint probability function of (X, Y ) if
X X
P ((X, Y ) ∈ A) = PX,Y (x, y)
(x,y)∈A
for all events A in R2 .

In particular, for event A = (−∞, x] × (−∞, y] in the continuous case:
Z y Z x
FX,Y (x, y) = fX,Y (u, v) du dv
−∞ −∞
For the discrete case: XX

FX,Y (x, y) = PX,Y (u, v)
v≤y u≤x
It follows that
fX,Y (x, y) dx dy ≈ P (x < X ≤ x + dx, y < Y ≤ y + dy)
and that
PX,Y (x, y) = P (X = x, Y = y).
By the Fundamental Theorem of Calculus,

δ2
FX,Y (x, y) = fX,Y (x, y)
δx δy
Given fX,Y (x, y), to find the marginal p.d.f., integrate out the variable you wish to
get rid of: Z ∞
fX (x) = fX,Y (x, y) dy
−∞

lOMoARcPSD|36552487
and Z ∞
fY (y) = fX,Y (x, y) dx
−∞
(beware of times when the p.d.f. changes its form – see next example for details)
20.2.1 Example
Let
2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(1) Find the marginal p.d.f.s fX and fY .
Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z 1
2
=0+ (x + 2y) dy + 0
0 3
2
= (x + 1) for 0 < x < 1
3
= 0 elsewhere
and
1
2
Z
fY (y) = 0 + (x + 2y) dx
0 3
1
= (1 + 4y) for 0 < y < 1
3
= 0 elsewhere
(2) Find the joint c.d.f. of (X, Y ).

We need FX,Y (x, y) for all (x, y) ∈ R2 .
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = 0 for either x ≤ 0 or y ≤ 0

lOMoARcPSD|36552487
For 0 < x < 1 and 0 < y < 1:

Z y Z x
FX,Y (x, y) = fX,Y (u, v) du dv
Z−∞ −∞
yZ x
2
= (u + 2v) du dv
0 0 3
1 2
= x2 y + xy 2
3 3
For 0 < x < 1 and y ≥ 1:

1 x y x
2
Z Z Z Z
FX,Y (x, y) = (u + 2v) du dv + 0 du dv
0 0 3 1 0
2
x 2
= + x
3 3
For x ≥ 1 and 0 < y < 1:

y 1 y x
2
Z Z Z Z
FX,Y (x, y) = (u + 2v) du dv + 0 du dv
0 0 3 0 1
y 2 2
= + y
3 3
For x ≥ 1 and y ≥ 1:
FX,Y (x, y) = 1
(3) Find the marginal c.d.f. FX .

The first possible way involves using the marginal p.d.f. that we got from part 1 of
this example.
FX (x) = 0 for x ≤ 0
and
x
2
Z
FX (x) = (u + 1) du for 0 < x < 1
0 3
x2 2
= + x
3 3
and, for x ≥ 1, FX (x) = 1.

lOMoARcPSD|36552487
The second possible way:
FX (x) = FX,Y (x, +∞)

= 0 for x ≤ 0
x2 2
= + x for 0 < x < 1
3 3
= 1 for x ≥ 1
Note: identify the part of the range of FX,Y (x, y) where 0 < x < 1 and where you
can let y → +∞. In this case, the part is 0 < x < 1 and y ≥ 1. Finally, for such y,
FX,Y (x, y) does not change with y.
Thus, we get the same marginal c.d.f. for X by two different but equivalent meth-
ods.
20.3 Conditional Distributions
Conditioning plays a huge part in probability and statistics. Hence, we need to

consider conditional distributions.
Definition: Given the joint probability function of (X, Y ), PX,Y (x, y) = P (X =
x, Y = y), we define the conditional probability function of Y given X = x to
be:
PX,Y (x, y)
P (Y = y | X = x) = ∀x : PX (x) 6= 0
PX (x)
This is denoted as FY (y | X ≤ x).
21 05 April
21.1 Conditional distributions continued
21.1.1 Conditional probability function
Also straight-forward is the conditional probability function:
PY | X=x (y) = P (Y = y | X = x)

lOMoARcPSD|36552487
Again, by the definition of conditional probability, the right hand side is equal
to
P (X = x, Y = y) PX,Y (x, y)
=
P (X = x) PX (x)
(provided that P (X = x) 6= 0).
21.1.2 Conditional probability density function
Something more interesting is how we deal with, say, P (Y ≤ y | X = x) when X

and Y are jointly continuous. We cannot define this as the ratio of the joint divided
by the P (X = x) since the latter is 0 for all x. Because of this, we need to take a
slightly different route.
First, define the conditional p.d.f. of Y given X = x denoted by
fX,Y (x, y)
fY | X=x (y | x) =
fX (x)
for all x such that fX (x) 6= 0.

Now, since we know that if we integrate a p.d.f. over a region A, we get the probability
of that region. Therefore,
FY | X=x (y | x) = P (Y ≤ y | X = x)
Z y
= fY | X=x (u | x) du
−∞
It follows that Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞
Note that this is the usual definition of expected value, except that we use the
conditional p.d.f..
In the discrete case:
X
P (Y ≤ y | X = x) = PY | X=x (u | x)
∀u:u≤y
It’s only in the continuous case where things get a bit more complicated.

lOMoARcPSD|36552487
21.2 The Law of Total Probability for Random Variables
The following theorems are very useful analogues of the Law of Total Probability for
events.
Recall: n
X
P (A) = P (A | Bi )P (Bi )
i=1
21.2.1 Discrete case
For discrete random variables:

X
P (Y = y) = P (Y = y | X = x)P (X = x)
∀x
X
= PY | X=x PX (x)
∀x
The proof of this is the same as the proof for sets.
21.2.2 Continuous case
For continuous random variables, we have the following theorem:

Theorem: Let X, Y have joint p.d.f X, Y with conditional p.d.f. fY | X=x (y | x).
Then,
Z ∞
(a) fY (y) = fY | X=x (y | x)fX (x) dx
−∞
Z ∞
(b) and FY (y) = FY | X=x (y | x)fX (x) dx
−∞
Proof of (a): We have

Z ∞
fY (y) = fX,Y (x, y) dx
Z−∞
∞
= fY | X (y | x)fX (x) dx
−∞
Try part (b) yourself. Or not. Doesn’t matter to me.

lOMoARcPSD|36552487
21.3 Example
From Tuesday’s example, we had

2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
(4) Find fY | X=x (y | x).
We must find
fX,Y (x, y)
fX (x)
By (1), we have:
For 0 < y < 1 and 0 < x < 1:
2
3
(x + 2y)
fY | X=x (y | x) = 2
3
(x + 1)
Otherwise, for y ∈
/ (0, 1),
fY | X=x (y | x) = 0
(5) Find E(Y | X = x) for 0 < x < 1 and, in particular, find E(Y | X = 12 ).
By definition,
Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞
Z 1
y(x + 2y)
= dy
0 x+1
3 x 2
= ( + ) for 0 < x < 1
2(x + 1) 3 3
In particular,
1 3 1 2 5
E(Y | X = ) = 1 ( + )=
2 2( 2 + 1) 6 3 6

lOMoARcPSD|36552487
21.4 Bivariate analogues
We seek a summary of a bivariate distribution – a sort of analogue to E(X) or

V ar(X) for a univariate distribution. Before doing this, however, we need another
definition.
Definition: Let g(x, y) be a real-valued function of (x, y). Then, we define for the
continuous case:
Z ∞Z ∞
E(g(X, Y )) = g(x, y) fX,Y (x, y) dx dy
−∞ −∞
In the discrete case:

XX
E(g(X, Y )) = g(x, y) PX,Y (x, y)
y x
21.4.1 Covariance
An important special case is when

g(X, Y ) = (X − µX )(Y − µY )
(i.e. E(X) = µX and E(Y ) = µY )
Thus, in this case, we’re talking about
E((X − µX )(Y − µY ))
This is given a special name – the covariance between X and Y , the Cov(X, Y ), and
is denoted by σXY .
21.4.2 Notes
(1) Cov(X, Y ) is a measure of how X and Y vary about their means simultaneously.
If Cov(X, Y ) > 0, then this tells us that as X increases, so does Y , and also as X
decreases, so does Y . Conversely, if Cov(X, Y ) < 0, then as X increases, Y tends to
decrease, and vice versa.
(2) “little theorem”
Cov(X, Y ) = E(XY ) − E(X)E(Y )

lOMoARcPSD|36552487
Proof: by definition,
Cov(X, Y ) = E((X − µX )(Y − µY ))

= E(XY − µY X − µX Y + µX µY )
= E(XY ) − µY E(X) − µX E(Y ) + µX µY
= E(XY ) − µX µY
22 10 April
22.1 Covariance continued
22.1.1 Covariance and Correlation
While the sign of Cov(X, Y ) tells you whether or not X and Y tend to vary in the
same direction together (positive if they do, negative if in opposite directions), the
magnitude of Cov(X, Y ) depends on the scale of measurement. Thus,
Cov(aX, bY ) = E((aX − aµX )(bY − bµY ))

= E(ab(X − µX )(Y − µY ))
= abE((X − µX )(Y − µY ))
= abCov(X, Y )
In other words, Cov(aX, bY ) 6= Cov(X, Y ). Therefore, we define a new quantity

that has the same sign as Cov(X, Y ), but which is scale invariant. This way, it
does not matter what scale we take our measurements in (Celsius vs Fahrenheit,
kilometres vs miles, etc.). Thus, we define the correlation between X and Y , written
as Corr(X, Y ) (also ρ(X, Y )), and is defined as
Cov(X, Y ) σXY
ρ(X, Y ) = p =
V ar(X)V ar(Y ) σX σY
Note that the sign of ρ(X, Y ) is the same as the sign of Cov(X, Y ), and
Cov(aX, bY ) |ab|Cov(X, Y )
|ρ(aX, bY )| = p = p = ρ(X, Y )
V ar(aX)V ar(aY ) |ab| V ar(X)V ar(Y )
(i.e. ρ is scale invariant).

lOMoARcPSD|36552487
22.1.2 Important remarks on the correlation coefficient
(1) It is not difficult to show that |ρ(X, Y )| ≤ 1 (i.e. −1 ≤ ρ(X, Y ) ≤ 1) using

1
the Cauchy-Schwartz inequality (E(XY ) ≤ (E(X 2 )E(Y 2 )) 2 ), or using the fact that
0 ≤ E((X − Y )2 ) = E(X 2 ) − 2E(XY ) + E(Y 2 ) ⇒ 2E(XY ) ≤ E(X 2 ) + E(Y 2 ). The
proof was given in class, but will not be on the final.
(2) |ρ| = 1 if and only if Y = aX + b for constants a, b, i.e. if and only if there is a
perfect linear relationship. Further, the above proof from (1) gives us the claim for
(2), since E(X ± Y )2 = 0 ⇔ Y = ±X.
(3) It is important to note that the correlation between two random variables is
a measure of linear dependence between them, and nothing else. Avoid using the
term “correlation” to describe dependence in general.
22.1.3 Example
(Same numbers from Thursday)
2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
(6) Find Cov(X, Y ).
We need µX and µY . We then need E(XY ).

lOMoARcPSD|36552487
Z ∞
µX = xfX (x) dx
−∞
Z 1
2
= x (x + 1) dx
0 3
5
=
Z9 ∞
µY = yfY (y) dy
−∞
Z 1
1
= y (1 + 4y) dy
0 3
∞ ∞
11
Z Z
= E(XY ) = xyfX,Y (x, y) dx dy
18 −∞ −∞
Z 1Z 1
2
= xy (x + 2y) dx dy
0 0 3
1
=
3
1 5 11
Cov(X, Y ) = −
3 9 18
−3
=
486
(7) Find Corr(X, Y ) = ρ(X, Y ).

We need V ar(X) and V ar(Y ).

lOMoARcPSD|36552487
1
2
Z
2
E(X ) = x2 (x + 1) dx
0 3
7
=
18
7 5
V ar(X) = − ( )2
18 9
= .0803
√
σX = .0803
= .2833
... = ...
V ar(Y ) = .6821
√
σY = .6821
= .8259
So, at last, we can find Corr(X, Y ).

−3
486
Corr(X, Y ) =
.2833 × .8259
= −0.0319
22.1.4 Linking variance and covariance
Theorem: V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2Cov(X, Y ).

Proof:
V ar(X + Y ) = E((X + Y )2 ) − (µX + µY )2
= E(X 2 ) − µ2X + E(Y 2 ) − µ2Y + 2E(XY ) − 2µX µY
= V ar(X) + V ar(Y ) + 2Cov(X, Y )
For V ar(X − Y ), we get V ar(X) + V ar(Y ) − 2Cov(X, Y ).
Corollary: if Cov(X, Y ) = 0, then V ar(X + Y ) = V ar(X) + V ar(Y ). In gen-
eral: n n
X X
V ar( Xi ) = V ar(Xi )
i=1 i=1
if Xi , Xj are uncorrelated and i 6= j.

lOMoARcPSD|36552487
22.2 Independence between random variables
We talked about independence of events, and so now it is natural to discuss the

notion of independence between random variables.
22.2.1 Definition
Note: This definition is valid whether the random variables are continuous or dis-
crete.
The random variables X1 , X2 , . . . , Xn are said to be independent if and only if
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = FX1 (x1 )FX2 (x2 ) . . . FXn (xn )
for all −∞ < xi < ∞.
If X1 , X2 , . . . , Xn are jointly continuous, then it is easy to see that they are indepen-
dent if and only if
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
If X1 , X2 , . . . , Xn are jointly discrete, then they are independent if and only if
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = PX1 (x1 )PX2 (x2 ) . . . PXn (xn )
22.2.2 Example continued
(8) Are X and Y independent?

Use fX,Y (x, y) and fX (x)fY (y). Try x = 41 , y = 14 .
1 1 2 1 2
fX,Y ( , ) = ( + )
4 4 3 4 4
1 25
fX ( ) =
4 34
1 1 4
fY ( ) = (1 + )
4 3 4

lOMoARcPSD|36552487
So that
1 1 5 1
fX ( )fY ( ) = 6=
4 4 9 2
Therefore, X and Y are not independent.
22.2.3 Independence and covariance
What is the relationship between independence and covariance?

Theorem: If X and Y are independent, then Cov(X, Y ) = 0. Note that the converse
is not true.
Proof: (continuous case) Assume X ⊥ Y . We must show that E(XY ) =
E(X)E(Y ).
Z ∞Z ∞
E(XY ) = xyfX,Y (x, y) dx dy
−∞ −∞
Z ∞Z ∞
= xyfX (x)fY (y) dx dy
−∞ −∞
Z ∞ Z ∞
= yfY (y) dy xfX (x) dx
−∞ −∞
= µX µY
23 12 April
23.1 Sums of independent random variables
Sums of random variables are particularly important in probability and statistics

since we frequently encounter averages of random variables, apart from the divisor
n,
n
1X
Xi
n i=1
denoted x̄, is just a sum.
We’ll do three things:

lOMoARcPSD|36552487
First off, use the m.g.f. method to find the exact distribution of a sum of independent
random variables, under certain circumstances.
Secondly, use the Central Limit Theorem to use the approximate distribution of
a sum of a “large” number of independent random variables under general condi-
tions.
Thirdly, we’ll discuss the Weak Law of Large Numbers that enables to us to interpret
probability as a limiting relative frequency.
The moment generating function method for finding the distribution of a
sum of independent random variables: Recall from last class that if X ⊥ Y ,
then E(XY ) = E(X)E(Y ) (i.e. Cov(X, Y ) = 0). In general, if X1 , X2 , . . . , Xn are
independent, then !
Yn Yn
E Xi = E(Xi ).
i=1 i=1
The following extended result is true: if X ⊥ Y , then g1 (x) ⊥ g2 (y) for all functions
g 1 , g2 .
23.1.1 Setup
We have independent r.v.s X1 , X2 , . . . , Xn that are assumed to come from some

known distribution (e.g. Poisson, Normal, etc.). We want to find the distribution
of n
X
Sn = Xi .
i=1
This is how the m.g.f. method works:

Step 1: find the m.g.f.s MXi (t).

lOMoARcPSD|36552487
Step 2: find the m.g.f. of Sn , MSn (t) as follows:
MSn (t) = MPni=1 Xi (t)

Pn
= E(et i=1 Xi
)
tX1 tX2
= E(e e . . . etXn )
= E(g(X1 )g(X2 ) . . . g(Xn )) where g = etXi .
Yn
= E(etXi )
i=1
The last step above is valid because functions of independent random variables are
themselves independent, and E(g(X1 )g(X2 )) = E(g(X1 ))E(g(X2 )). Then,
... = ...
Yn
MSn (t) = MXi (t)
i=1
Remark: if the Xi ’s all have the same distribution (termed identically distributed ),
then
MSn (t) = (MXi (t))n .
Step 3: having found ni=1 MXi (t), we hope to recognize its form as the m.g.f. of
Q
a familiar distribution. If so, then by the Uniqueness Theorem of M.G.F.s, that
distribution must be the distribution of Sn .
23.1.2 In practice
Theorem: Let X1 , X2 , . . . , Xn be independent random variables.

(a) Let Xi ∼ P oisson(λi ).
(b) Let Xi ∼ N (µi , σi2 ).
(c) Let Xi ∼ Binomial(ni , p).
(d) Let Xi ∼ χ2νi .
Find the distributions of of Sn in (a) through (d).
Solution: (easy!)

lOMoARcPSD|36552487
t
(a) We have MXi (t) = eλi (e −1) . Therefore,
n
t
Pn
λi (et −1)
Y
MSn (t) = eλi (e −1) = e i=1
i=1
which we recognize as the m.g.f. of a P oisson(P ni=1 λi ) random variable. Therefore,

P
by the Uniqueness Theorem, Sn ∼ P oisson( ni=1 λi ). In particular, if λ1 = λ2 =
· · · = λn = λ, then Sn ∼ P oisson(nλ).
σi2 t2
(b) MXi (t) = eµi t+ 2
n
Y σi2 t2
MSn (t) = eµi t+ 2
i=1
Pn Pn 2
µi t+ σi2 t2
=e i=1 i=1
Pn Pn
which we recognize as the m.g.f. of a N ( i=1 µi , i=1 σi2 ) random variable.
(c) (try it yourself)
(d)
1
MXi (t) = νi
(1 − 2t) 2
Therefore,
1
MSn (t) = Pn νi
(1 − 2t) i=1 2
Which is the m.g.f. of a

χ2Pni=1 νi r.v.
23.2 The Central Limit Theorem
Roughly, the Central Limit Theorem says: sums of a large number of independent
random variables are approximately normally distributed.
Theorem: Let X1 , X2 , . . . be independent and identically distributed (i.i.d.) ran-
dom variables with mean µ and variance σ 2 . Then,

Sn − nµ
P √ ≤ x) → P (Z ≤ x) ∀x as n → ∞
nσ

lOMoARcPSD|36552487
where Z is a Standard Normal distribution (i.e. Z ∼ N (0, 1)).

Remember this is talking about the sums of the r.v.s, not the r.v.s themselves!
This helps explains why, in the real world, a lot of factors seem to be normally dis-
tributed – IQ scores, heights, weights, etc.. This does not prove why, say, heights
are normally distributed, however – it’s just an observation and a plausibility argu-
ment.
23.2.1 Notes
(1) V ar(Sn ) = ni=1 V ar(Xi ) = nσ 2 , while E(Sn ) = nµ. Therefore, the l.h.s. of
P
the Central Limit Theorem is just Sn standardized to have mean 0 and standard
deviation 1.
(2) The C.L.T. can be written in the form
!
X̄ − µ
P σ ≤ x → P (Z ≤ x)
√
n
where Z ∼ N (0, 1). Just divide top and bottom by n.

(3) Note that the C.L.T. gives us a Normal distribution as the approximate distri-
bution as the sum of any i.i.d. random variables.
23.2.2 Application
Suppose that it is known that the survival time for patients with Alzheimer’s disease
from onset of symptoms has a mean of 8 years and a standard deviation of 4 years. If
a sample of 30 patients with the disease is taken, what is the approximate probability
that their average survival will be less than seven years?
P30
Xi
Solution: the C.L.T. generally works well with n ≥ 30. We’ll let X̄ = i=1
30
. We

lOMoARcPSD|36552487
want P (X̄ < 7). Use the C.L.T. as follows:

!
X̄ − µ 7−µ
P (X̄ < 7) = P <
√σ √σ
n n
!
X̄ − 8 7−8
=P <
√4 √4
30 30
≈ P (Z < −1.37)
= .0853

Lecture Note Lectures All Lectures Prof D Wolfson

Uploaded by

Copyright:

Available Formats

Lecture Note Lectures All Lectures Prof D Wolfson

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Note Lectures All Lectures Prof D Wolfson

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|36552487

Lecture note, lectures all lectures - prof. d. wolfson

Probability (McGill University)

Studocu is not sponsored or endorsed by any college or university

MATH 323: Lecture Notes

April 16, 2012

1.1 Axioms and Theorems

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 2

2.1 Using counting methods to compute probabilities

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 3

2.1.2 Multiplication rule for counting

Suppose you have k sets of distinct objects of size n1 , n2 , . . . , nk respectively. Then,

4.1 Conditional Probabilities

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 4

If P (A) = 0, then P (B | A) can be defined arbitrarily, so we consider it to be

P (A ∩ B) = P (B | A)P (A) = P (A | B)P (B)

5.1 The Law of Total Probability

Let A be any event and let B1 , B2 , . . . , Bm be a collection of m events satisfying:

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 5

5.2 Bayes’ Theorem

Let A be any event. Let B1 , B2 , . . . , Bm form a partition of S. Then, for every

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 6

It follows that, if A ⊥ B, then P (A ∪ B) = P (A) + P (B) − P (A)P (B).

7.1 Random variables

Often, we are not so interested in the outcomes of an experiment themselves, but

8.1 Random variables continued

8.1.1 The cumulative distribution function

We define P (x ∈ B) to be P (w ∈ S : x ∈ B : P (X −1 (B)) (i.e. we refer back to the

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 7

Definition: P (X ∈ (−∞, x]) = P (X ≤ x) is a function of X called the cumulative

FX (x) = P (X ≤ x)∀x ∈ (−∞, ∞).

8.1.2 Properties of the c.d.f.

(1) The c.d.f. is a real-valued function of x.

8.1.3 Continuous vs discrete random variables

Definition: We call a random variable continuous if its c.d.f. is a real-valued

9.1 Named probability distributions

9.1.1 Discrete uniform distribution

Definition: X has a discrete uniform distribution on the set of N real numbers

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 8

9.1.2 Bernoulli distribution

Definition: a random variable X has a Bernoulli distribution with parameter p if

10.1 Random variable distributions

Cumulative distribution function (c.d.f.): FX (x) = P (X ≤ x)

10.1.1 Binomial distribution

The random variable X has a binomial distribution with parameters n and p if

for x = 0, 1, . . . , n, where n is a non-negative integer and 0 ≤ p ≤ 1.

We write X ∼ Bin(n, p) to mean “X has the binomial distribution”.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

Ryan Ordille MATH 323: Probability 9

Assuming these are all independent. This is equal to

{X = x} = ∪all possible configurations {configuration i with x successes}

Note that the configurations are disjoint, so therefore, by axiom 3: