Lecture Note Lectures All Lectures Prof D Wolfson

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

lOMoARcPSD|36552487

Lecture note, lectures all lectures - prof. d. wolfson

Probability (McGill University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])
lOMoARcPSD|36552487

MATH 323: Lecture Notes

Ryan Ordille

April 16, 2012

These course notes are for MATH 323: Probability, offered at McGill University
in the Winter 2012 semester, taught by Professor David Wolfson. These notes are
simply what the instructor wrote on the board, and may contain errors. The notes
are written by me, Ryan Ordille, and no copyright infringement is intended. Please
don’t sell these notes, or pass them off as your own.
The notes up until 28 February (first class after the break and midterm) are concise
– i.e. without review or obvious proofs. Most pre-midterm examples will be left out
since, when these notes were typed up towards the end of the course, the solutions
are obvious.
These course notes can be found at https://github.com/ryanordille/m323w12.
Thank you to everyone who submitted corrections, and good luck!

1 17 January

1.1 Axioms and Theorems

1. P (E) ≥ 0
2. P (S) = 1
3. ∀Ei ∩ EJ = ∅:

X
P (∪∞
i=1 Ei ) = P (Ei )
i=1

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 2

4. Theorem 1: P (∅) = 0
5. Theorem 2: P (E c ) = 1 − P (E)
6. Theorem 3: P (E ∩ F c ) = P (E) − P (E ∩ F )
7. Theorem 4: If E ⊂ F , then P (E) ≤ P (F ).
8. Theorem 5: If E and F are any two events, then P (E ∪ F ) = P (E) + P (F ) −
P (E ∩ F ).
Word problem hints:
• “either/or” corresponds to a union of two events
• “at least one” – union
• “not” – complement
• “and” – intersection
• “proportion” – “a probability statement about an individual”

2 19 January

2.1 Using counting methods to compute probabilities

Theorem 6: Let S be a sample space (set of all possible outcomes) with finitely
many outcomes N . If all these outcomes are equally likely, then, if E is any
event,
number of ways that E can occur
P (E) =
total number of possible outcomes = N

Counting the number of ways that E can occur (and N ) can be difficult without the
following counting rules.

2.1.1 Factorial

By definition,
n! = n × (n − 1) × · · · × 2 × 1 where 0! = 1

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 3

2.1.2 Multiplication rule for counting

Suppose you have k sets of distinct objects of size n1 , n2 , . . . , nk respectively. Then,


the number of ways to choose one object from each set is n1 , n2 , . . . , nk .

2.1.3 n choose k

Suppose that a set contains n distinct objects. Then, the number of ways to select
k objects from these n objects, if we sample without replacement, is denoted by nk
(“n choose k”). It turns out that:
 
n n!
=
k k!(n − k)!

Note that here the order of selections is not important, e.g. the selection (A, B) is
equivalent to the selection (B, A).

2.1.4 Permutations

If the order is important in the previous example, this is denoted by Pkn , and

n!
Pkn =
(n − k)!

3 24 January
This lecture just contained examples of the previous counting rules.

4 26 January

4.1 Conditional Probabilities

Very often, if we know that some event A has occurred, then this will affect the
probability that the event B will occur.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 4

Definition: for two events A and B, with P (A) 6= 0, we define the probability of
“B given A”, denoted by P (B | A), as

P (A ∩ B)
P (B|A) =
P (A)

If P (A) = 0, then P (B | A) can be defined arbitrarily, so we consider it to be


undefined.
Notice that the right-hand side of the definition is given in terms of already-defined
quantities – straightforward, unconditional probabilities.

4.2 Notes

(1) Conditional probabilities satisfies the three axioms, just like unconditional prob-
abilities.
(2) (the multiplication rule for conditional probabilities) Let A and B be
two events. Then,

P (A ∩ B) = P (B | A)P (A) = P (A | B)P (B)

(3) When solving conditional probability problems, use the conditioning technique
to break a long intersection into a series of conditional and unconditional probabili-
ties.

5 31 January
Note that it’s not true in general that P (A | (B1 ∪B2 )) = P (A | B1 )+P (A | B2 ) even
if B1 ∩ B2 = ∅. However, it is true that P (B1 ∪ B2 |A) = P (B1 |A) + P (B2 |A).

5.1 The Law of Total Probability

Let A be any event and let B1 , B2 , . . . , Bm be a collection of m events satisfying:


1. Bi ∩ Bj = ∅ ∀i 6= j

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 5

2. ∪m
i=1 Bi = S (i.e. B1 , B2 , . . . Bm form a partition of S)

Then,
m
X
P (A) = P (A | Bi )P (Bi )
i=1

Notice here that the left-hand side (P (A)) may be difficult to find directly, while the
components of the right-hand side might be known or easy to find.

5.2 Bayes’ Theorem

Let A be any event. Let B1 , B2 , . . . , Bm form a partition of S. Then, for every


k = 1, 2, . . . , m,
P (A | Bk )P (Bk ) P (A | Bk )P (Bk )
P (Bk | A) = = Pm
P (A) i=1 P (A|Bi )P (Bi )

6 02 February

6.1 Independence

Sometimes, knowing that an event A has occurred will not affect the probability
that B will occur. In such a situation, we say that A and B are independent. More
formally, two events A, B are said to be independent (A ⊥ B) if and only if
P (B | A) = P (B) or P (A | B) = P (A)

Theorem: if A and B are disjoint, then A and B can only be independent if either
P (A) = 0 or P (B) = 0.
There’s another important (albeit less intuitive) definition of independence: A ⊥ B
if and only if P (A ∩ B) = P (A)P (B).
More generally, events A1 , A2 , . . . , An are mutually independent if and only if, for
every subset Ai1 , Ai2 , . . . , Aik of A1 , A2 , . . . , An ,
k
Y
P (∩kj=1 Aij ) = P (Aij )
j=1

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 6

It follows that, if A ⊥ B, then P (A ∪ B) = P (A) + P (B) − P (A)P (B).


In general, sampling without replacement assumes dependence, while sampling with
replacement assumes independence. Disjointness is entirely a set property, and
should not be confused with independence.

7 07 February

7.1 Random variables

Often, we are not so interested in the outcomes of an experiment themselves, but


rather in numerical values that can be associated with these outcomes.
Definition: a function X that maps the sample space S to the real line in such a
way that, for every ω ∈ S, X(ω) is a real number, is called a random variable (or
r.v. for short).
Note that two or more distinct ω’s can give the same value of X(ω). However, one
value of ω is not allowed to give two different values of X(ω), as this would not make
X a function.
The term “random variable” comes about because the outcomes of the experiment
are random or uncertain before we perform our experiment, and hence the value of
X will also be uncertain before the experiment.
We denote random variables with capitals, and the values of random variables after
experiments with lowercase letters.

8 09 February

8.1 Random variables continued

8.1.1 The cumulative distribution function

We define P (x ∈ B) to be P (w ∈ S : x ∈ B : P (X −1 (B)) (i.e. we refer back to the


events of S to find the probabilities of events on the real line).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 7

Definition: P (X ∈ (−∞, x]) = P (X ≤ x) is a function of X called the cumulative


distribution function (or c.d.f.) of the random variable X, and

FX (x) = P (X ≤ x)∀x ∈ (−∞, ∞).

8.1.2 Properties of the c.d.f.

(1) The c.d.f. is a real-valued function of x.


(2) In order to specify FX , we need to specify FX (x) ∀x ∈ (−∞, ∞).
(3) All c.d.f.s are non-decreasing and right-continuous.
(4) FX (−∞) = limx→−∞ FX (x) = 0 and FX (∞) = limx→∞ FX (x) = 1.

8.1.3 Continuous vs discrete random variables

Definition: We call a random variable continuous if its c.d.f. is a real-valued


function of x (i.e. it has no jumps). If a random variable is not continuous, the
random variable is said to be discrete, and can assume at most a countable number
of distinct values.
Definition: For a discrete random variable X, the real-valued function of x specified
by PX (x) = P (X = x) ∀x that X can assume is called the probability function of
X.
Theorem: If we know PX (x) ∀x that X can assume, then we can find FX (x) ∀x ∈
(−∞, ∞). Conversely, if we’re given the c.d.f., we can find the probability func-
tion.

9 14 February

9.1 Named probability distributions

9.1.1 Discrete uniform distribution

Definition: X has a discrete uniform distribution on the set of N real numbers


a1 < a2 < . . . , aN if P (X = ai ) = N1 ∀i = 1, 2, . . . , N .

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 8

The total probability (mass) is equally or uniformly spread out on each of the num-
bers ai .

9.1.2 Bernoulli distribution

Definition: a random variable X has a Bernoulli distribution with parameter p if


P (X = 1) = p and P (X = 0) = 1 − p = q.
This is used as a “building block” for more complicated random variables.
Starting from after the midterm and break:

10 28 February

10.1 Random variable distributions

Cumulative distribution function (c.d.f.): FX (x) = P (X ≤ x)


For discrete random variables: P (X = x) = PX (x)

10.1.1 Binomial distribution

The random variable X has a binomial distribution with parameters n and p if


 
n x
P (X = x) = p (1 − p)n−x
x

for x = 0, 1, . . . , n, where n is a non-negative integer and 0 ≤ p ≤ 1.


Note that we should check if:
1. PX (x) ≥ 0 (obvious)
P
2. all x in range PX (x) = 1 (easy to check)

We write X ∼ Bin(n, p) to mean “X has the binomial distribution”.


How the binomial distribution arises: (the binomial setup)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 9

Firstly, we have a sequence of n independent trials – that is, the outcomes of these
trials are mutually independent.
Secondly, each trial can result in exactly one of two possible outcomes: a “success”
(S) or a “failure” (F). We call such trials Bernoulli trials.
Thirdly, the probability of success at trial i is constant and equal to p for every
i = 1, 2, . . . , n. For example, in a coin toss, we cannot change the probability of
heads halfway through, so the probability of success is constant.
Theorem: Let X be the number of successes observed in these n trials. Then,
 
n x
P (X = x) = p (1 − p)n−x for x = 0, 1, 2, . . . , n.
x

Proof: (same idea for the genetic mutation example) First, note that the probability
of any particular configuration in which there are x successes (and n − x failures) is
just px (1 − p)n−x .
E.g:

P (S1 ∩S2 ∩. . .∩Sx ∩Fx+1 ∩Fx+2 ∩. . .∩Fn ) = P (S1 )P (S2 ) . . . P (Sx )P (Fx+1 ) . . . P (Fn )

Assuming these are all independent. This is equal to

p × p × . . . × p × (1 − p) × . . . × (1 − p) = px (1 − p)n−x

But,

{X = x} = ∪all possible configurations {configuration i with x successes}

Note that the configurations are disjoint, so therefore, by axiom 3:

X X
P (X = x) = P (configuration i) = px (1 − p)n−x
all configurations

Since each configuration is defined by a choice of x objects from n total objects, then
for every x = 0, 1, . . . , n:
 
n x
P (X = x) = p (1 − p)n−x
x

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 10

10.1.2 Binomial distribution example

Suppose that the five year survival probability for lung cancer is .10. If thirty people
with lung cancer are sampled, what is the probability that at least three will survive
five years or longer?
Solution: Let Y = the number out of thirty who will survive five or more years.
We shall reasonably assume the binomial setup, with n = 30 and P (Si ) = .10, where
Si is the event where the ith subject survives five or more years. Therefore,
30   2  
X 30 30−y
X 30
P (Y ≥ 3) = y
(.10) (1 − .10) =1− (.10)y (1 − .10)30−y
y=3
y y=0
y

Important note: do not immediately have the “burning desire” to use the binomial
distribution as soon as you see a bunch of trials, each of which can result in exactly
one of two outcomes. You must check if the trials are independent – they will not
be if you are sampling without replacement!
Observe that the Bernoulli distribution is a special case of the binomial distribution,
where n = 1. So, PX (x) = px (1 − p)1−x for x = 0, 1.

10.1.3 Poisson distribution

The random variable X is said to have a Poisson distribution with parameter λ > 0
if
λx e−λ
P (X = x) = for x = 0, 1, 2, . . .
x!
(Note the infinite range of x)
Check that:
1. PX (x) ≥ 0 – obvious
P∞ λ
2. 0 PX (x) = 1 – Taylor series expansion for e

The Poisson distribution arises as an approximation to the binomial for “large n and
small p”.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 11

Theorem: Let X ∼ Bin(n, p). The limit of P (X = x) as n tends to infinity and p


tends to 0, in such a way that n × p is constant (= λ), is

λx e−λ
for x = 0, 1, 2, . . .
x!
6
(e.g. p = n
: np = n( n6 ) = 6)
Proof: We shall use the following result in proving our theorem.
a n
lim n → ∞(1 + ) = ea
n

 
n x
P (X = x) = p (1 − p)n−x for x = 0, 1, . . . , n
x
n!
(∗) = px (1 − p)n (1 − p)−x
x!(n − x)!
Now since λ = np, we have p = nλ . Then, (∗) becomes:
n! λx λ λ
(1 − )n (1 − )−x
x!(n − x)! n x n n
After some algebraic manipulation, this becomes:
1 n(n − 1)(n − 2) . . . (n − x + 1) λ λ
(1 − )n (1 − )−x λx
x! nn . . . n n n
Notice that the second term has x terms on both the top and the bottom. Now, let
n go to infinity to get the limit:
λx −λ
e 
x!

11 01 March

11.1 Poisson distribution continued

11.1.1 Poisson distribution example

λx e−λ
P (X = x) = where x = 0, 1, 2, . . .
x!

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 12

Suppose that in a book of 1000 pages, on any particular page, there can be either
zero errors or one error. Suppose further that the probability of an error on any
particular page is .002. what is the approximate probability that there will be at
most three errors in the book?
Exact solution: There is a binomial setup – errors are likely to occur independently
amongst the n = 1000 trials. Each trial can result in either a success (an error is
found) or a failure (an error is not found). There is also a constant probability of
success (.002) for every trial. Let X be the number of successes in 1000 pages. Then
X ∼ Bin(1000, .002).

3  
X 1000
P (X ≤ 3) = (.002)x (1 − .002)1000−x
x=0
x

Approximate solution: Since n is large and p is small, we can use the Poisson
2
approximation with λ = n × p = 1000 × 1000 = 2. Therefore:
3 3
X X 2x e−2
P (X ≤ 3) = P (X = x) =
x=0 x=0
x!

If X has a Poisson distribution with parameter λ, we write X ∼ P o(λ).

11.2 The Hypergeometric Distribution

The random variable X has a hypergeometric distribution with parameters N , a, and


n if:
a N −a
 
x n−x
P (X = x) = N

n

(for x ≤ a and n − x ≤ N − a).


The hypergeometric distribution is applied to a situation where there are a objects
of some type in a set of N objects. The distribution discusses the probability of
drawing x objects of this type in a sample of size n.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 13

11.3 The Geometric Distribution

X is said to have a geometric distribution with parameter p if

P (X = x) = Px (x) = (1 − p)x−1 p for x = 1, 2, . . . .



X p
p (1 − p)x−1 = =1
x=1
1 − (p − 1)

The geometric random variable is used to describe or model the trial number at
which the first success occurs in a sequence of independent Bernoulli trials, each
with the probability of success p.

11.4 Expected values

The probability distribution of a random variable provides the complete story about
the random variable. There is no information about a random variable once we
know its probability distribution. However, we often wish to summarize a probability
distribution. The two most common summaries are:
1. a parameter that discusses the “centre of distribution” and
2. a parameter that discusses how spread out the values are from the centre.
To this end, we give the following definition:
Expected value: let Y be a discrete random variable with the probability function
PY (y). Then, we define the expected value of y denoted by E(Y ) to be
X X
E(Y ) = yPY (y) = yP (Y = y)
all y all y

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 14

12 06 March

12.1 Review

Recall: given a random variable Y , we define the expected value or expectation of


Y denoted by E(Y ) to be:
X
E(Y ) = yP (Y = y)
all y

provided that the sum is finite.

12.2 Theory

Notes:
1. E(Y ) is often denoted by µY , and also is termed the mean of Y (also called
the “population mean”).
2. Interpretation: E(Y ) is a weighted average or mean of the possible values of
the random variable Y , where the weights are the probabilities of these values.
For example, in the special case where Y has a discrete uniform distribution
a1 , a2 , . . . , aN , then
N N
X 1 X
E(Y ) = ai P (Y = ai ) = ai
i=1
N i=1

So think of µ as the “average value” of Y .


3. µ is a constant, a parameter specific to a given probability distribution. You
need the distribution to compute µ.
4. E(cY ) = cE(Y ) where c is a constant, and E( ni=1 Yi ) =
P Pn
i=1 E(Yi ) (the
proof will come later)
5. Note, however, that in general, E(XY ) 6= E(X)E(Y ).
6. Let g be some real-valued function of a random variable Y . Then, if X = g(Y ),
we have X
E(X) = E(g(Y )) = g(y)P (Y = y)
all y

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 15

P
The point of this: by definition, E(g(Y )) = E(X) = all x xP (X = x). Thus,
in order to find E(g(Y )), it would appear that we first need to find the prob-
ability distribution of X = g(Y ). We’ll see such transformations later – they
can be difficult. You do not have to first find the distribution of X – you can
use the distribution of the original Y and sum g(y)P (Y = y).
In particular, if g(Y ) = Y k , we can call E(g(Y )) = E(Y k ) is called the kth moment
of Y . µ is called the first moment of Y .
X
E(Y k ) = y k P (Y = y)
all y

Of particular importance is a special function of Y :


g(Y ) = (Y − µY )2
In this case, E(g(Y )) = E((Y − µY )2 ) is called the variance of Y , and is denoted by
V ar(Y ), also σY2 . This gives the average squared distance between the values of Y
and p its mean. It is a measure of spread or variation of Y (i.e. its distribution). We
call σY2 the standard deviation of Y . This is more convenient than V ar(Y ), since
it is in the same units as Y , unlike V ar(Y ).
Result:
V ar(Y ) = E((Y − µY )2 ) = E(Y 2 ) − µ2Y
Proof:
E((Y − µY )2 ) = E(Y 2 − 2µY Y + µ2Y )
= E(Y 2 ) − E(2µY Y ) + E(µ2Y )
= E(Y 2 ) − 2µY E(Y ) + µ2Y
= E(Y 2 ) − 2µ2Y + µ2Y
= E(Y 2 ) − µ2Y 

Final note: V ar(cY ) 6= c × V ar(Y ), but V ar(cY ) = c2 × V ar(Y ).

12.3 Examples

12.3.1 Example 1

The calculation of insurance premiums: An insurance company will insure


your computer against theft for $1000. It is known with a probability .05 that your

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 16

computer will be stolen. What premium should the insurance company charge so
that its expected gain is 0?
Solution: Let c be the required premium. Let Y be the gain of the company in a
given year. We need to find the value of c such that E(Y ) = 0.
First, we need PY (y) for all y. We have P (Y = c) = .95 (where the computer
was not stolen) and P (Y = (c − 1000)) = .05 (where the computer was stolen).
Therefore,
E(Y ) = c × .95 + (c − 100) × .05
Setting E(Y ) = 0 and solving for c, we get c = 50. Thus, if the company charges
$50 for the policy, on average, over a large number of clients, they would neither lose
nor gain money.

12.3.2 Example 2

“Nuts and Bolts” Example: suppose that the random variable x has the proba-
bility distribution:

P (X = −1.2) = .32
P (X = 2.6) = .40
P (X = 0) = .28

Find E(X) and V ar(X).

E(X) = −1.2 × .32 + 0 × .28 + 2.6 × .40 = .66 = µX


X
V ar(X) = E(X 2 ) − µ2X = x2 P (X = x)
E(X 2 ) = (−1.2)2 × .32 + 02 × .28 + (2.6)2 × .40
= 3.1648
∴ V ar(X) = 3.1648 − (.66)2
= 2.7292

Also, σ = 2.7292 = 1.652

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 17

13 08 March

13.1 Summary

Centre – E(X) = µ
Spread – V ar(x) = σ 2

Standard Deviation – σ2 = σ

X X
(x − µ)2 P (X = x) = x2 P (X = x) − µ2

13.2 The Mean and Variance of Some Named Distributions

13.2.1 Binomial

n  
X n x
E(X) = µ = x p (1 − p)n−x
x=1
x
X n(n − 1)!
= ppx−1 (1 − p)n−1−(x−1) as (n − x)! = (n − 1 − (x − 1))!
(x − 1)!(n − 1 − (x − 1))!
X n − 1 
= np px−1 (1 − p)n−1−(x−1)
x−1
n−1 
X n − 1
= np py (1 − p)n−1−y
y=0
y
= np

The sum above is equal to 1 since it is just a sum of n − 1 Bin(n − 1, p) probabili-


ties.
To find V ar(X), we first have to find E(X 2 ). By definition:
n  
2
X n x
2
E(X ) = x p (1 − p)n−x
x=1
x

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 18

Note that x2 will not cancel with the leading terms of x! as before, so we have to use
a trick. We first calculate E(X(X − 1)), which is easy. Then, notice that

E(X(X − 1)) = E(X 2 ) − E(X) = E(X 2 ) − µ

So, E(X 2 ) = E(X(X − 1)) + µ. Finally, V ar(X) = E(X(X − 1)) + µ − µ2 .

n  
X n x
E(X(X − 1)) = x(x − 1) p (1 − p)n−x
x=2
x
X (n − 2)!
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
(x − 2)!(n − 2 − (x − 2))!
X n − 2 
= n(n − 1)p2 px−2 (1 − p)n−2−(x−2)
x−2
2
= n(n − 1)p

V ar(X) = n(n − 1)p2 + np − n2 p2 = np(1 − p)

13.2.2 Bernoulli

In particular, the mean and variance of a Bernoulli random variable are p and p(1−p)
respectively (because n = 1).

13.2.3 Poisson


X λx e−λ
E(X) = x
x=1
x!

−λ λx−1
X
=e λ
x−1=0
(x − 1)!

So, the mean of a Poisson random variable is just its parameter λ.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 19

For the variance, we use the trick we used for the variance of a binomial distribution.
First, we solve for E(X(X − 1)):

n
X λx e−λ
E(X(X − 1)) = x(x − 1)
x=2
x!
X λx−2 e−λ
= λ2 x(x − 1)
x(x − 1)(x − 2)!
X λx−2 e−λ
= λ2
(x − 2)!
2
= λ as the sum represents a Poisson distribution where x = x − 2

Since E(X 2 ) = E(X(X − 1)) + µ, V ar(X) = E(X(X − 1)) + µ − µ2 .


∴ V ar(X) = λ2 + λ − λ2 = λ

13.2.4 Geometric


X
E(X) = xp(1 − p)x−1
x=1
X∞
using our trick... = p x(1 − p)x−1
x=1

X d
=p − (1 − p)x
x=1
dp

d X
= −p (1 − p)x we can interchange the derivative and the sum
dp x=1
d 1−p
= −p where (1 − p) = r
dp 1 − (1 − p)
d 1
= −p ( − 1)
dp p
1
=
p

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 20

For the variance, first find E(X(X − 1)) (2 derivatives), and then you can find the
variance.
1−p
V ar(X) =
p2

13.3 Continuous probability distributions

Definition: a random variable X with c.d.f. FX is said to be continuous if FX is


continuous for all −∞ < x < ∞.
The continuous c.d.f.s split into two types:
1. the so-called “absolutely continuous” c.d.f.s (essentially, they are differentiable)
and
2. the so-called “singular” c.d.f.s (without derivatives).
From now on in this course, we’ll assume for continuos c.d.f.s FX , that they are
differentiable for all −∞ < x < ∞.
It follows that if X is continuous, then P (X = x) = 0 for every x. Consider the
“area under a curve” analogy.
Remember (as FX (x) = P (X ≤ x)):

P (X = x) = FX (x) − P (X < x)

Hence, we cannot specify a continuous random variable by specifying the values


P (X = x) for all x that X can assume, as we did in the discrete case. Instead,
we introduce an analogue of the probability function known as the probability den-
sity function (p.d.f.). It will turn out that the p.d.f. also uniquely determines the
probability distribution.
Definition: A real-valued function fx is said to be the probability density function
of a random variable X if:
1. fx (x) ≥ 0 for all −∞ < x < ∞ and
R
2. P (X ∈ A) = A fx (x)dx (i.e. fx has the property that when you integrate it
over a set, you get the probability of the set).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 21

14 13 March

14.1 The probability density function

Definition: the probability density function (p.d.f.) of a random variable X is any


function fX (x) such that:
1. fX (x) ≥ 0 ∀ − ∞ < x < ∞ and
R
2. P (X ∈ A) = A fX (s)dx ∀ events A ∈ [R]
A p.d.f. gives P (X ∈ A) by integrating over A.

14.2 Notes on the p.d.f.

(1) In particular, if A is of the form (−∞, x] then


Z x
P (X ∈ A) = P (X ≤ x) = fX (y)dy
−∞

Rx
In short, FX (x) = −∞
fX (y)dy.
(2) Conversely, we can recover the p.d.f. from the c.d.f. by the Fundamental Theo-
rem of Calculus, since:
d
FX (x) = FX′ (x) = fX (x)∀x
dx

Thus, in particular, if fX is a p.d.f., then:


Z ∞
fX (x)dx = 1
−∞

(3) Interpretation of a p.d.f: We have, for small ∆x, that:

FX (x + ∆x) − FX (x)
≈ fX (x)
∆x
So it follows that:
FX (x + ∆x) − FX (x) ≈ ∆xfX (x)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 22

But the left hand side is just P (x < X ≤ x + ∆x). Finally, we have that fX (x)∆x
is approximately the probability that X lies in (x, x + ∆x].
Note that, because the p.d.f. does not represent a probability on its own, it can be
greater than 1 or less than 0. When multiplied by a small ∆x, it gives an approximate
probability.
Any function whose total area equals 1 can qualify as a p.d.f., even if some points
are larger than 1.
(4) For continuous random variables, we define:
X Z ∞
E(g(X)) = g(x)PX (x) = g(x)fX (x)dx
allx −∞

As before, when g(X) = X k , then we refer to the E(g(X)) = E(X k ) as the kth
moment. Of particular importance are the first moment (k = 1) and the second
moment (k = 2). Again, as before, we call the first moment the mean or the expected
value of X.
So, by definition: Z ∞
E(X) = µX = xfX (x)dx
−∞

and Z ∞
2
E(X ) = x2 fx (x)dx)
−∞

Finally, as before:
Z ∞
2
V ar(X) = E((X − µX ) ) = (x − µX )2 fX (x)dx = E(X 2 ) − µ2X

and Z ∞ Z ∞
2
V ar(X) = x fX (x)dx − ( xfX (x)dx)2
−∞ −∞

Note: watch out for a p.d.f. that may change its form in different ranges of (−∞, ∞)
when carrying out an integration.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 23

14.3 Examples

14.3.1 Example 1

Let fX (x) = c(x2 + 1) for 0 < x < 1 and fX (x) = 0 elsewhere (c is a constant).
1. Find c.
2. Find P (.25 < X ≤ .50).
3. Find P (.25 < X < .50).
4. Find FX .
5. Find E(X) and σX .
R∞
(1) Since −∞ fX (x)dx = 1, we must have:
Z 0 Z 1 Z ∞
2
0dx + c(x + 1)dx + 0dx = 1
−∞ 0 1

This gives c = .75.


(2)
.50
55
Z
P (.25 < X ≤ .50) = (.75)(x2 + 1)dx =
.25 256

(3) For a continuous p.d.f., these two values are the same (by rules of integra-
tion).
55
P (.25 < X < .50) = P (.25 < X ≤ .50) =
256
(4)

FX (x) = 0 ∀x ≤ 0
Z x
FX (x) = fX (y)dy ∀0 < x < 1
−∞
Z 0 Z x
= 0dy + (.75)(y 2 + 1)dy
−∞ 0
x3
= (.75)( + x)
3
FX (x) = 1 ∀x ≥ 1

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 24

(5)
Z ∞
E(X) = xfX (x)dx
−∞
Z 1
= x(.75)(x2 + 1)dx
0
x4 x2 1
= (.75)( + )
4 2 0
= (.75)(.25 + .50)
9
=
16

To find σX , first find:


Z 0 Z 1 Z ∞
2 2 2
E(X ) = 0dx + x (.75)(x + 1)dx + 0dx
−∞ 0 1
x5 x 1 3
= (.75)( + )
5 3 0
σ 2 = V ar(X)
9
= E(X 2 ) − ( )2
√ 16
σ= σ 2

15 15 March

15.1 Named continuous probability distributions

15.1.1 The uniform distribution

Definition: The random variable X is said to be uniformly distributed on the


interval [a, b] if Z x
1
fX (x) = dx for a ≤ x ≤ b
a b−a

and fX (x) = 0 elsewhere.


Notes:

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 25

1
1. The p.d.f. is constant on [a, b]. On the graph, the height is constantly b−a , so
the area under the curve is 1. The probability is uniformly spread out in the
interval.
2. The uniform distribution is often used to model situations in which we believe
outcomes occur completely at random.
3. An important special case is the uniform distribution [0, 1].
4. Notation – we write X ∼ U (a, b) to mean that X has a uniform distribution
on the interval [a, b].
5. The c.d.f. of X is:
FX (x) = 0 for x < a
x−a
= for a ≤ x ≤ b
b−a
= 1 for b < x
The graph of the c.d.f. is 0 up to a, then grows linearly up to b, then is
constantly 1 after.
6.
Z ∞
µ = E(X) = xfX (x)dx
−∞
b
1
Z
=0+ x dx + 0
a b−a
b 2 − a2
=
2(b − a)
a+b
=
2
It is then easy to see that the variance of a uniform distribution is V ar(X) =
(b−a)2
12
.

15.1.2 The exponential distribution

Definition: X has an exponential distribution with parameter β > 0 if:


1 − βx
fX (x) = e for x ≥ 0
β

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 26

The distribution is equal to 0 elsewhere.


Notes:
1. we write X ∼ Exp(β) to describe this distribution.
2. If X is exponential, then X is a non-negative random variable, meaning P (X ≥
0) = 1.
3. The p.d.f. of fX (x) is β1 at x = 0, and grows exponentially downwards as x
grows larger. The probability of an interval of length L decreases as L moves
down the graph (e.g. the probability between 2 and 4 is greater than the
probability between 4 and 6). The probability is concentrated towards the
origin.
4. The c.d.f. of X is:

FX (x) = 0 ∀x < 0
Z x
a −y
=0+ e β dy for x ≥ 0
0 β
−x
=1−e β

R∞ R∞ x
5. µ = E(X) = −∞ xfX (x)dx = 0 + 0 x β1 e− β dx = β You can do this using
integration by parts, or we’ll see a trick for this later. Also, σ 2 = V ar(X) = β 2 .
6. Sometimes, the exponential distribution will be “parameterized” in a different
way, i.e. the parameter will be written in a different form. The alternative
form for the p.d.f. is:

fX (x) = λeλx for x ≥ 0


= 0 elsewhere

Watch out how the writer is writing in the parameter for the distribution! In
this case, E(X) = λ1 and V ar(X) = λ12 .
7. The memoryless property:
Theorem: Let X ∼ Exp(β). Then, P (x ≤ X < x + h | X ≥ x) = P (0 ≤ X <
h).
In other words, the information that X ≥ x is “forgotten”.
Important note: the memoryless property does not assert that P (0 ≤ X <

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 27

h) = P (x ≤ X < x + h)!
Proof: Let x ≤ X < x + h be B, and X ≥ x be A. Then,

P (x ≤ X < x + h|X ≥ x) = P (B ∩ A)/P (A)


= P (B)/P (B)since B is a subset of A
FX (x + h) − FX (x)
=
1 − P (X < x)
−(x+h) −x
1−e β − (1 − e β )
= −x
1 − (1 − e β )
−h
=1−e β

= P (0 < X ≤ h)

There is an interesting converse – the only continuous distribution with the


memoryless property is the exponential. The geometric discrete distribution
also has this property.
8. The exponential distribution is used when you believe that X has a constant
“hazard” (i.e. P (x ≤ X < x + h | X ≥ x) is roughly constant in x for small
h). It is also used to model the times between events that occur according to
a Poisson Process (explained in later statistic courses).

16 20 March

16.1 The Gamma Distribution

Before defining the gamma distribution, we need to define the gamma function.

16.1.1 The Gamma Function

Let α > 0. We denote the gamma function by Γ(α) and define


Z ∞
Γ(α) = xα−1 e−x dx
0

The gamma function has the following two important properties:

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 28


1. Γ( 21 ) = π
2. Γ(α + 1) = αΓ(α)

16.1.2 The Gamma Distribution

Definition: A random variable X has a gamma density with parameters α, β if its


p.d.f. is given by:
−x
1 xα−1 e β
fX (x) = for x ≥ 0
Γ(α) β α
fX (x) = 0 when x < 0.
Notes:
(1)
∞ −x
1 xα−1 e
Z β
dx = 1
0 Γ(α) β α
x
This is as it should be – set y = β
to get

1
Z
y α−1 e−y dy = 1.
Γ(α) 0

x
Set y = β
and dx = β dy to get the same answer.
This just proves that this given formula is a density, as it integrates out to 1.
(2) The gamma distribution is said to “flexible”, meaning that many different shapes
for the p.d.f. can be induced by changing the two parameters α and β.
For α > 1, the p.d.f. grows rapidly immediately after x > 0, then drops off with a
tail as x grows larger. fX is said to be skewed to the right.
For α = 1, fX grows exponentially downwards, similar to the exponential distribu-
tion.
For α < 1, the p.d.f. also grows exponentially downwards, but more steeply.
(3) The gamma density can be used to model the waiting time for the nth event if
the times between events are independent exponential random variables.
(4) There are two important special cases of the gamma distribution:

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 29

1. If we set α = 1, we get an exponential distribution with parameter β.


ν
2. If we set α = 2
and β = 2, then the density becomes:
ν
1 x 2 −1 −x
ν e2
Γ( ν2 ) 2 2

For x ≥ 0 (0 otherwise).
This particular p.d.f. plays an important role in statistics, and is called a Chi-
square p.d.f. “with ν degrees of freedom”. ν is just a parameter with this
peculiar name.
We write X ∼ χ2ν to mean “X has a Chi-square distribution with ν degrees of
freedom”.
(5) It is not too difficult to derive E(X) and V ar(X) from the definition. It will be
easier, however, once we know about moment-generating functions.
x
In the end, E(X) = αβ. Write xα as xα+1−1 and let y = β
, and carry out the
integration.
We get, similarly, V ar(X) = αβ 2 .
ν
In particular, if X ∼ χ2ν , then E(X) = ν (set α = 2
and β = 2) and V ar(X) =
2ν).
(6) The c.d.f. FX is not known in closed form: F (x) = 0 for x < 0 and:
Z x
1 y α−1 −y
FX (x) = α
e β dy
0 Γ(α) β

For x > 0.
(7) Notation:
We write X ∼ Gamma(α, β) to mean “X has a gamma distribution with parameters
α, β”.

16.2 The Normal (or Gaussian) Distribution

The normal or Gaussian distribution is easily the most important distribution in


probability and statistics! The distribution seems to occur naturally all over.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 30

16.2.1 Definition

The random variable X has a normal distribution with parameters µ, σ 2 if its p.d.f.
is given by
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ
for −∞ < x < ∞.

16.2.2 Notes

(1) We write X ∼ N (µ, σ 2 ).


(2) It is possible to show directly that E(X) = the parameter µ and V ar(X) = the
parameter σ 2 .
Note that if X ∼ N (1.2, 7.8), we mean that µ = 1.2 and σ 2 = 7.8 not σ = 7.8.
We’ll derive µ and σ 2 by using the so-called moment generating function later, as
the current integration would be a bit tricky (but not impossible).
(3) The p.d.f. has the famous bell shape. The features are:
1. fX is symmetric about µ.
2. Changing µ changes the location of the p.d.f., i.e. where it is centred on the
x-axis.
3. Increasing σ 2 increases the spread of the p.d.f. and decreasing σ 2 decreases the
spread.
(4) The c.d.f. is not known in closed form, similar to the gamma distribution.
Probabilities of intervals need to be done using numerical integration.
However, unlike the gamma density, it is possible to use a single table to find any
normal probability. The idea is to reduce the general problem to what is called a
standard normal problem.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 31

16.2.3 Standard Normal Problem

Background: if I give you any random variable with mean µ and standard deviation
σ, then
X −µ
Y =
σ
has E(Y ) = 0 and V ar(Y ) = 1.
We are said to have standardized X.
However, if X ∼ N (µ, σ 2 ), we have the following:

X −µ
Z= ∼ N (0, 1)
σ
This is called a standard normal random variable or distribution.

17 22 March

17.1 Standardizing continued

Our main result from Tuesday: if X ∼ N (µ, σ 2 ), then

X −µ
Z= ∼ N (0, 1)
σ
Note: For any random variable with mean µ and variance σ 2 , X−µ σ
will have mean 0
and variance 1. The proof of this is simple - just plug in the fraction for X in E(X)
and V ar(X).
The point of the main result is that, after standardizing, we still get a normal random
variable. If you standardize a random variable, you don’t always get a random
variable of the same type (unless you’re standardizing a random variable with a
normal distribution).

17.1.1 Example 1

Problem: If X ∼ N (−1.2, 4), find P (−1.9 ≤ X < 2.2).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 32

Solution: The idea is to reduce the problem to a N (0, 1) problem, and then use
N (0, 1) tables.
Step 1: (do not draw a sketch now)

−1.9 − (−1.2) X − (−1.2) 2.2 − (−1.2)


P (−1.9 ≤ X < 2.2) = P ( ≤ < )
2 2 2
= P (−.35 ≤ Z < 1.7) our main result from before

Step 2: draw a sketch: (draw a standard normal distribution with mean = 0, shade
in area A between −.35 and 1.7)
Tables will give you areas to the right of a value z. Areas to the right of z = 3
is essentially 0, so tables will usually not give values of z > 3. Recall that the
areas will be the probabilities – e.g. the area to the right of z = 1.78 will equal
P (Z ≥ 1.78).
We’ll call A1 the area between −.35 and 0, and A2 the area between 0 and 1.7. We’ll
get these values by using the symmetry of N (0, 1) about µ = 0. Tables only give
positive values of z, so to get A1 , subtract P (Z ≥ .35) from .5.
From the table values, A2 = .5 − .0446 and A1 = .5 − .3632, so A = A1 + A2 =
.5922.

17.1.2 Example 2

Problem: Use the N (0, 1) tables inversely here. Suppose that a car battery is
known to have a lifetime that is approximately normally distributed with a mean of
36 months and a standard deviation of 6 months. What should the warranty period
be set at so that only 5% of batteries will need to be replaced?
Notice: Batteries cannot have a negative lifetime, so our true normal distribution
will not work. However, our lifetime is so skewed to the right that 2σ is still way
to the right of 0, so we can shift our model. Strictly speaking, modelling anything
that cannot hold negative values is not correctly, but for almost all cases, the normal
distribution will work just fine.
Solution: We have that X ∼ N (36, 62 ). Let x0 be the required warranty period.
We want that x0 such that P (X < x0 ) = .05, i.e. such that P (X ≥ x0 ) = .95.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 33

Reduce this distribution to a standard normal distribution. So, we seek x0 such


that:
X − 36 x0 − 36
P( ≥ ) = .95
6 6
i.e. such that P (Z ≥ x0 −46
6
) = .95.
Draw a sketch – we’re looking for a z0 from standard normal tables such that the
area to the right is .95, then set that equal to our above probability, and we can solve
our problem.
This z0 must be to the left of the mean 0, since the area to the right of the mean is
.5. Find a z1 such that the area to the right of it is .05, and according to our tables,
z1 = 1.64. Take the negative, so the area to the right of z0 = −1.64 is .95 (using the
symmetry of the normal distribution).
From the N (0, 1) tables, we know that P (Z ≤ −1.64) = .95. Finally, we must have
that
x0 − 36
= −1.64
6
We get x0 = 26.16 months.

17.2 Moment Generating Functions

17.2.1 Definition

Let X be a random variable with p.d.f. fX (respectively, probability PX for the


discrete case). We define the moment generating function (denoted m.g.f.) to be
that function of t, such that
MX (t) = E(etx )

17.2.2 Notes

(1) The m.g.f. is a function of the real valued t.


(2) In the continuous case,
Z ∞
tx
E(e ) = etx fX (x) dx
−∞

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 34

In the discrete case, X


E(etx ) = etx P (X = x)
x

(3) For some distributions, the m.g.f. does not exist because the integral (or sum)
does not converge. We say that the m.g.f. exists if it exists in some interval containing
0.
(4) If the m.g.f. exists, then it is possible to recover the p.d.f. or probability
function, i.e. there is a one-to-one correspondence between a p.d.f. (pf) and a m.g.f..
Recovering it, although possible, is a bit complicated, and we will not be expected
to do so.
(5) Uses of the m.g.f. – the m.g.f. can be used to find the moments of a ran-
dom variable, and is often easier than finding the moments (E(X k )) by using the
definition.
Theorem: E(X k ) = M (k) (0).
Proof (continuous case): (discrete case – replace integral by sum) We have, by
definition:
Z ∞
tx
MX (t) = E(e ) = etx fX (x) dx
−∞
d
Z
(1)
MX (t) = ...
dt
d
Z
= ...
dt
Z
= xetx fX (x) dx

now set t = 0
Z ∞
(1)
MX (0) = xfX (x) dx
−∞
= E(X)

In general, we get Z ∞
(k)
M (t) = xk etx fX (x) dx
−∞

This gives M (k) (0) = E(X k ), as advertised.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 35

18 27 March

18.1 Recall

MX (t) = E(etx )
(k)
MX (0) = E(X k )

18.2 The m.g.f.s of some important distributions

18.2.1 Gamma distribution

X ∼ Gamma(α, β)

1 xα−1 −x
Z
MX (t) = etx α
e β dx
0 Γ(α) β
Z ∞
1 1 1
= α
xα−1 e−x( β −t) dx
Γ(α) β 0

Set y = x( β1 − t) with dy = ( β1 ) − t dx to get

1
MX (t) = for |βt| < 1
(1 − βt)α

From MX (t), we immediately get

MX′ (0) = αβ(1 − βt)−α−1


t=0

by the chain rule. Therefore, MX′ (0) = αβ. Similarly,

MX′′ (t) = αβ 2 + α2 β 2
t=0

as V ar(Y ) = E(Y 2 ) − (E(Y ))2 . Therefore, V ar(X) = αβ 2 .


So, in particular, for α = 1 (i.e. the exponential distribution), we have
1
MX (t) =
(1 − βt)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 36

with E(X) = β and V ar(X) = β 2 .


ν
For α = 2
and β = 2 (i.e. a chi-square distribution with ν degrees of freedom), we
get
1
MX (t) = ν
(1 − 2t) 2
and E(X) = ν and V ar(X) = 2ν.

18.2.2 Binomial distribution

X ∼ Bin(n, p)
n  
X n x tx
MX (t) = e p (1 − p)n−x
x=0
x
= (1 − p + pet )n ∀t ∈ (−∞, ∞)

(note: 1 − p = a and pet = b)

n  
X n x
tx
MX (t) = e p (1 − p)n−x
x=0
x
n  
X n
= (pet )x (1 − p)n−x
x=0
x
= (pet + (1 − p))n by the Binomial theorem
= (1 − p + pet )n ∀t ∈ (−∞, ∞)

Pn n

Note: the Binomial theorem is x=0 x
ax bn−x = (a + b)n .
We get MX′ (0) = np and MX′′ (0) = np(1 − p) + n2 p2 . Therefore, V ar(X) = np(1 −
p).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 37

18.2.3 Poisson distribution

X ∼ P o(λ)

MX (t) = E(etx )

X etx λx e−λ
=
x=0
x!
∞ x
−λ
X (et λ)
=e
x=0
x!
−λ et λ
=e e by Taylor series of ex

t
= eλ(e −1)

Therefore, E(X) = MX′ (0) = λ and E(X 2 ) = λ + λ2 , so therefore V ar(X) = λ.

18.2.4 Normal distribution

Let X ∼ N (0, 1) to start with.


1 1 2
fX (x) = √ e 2 x ∀x ∈ (−∞, ∞)

Therefore,

1
Z
1 2
MX (t) = etx √ e− 2 x dx
−∞ 2π

1
Z
1 2 −2tx+t2 ) t2
= e− 2 (x e 2 √ dx
−∞ 2π

1
Z
t2 1 2
=e 2 √ e− 2 (x−t) dx
−∞ 2π
t2
=e 2 since the integrand is just a N (t, 1) p.d.f..

To get the m.g.f. of a N (µ, σ 2 ) random variable for arbitrary µ and σ 2 , we use the
following property of an m.g.f.: let a and b be constants. Then,

MaX+b (t) = ebt MX (at)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 38

Now, recall that if X ∼ N (µ, σ 2 ), then Z = X−µ


σ
∼ N (0, 1). Therefore, we can
2
always write a N (µ, σ ) random variable X as X = σZ + µ.
Putting these two results together, we get
σ 2 t2
MX (t) = MσZ+µ (t) = eµt e 2

and
1 2 t2
MX (t) = eµt+ 2 σ ∀t ∈ (−∞, ∞)
We find MX′ (0) = µ, MX′′ (0) = σ 2 + µ2 , and V ar(X) = σ 2 .

18.3 Transformations of random variables

Often, we’re given the distribution of a random variable X, but we’re more inter-
ested in some function Y = g(X) of this random variable, e.g. maybe we have the
distribution of the velocity V , and we are interested in the distribution of the kinetic
energy Y = 21 mV 2 . In general, we’re concerned with finding the distribution of g(X)
knowing the distribution of X.
First, consider the following two examples to illustrate the eventual formula for the
continuous case.

18.3.1 Example 1

X−µ
Let X ∼ N (µ, σ 2 ). Find the p.d.f. of Z = σ
. We know the answer to this, but we
don’t know the proof for it.
Step 1: write down the c.d.f. of Z.

FZ (z) = P (Z ≤ z)
X −µ
= P( ≤ z)
σ

Step 2: write the FZ in terms of FX .


X −µ
P( ≤ z) = P (X ≤ σz + µ)
σ
= FX (σz + µ)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 39

Step 3: differentiate in terms of z.


d
fz (z) = FZ (z)
dz
= σfX (σz + µ) by chain rule

Finally:
1 1 − 1 ( x−µ )2
fX (x) = √ e 2 σ
2π σ σz+µ
1 1 2
= √ e− 2 z ∀z ∈ (−∞, ∞)

18.3.2 Example 2 (Careful!)

Let Z ∼ N (0, 1). Find the p.d.f. of Y = Z 2 .

FY (y) = P (Y ≤ y)
= P (Z 2 ≤ y)

= P (|Z| ≤ y)
√ √
= P (− y ≤ Z ≤ y)
√ √
= FZ ( y) − FZ (− y) for y ≥ 0

Finally,
d
fY (y) = FY (y)
dy
1 1 √ 1 1 √
= y − 2 fZ ( y) + y − 2 fZ (− y) ∀y ≥ 0
2 2
− 12 √
= y fZ ( y) (N (0, 1) density symmetric about 0)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 40

19 29 March

19.1 Transformations continued

If Z ∼ N (0, 1), then Y = Z 2 ∼ X 2 . We have


√ √
FY (z) = P (− z ≤ Z ≤ z)
√ √
FZ ( z) − FZ (− z)
d
fY (z) = FY (z)
dz
1 −1 √ √ 1 1
= z 2 fZ ( z) + fZ (− z) z − 2
2 2

But f√Z is a N (0, 1) p.d.f. which is symmetric about 0, and therefore fZ ( z) =
fZ (− z).
Therefore, we have
1 √
fY (z) = z − 2 fZ ( z) for 0 < z < ∞
= 0 for z ≤ 0

ν 1
Finally, we use the following facts: (with α = 2
= 2
and β = 2)

1 1 2
fZ (u) = √ e− 2 u for − ∞ < u < ∞

If W ∼ χ21 , then
1 1 1 w
fW (w) = √ 1 w− 2 e− 2 for w ≥ 0
π 22
with fW (w) = 0 elsewhere.
We have
1 1 1
fY (z) = z − 2 √ √ e− 2 z for z > 0
2 π
= 0 for z ≤ 0

We’re done!

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 41

19.1.1 General formula and theorem

We can now give a general formula that allows us to go from the p.d.f. of a given
random variable X to the p.d.f. of a transformed random variable Y = g(X).
Theorem: Let X have p.d.f. fX . Let y = g(x) be either strictly increasing or
strictly decreasing as a function of x, and X is continuous. Define Y = g(X). Then,
the p.d.f. of Y is
dx
fY (y) = fX (g −1 (y))| | for the appropriate range of values of Y .
dy
dx 1
Note that dy
= dy .
| dx |

Proof: First,
FY (y) = P (g(X) ≤ y)
= P (X ≤ g −1 (y)) (if g increasing)
while = P (X ≥ g −1 (y) (if g decreasing)

Thus, we have,
FY (y) = FX (g −1 (y)) (if g increasing)
= 1 − FX (g −1 (y)) (if g decreasing)

Finally, as g(x) = y ⇒ x = g −1 (y),


d
fY (y) = FY (y)
dy
dx
= fX (g −1 (y)) (if g increasing)
dy
dx
= −fX (g −1 (y)) (if g decreasing)
dy
dy 1
But if g is decreasing, then dx
= dx < 0, so that the two situations (where g is
dy
increasing and g is decreasing) can be combined into a single formula.
dx
fY (y) = fX (g −1 (y))| |
dy
Note: do not apply this formula unless g is either strictly increasing or strictly
decreasing.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 42

19.1.2 The probability integral transformation

The following is a very important result that allows one to simulate observations
from a given probability distribution by knowing only how to simulate observations
from a U (0, 1) distribution. This famous result is called the probability integral trans-
formation.
Theorem: Let X be a continuous random variable with a strictly increasing FX .
Let Y = FX (X). Then Y ∼ U (0, 1).
Note: here, our g is FX , i.e. Y = g(X) = FX (X). Also, you must use FX and not
some other F .
Proof: by our formula,
dx
fY (y) = fX (FX−1 (y))
dy
(| dx
dy
|= dx
dy
since FX is increasing)

1
fY (y) = fX (FX−1 (y)) dy
dx
dy
Recall y = FX (x), dx
= fX (x), and x = FX−1 (y). Therefore,

1
fY (y) = fX (FX−1 (y)) = 1 for 0 < y < 1
fX (FX−1 (y))

and fY (y) = 0 elsewhere.


We recognize the above as a U (0, 1) p.d.f.

19.2 Joint probability distributions

Very often, we’re interested in the simultaneous behaviour of several random vari-
ables, rather than one at a time, as we have considered up until now, e.g. if X =
number of kilometres traveled by a tire and Y = tread depth, we may wish to know
about the simultaneous or joint distribution of X and Y . This leads to so-called
multivariate distributions.
We shall consider bivariate (i.e. pairs of random variables) distributions, and indicate
the general extensions at the end.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 43

19.2.1 Definition

The random variables X and Y have joint c.d.f., denoted FX,Y , if

FX,Y (x, y) = P (X ≤ x ∩ Y ≤ y)

Which we denote as P (X ≤ x, Y ≤ y).

19.2.2 Notes

(1) This definition holds for both continuous and discrete random variables.
(2) It is possible to show (in an advanced probability course) that the joint c.d.f.
uniquely determines the probability distribution in two-dimensional space.

20 03 April

20.1 Properties of the joint c.d.f.

(1) FX,Y (x, y) uniquely determines the joint probability distribution in two-dimensional
space, i.e. in theory, given any event B in R2 , once we know FX,Y (x, y) for all
−∞ < x < ∞, −∞ < x < ∞, then P ((X, Y ) ∈ B) is uniquely determined.
(2) We define the so-called marginal c.d.f. FX and FY of FX,Y as follows:

FX (x) = P (X ≤ x) = FX,Y (x, +∞) = lim FX,Y (x, y)


y→∞

and, similarly,
FY (y) = FX,Y (+∞, y)

(3) FX,Y (x, y) is non-decreasing in x and y (e.g. FX,Y (x, y) ≤ FX,Y (x′ , y) for x <
x′ ).
(4) FX,Y (−∞, −∞) = 0 (c.f. FX (−∞) = 0) and FX,Y (∞, ∞) = 1.
(5) FX,Y (x, y) is jointly continuous from the right (c.f. FX (x) is continuous from the
right).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 44

20.2 The role of the p.d.f. and probability functions in joint


distributions

Definition: We call fX,Y (x, y) the joint p.d.f. of (X, Y ) if


Z Z
P ((X, Y ) ∈ A) = fX,Y (x, y) dx dy
A

and fX,Y (x, y) ≥ 0.


We call PX,Y (x, y) the joint probability function of (X, Y ) if
X X
P ((X, Y ) ∈ A) = PX,Y (x, y)
(x,y)∈A

for all events A in R2 .


In particular, for event A = (−∞, x] × (−∞, y] in the continuous case:
Z y Z x
FX,Y (x, y) = fX,Y (u, v) du dv
−∞ −∞

For the discrete case: XX


FX,Y (x, y) = PX,Y (u, v)
v≤y u≤x

It follows that

fX,Y (x, y) dx dy ≈ P (x < X ≤ x + dx, y < Y ≤ y + dy)

and that
PX,Y (x, y) = P (X = x, Y = y).

By the Fundamental Theorem of Calculus,


δ2
FX,Y (x, y) = fX,Y (x, y)
δx δy

Given fX,Y (x, y), to find the marginal p.d.f., integrate out the variable you wish to
get rid of: Z ∞
fX (x) = fX,Y (x, y) dy
−∞

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 45

and Z ∞
fY (y) = fX,Y (x, y) dx
−∞

(beware of times when the p.d.f. changes its form – see next example for details)

20.2.1 Example

Let
2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(1) Find the marginal p.d.f.s fX and fY .
Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z 1
2
=0+ (x + 2y) dy + 0
0 3
2
= (x + 1) for 0 < x < 1
3
= 0 elsewhere

and
1
2
Z
fY (y) = 0 + (x + 2y) dx
0 3
1
= (1 + 4y) for 0 < y < 1
3
= 0 elsewhere

(2) Find the joint c.d.f. of (X, Y ).


We need FX,Y (x, y) for all (x, y) ∈ R2 .

FX,Y (x, y) = P (X ≤ x, Y ≤ y) = 0 for either x ≤ 0 or y ≤ 0

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 46

For 0 < x < 1 and 0 < y < 1:


Z y Z x
FX,Y (x, y) = fX,Y (u, v) du dv
Z−∞ −∞
yZ x
2
= (u + 2v) du dv
0 0 3
1 2
= x2 y + xy 2
3 3

For 0 < x < 1 and y ≥ 1:


1 x y x
2
Z Z Z Z
FX,Y (x, y) = (u + 2v) du dv + 0 du dv
0 0 3 1 0
2
x 2
= + x
3 3

For x ≥ 1 and 0 < y < 1:


y 1 y x
2
Z Z Z Z
FX,Y (x, y) = (u + 2v) du dv + 0 du dv
0 0 3 0 1
y 2 2
= + y
3 3

For x ≥ 1 and y ≥ 1:
FX,Y (x, y) = 1

(3) Find the marginal c.d.f. FX .


The first possible way involves using the marginal p.d.f. that we got from part 1 of
this example.
FX (x) = 0 for x ≤ 0
and
x
2
Z
FX (x) = (u + 1) du for 0 < x < 1
0 3
x2 2
= + x
3 3
and, for x ≥ 1, FX (x) = 1.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 47

The second possible way:

FX (x) = FX,Y (x, +∞)


= 0 for x ≤ 0
x2 2
= + x for 0 < x < 1
3 3
= 1 for x ≥ 1

Note: identify the part of the range of FX,Y (x, y) where 0 < x < 1 and where you
can let y → +∞. In this case, the part is 0 < x < 1 and y ≥ 1. Finally, for such y,
FX,Y (x, y) does not change with y.
Thus, we get the same marginal c.d.f. for X by two different but equivalent meth-
ods.

20.3 Conditional Distributions

Conditioning plays a huge part in probability and statistics. Hence, we need to


consider conditional distributions.
Definition: Given the joint probability function of (X, Y ), PX,Y (x, y) = P (X =
x, Y = y), we define the conditional probability function of Y given X = x to
be:
PX,Y (x, y)
P (Y = y | X = x) = ∀x : PX (x) 6= 0
PX (x)

This is denoted as FY (y | X ≤ x).

21 05 April

21.1 Conditional distributions continued

21.1.1 Conditional probability function

Also straight-forward is the conditional probability function:

PY | X=x (y) = P (Y = y | X = x)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 48

Again, by the definition of conditional probability, the right hand side is equal
to
P (X = x, Y = y) PX,Y (x, y)
=
P (X = x) PX (x)
(provided that P (X = x) 6= 0).

21.1.2 Conditional probability density function

Something more interesting is how we deal with, say, P (Y ≤ y | X = x) when X


and Y are jointly continuous. We cannot define this as the ratio of the joint divided
by the P (X = x) since the latter is 0 for all x. Because of this, we need to take a
slightly different route.
First, define the conditional p.d.f. of Y given X = x denoted by

fX,Y (x, y)
fY | X=x (y | x) =
fX (x)

for all x such that fX (x) 6= 0.


Now, since we know that if we integrate a p.d.f. over a region A, we get the probability
of that region. Therefore,

FY | X=x (y | x) = P (Y ≤ y | X = x)
Z y
= fY | X=x (u | x) du
−∞

It follows that Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞

Note that this is the usual definition of expected value, except that we use the
conditional p.d.f..
In the discrete case:
X
P (Y ≤ y | X = x) = PY | X=x (u | x)
∀u:u≤y

It’s only in the continuous case where things get a bit more complicated.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 49

21.2 The Law of Total Probability for Random Variables

The following theorems are very useful analogues of the Law of Total Probability for
events.
Recall: n
X
P (A) = P (A | Bi )P (Bi )
i=1

21.2.1 Discrete case

For discrete random variables:


X
P (Y = y) = P (Y = y | X = x)P (X = x)
∀x
X
= PY | X=x PX (x)
∀x

The proof of this is the same as the proof for sets.

21.2.2 Continuous case

For continuous random variables, we have the following theorem:


Theorem: Let X, Y have joint p.d.f X, Y with conditional p.d.f. fY | X=x (y | x).
Then,
Z ∞
(a) fY (y) = fY | X=x (y | x)fX (x) dx
−∞
Z ∞
(b) and FY (y) = FY | X=x (y | x)fX (x) dx
−∞

Proof of (a): We have


Z ∞
fY (y) = fX,Y (x, y) dx
Z−∞

= fY | X (y | x)fX (x) dx
−∞

Try part (b) yourself. Or not. Doesn’t matter to me.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 50

21.3 Example

From Tuesday’s example, we had


2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(4) Find fY | X=x (y | x).
We must find
fX,Y (x, y)
fX (x)

By (1), we have:
For 0 < y < 1 and 0 < x < 1:
2
3
(x + 2y)
fY | X=x (y | x) = 2
3
(x + 1)

Otherwise, for y ∈
/ (0, 1),
fY | X=x (y | x) = 0

(5) Find E(Y | X = x) for 0 < x < 1 and, in particular, find E(Y | X = 12 ).
By definition,
Z ∞
E(Y | X = x) = y fY | X=x (y | x) dy
−∞
Z 1
y(x + 2y)
= dy
0 x+1
3 x 2
= ( + ) for 0 < x < 1
2(x + 1) 3 3

In particular,
1 3 1 2 5
E(Y | X = ) = 1 ( + )=
2 2( 2 + 1) 6 3 6

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 51

21.4 Bivariate analogues

We seek a summary of a bivariate distribution – a sort of analogue to E(X) or


V ar(X) for a univariate distribution. Before doing this, however, we need another
definition.
Definition: Let g(x, y) be a real-valued function of (x, y). Then, we define for the
continuous case:
Z ∞Z ∞
E(g(X, Y )) = g(x, y) fX,Y (x, y) dx dy
−∞ −∞

In the discrete case:


XX
E(g(X, Y )) = g(x, y) PX,Y (x, y)
y x

21.4.1 Covariance

An important special case is when


g(X, Y ) = (X − µX )(Y − µY )
(i.e. E(X) = µX and E(Y ) = µY )
Thus, in this case, we’re talking about
E((X − µX )(Y − µY ))
This is given a special name – the covariance between X and Y , the Cov(X, Y ), and
is denoted by σXY .

21.4.2 Notes

(1) Cov(X, Y ) is a measure of how X and Y vary about their means simultaneously.
If Cov(X, Y ) > 0, then this tells us that as X increases, so does Y , and also as X
decreases, so does Y . Conversely, if Cov(X, Y ) < 0, then as X increases, Y tends to
decrease, and vice versa.
(2) “little theorem”
Cov(X, Y ) = E(XY ) − E(X)E(Y )

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 52

Proof: by definition,

Cov(X, Y ) = E((X − µX )(Y − µY ))


= E(XY − µY X − µX Y + µX µY )
= E(XY ) − µY E(X) − µX E(Y ) + µX µY
= E(XY ) − µX µY

22 10 April

22.1 Covariance continued

22.1.1 Covariance and Correlation

While the sign of Cov(X, Y ) tells you whether or not X and Y tend to vary in the
same direction together (positive if they do, negative if in opposite directions), the
magnitude of Cov(X, Y ) depends on the scale of measurement. Thus,

Cov(aX, bY ) = E((aX − aµX )(bY − bµY ))


= E(ab(X − µX )(Y − µY ))
= abE((X − µX )(Y − µY ))
= abCov(X, Y )

In other words, Cov(aX, bY ) 6= Cov(X, Y ). Therefore, we define a new quantity


that has the same sign as Cov(X, Y ), but which is scale invariant. This way, it
does not matter what scale we take our measurements in (Celsius vs Fahrenheit,
kilometres vs miles, etc.). Thus, we define the correlation between X and Y , written
as Corr(X, Y ) (also ρ(X, Y )), and is defined as
Cov(X, Y ) σXY
ρ(X, Y ) = p =
V ar(X)V ar(Y ) σX σY

Note that the sign of ρ(X, Y ) is the same as the sign of Cov(X, Y ), and

Cov(aX, bY ) |ab|Cov(X, Y )
|ρ(aX, bY )| = p = p = ρ(X, Y )
V ar(aX)V ar(aY ) |ab| V ar(X)V ar(Y )

(i.e. ρ is scale invariant).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 53

22.1.2 Important remarks on the correlation coefficient

(1) It is not difficult to show that |ρ(X, Y )| ≤ 1 (i.e. −1 ≤ ρ(X, Y ) ≤ 1) using


1
the Cauchy-Schwartz inequality (E(XY ) ≤ (E(X 2 )E(Y 2 )) 2 ), or using the fact that
0 ≤ E((X − Y )2 ) = E(X 2 ) − 2E(XY ) + E(Y 2 ) ⇒ 2E(XY ) ≤ E(X 2 ) + E(Y 2 ). The
proof was given in class, but will not be on the final.
(2) |ρ| = 1 if and only if Y = aX + b for constants a, b, i.e. if and only if there is a
perfect linear relationship. Further, the above proof from (1) gives us the claim for
(2), since E(X ± Y )2 = 0 ⇔ Y = ±X.
(3) It is important to note that the correlation between two random variables is
a measure of linear dependence between them, and nothing else. Avoid using the
term “correlation” to describe dependence in general.

22.1.3 Example

(Same numbers from Thursday)

2
fX,Y (x, y) = (x + 2y) for 0 < x < 1, 0 < y < 1
3
and fX,Y (x, y) = 0 elsewhere.
(6) Find Cov(X, Y ).
We need µX and µY . We then need E(XY ).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 54

Z ∞
µX = xfX (x) dx
−∞
Z 1
2
= x (x + 1) dx
0 3
5
=
Z9 ∞
µY = yfY (y) dy
−∞
Z 1
1
= y (1 + 4y) dy
0 3
∞ ∞
11
Z Z
= E(XY ) = xyfX,Y (x, y) dx dy
18 −∞ −∞
Z 1Z 1
2
= xy (x + 2y) dx dy
0 0 3
1
=
3
1 5 11
Cov(X, Y ) = −
3 9 18
−3
=
486

(7) Find Corr(X, Y ) = ρ(X, Y ).


We need V ar(X) and V ar(Y ).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 55

1
2
Z
2
E(X ) = x2 (x + 1) dx
0 3
7
=
18
7 5
V ar(X) = − ( )2
18 9
= .0803

σX = .0803
= .2833
... = ...
V ar(Y ) = .6821

σY = .6821
= .8259

So, at last, we can find Corr(X, Y ).


−3
486
Corr(X, Y ) =
.2833 × .8259
= −0.0319

22.1.4 Linking variance and covariance

Theorem: V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2Cov(X, Y ).


Proof:
V ar(X + Y ) = E((X + Y )2 ) − (µX + µY )2
= E(X 2 ) − µ2X + E(Y 2 ) − µ2Y + 2E(XY ) − 2µX µY
= V ar(X) + V ar(Y ) + 2Cov(X, Y )
For V ar(X − Y ), we get V ar(X) + V ar(Y ) − 2Cov(X, Y ).
Corollary: if Cov(X, Y ) = 0, then V ar(X + Y ) = V ar(X) + V ar(Y ). In gen-
eral: n n
X X
V ar( Xi ) = V ar(Xi )
i=1 i=1
if Xi , Xj are uncorrelated and i 6= j.

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 56

22.2 Independence between random variables

We talked about independence of events, and so now it is natural to discuss the


notion of independence between random variables.

22.2.1 Definition

Note: This definition is valid whether the random variables are continuous or dis-
crete.
The random variables X1 , X2 , . . . , Xn are said to be independent if and only if
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = FX1 (x1 )FX2 (x2 ) . . . FXn (xn )
for all −∞ < xi < ∞.
If X1 , X2 , . . . , Xn are jointly continuous, then it is easy to see that they are indepen-
dent if and only if
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
for all −∞ < xi < ∞.
If X1 , X2 , . . . , Xn are jointly discrete, then they are independent if and only if
PX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = PX1 (x1 )PX2 (x2 ) . . . PXn (xn )
for all −∞ < xi < ∞.

22.2.2 Example continued

(8) Are X and Y independent?


Use fX,Y (x, y) and fX (x)fY (y). Try x = 41 , y = 14 .

1 1 2 1 2
fX,Y ( , ) = ( + )
4 4 3 4 4
1 25
fX ( ) =
4 34
1 1 4
fY ( ) = (1 + )
4 3 4

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 57

So that
1 1 5 1
fX ( )fY ( ) = 6=
4 4 9 2
Therefore, X and Y are not independent.

22.2.3 Independence and covariance

What is the relationship between independence and covariance?


Theorem: If X and Y are independent, then Cov(X, Y ) = 0. Note that the converse
is not true.
Proof: (continuous case) Assume X ⊥ Y . We must show that E(XY ) =
E(X)E(Y ).
Z ∞Z ∞
E(XY ) = xyfX,Y (x, y) dx dy
−∞ −∞
Z ∞Z ∞
= xyfX (x)fY (y) dx dy
−∞ −∞
Z ∞ Z ∞
= yfY (y) dy xfX (x) dx
−∞ −∞
= µX µY

23 12 April

23.1 Sums of independent random variables

Sums of random variables are particularly important in probability and statistics


since we frequently encounter averages of random variables, apart from the divisor
n,
n
1X
Xi
n i=1
denoted x̄, is just a sum.
We’ll do three things:

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 58

First off, use the m.g.f. method to find the exact distribution of a sum of independent
random variables, under certain circumstances.
Secondly, use the Central Limit Theorem to use the approximate distribution of
a sum of a “large” number of independent random variables under general condi-
tions.
Thirdly, we’ll discuss the Weak Law of Large Numbers that enables to us to interpret
probability as a limiting relative frequency.
The moment generating function method for finding the distribution of a
sum of independent random variables: Recall from last class that if X ⊥ Y ,
then E(XY ) = E(X)E(Y ) (i.e. Cov(X, Y ) = 0). In general, if X1 , X2 , . . . , Xn are
independent, then !
Yn Yn
E Xi = E(Xi ).
i=1 i=1

The following extended result is true: if X ⊥ Y , then g1 (x) ⊥ g2 (y) for all functions
g 1 , g2 .

23.1.1 Setup

We have independent r.v.s X1 , X2 , . . . , Xn that are assumed to come from some


known distribution (e.g. Poisson, Normal, etc.). We want to find the distribution
of n
X
Sn = Xi .
i=1

This is how the m.g.f. method works:


Step 1: find the m.g.f.s MXi (t).

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 59

Step 2: find the m.g.f. of Sn , MSn (t) as follows:

MSn (t) = MPni=1 Xi (t)


Pn
= E(et i=1 Xi
)
tX1 tX2
= E(e e . . . etXn )
= E(g(X1 )g(X2 ) . . . g(Xn )) where g = etXi .
Yn
= E(etXi )
i=1

The last step above is valid because functions of independent random variables are
themselves independent, and E(g(X1 )g(X2 )) = E(g(X1 ))E(g(X2 )). Then,

... = ...
Yn
MSn (t) = MXi (t)
i=1

Remark: if the Xi ’s all have the same distribution (termed identically distributed ),
then
MSn (t) = (MXi (t))n .

Step 3: having found ni=1 MXi (t), we hope to recognize its form as the m.g.f. of
Q
a familiar distribution. If so, then by the Uniqueness Theorem of M.G.F.s, that
distribution must be the distribution of Sn .

23.1.2 In practice

Theorem: Let X1 , X2 , . . . , Xn be independent random variables.


(a) Let Xi ∼ P oisson(λi ).
(b) Let Xi ∼ N (µi , σi2 ).
(c) Let Xi ∼ Binomial(ni , p).
(d) Let Xi ∼ χ2νi .
Find the distributions of of Sn in (a) through (d).
Solution: (easy!)

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 60

t
(a) We have MXi (t) = eλi (e −1) . Therefore,
n
t
Pn
λi (et −1)
Y
MSn (t) = eλi (e −1) = e i=1

i=1

which we recognize as the m.g.f. of a P oisson(P ni=1 λi ) random variable. Therefore,


P
by the Uniqueness Theorem, Sn ∼ P oisson( ni=1 λi ). In particular, if λ1 = λ2 =
· · · = λn = λ, then Sn ∼ P oisson(nλ).
σi2 t2
(b) MXi (t) = eµi t+ 2

n
Y σi2 t2
MSn (t) = eµi t+ 2

i=1
Pn Pn 2
µi t+ σi2 t2
=e i=1 i=1

Pn Pn
which we recognize as the m.g.f. of a N ( i=1 µi , i=1 σi2 ) random variable.
(c) (try it yourself)
(d)
1
MXi (t) = νi
(1 − 2t) 2
Therefore,
1
MSn (t) = Pn νi
(1 − 2t) i=1 2

Which is the m.g.f. of a


χ2Pni=1 νi r.v.

23.2 The Central Limit Theorem

Roughly, the Central Limit Theorem says: sums of a large number of independent
random variables are approximately normally distributed.
Theorem: Let X1 , X2 , . . . be independent and identically distributed (i.i.d.) ran-
dom variables with mean µ and variance σ 2 . Then,
 
Sn − nµ
P √ ≤ x) → P (Z ≤ x) ∀x as n → ∞

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 61

where Z is a Standard Normal distribution (i.e. Z ∼ N (0, 1)).


Remember this is talking about the sums of the r.v.s, not the r.v.s themselves!
This helps explains why, in the real world, a lot of factors seem to be normally dis-
tributed – IQ scores, heights, weights, etc.. This does not prove why, say, heights
are normally distributed, however – it’s just an observation and a plausibility argu-
ment.

23.2.1 Notes

(1) V ar(Sn ) = ni=1 V ar(Xi ) = nσ 2 , while E(Sn ) = nµ. Therefore, the l.h.s. of
P
the Central Limit Theorem is just Sn standardized to have mean 0 and standard
deviation 1.
(2) The C.L.T. can be written in the form
!
X̄ − µ
P σ ≤ x → P (Z ≤ x)

n

where Z ∼ N (0, 1). Just divide top and bottom by n.


(3) Note that the C.L.T. gives us a Normal distribution as the approximate distri-
bution as the sum of any i.i.d. random variables.

23.2.2 Application

Suppose that it is known that the survival time for patients with Alzheimer’s disease
from onset of symptoms has a mean of 8 years and a standard deviation of 4 years. If
a sample of 30 patients with the disease is taken, what is the approximate probability
that their average survival will be less than seven years?
P30
Xi
Solution: the C.L.T. generally works well with n ≥ 30. We’ll let X̄ = i=1
30
. We

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])


lOMoARcPSD|36552487

Ryan Ordille MATH 323: Probability 62

want P (X̄ < 7). Use the C.L.T. as follows:


!
X̄ − µ 7−µ
P (X̄ < 7) = P <
√σ √σ
n n
!
X̄ − 8 7−8
=P <
√4 √4
30 30
≈ P (Z < −1.37)
= .0853

Downloaded by Prof. Madya Dr. Umar Yusuf Madaki ([email protected])

You might also like