Lect 02

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

LECTURE 2

Review of Basic Probability Theory

2.1 PROBABILITY SPACE AND AXIOMS

Probability theory provides a set of mathematical rules to assign probabilities to outcomes


of random experiments, e.g., coin flips, packet arrivals, stock prices, neural spikes, noise
voltages, and so on. Given a random experiment, its sample space Ω is the set of all out-
comes. An event is a subset of the sample space and we say that an event A ⊆ Ω occurs if
the outcome ω of the random experiment is an element of A. Let F be a set of events. A
probability measure P : F → [0, 1] is a function that assigns probabilities to the events in
F . We refer to the triple (Ω, F , P) as the probability space of the random experiment.
The probability measure P must satisfy the following.

Axioms of probability.
. P(A) ≥ 0 for every event A in F .
. P(Ω) = 1.
. Countable additivity. If A 1 , A 2 , . . . are disjoint, i.e., A i ∩ A j = , i ̸= j, then

∞ ∞
P 󶀤 󵠎 A i 󶀴 = 󵠈 P(A i ).
i=1 i=1

For the probability measure P to be well-defined over all events of interest, the set of
events F must satisfy:
.  ∈ F .
. If A ∈ F , then A c ∈ F .
. If A 1 , A 2 , . . . ∈ F , then ⋃i=1 A i ∈ F .

Due to these defining properties, F is often referred to as a σ-algebra or σ-field.


2 Review of Basic Probability Theory

2.2 DISCRETE PROBABILITY SPACES

A probability space (Ω, F , P) is said to be discrete if the sample space Ω is countable, i.e.,
finite or countably infinite.
Example . (Flipping a coin). Ω = {H , T}, F = {, {H}, {T}, Ω}, and
P() = 0, P({H}) = p, P({T}) = 1 − p, P(Ω) = 1,
where p ∈ [0, 1] is the bias of the coin. A fair coin has a bias of 1/2.

For discrete sample spaces, F is often the set of all subsets of Ω, namely, the power
set 2Ω of Ω. (Recall that |2Ω | = 2|Ω| .) In this case, the probability measure P can be fully
specified by assigning probabilities to individual outcomes (or singletons) {ω} so that
P({ω}) ≥ 0, ω ∈ Ω,
and
󵠈 P({ω}) = 1.
ω∈Ω

Then it follows by the third axiom of probability that for any event A ⊆ Ω,
P(A) = 󵠈 P({ω}).
ω∈A

Example 2.2 (Rolling a fair die). Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω = {, {1}, {2}, . . . , Ω},
and
P({i}) = , i = 1, 2, . . . , 6.
1
6
The probability of the event A “the outcome is even,” i.e., A = {2, 4, 6}, is

P(A) = P({2}) + P({4}) + P({6}) = = .


3 1
6 2
Example 2.3 (Flipping a coin n times). A coin with bias p is flipped n times. Then
Ω = {H , T}n = {sequences of heads/tails of length n},
F = 2Ω ,
P({ω}) = pi (1 − p)n−i ,
where i is the number of heads in ω. The probability of the event A k “the outcome consists
of k heads and n − k tails” is
n
P(A k ) = 󵠈 P({ω}) = 󶀤 󶀴pk (1 − p)n−k .
ω:ω has k heads
k
We can verify that
n n
n
P(Ω) = 󵠈 P(A k ) = 󵠈 󶀤 󶀴pk (1 − p)n−k = 1.
k=0 k=0
k
2.3 Continuous Probability Spaces 3

Example 2.4 (Flipping a coin until the first head). Ω = {H , TH , T TH , T T TH , . . .},


F = 2Ω , and
P({ω}) = (1 − p)i p,

where i is the number of tails in ω. Again we can verify that



P(Ω) = 󵠈 P({ω}) = 󵠈(1 − p)i p = 1.
ω∈Ω i=0

Example 2.5 (Counting the number of packets). Consider the number of packets ar-
riving at a node in a communication network in time interval (0, T] at rate λ ∈ (0, ∞).
Then, Ω = {0, 1, 2, 3, . . . }, F = 2Ω , and

(λT)k −λT
P({k}) = e , k = 0, 1, 2, . . . ,
k!
provided that the number of packets are Poisson distributed. Note that

(λT)k −λT
P(Ω) = 󵠈 e = 1.
k=0
k!

In all examples so far, F = 2Ω . This is not necessarily the case.


Example .. (Rolling a colored die). Suppose that each face of a die is colored, say, 
and  are red, and  through  are blue. Further suppose that the observer of a die roll can
only note the color of the face, not the actual number. Then Ω = {1, 2, 3, 4, 5, 6} as before,
but
F = {, {1, 2}, {3, 4, 5, 6}, Ω}.

This is a valid σ-algebra (check!), but it is much smaller in size than the previous case and
the probability measure is fully specified by P({1, 2}) alone. As an extreme, if all six faces
are of the same color, then we have the trivial σ-algebra F = {, Ω}, which is still valid
but hardly interesting. Thus, the choice of F controls the level of granularity at which one
can assign probabilities.

2.3 CONTINUOUS PROBABILITY SPACES

A continuous probability space has an uncountable number of elements in Ω. Unlike the


discrete case, the choice of F = 2Ω , albeit valid, is too rich to admit an interesting prob-
ability measure under the standard axioms of probability. At the same time, specifying
probabilities to singletons is not sufficient to extrapolate probabilities for other events.
Hence, F should be chosen more carefully, which is the main reason behind the intricate
definitions of probability measure and σ-algebra.
Suppose that Ω is the real line ℝ or its subinterval, e.g., [0, 1] = {x ∈ ℝ : 0 ≤ x ≤ 1}.
4 Review of Basic Probability Theory

Then the set of events is typically taken to contain all open subintervals of Ω, i.e., all
intervals of the form (a, b), a, b ∈ Ω. More formally, let F be the smallest σ-algebra that
contains all open subintervals in Ω. This σ-algebra is commonly referred to as the Borel
σ-algebra B and accordingly each event in B is called a Borel set.
Since B is a σ-algebra, it is closed under complement, countable unions, and countable
intersections (cf. Problem .), and contains many subsets other than open intervals. For
example, since the half-open interval (a, b] can be represented by a countable intersection
of open intervals (Borel sets) as

(a, b] = 󵠏 (a, c), (.)


c∈ℚ: c>b

it is also Borel. As a matter of fact, B contains all open subsets and thus is the smallest
σ-algebra that contains all open subsets of Ω. The probability of any Borel set can be fully
specified by assigning probabilities to open intervals (or to closed intervals, half-closed
intervals, half-intervals, etc.).
Example . (Picking a random number between  and ). Ω = [0, 1], F = B, and

P((a, b)) = b − a, 0 ≤ a < b ≤ 1.

This is the uniform distribution over Ω. By (.) and the axioms of probability,

P((a, b]) = lim (c − a) = b − a, 0 ≤ a < b ≤ 1,


c→b

It can be similarly checked that

P([a, b]) = b − a, 0 ≤ a < b ≤ 1.

In particular, P({a}) = 0, a ∈ [0, 1], and the probability of picking any specific number is
zero.

For any reasonable Ω (such as a finite set or the d-dimensional Euclidean space ℝ d ,
but sometimes even a space of time series or functions), the Borel σ-algebra can be defined
as the smallest σ-algebra that contains all open subsets. When Ω is countable, the Borel
σ-algebra is 2Ω . Henceforth, we assume that F is the Borel σ-algebra of Ω and any event
of our interest is Borel unless specified otherwise. Note, however, that for an uncountable
Ω, there are many subsets of Ω that are not Borel (if interested in these sets, refer to any
graduate-level course on measure theory).

2.4 BASIC PROBABILITY LAWS

We can establish the following as simple corollaries of the axioms of probability.


. P(A c ) = 1 − P(A).
2.5 Conditional Probability and the Bayes Rule 5

. If A ⊆ B, then P(A) ≤ P(B).


. P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
. P(A ∪ B) ≤ P(A) + P(B).
More generally, we have the following inequality, also known as Boole’s inequality, that
can be generalized to a countably infinite number of events.

Union of events bound. For any events A 1 , A 2 , . . . , A n ,


n n
P 󶀣 󵠎 A i 󶀳 ≤ 󵠈 P(A i )
i=1 i=1

The following identity is very useful in finding the probability of a complicated event.

Law of total probability. Let A 1 , A 2 , . . . be events that partition Ω, that is, A 1 , A 2 , . . .


are disjoint (A i ∩ A j = , i ̸= j) and ⋃i A i = Ω. Then for any event B,

P(B) = 󵠈 P(A i ∩ B).


i

2.5 CONDITIONAL PROBABILITY AND THE BAYES RULE

So far probability measures are similar to other common measures, such as length, area,
volume, and weight, all of which are nonnegative and countably additive. The notion
of conditioning is a unique feature of probability theory that is not found in the general
measure theory.
Let B be an event such that P(B) ̸= 0. The conditional probability of the event A given
B is defined to be
P(A ∩ B)
P(A | B) = .
P(B)

The function P(⋅ | B) : F → [0, 1] is a probability measure in itself, that is, it satisfies the
three axioms of probability:
. P(A | B) ≥ 0 for every A in F .
. P(Ω | B) = 1.
. If A 1 , A 2 , . . . are disjoint, then
∞ 󵄨󵄨 ∞
P 󶀤 󵠎 A i 󵄨󵄨󵄨 B󶀴 = 󵠈 P(A i | B).
i=1
󵄨󵄨 i=1
6 Review of Basic Probability Theory

Assume that P(A) ̸= 0 and P(B) ̸= 0. Then the conditional probability of A given B—
the a posteriori probability (or posterior in short) of A—can be related to the unconditional
probability of A— the a priori probability (or prior in short) of A. Using the definition of
conditional probability twice, we have

P(A ∩ B) P(B | A)
P(A | B) = = P(A). (.)
P(B) P(B)

By multiplying P(B) on both sides, we establish the following useful identity.

Chain rule. For any pair of events A and B,

P(A ∩ B) = P(A) P(B | A) = P(B) P(A | B).

Note that the chain rule holds even when P(A) = 0 or P(B) = 0 if we interpret the
product of zero and an undefined number to be zero. By induction, the chain rule can be
generalized to more than two events. For example,

P(A 1 ∩ A 2 ∩ A 3 ) = P(A 1 ) P(A 2 | A 1 ) P(A 3 | A 1 ∩ A 2 ).

Let A 1 , A 2 , . . . , A n be nonzero probability events that partition Ω and let B be a nonzero


probability event. By (.),

P(B | A j )
P(A j | B) = P(A j ). (.)
P(B)

By the law of total probability,


n n
P(B) = 󵠈 P(A i ∩ B) = 󵠈 P(A i ) P(B | A i ) (.)
i=1 i=1

Substituting (.) into (.) yields the famous relationship between the priors P(A i ), i =
1, 2, . . . , n, and the posteriors P(A j | B), j = 1, 2, . . . , n.

Bayes rule. If A 1 , A 2 , . . . , A n are nonzero probability events that partition Ω, then for
any nonzero probability event B,

P(B | A j )
P(A j | B) = P(A j ), j = 1, 2, . . . , n.
∑i=1 P(A i ) P(B | A i )
n

The Bayes rule also applies to a countably infinite number of events.


2.6 Independence 7

Example . (Binary communication channel). Consider the probability transition di-
agram for a noisy binary channel in Figure .. This is a random experiment with sample
space
Ω = {(0, 0), (0, 1), (1, 0), (1, 1)} ,

where the first entry is the bit sent (the input of the channel) and the second is the bit
received (the output of the channel). Define the two events

A = {0 is sent} = {(0, 1), (0, 0)},


B = {0 is received} = {(0, 0), (1, 0)}.

The probability measure on Ω is determined by P(A), P(B | A), and P(B c |A c ), which are
given on the probability transition diagram. To find P(A | B), we use Bayes rule:

P(B | A)
P(A | B) = P(A)
P(A) P(B | A) + P(A c ) P(B | A c )

to obtain
0.9 0.9
P(A | B) = ⋅ 0.2 = ⋅ 0.2 = 0.9.
0.2 ⋅ 0.9 + 0.8 ⋅ 0.025 0.2

Note that the posterior P(A | B) = 0.9 is much larger than the prior P(A) = 0.2; even
though the observation is noisy, it still reveals some useful information about the input.
PSfrag replacements
p(0|0) = 0.9
p(0) = 0.2 0 0
0.1

0.025
p(1) = 0.8 1 1
p(1|1) = 0.975

Figure .. The probability transition diagram for a binary communication chan-
nel. Here p(⋅) denotes the probability of the input and p(⋅|⋅) denotes the conditional
probability of the output given the input.

2.6 INDEPENDENCE

Two events A and B are said to be statistically independent (or independent in short) if

P(A ∩ B) = P(A) P(B).


8 Review of Basic Probability Theory

When P(B) ̸= 0, this is equivalent to

P(A | B) = P(A).

In other words, knowing whether B occurs provides no information about whether A


occurs.
Example .. We revisit the binary channel discussed in Example .. Assume that two
independent bits are sent over the channel and we would like to find the probability that
both bits are in error. Define the two events

E1 = {First bit is in error},


E2 = {Second bit is in error}.

Since the bits are sent independently, the probability that both are in error is

P(E1 ∩ E2 ) = P(E1 ) P(E2 ).

To find P(E1 ), we express E1 in terms of the events A 1 ( is sent in the first transmission)
and B1 ( is received in the first transmission) as

E1 = (A 1 ∩ B1c ) ∪ (A c1 ∩ B1 ).

Since E1 has been expressed as the union of disjoint events,

P(E1 ) = P(A 1 ∩ B1c ) + P(A c1 ∩ B1 )


= P(A 1 ) P(B1c | A 1 ) + P(A c1 ) P(B1 | A c1 )
= 0.2 ⋅ 0.1 + 0.8 ⋅ 0.025
= 0.04.

The probability that the two bits are in error is

P(E1 ∩ E2 ) = P(E1 ) P(E2 ) = (0.04)2 = 1.6 × 10−3 .

In general, the events A 1 , A 2 , . . . , A n are said to be mutually independent (or indepen-


dent in short) if for every subset A i1 , A i2 , . . . , A i󰑘 of the events,
k
P(A i1 ∩ A i2 ∩ ⋅ ⋅ ⋅ ∩ A i󰑘 ) = 󵠉 P(A i 󰑗 ).
j=1

For example, A, B, and C are independent if all of the following hold:

P(A ∩ B) = P(A) P(B), (.)


P(A ∩ C) = P(A) P(C), (.)
P(B ∩ C) = P(B) P(C), (.)
P(A ∩ B ∩ C) = P(A) P(B) P(C). (.)
2.6 Independence 9

Note that the last identity (.), or more generally,


n
P(A 1 ∩ A 2 ∩ ⋅ ⋅ ⋅ ∩ A n ) = 󵠉 P(A i )
j=1

is not sufficient for mutual independence.


Example .. Roll two fair dice independently. Define the events

A = {The first die roll is , , or },


B = {The first die roll is , , or },
C = {The sum of the two rolls is } = {(3, 6), (4, 5), (5, 4), (6, 3)}.

Since the dice are fair and the experiments are done independently, the probability of any
pair of outcomes is 1/36. Therefore

P(A) = P(B) = P(C) = .


1 1 1
, ,
2 2 9
Since A ∩ B ∩ C = {(3, 6)},

P(A ∩ B ∩ C) = = P(A) P(B) P(C).


1
36
But A, B, and C are not independent because

P(A ∩ B) = ̸= = P(A) P(B).


1 1
3 4

Similarly, pairwise independence, e.g., (.)–(.), does not imply independence.


Example .. Flip two fair coins independently. Define the events

A = {The first coin is a head},


B = {The second coin is a head},
C = {Both coins are the same}.

Since the flips are independent,

P(A ∩ B) = P(B ∩ C) = P(C ∩ A) =


1
4
= P(A) P(B) = P(B) P(C) = P(C) P(A),

and A, B, and C are pairwise independent. However, they are not independent since

P(A ∩ B ∩ C) =
1
̸= = P(A) P(B) P(C).
1
4 8

Two events A and B are said to be conditionally independent give a third event C with
P(C) > 0 if
P(A ∩ B | C) = P(A|C) P(B |C).
10 Review of Basic Probability Theory

Example .. We continue Example .. Since A ∩ C = {(3, 6)}, B ∩ C = {(3, 6), (6, 3)},
and A ∩ B ∩ C = {(3, 6)},

P(A ∩ B | C) =
1
̸= = P(A|C) P(B |C).
1
4 8

Hence, A and B are not conditionally independent given C. Now define the event

D = {The sum of the two rolls is } = {(1, 3), (2, 2), (3, 1)}.

Since A ∩ D = {(1, 3), (2, 2), (3, 1)}, B ∩ D = {(2, 2), (3, 1)}, and A ∩ B ∩ D = {(2, 2), (3, 1)},

P(A ∩ B | D) =
2
= P(A|D) P(B |D).
3

Hence, A and B are conditionally independent given D.

Conditional independence of more than two events given another event C is defined
similarly as independence with respect to the probability measure P(⋅|C). Conditional
independence neither implies nor is implied by (unconditional) independence. In Exam-
ple ., A and B are conditionally independent given D, but are not independent uncon-
ditionally as shown in Example .. In Example ., A and B are independent, but they
are not conditionally independent given C as

P(A ∪ B | C) = ̸= = P(A|C) P(B |C).


1 1
2 4

PROBLEMS

σ-algebra. Show that if A 1 , A 2 , . . . ∈ F , then ⋂i=1 A i ∈ F .



..
.. Limits of probabilities. Show
(a) P(⋃ i=1 A i ) = limn→∞ P(⋃i=1 A i ).
∞ n

(b) P(⋂ i=1 A i ) = limn→∞ P(⋂i=1 A i ).


∞ n

.. Extension of a probability measure. Consider a discrete probability space (Ω, 2Ω , P),
where Ω is a subset of ℝ. Show that P(⋅ ∩ Ω) is a valid probability measure for the
sample space ℝ and the set of events 2ℝ , that is, it satisfies the axioms of probability.
.. Independence. Show that the events A and B are independent if P(A|B) = P(A|B c ).
.. Conditional independence. Let A and B be two events such that P(A ∩ B) > 0. Show
that A and B are conditionally independent given A ∩ B.
.. Conditional probabilities. Let P(A) = 0.8, P(B c ) = 0.6, and P(A ∪ B) = 0.8. Find
(a) P(A c |B c ).
(b) P(B c |A).
Problems 11

.. Let A and B be two events with P(A) ≥ 0.5 and P(B) ≥ 0.75. Show that P(A ∩ B) ≥
0.25.
.. Monty Hall. Gold is placed behind one of three curtains. A contestant chooses one
of the curtains, Monty Hall (the game host) opens one of the unselected empty
curtains. The contestant has a choice either to switch his selection to the third
curtain or not.
(a) What is the sample space for this random experiment? (Hint: An outcome
consists of the curtain with gold, the curtain chosen by the contestant, and the
curtain chosen by Monty.)
(b) Assume that placement of the gold behind the three curtains is random, the
contestant choice of curtains is random and independent of the gold place-
ment, and that Monty Hall’s choice of an empty curtain is random among the
alternatives. Specify the probability measure for this random experiment and
use it to compute the probability of winning the gold if the contestant decides
to switch.
.. Negative evidence. Suppose that the evidence of an event B increases the proba-
bility of a criminal’s guilt; that is, if A is the event that the criminal is guilty, then
P(A|B) ≥ P(A). Does the absence of the event B decrease the criminal’s probability
of being guilty? In other words, is P(A|B c ) ≤ P(A)? Prove or provide a counterex-
ample.
.. Random state transition. Consider the state diagram in Figure .. The sample
space is
Ω = {(α, α), (α, β), . . . , (γ, γ)},
where the first entry is the initial state and the second entry is the next state. Define
the events
A 1 = {the initial state is α}, A 2 = {the next state is α},
B1 = {the initial state is β}, B2 = {the next state is β},
C1 = {the initial state is γ}, C2 = {the next state is γ}.
Assume that P(A 1 ) = 0.5, P(B1 ) = 0.2, and P(C1 ) = 0.3.
(a) Find P(A 2 ), P(B2 ), and P(C2 ).
(b) Find P(A 1 | A 2 ), P(B1 | B2 ), and P(C1 |C2 ).
(c) Find two events among A 1 , A 2 , B1 , B2 , C1 , C2 that are pairwise independent.

.. Geometric pairs. Consider a probability space consisting of the sample space
Ω = {1, 2, 3, . . .}2 = {(i, j) : i, j ∈ ℕ} ,
i.e., all pairs of positive integers, the set of events 2Ω , and the probability measure
specified by
P((i, j)) = p2 (1 − p)i+ j−2 , 0 < p < 1 .
12 Review of Basic Probability Theory

0.3

0.5
α
0.2 0.1

0.8

β γ

0.2 0.2 0.7

Figure .. The state diagram for a three-state system. Here the label of each edge
i → j denotes the transition probability from state i to state j, that is, the conditional
probability that the next state is j given the initial state is i.

(a) Find P({(i, j) : i ≥ j}).


(b) Find P({(i, j) : i + j = k}).
(c) Find P({(i, j) : i is an odd number}).
(d) Describe an experiment whose outcomes (i, j), i, j ∈ ℕ, have the probabilities
P((i, j)).
.. Juror’s fallacy. Suppose that P(A|B) ≥ P(A) and P(A|C) ≥ P(A). Is it always true
that P(A|B ∩ C) ≥ P(A) ? Prove or provide a counterexample.
.. Polya’s urn. Suppose we have an urn containing one red ball and one blue ball. We
draw a ball at random from the urn. If it is red, we put the drawn ball plus another
red ball into the urn. If it is blue, we put the drawn ball plus another blue ball into
the urn. We then repeat this process. At the n-th stage, we draw a ball at random
from the urn with n + 1 balls, note its color, and put the drawn ball plus another
ball of the same color into the urn.
(a) Find the probability that the first ball is red.
(b) Find the probability that the second ball is red.
(c) Find the probability that the first three balls are all red.
(d) Find the probability that two of the first three balls are red.

You might also like