Stat230 Chapter03
Stat230 Chapter03
Stat230 Chapter03
Chapter 3
Mathematics is a tool for reasoning. It is meant to help us see things clearly and refine our reasoning to the
point that they are free from logical sloppiness. Probability theory is the mathematical language developed for
reasoning about chance and randomness.
The basis of mathematical approach to science is the use of mathematical models. In order to reason about
a natural phenomenon, we first build a mathematical model that captures the essence of that natural phe-
nomenon. Once we have a mathematical model, we can use mathematical tools (such as algebra, calculus,
combinatorics, etc.) to analyze the model and make predictions about it. Finally, we need to translate the result
of our analysis and predictions back to see what they tell us about the original natural phenomenon.
In probability theory, the natural phenomena we wish to reason about are those which involve chance and
randomness. Such phenomena can often be described as random experiments.
3.1 Examples
We start with a few simple examples of random experiments and their mathematical models. These first ex-
amples are so simple that writing down mathematical models for them may seem redundant. However, the
examples are meant to help us familiarize ourselves with the language and the logic of using it. Once we be-
come familiar with the language, we can apply it to more and more complex scenarios where its power becomes
apparent.
Example 3.1.1 (Flipping a fair coin). Consider the random experiment of flipping a fair coin. The experiment
has two possible outcomes: either the coin comes up heads or it comes up tails. Let us represent these two
outcomes with H (standing for heads) and T (standing for tails). The set
Ω := {H, T}
is called the sample space for the experiment. Since the coin is fair, the two possible outcomes are equally likely.
We express this by writing
The function P is the measure of probabilities: it tells us the probability of each possible outcome.
Q What exactly is the interpretation of P(H)? How should we understand the statement “P(H) = 1/2”?
A good way to think about P(H) is as the “idealized frequency” of the occurrence of the outcome H when we
repeat the experiment many many times. More specifically,
• If we repeat the experiment n times (where n is very large), then
⟨number of heads⟩
P(H) ≈ .
n
Note that for any fixed n (say n = 10000), the frequency of heads need not be exactly 1/2, but we expect it to be
close to 1/2. Furthermore, the larger we choose n, the closer we expect the frequency of heads to be to 1/2. We
Example 3.1.2 (Flipping a biased coin). Let us again consider the random experiment of flipping a coin, but
this time suppose that the coin is biased (i.e., not fair), meaning that the outcomes are not equally likely. The
possible outcomes are the same as before, so the sample space is again
Ω := {H, T} .
What can we say about the measure of probabilities? The fact that the coin is biased means that the two possible
outcomes H and T are not equally likely. Without further information about the amount of the bias, the best we
can say is that
P(H) + P(T) = 1 .
As before, P(H) (the probability of H) and P(T) (the probability of T) are understood as “idealized frequencies”
in many repeated experiments. The two probabilities add up to 1 because the frequencies of the two possible
outcomes in repeated experiments add up to 1.
If we do not know the value of P(H), we can still treat it as a parameter. If we denote the value of P(H) by p,
then P(T) = 1 − p. We now have a complete (parametric) model of the random experiment of flipping a coin.
Note that based on the value of the parameter p, this model encompasses both the case in which the coin is
biased and the case in which the coin is fair. Namely, p = 1/2 if the coin is fair and p ̸= 1/2 if the coin is biased.
If p > 1/2, the coin is biased towards showing H more often, and if p < 1/2, the coin is biased towards showing T
more often. #
Example 3.1.3 (Rolling a die). After flipping a coin, the next simplest example of a random experiment is
perhaps the experiment of rolling an ordinary 6-sided die. The experiment has 6 possible outcomes, depending
on which side of the die is shown. The sample space (i.e., the set of all possible outcomes) is
Ω := { , , , , , }.
If the die is fair (i.e., not rigged), the possible outcomes are all equally likely, hence the appropriate measure of
probabilities is
P( ) = P( ) = · · · = P( ) = 1/6 .
Q What is the probability that the number shown on the die is even?
A1 1/2 . This is rather obvious based on symmetry: the number shown on the die is either even (three
outcomes) or odd (three outcomes), and there is no reason why one of the two possibilities should be
more likely than the other.
While the above answer is valid, it does not really use our mathematical model. Here is how we can find the
answer using the model:
A2 The event “the dies shows an even number” can be represented by a subset of the possible outcomes
A := { , , },
which consists of those outcomes that realize the event (see Figure 3.1). The event happens if and only if
2
the outcome of the experiment falls within this set. The probability of this event is simply the sum of the
probabilities of the individual outcomes within this event:
indicating the result of the first and the second flips. We do not assume the coin to be fair or unfair. Instead, we
use a parameter p to denote the chance of getting a head in each single flip (as in Example 3.1.2).
Q What is the appropriate measure of probabilities in this example?
A
Pp (HH) = p2 , Pp (HT) = p(1 − p) ,
Pp (TH) = (1 − p)p , Pp (TT) = (1 − p)2 .
Why? In order to justify Pp (HH) = p2 , imagine repeating this experiment many many times. In a fraction of
about p of these repeated experiments, the first flip shows a head. Among those times in which the first flip
shows a head, in a fraction of about p, the second flip also shows a head. Thus, in overall, in a fraction of
about p × p of all the repeated experiments, both flips show heads. The other probability assignments can
be argued similarly.
As concrete examples, if p = 1/2, we have
P1/3 (HH) = 1/9 , P1/3 (HT) = P1/3 (TH) = 2/9 , P1/3 (TT) = 4/9 .
This completes the description of the mathematical model. Note that we have used the subscript p in Pp to
emphasize the dependence of the measure of probabilities on the parameter p.
Let us use the model to answer some questions.
Q What is the probability that in both flips, the coin shows the same side?
Q What is the probability that in the first flip, the coin comes up heads?
A1 (Short answer) p . Remember that our original assumption was that each single time we flip the coin, the
chance of it coming up heads is p. Whether we flip the coin a second time afterwards or not is irrelevant.
A2 (Answer using the model) The event of interest is B := {HH, HT}, and its probability is
Fortunately, this is consistent with the previous short answer. Our model seems to be working well.
B happen.
3
Ω
B
HH HT
TH TT
A
2 + (1 − p)2 + p − p
Pp (A ∪ B) = Pp (A) + Pp (B) − Pp (A ∩ B) = p 2 = p + (1 − p)2 .
The first equality is the probabilistic form of the inclusion-exclusion principle. In order to justify it, again
imagine that we repeat the experiment n times (n very large). Let Nn (A ∪ B) denote the number of times
in which the event A ∪ B occurs. By the (standard) inclusion-exclusion principle, we have
Nn (A ∪ B) = Nn (A) + Nn (B) − Nn (A ∩ B) .
Hence,
Nn (A ∪ B) Nn (A) + Nn (B) − Nn (A ∩ B)
Pp (A ∪ B) ≈ =
n n
Nn (A) Nn (B) Nn (A ∩ B)
= + − ≈ Pp (A) + Pp (B) − Pp (A ∩ B) .
n n n
#
Exercise. To answer the last question in the last example, we used three different reasonings and obtained three
apparently different values for the probability Pp (A ∪ B). Verify that the three answers are indeed the same.
The next example is somewhat more interesting. It contains a question whose answer cannot be easily
guessed without the help of the mathematical model.
Example 3.1.5 (Flipping until a head comes up). Consider the same coin we used in the previous example. As
before, let p denote the chance of getting a head in one flip of the coin. We perform the following experiment:
we repeat flipping the coin until the coin comes up heads for the first time.
This experiment has infinitely many possible outcomes, which can be represented by H, TH, TTH, and so on.
The sample space is
Ω := {H, TH, TTH, TTTH, . . .} .
A
P(H) = p
P(TH) = (1 − p)p
P(TTH) = (1 − p)2 p
.. ..
. .
P(Tn H) = (1 − p)n p (for every n ≥ 0)
These can be justified as in Example 3.1.4.
4
We now have a complete model of the random experiment: the sample space Ω and the measure of probabili-
ties P.
Q What is the probability that we get an even number of T’s before the first H?
E := {H, T2 H, T4 H, . . .} .
Thus,
This is a geometric series2 with starting term p and common ratio (1 − p)2 . It converges to p/[1 − (1 − p)2 ].
Hence,
p
P(E) = .
1 − (1 − p)2
Note that in case the coin is fair (i.e., p = 1/2), the above answer gives P(E) = 2/3. Is it surprising that
the answer in this case is not 1/2? The Venn diagram in Figure 3.3 may help you understand this. Another
Ω
E
H T2 H T4 H ···
TH T3 H T5 H ···
A2 Consider the opposite event that “we get an odd number of T’s before the first H.” This event is described
by the complement of E, that is, the set
E c = Ω \ E = {TH, T3 H, T5 H, . . .} .
Observe that in order for E c to happen, the first flip must show a T, and after the first flip, there must be
an even number of T’s before the first H. We can symbolically express this observation by writing E c = TE.
The chance of having a tail in the first flip is 1 − p, and the chance of realizing the event E starting from
the second flip is P(E). Since the result of the first flip does not in any way affect what happens after the
first flip, we obtain
P(E c ) = (1 − p) P(E) .
Now, remember that the probabilities of all the outcomes must add up to 1, hence
P(E c ) + P(E) = 1 .
Solving the above two equations for P(E) and P(E c ), we obtain
1
P(E) = .
2−p
This solution will become clearer once we formulate the concept of the independence of events later in
this chapter. #
Exercise. Verify that the two answers in the last example are the same.
2A quick review of geometric series comes in Interlude 3.A.
5
Interlude 3.A (Review of geometric series). A geometric series is the sum of an infinite number of terms in
which the ratio between every two consecutive terms is the same. For instance, in the series
3 3 3
S =3+ + + + ··· ,
2 4 8
each term is half the previous term. The following trick can be used to find the value the sum. Observe that
1 3 3
S =3+ 3 + + + ··· .
2 | 2 {z8 }
S
where a is the starting term and r is the common ratio of the consecutive terms. Note that a geometric series
does not always converge, for instance when a = 1 and r = −2, the geometric series
1 − 2 + 4 − 8 + 16 + · · ·
diverges. The series (e) converges if and only if −1 < r < 1.3 When it converges, the value of the series (e)
can be found as in the example above. 3
Terminology.
• The set of all possible outcomes in a random experiment is called the sample space and is often denoted
by the Greek letter Ω.
• An event regarding the experiment is described by a subset of the sample space.
• Two events are mutually exclusive (i.e., they cannot occur simultaneously) if and only if they are disjoint
as sets (i.e., do not share elements).4
• A measure of probabilities, often denoted by P, is a function that assigns numbers to events.
Interpretation. The probability of an event A ⊆ Ω is understood as the “idealized frequency” of the occurrence
of A if we repeat the experiment many many times.5 More specifically, suppose that we repeat the experiment
n times, and denote by Nn (A) the number of times in which A occurs. Then,
Nn (A)
P(A) ≈
n
when n is large, and the approximation becomes more and more accurate as n → ∞.
Axioms for consistency. In order to have a meaningful and self-consistent model, the measure of probabilities
must satisfy a number of axioms:
I. For each event A ⊆ Ω, 0 ≤ P(A) ≤ 1.
II. P(Ω) = 1.
III. P(∅) = 0.
IV. (additivity) If A and B are disjoint events (i.e., A ∩ B = ∅), then P(A ∪ B) = P(A) + P(B).
3 The rigorous proof of this can be found in most calculus books.
4 We will use the two terms mutually exclusive and disjoint interchangeably.
5 Later on you may notice that, in some scenarios, this interpretation does not make much sense. Still, the “idealized frequency”
6
More generally,
V. (countable additivity) If A1 , A2 , . . . is a finite or infinite sequence of pairwise disjoint events, then
P(A1 ∪ A2 ∪ · · · ) = P(A1 ) + P(A2 ) + · · ·
A The first three are easy to interpret. To understand the additivity axiom, imagine repeating the experiment
n times, where n is very large. If A and B are disjoint, they cannot occur simultaneously, hence Nn (A∪B) =
Nn (A) + Nn (B). Therefore,
Nn (A ∪ B) Nn (A) + Nn (B) Nn (A) Nn (B)
P(A ∪ B) ≈ = = + ≈ P(A) + P(B) ,
n n n n
where the approximations become accurate at the limit of n → ∞. The countable additivity axiom is a
natural extension of the additivity axiom.
Usual steps in reasoning about probabilities. In order to mathematically reason about a random experiment,
we often take the following steps:
1 Identify the sample space.
2 Assign probabilities to those events for which the probabilities are self-evident.
The first two steps concern building a mathematical model for the experiment, step 3 involves solving purely
mathematical problems, and step 4 is about translating back the solutions of the mathematical problems to
see what they tell us about the original random experiment. Step 2.5 is handled in more advanced courses in
probability theory or measure theory. Recognizing the distinction between these steps often helps us see things
more clearly.
The following is an example in which assigning self-evident probabilities and verifying the consistency of the
model is far from trivial.
Example 3.2.1 (Drawing a number from an interval at random). Consider the experiment of picking a number
from the interval [0, 1] completely at random. For instance, the experiment could involve spinning a wheel
(similar to that in a game of roulette) as in Figure 3.4. The picked number would then be the angle between
the reference mark and the arrow divided by 2π.
θ
0
0.5953937 · · ·
Figure 3.4: A spinning wheel device for picking a number between 0 and 1 at random. The device consists of
a wheel (blue circle) which can rotate freely relative to a fixed frame (black circle). The wheel has an arrow
drawn on it, which can be positioned at any angle θ relative to a reference mark on the frame. In order to pick
a random number, we spin the wheel. The picked number is θ/(2π) once the wheel stops.
The sample space is the set of all real numbers between 0 and 1, that is,
Ω := [0, 1] .
As in Example 3.1.5, this experiment has infinitely many possible outcomes. However, unlike that example,
here the sample space is uncountable.6
6 That is, the set is so large that its elements cannot be enumerated in a sequence. If you have not seen this before, you can check out
7
Q What is the appropriate measure of probabilities?
A Identifying the measure of probabilities for this experiment is more complicated than in the previous exam-
ples, because there are way too many events. However, the probabilities of some events are self-evident.
Consider the event that the picked number falls within an interval [a, b], where 0 ≤ a ≤ b ≤ 1. We expect
that, if we repeat the experiment many many times, then the proportion of times in which the picked num-
ber falls within the interval [a, b] will be roughly proportional to the length of the interval. Therefore, the
probability of the event [a, b] must be
In more advanced courses in probability theory or measure theory, it will be shown that these probabilities
(for all the closed sub-intervals of [0, 1]) are consistent with one another, and that the probability of virtually
any other event can be deduced from the probabilities of the closed sub-intervals of [0, 1].
This is the end of the modeling stage. Let us see how the probabilities of some simple events can be deduced
from the probabilities of intervals.
Q What is the probability that the picked number is is either ≥ 2/3 or ≤ 1/3?
A The event of interest is is [0, 1/3] ∪ [2/3, 1]. Since the events [0, 1/3] and [2/3, 1] are disjoint (i.e., mutually
exclusive), we have
P the picked number is either ≥ 2/3 or ≤ 1/3 = P [0, 1/3] ∪ [2/3, 1]
= P([0, 1/3]) + P([2/3, 1]) = 1/3 + 1/3 = 2/3 .
Q What is the probability that the picked number falls in the open interval (a, b)? (Assume: 0 ≤ a ≤ b ≤ 1.)
Note that, in the modeling stage, we could have declared the probability of the open intervals (a, b) as
self-evident, too. However, that would have been redundant because, as the above computation shows, the
probability of open intervals can be deduced from the probability of closed intervals.
A
P the picked number is rational = P(Q ∩ [0, 1]) = 0 .
To see why, recall that the set of rational numbers is countable. Let q1 , q2 , q3 , . . . be an enumeration of all
the rational numbers in [0, 1]. As we saw above, for each k,
P the picked number is qk = P([qk , qk ]) = 0 .
8
Some basic facts. The following facts about probability models have rather intuitive interpretations. Never-
theless, they can be deduced directly from the axioms of consistency via logical reasoning alone.
▶ (complement) For every event A ⊆ Ω, P(Ac ) = 1 − P(A) (see Figure 3.5).
Interpretation:
• Ac is the event that A does not happens.
▶ (inclusion-exclusion principle) For every two events A, B ⊆ Ω (not necessarily disjoint),
▶ If A, B ⊆ Ω are two events such that A ⊆ B, then P(A) ≤ P(B) (see Figure 3.6).
Interpretation:
• A ⊆ B means that whenever A happens, B also happens.
Ω Ω
B
c
A A A
Figure 3.5: An event and its complement. Figure 3.6: Two events. Whenever A hap-
Event Ac happens if and only if A does not pens, B also happens.
happen.
Ω Ω
A B A B
(a) (b)
9
CS 10 8
other 7 8
female male
Example 3.3.1 (Students in a class). Suppose there are 33 students in a class, out of which 17 are female and
16 male. Among the female students, 10 are computer science majors and 7 have other majors, whereas among
the male students, 8 are computer science majors and 8 have other majors. This information is summarized in
Table 3.1.
During the office hours, the instructor hears a knock on the door of his/her office and knows that it must be
a student from this class.
Q What is the chance that the student is a computer science major?
A The sample space, Ω, is the set of all 33 students in the class. Since there is no reason to believe otherwise,
let us assume that all the students are equally likely to show up during the office hours. So the measure of
probabilities, P, assign probability 1/|Ω| = 1/33 to each student. The event of interest, E, is the set of all
18 computer science majors in the class. Thus,
|E| 18
P(E) = = .
|Ω| 33
Suppose now that the instructor opens the door and sees that the student is male.
Q After learning this additional information, what is the chance that the student is a computer science major?
A The information reduces the set of possibilities to the set of all 16 male students, which we call B. The
revised probability is
|E ∩ B| 8
P(E | B) = = = 1/2 .
|B| 16
Suppose someone performs this experiment in secret, and without revealing the exact outcome of the exper-
iment, tells us that the coin has shown the same side in both flips.
Q Knowing this partial information about the outcome, what is the probability that the coin has come up
heads in the first flip?
10
has happened. How should we revise the probability of E once we learn that C has happened?
Our guide is again thinking of probabilities as “idealized frequencies”. The revised probability of E know-
ing C should approximately be the frequency of the occurrence of E among those times in which C has
happened, if we repeat the experiment many many times. More specifically, imagine repeating the exper-
iment n times, where n is very large. Here is an example of a sequence of outcomes for these repeated
experiments:
E E E
H T T H T H H T H H ···
H H H H T T T T H T
C C C C C
As before, let Nn (C) denote the number of times in which C has occurred. Then, the revised probability is
Nn (E ∩ C) Nn (E ∩ C)/n P(E ∩ C)
P(E | C) ≈ = ≈ .
Nn (C) Nn (C)/n P(C)
P(E ∩ C) p2
P(E | C) = = 2 .
P(C) p + (1 − p)2
Terminology. In general, if E, C ⊆ Ω are two events in a probability model, then the conditional probability of
E given C is the quantity
P(E ∩ C)
P(E | C) := .
P(C)
This is how we should revise the probability of E once we learn that C has happened. Note that the conditional
probability P(E | C) makes sense only if P(C) > 0.
Some basic facts about conditional probabilities. Conditional probabilities satisfy properties analogous to
unconditional probabilities. The following facts all have intuitive interpretations, and can be logically deduced
from the definition of conditional probabilities and the basic axioms and facts about the measure of probabilities.
Let C ⊆ Ω be an event with P(C) > 0.
▶ For each event A ⊆ Ω, 0 ≤ P(A | C) ≤ 1.
▶ P(Ω | C) = 1. (In fact, P(C | C) = 1.)
▶ P(∅ | C) = 0. (In fact, P(A | C) = 0 whenever A ∩ C = ∅.)
▶ (additivity) If A and B are disjoint events, then P A ∪ B C = P(A | C) + P(B | C).
▶ (countable additivity) If A1 , A2 , . . . is a finite or infinite sequence of pairwise disjoint events, then
P A1 ∪ A2 ∪ · · · C = P(A1 | C) + P(A2 | C) + · · ·
11
CS 3 4
other 6 8
female male
Q What is the chance that the a random student from this class is a computer science major?
|E ∩ B| 4
P(E | B) = = = 1/3 .
|B| 12
Note that the two probabilities with or without the partial information turned out to be the same. This is not
surprising: the ratio of CS to other is the same among both male and female students. In other words, whether
a randomly chosen student from this class is a CS major or not is statistically independent of whether he/she is
male or female. #
Example 3.3.4 (Flipping a coin twice). Let us get back to the random experiment of flipping a coin with
parameter p twice. The sample space for the experiment is
• A := {HH, HT} (in the 1st flip, the coin comes up heads),
• B := {HH, TH} (in the 2nd flip, the coin comes up heads).
12
Intuitively, we know that the two events A and B are independent of each other: the result of the first flip does
not in any way affect the result of the second flip. In the language of our model, this independence is reflected
in the identity P(B | A) = P(B), which can be directly verified:
P(A ∩ B) p2
P(B | A) = = = p = P(B) .
P(A) p
Suggestion. Make sure you follow the 2nd and the 4th equalities. What are the values of P(A ∩ B), P(A)
and P(B)? #
In general, the idea that two events A, B ⊆ Ω in a probability model are independent can be conveniently
expressed by the identity
Although this identity appears asymmetric, the condition of independence is symmetric. Namely, since P(A|B) =
P(A ∩ B)/ P(B), the condition (⊥1 ) is (essentially) equivalent to the condition
The latter condition is however superior: in addition to treating A and B symmetrically, condition (⊥2 ) makes
sense even if P(B) = 0. Note that when P(B) = 0, the conditional probability P(A | B) is meaningless. For this
reason, we adopt (⊥2 ) as the definition of statistical independence.
This means that whether A has happened or not does not provide us with any information about the occurrence
of B, and vice versa.
Some basic facts about independence. The following facts all have intuitive interpretations, and can be
logically deduced from the definition of conditional probabilities and the basic axioms and facts about the
measure of probabilities. Let A, B ⊆ Ω be two events.
▶ If P(A | B) = P(A) then A and B are independent. The two conditions are equivalent if P(B) > 0.
▶ If A and B are independent, then so are any of the pairs (Ac , B), (A, B c ) and (Ac , B c ). (In words, saying
that A and B are independent is the same thing as saying that Ac and B are independent, and so on.)
Let us conclude this section with a more interesting example regarding conditional probabilities.
Example 3.3.5 (Flipping a coin 3 times). Consider the experiment of flipping a coin 3 times in a row. As usual,
we let p be the bias parameter of the coin, indicating the chance that, in one flip, the coin comes up heads. The
sample space for this experiment is
Suppose someone performs this experiment in secret, and tells us that, in these three flips, the coin has come
up tails twice and heads once, but does not tell us in which order.
Q Conditioned on this partial information, what is the probability that, in the first flip, the coin has come up
heads?
7 Here × denotes the Cartesian product of sets. Recall that the Cartesian product of two sets A and B, denoted by A × B, is the set of all
13
A The event of interest (“the 1st flip is a head”) is
while the condition (“two tails and one head”) can be described by the event
We have
P(E ∩ C) p(1 − p)2
P(E | C) = = = 1/3 .
P(C) p(1 − p)2 + (1 − p)p(1 − p) + (1 − p)2 p
| {z }
3p(1−p)2
At first, it might come as a surprise that the answer does not depend on the parameter p. However, upon a
second inspection, this makes sense: the condition C is realized by three possible outcomes, all of which are
equally likely. Therefore, it is natural that, with the knowledge that C has happened, the chance of each of the
three possible outcomes is 1/3. #
A 3 × 5 × 7 = 105. #
In general:
Fact. If we have two collections of objects A (with m elements) and B (with n elements), then the number of
pairs (a, b) where a is an object from A and b is an object from B is m × n.10
Example 3.4.2 (Freshmen and mentors). There are 10 freshmen students just entering the university. The
university wants to assign 10 volunteered seniors as mentors for these freshmen, in such a way that each
freshman has got a mentor and each senior is a mentor of only one freshman.
Q In how many ways can this assignment be done?
A Imagine assigning mentors to the freshmen one after another. There are 10 options for the first freshman.
Once a mentor is assigned to the first freshman, there remains 9 options for the second freshman. Once
a mentor is assigned to the second freshman, there remains 8 options for the third freshman, and so on.
Thus, in total, there are 10 × 9 × 8 × · · · × 1 = 10! ways we can assign mentors to the freshman. #
In general:
Fact. The number of ways one can match n distinguishable objects of category 1 with n distinguishable objects
of category 2 in a one-to-one fashion is n!.
Example 3.4.3 (Sitting around a circular table). In a meeting of the Arab League, the representatives of the
22 member states sit around a circular table. Suppose we are tasked to assign the 22 seats around the table to
these 22 representatives.
Q In how many possible ways can we assign the seats if we only care about the relative position of the
representatives (in particular, who sits next to whom) and not the actual seats they take?
14
7 6 4 3 10 16
8 5 5 2 12 2
9 4 6 1 20 3
10 3 7 22 5 7
11 2 8 21 1 8
12 1 9 20 14 17
13 22 10 19 19 18
14 21 11 18 15 13
15 20 12 17 4 21
16 19 13 16 22 9
17 18 14 15 11 6
Figure 3.8: Three possible assignments of seats to the 22 representatives (see Example 3.4.3). The repre-
sentatives are numbered from 1 to 22 in a fixed way (say, Algeria is 1, Bahrain is 2, and so on). The two
assignments (a) and (b) are equivalent (because the relative positions in both are the same) and should be
counted as one. Assignment (c) is however different from the other two.
A2 Pick one of the representatives arbitrarily, say the representative for Lebanon. Note that the exact position
of the Lebanese representative is irrelevant, so we can ask him/her to sit in a seat of his/her choosing.
Once the seat of the Lebanese representative is fixed, there remains 21 seats which need to be assigned to
the remaining 21 representatives. This can be done in 21! different ways. #
A The sample is without replacement, because one student cannot be chosen for two different committee roles.
The sample is also ordered, because the committee members all have distinct roles. #
In this subsection, we are concerned with counting the number of possible ways we can draw a sample of
size k from a jar with n distinguishable balls, in each of these four cases. Let us consider the four cases one by
one. Use Figure 3.9 as a guiding example.
Q What is the number of ways we can draw an ordered sample of size k with replacement from a jar with n
distinguishable balls?
8 There
is an entire branch of mathematics known as combinatorics (more specifically, combinatorial enumeration) dedicated to counting.
9 This
example is taken from the book Elementary Probability for Applications by Rick Durrett.
10 In concise mathematical notation: |A × B| = |A| × |B|.
15
with 3 5
3 3 1 5 3
replacement 1
5 without 5 7
4 5 1 4 7 1
2 replacement 4
7
6
1 3
ordered unordered
A nk . Imagine drawing the k balls one after another, replacing each ball back in the jar after it is drawn. There
are n choices for the first ball, n choices for the second ball and so on. In overall, there are n × n × · · · × n =
nk possibilities.
| {z }
k times
Q What is the number of ways we can draw an ordered sample of size k without replacement from a jar with
n distinguishable balls?
n!
A . Again, imagine drawing the k balls one after another, but this time keeping the drawn balls
(n − k)!
outside the jar. There are n choices for the first ball, n − 1 choices for the second ball (because one ball is
already out), n−2 choices for the third ball, and so on. Hence, there are n×(n−1)×(n−2)×· · ·×(n−k+1) =
n!/(n − k)! possibilities in total.
Q What is the number of ways we can draw an unordered sample of size k without replacement from a jar
with n distinguishable balls?
n
A . Since we are considering samples without replacement, the drawn balls have to be distinct. Since
k
we do not care about the order, we can draw the k balls at the same time. Thus, the sample is simply a
subset of the balls in the jar which has k elements. How many k-element subsets does an n-element set
have? That is precisely what nk stands for.
A Consider one such sample. Since the sample is drawn with replacement, the same ball can be drawn more
than once. Since we do not care about the order, in order to specify the sample, we only need to tell
how many times each of the balls was chosen. For instance, in the sample in Figure 3.9 (top-right), ball
number 1 is chosen once, ball number 3 is chosen twice and ball number 5 is chosen once. The other balls
are each chosen zero times. Let us represent this with the following string of characters | and ⋆:
⋆ | | ⋆⋆ | | ⋆ | |
Q Can you guess how this string corresponds to the above sample?
A The four ⋆’s represent the drawn balls. The six | ’s distinguish between the balls in the following
manner:
• The star before the first | represents the one time in which ball number 1 was drawn.
• There is no star between the first and the second | ’s indicating that ball number 2 was not drawn.
• The two stars between the second and the third | ’s represent the two times in which ball number 2
was drawn.
• ...
• There is no star after the sixth | indicating that ball number 7 was not drawn.
16
In the same fashion, any unordered sample of k balls with replacement from a jar with n distinguishable
balls can be represented using a string consisting of k copies of ⋆ and n − 1 copies of | :
• The number of stars before the first | indicates the number of times ball number 1 was drawn.
• The number of stars between the first and the second | indicates the number of times ball number 2
was drawn.
• ...
• The number of stars after the last | indicates the number of times ball number n was drawn.
This representation is faithful: each sample is represented by one and only one such string, and each such
string corresponds to one and only one sample. Therefore, the number of samples of size k from a jar with
n distinguishable balls is the same as the number of strings consisting of k copies of character ⋆ and n − 1
copies of character | .
Q How many such strings are there?
n−1+k
A . Each such string consists of n − 1 + k characters, k of which are ⋆’s and the remaining
k
are | ’s. In order to identify one such string, it is enough to identify the positions of the k ⋆’s. There
are n−1+k k choices for the position of the ⋆’s. Alternatively, we could identify each string with the
position of its n − 1 copies of | . There are n−1+k
n−1 choices for the position of | ’s. However, note that
n−1+k n−1+k
k = n−1 .
ordered unordered
with k
n−1+k
n
replacement k
without n! n
replacement (n − k)! k
Table 3.3: Number of ways to draw a sample of k balls from a ball with n distinguishable balls
Depending on the answers to these two questions, we have four different variants of the problem. Figure 3.10
contains examples of these four types of allocations in the case n = 7 and k = 4.
There is a perfect analogy between the act of allocating k tokens to n distinguishable boxes and the act of
drawing a sample of n balls from a jar with n distinguishable balls.
Q Can you see this analogy? (Hint: Compare Figures 3.9 and 3.10.)
A The boxes correspond to the balls in the jar. The tokens correspond to the balls drawn from the jar.
With similar reasonings as in the case of drawing a sample of balls from a jar (or simply using the above
analogy), we can count the number of ways once can allocate k balls into n distinguishable boxes in each of the
four different scenarios. Table 3.4 summarizes the answers to these counting problems.
17
without 1
3 4
exclusion 2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
with
2 3 1 4
exclusion
1 2 3 4 5 6 7 1 2 3 4 5 6 7
distinguishable indistinguishable
distinguishable indistinguishable
without k
n−1+k
n
exclusion k
with n! n
exclusion (n − k)! k
A It might be easier to start with a concrete example, say n = 5 and k = 2. In this case, the sample space is
Ω := {H, T}5 .
Let E denote the event that exactly 2 out of 5 flips show heads and 3 show tails. This event is realized
precisely when one of the following 10 outcomes occurs:
11 The notation {H, T}5 refers to the Cartesian product of {H, T} five times with itself, that is, {H, T} × {H, T} × {H, T} × {H, T} × {H, T}.
18
Let us now consider the general scenario (n and k fixed but arbitrary). The sample space is
Ω := {H, T}n .
where #H(ω) denotes the number of H’s, and #T(ω) denotes the number of T’s shown in ω.
Let E denote the event that exactly k out of n flips show heads and n−k show tails. Note that the probability
of each individual outcome that realizes E is pk (1 − p)n−k . Therefore,
Thus in order to find the probability of E, we need to count the number of individual outcomes in E.
Q What is the number of outcomes in E?
n
A . Each outcome in E can be represented by a sequence consisting of k H’s and n − k T’s. Each such
k
sequence is uniquely identified by the positions
of the k H’s in the sequence. The number of ways we
can choose k positions out of n positions is nk .
We conclude that
n k
P(E) = p (1 − p)k .
k
#
Exercise. In the above example, the number of heads can be any integer between 0 and n. These possibilities are
mutually exclusive and together exhaust the entire sample space. Therefore, by the additivity of the measure of
probabilities, their probabilities must add up to 1. Based on the above computation, we find that
n
X n k
p (1 − p)n−k = 1 .
k
k=0
Verify the latter identity directly based on the properties of the binomial coefficients. (Hint: Recall the Binomial
Theorem.)
Example 3.4.6 (blue balls, red balls; with replacement). Suppose that we have a jar with N balls, K of which
are blue, and the remaining N − K are red (see Figure 3.11 for the case N = 7 and K = 4). At random, we
draw an ordered sample of n balls with replacement from the jar.
Q What is the probability of getting k blue balls and n − k red balls?12
A1 Let us start by building a model for this experiment. Since we only care about the color of the balls, we
can represent each possible outcome by the sequence of colors we see. For instance, if n = 5, one possible
outcome would be
BRBBR ,
indicating that the first drawn ball is blue, the second is red, the third and the fourth are blue, and the
fifth is red. The sample space can thus be expressed as13
Ω := {B, R}n .
Observe that each time we draw a ball, the chance of getting a blue ball is K/N and the chance of getting
a red ball is (N −K)/N . Furthermore, since we replace each ball before drawing the next balls, the color of
the ball in each draw does not in any way affect the color of the following balls. Thus, for each outcome
ω ∈ Ω, the measure of probability should be
#B(ω) #R(ω)
K N −K
P(ω) := ,
N N
12 Let us emphasize that, in general, mathematical notation is case-sensitive: N and n do not refer to the same thing. In fact, mathematical
19
where #B(ω) denotes the number of blue balls drawn, and #R(ω) denotes the number of red balls drawn
in ω. This completes the description of the model.
Let E denote the event that we draw exactly k blue balls and n − k red balls. Note that, according to our
model, each individual outcome in E has probability (K/N )k ((N −K)/N )n−k . Hence,
k n−k
K N −K
P(E) = |E| · .
N N
We conclude that
k n−k
n K N −K
P(E) = .
k N N
A2 Observe that there is an analogy between this example and the example of flipping a coin n times in
a row (Example 3.4.5). Drawing a ball from the jar can be likened to flipping a coin. Drawing a blue
ball corresponds to getting a head, and drawing a red ball corresponds to getting a tail. In order for the
analogy to be perfect, the parameter of the coin has to be p := K/N . Now, the event of drawing k blue
balls and n − k red balls corresponds to the event of getting k heads and n − k tails. Thus, based on
Example 3.4.5, we immediately see that
k n−k
n k n K N −K
P(E) = p (1 − p)n−k = .
k k N N
#
Example 3.4.7 (blue balls, red balls; without replacement). This is a variant of the previous example. Again,
we have a jar with N balls, K of which are blue, and the remaining N − K are red (Figure 3.11). This time, we
draw a random unordered sample of n balls without replacement from the jar.
Q What is the probability of getting k blue balls and n − k red balls? (Assume that k ≤ K and n − k ≤ N − K
to make sure there are enough balls of each color.)
20
Since all outcomes are equally likely, we have
|E|
P(E) = .
|Ω|
It remains to count the number of outcomes in Ω and in E.
Q What is the total number of outcomes in Ω?
N
A . That is precisely the number of n-element subset of an N -element set.
n
We conclude that
K N −K
k n−k
P(E) = .
N
n
#
Example 3.4.8 (Flipping a coin until k heads). Let us again perform an experiment with a coin. As usual, the
bias parameter of the coin is denoted by p. We repeat flipping the coin until we get k heads. For instance, if
k = 3, a few possible outcomes of the experiment would be
HTTHH
HHH
TTTHTHTH
In this case, each possible outcome can be described by a string of characters H and T containing exactly 3 H’s
and with the requirement that the last character is a H.
Q What is the probability that the k’th head comes up at the n’th flip?
For instance, if k = 3 and n = 7, the event we are interested in has happened if the outcome is
TTHTHTH
but has not happened if the outcome is
HTTHH.
where #H(ω) denotes the number of H’s, and #T(ω) denotes the number of H’s in ω.
We are interested in the probability of the event
◦ (event of interest) E: set of all sequences in Ω which have length n.
Note that each individual outcome in E has probability pk (1 − p)n−k (k heads and n − k tails). Therefore,
21
Q What is the number of outcomes in E?
n−1
A . Observe that an outcome is in E, if and only if
k−1
• The n’th flip shows a H,
• Among the first n − 1 flips, there are k − 1 H’s.
n−1
The number of ways we can choose k − 1 positions out of n − 1 positions is k−1 .
We conclude that
n−1 k
P(E) = p (1 − p)n−k .
k−1
Curiosity Exercise. In the above example, the k’th head will come at one of the times n = k, or n = k + 1,
or n = k + 2, or . . . . These possibilities are mutually exclusive and together exhaust the entire sample space.
Therefore, by the countable additivity of the measure of probabilities, their probabilities must add up to 1.
Based on the above computation, we find that
∞
X n−1
pk (1 − p)n−k = 1 .
k−1
n=k
Can you verify the latter identity directly based on the properties of the binomial coefficients? (Hint: Check out
the Negative Binomial Theorem.)
P(A ∩ B)
P(A | B) := .
P(B)
In words: the probability that both A and B happen is the same as the probability that B happens times the
probability that A happens given that B has happened. Recalling the interpretation of probabilities as “idealized
frequencies” in repeated experiments, the latter identity has a simple interpretation.
Q What is the interpretation of the above identity in terms of “idealized frequencies”?
A Suppose we repeat the experiment n times, where n is very large. As usual, let Nn (A ∩ B) denote the
number of times in which A ∩ B happens. Then, the above identity is simply the idealized version of the
following
Nn (A ∩ B) Nn (B) Nn (A ∩ B)
P(A ∩ B) ≈ = · ≈ P(B) P(A | B)
n n Nn (B)
in the limit n → ∞. In words, in many many repeated experiments, “the fraction of times in which both A
and B happen” is the product of “the fraction of times in which B happens” and “the fraction of times in
which A happens among those times in which B happens.”
The identity ( ) is known as the chain rule of conditional probabilities. It can be extended to more than
two events. For instance, in the case of three events A, B, C (see Figure 3.12), we have
P(A ∩ B ∩ C) = P(C) P(B | C) P A (B ∩ C) . ( )
22
B Bc
A
Cc
C
box #1 box #2
Figure 3.13: Two boxes with blue and red balls in them (see Example 3.5.1)
Before answering the question, let us emphasize that, depending on the values of k1 , ℓ1 , k2 , ℓ2 , the balls may not
be equally likely to be drawn. For instance, if box #1 has only one blue ball, and box #2 has one blue ball and
one red ball (i.e., k1 = 1, ℓ1 = 0, k1 = 1 and ℓ1 = 1), then it is clear that the chance that the blue ball from
box #1 is picked is 1/2, whereas the chance that the blue ball from box #2 is picked is 1/4.
A An intuitive way to find the probability is to consider the tree of possibilities as in Figure 3.14. At the
beginning, each of the two boxes has 1/2 probability to be chosen. If box #1 is chosen, then the chance of
drawing a blue ball is k1k+ℓ
1
1
. If box #2 is chosen, then the chance of drawing a blue ball is k2k+ℓ
2
2
. Therefore,
1 k1 1 k2
P a blue ball is drawn = · + · .
2 k1 + ℓ1 2 k2 + ℓ2
Let us take a moment to contemplate on what the above computation amounts to in the language of proba-
bility models. This will reveal a general trick that can then be used in other scenarios.
As the sample space, we can choose14
Ω := 1B, 1R, 2B, 2R .
The event that the ball is blue is E := {1B, 2B}. In order to find the probability of E, we divide the possibilities
based on whether box #1 is chosen or box #2. To be specific, let C := {1B, 1R} be the event that box #1 is
chosen, and note that C c is simply the event that box #2 is chosen. Observe that
E = (E ∩ C) ∪ (E ∩ C c )
23
k1
box #1 k1 +ℓ1 blue
1/2
ℓ1 red
k1 +ℓ
1
start
k2
k2 +ℓ2 blue
1/2
box #2 ℓ2
k2 +ℓ red
2
Figure 3.14: The tree of possibilities for the experiment in Example 3.5.1
C Cc
E∩C c
E∩C
(see Figure 3.15). Since E ∩ C and E ∩ C c are disjoint, their probabilities add up to the probability of E, that is,
P(E) = P(E ∩ C) + P(E ∩ C c ) .
Now, using the chain rule, we can write
P(E ∩ C) = P(C) P(E | C) and P(E ∩ C c ) = P(C c ) P(E | C c ) .
k1
P(E | C) := chance of drawing a blue ball given that box #1 is chosen = ,
k1 + ℓ1
k2
P(E | C c ) := chance of drawing a blue ball given that box #2 is chosen =
.
k2 + ℓ2
Therefore,
P(E) = P(E ∩ C) + P(E ∩ C c )
= P(C) P(E | C) + P(C c ) P(E | C c )
1 k1 1 k2
= · + · ,
2 k1 + ℓ1 2 k2 + ℓ2
which is the same as what we obtained earlier. #
The idea of breaking down the possibilities based on whether an event C has happened or not can be
generalized as follows.
Principle of total probability. Let A be an event, and suppose that F1 , F2 , F3 , . . . is a finite or countably
infinite collection of events that partition the sample space (see Figure 3.16).16 Then,
P(A) = P(A ∩ F1 ) + P(A ∩ F2 ) + P(A ∩ F3 ) + · · ·
= P(F1 ) P(A | F1 ) + P(F2 ) P(A | F2 ) + P(F3 ) P(A | F3 ) + · · · .
This identity can sometimes help us calculate P(A).
14 Alternatively, we could choose a more refined sample space by keeping track of which blue or red ball from the chosen box is drawn.
15 These four equations implicitly describe the measure of probabilities for our model. They should be understood as part of the modeling
rather than mathematical reasoning.
16 Mathematically, saying that F , F , F , . . . partition Ω means that F , F , F , . . . are disjoint and their union is Ω. The interpretation
1 2 3 1 2 3
of this is that the events F1 , F2 , F3 , . . . are mutually exclusive (i.e., no two of them can occur simultaneously) and together exhaust all the
possibilities in Ω.
24
F3 · · ·
F2
F1
Figure 3.16: The events F1 , F2 , F3 , . . . form a (finite or countably infinite) partition of the sample space.
Example 3.5.2 (Flipping a coin until two heads). We conduct the following experiment using a coin with bias
parameter p: we repeat flipping the coin until we get 2 heads.
Q What is the probability that the second head comes up right after the first head?
Suggestion. Before trying to find the probability systematically, can you make a guess about what the answer
should be?
A The idea is to break down the set of possibilities based on the time in which the first head comes up (see
Figure 3.17). Namely, let Ω denote the sample space. Consider the following events:
◦ E := the 2nd head comes up right after the 1st head ,
◦ C1 := the 1st head come up in the 1st flip ,
◦ C2 := the 1st head come up in the 2nd flip ,
◦ C3 := the 1st head come up in the 3rd flip ,
◦ ...
◦ Cn := the 1st head come up in the n’th flip ,
◦ ...
We want to find P(E). Note that C1 , C2 , C3 , . . . are mutually exclusive and together exhaust all the possi-
bilities in Ω. In other words, they partition Ω. Therefore,
A Note that P(E | Cn ) is simply the chance of getting a head at the (n + 1)st flip given that the first head
has come up at the n’th flip. Since the results of the first n flips in no way affect the result of the
(n + 1)st flip, we have P(E | Cn ) = p. This is the case for each n = 1, 2, 3, . . ..17
Therefore,
Note that, although we could find the value of P(Cn ) for each n = 1, 2, . . ., we did not need to do so. That
P(C1 ) + P(C2 ) + P(C3 ) + · · · = 1 follows from the fact that C1 , C2 , C3 , . . . partition Ω, and hence their
probabilities must add up to 1.
25
H HH
HT · · ·
TH THH
THT · · ·
T2 H T2 HH
T2 HT · · ·
T3 H T3 HH
.. T3 HT · · ·
.
Figure 3.17: The tree of possibilities for the experiment in Example 3.5.2. The blue nodes indicate the possibil-
ities in which E happens. The red crosses indicate the branches of possibilities in which E does not happen.
Table 3.5: Chances of positive and negative tests for women with or without breast cancer (Example 3.5.3)
Suppose that a woman aged 40–50 has received a positive result on her mammogram test.
Q What is the chance that she has breast cancer?
A Consider the following two events regarding a random woman aged 40–50 who undergoes the mammo-
gram test:
◦ A := the test is positive ,
◦ B := the woman has breast cancer .
A
P(B) = 0.01 , P(A | B) = 0.92 , P(Ac | B) = 0.08 ,
c
P(A | B ) = 0.10 , P(Ac | B c ) = 0.90 .
18 This example is taken from the book Elementary Probability for Applications by Rick Durrett.
26
The conditional probability P(B | A) can be found as follows:
P(A ∩ B)
P(B | A) = (definition)
P(A)
P(A ∩ B)
= (principle of total probability)
P(A ∩ B) + P(A ∩ B c )
P(B) P(A | B)
= (chain rule)
P(B) P(A | B) + P(B c ) P(A | B c )
0.01 × 0.92
=
0.01 × 0.92 + 0.99 × 0.1
92
= ≈ 8.50% .
1082
It might come as a surprise that the above chance is so small (less than 10%). The intuitive reason behind this
is that there are many more healthy women with positive tests than there are women with cancer and positive
tests. In particular, on average
• only 100 out of 10000 women have breast cancer, of which 92 receive positive test results, whereas
• 9900 out of 10000 women do not have breast cancer, of which still 990 receive positive test results.
Thus, among women who receive positive test results, only a fraction of 92/(990 + 92) ≈ 8.50% actually do have
cancer. #
The trick we used in the above example to calculate P(B | A) knowing the conditional probabilities of the
opposite type is known as Bayes’ rule.19
This is useful trick for finding P(F | E) if what is available to us is the opposite conditional probabilities P(E | F )
and P(E |F c ). The denominator P(E) can potentially be found using the principle of total probability, by writing
it as P(F ) P(E | F ) + P(F c ) P(E | F c ).
Example 3.5.4 (Three factories20 ). The computer chips used by an engineering company are made by three
factories: 10% by factory #1, 30% by factory #2, and 60% by factory #3. The chips are occasionally defective:
8% of the chips made by factory #1, 1% of those made by factory #2, and 2% of those made by factory #3 are
defective. This information is summarized in Table 3.6.
factory #1 10% 8%
factory #2 30% 1%
factory #3 60% 2%
27
A Consider the following events regarding a random chip from the company:
◦ D := the chip is defective ,
◦ F1 := the chip is made in factory #1 ,
◦ F2 := the chip is made in factory #2 ,
◦ F3 := the chip is made in factory #3 .
We are looking for P(F1 | D), while the information summarized in Table 3.6 contains the values of P(F1 ),
P(F2 ), P(F3 ), P(D | F1 ), P(D | F2 ) and P(D | F3 ).
Note that
P(F1 ∩ D) P(F1 ∩ D)
P(F1 | D) = = .
P(D) P(F1 ∩ D) + P(F2 ∩ D) + P(F2 ∩ D)
The three terms involved can be calculated using the chain rule as follows:
Therefore,
8 8
P(F1 | D) = = ≈ 34.8% .
8 + 3 + 12 23
Similarly,
3 3
P(F2 | D) = = ≈ 13.0% ,
8 + 3 + 12 23
12 12
P(F3 | D) = = ≈ 52.2% .
8 + 3 + 12 23
#
When 0 < P(A) < 1, the independence of A and B can be equivalently expressed by the identity
Exercise. Assuming 0 < P(A) < 1, verify that the two conditions (⊥ : 2) and (⊥′ : 2) are equivalent.
In this section, we want to extend the concept of independence to more than two events.
Example 3.6.1 (Flipping a coin three times). Consider the experiment of flipping a coin three times in a row.
Intuitively, we know that the results of the three flips are independent of one another.
Q How can we express this in the language of probabilities?
28
(i) The result of the 2nd flip is not affected by whether A has happened or not, that is,
P(B | A) = P(B | Ac ) .
(ii) The result of the 3rd flip is no affected by whether either of A and B has happened or not, that is,
The two conditions in the above example can be taken as the definition of independence of three events. The
following equivalent definition has the advantage that it makes sense even when some of the events involved
have probability 0 (in which case, conditional probabilities are meaningless).
Terminology. Three events A, B, C ⊆ Ω in a probability model are said to be (statistically) independent if the
following 23 = 8 identities hold:
Exercise. Can you verify that the two definitions (⊥ : 3) and (⊥′ : 3) are equivalent? Under what conditions, are
these two definitions equivalent to the one given in Example 3.6.1?
The independence of more than three events can be formulated in an analogous fashion.
29
A=B
1 2 3 4
8 7 6 5
◦ C := {4, 5, 6, 7}
(see Figure 3.18). Clearly, the three events are not independent. Nevertheless, observe that
1 1 1 1
P(A ∩ B ∩ C) = = × × = P(A) P(B) P(B) .
8 2 2 2
This shows that pairwise independence (i.e., the second, third and fourth equalities in (⊥′ : 3)) is essential for
the independence of A, B and C. #
30