An Intuitive Explanation of Bayes Theorem PDF
An Intuitive Explanation of Bayes Theorem PDF
An Intuitive Explanation of Bayes Theorem PDF
Eliezer S. Yudkow sky > Rationality > An Intuitive Explanation of Bayes' Theorem
http://yudkowsky.net/rational/bayes
1/45
03/06/2010
Yudkowsky - Bayes' Theorem
that the existing online explanations are too abstract. Bayesian
reasoning is very counterintuitive. People do not employ Bayesian
reasoning intuitively, find it very difficult to learn Bayesian reasoning
when tutored, and rapidly forget Bayesian methods once the tutoring is
over. This holds equally true for novice students and highly trained
professionals in a field. Bayesian reasoning is apparently one of those
things which, like quantum mechanics or the Wason Selection Test, is
inherently difficult for humans to grasp with our built-in mental faculties.
Or so they claim. Here you will find an attempt to offer an intuitive
explanation of Bayesian reasoning - an excruciatingly gentle
introduction that invokes all the human ways of grasping numbers, from
natural frequencies to spatial visualization. The intent is to convey, not
abstract rules for manipulating numbers, but what the numbers mean,
and why the rules are what they are (and cannot possibly be anything
else). When you are finished reading this page, you will see Bayesian
problems in your dreams.
And let's begin.
Next, suppose I told you that most doctors get the same wrong answer
on this problem - usually, only around 15% of doctors get it right.
("Really? 15%? Is that a real number, or an urban legend based on an
Internet poll?" It's a real number. See Casscells, Schoenberger, and
Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many
other studies. It's a surprising result which is easy to replicate, so it's
been extensively replicated.)
Do you want to think about your answer again? Here's a Javascript
calculator if you need one. This calculator has the usual precedence
rules; multiplication before addition and so on. If you're not sure, I
suggest using parentheses.
Calculator: (1 + 2) * 3 + 4
Result:
Compute!
http://yudkowsky.net/rational/bayes
2/45
03/06/2010
Yudkowsky - Bayes' Theorem
On the story problem above, most doctors estimate the probability to
be between 70% and 80%, which is wildly incorrect.
Here's an alternate version of the problem on which doctors fare
somewhat better:
10 out of 1000 women at age forty who participate in routine
screening have breast cancer. 800 out of 1000 women with
breast cancer will get positive mammographies. 96 out of 1000
women without breast cancer will also get positive
mammographies. If 1000 women in this age group undergo a
routine screening, about what fraction of women with positive
mammographies will actually have breast cancer?
Calculator: (1 + 2) * 3 + 4
Result:
Compute!
And finally, here's the problem on which doctors fare best of all, with
46% - nearly half - arriving at the correct answer:
100 out of 10,000 women at age forty who participate in routine
screening have breast cancer. 80 of every 100 women with
breast cancer will get a positive mammography. 950 out of
9,900 women without breast cancer will also get a positive
mammography. If 10,000 women in this age group undergo a
routine screening, about what fraction of women with positive
mammographies will actually have breast cancer?
Calculator: (80+950) / 10000
Result: 0.103
Compute!
http://yudkowsky.net/rational/bayes
3/45
03/06/2010
mammography.
Group C: 950 women without breast cancer, and a positive
mammography.
Group D: 8,950 women without breast cancer, and a negative
mammography.
Calculator: (0.8*0.01) / ((80+950) / 10000)
Result: 0.0776
Compute!
As you can check, the sum of all four groups is still 10,000. The sum
of groups A and B, the groups with breast cancer, corresponds to
group 1; and the sum of groups C and D, the groups without breast
cancer, corresponds to group 2; so administering a mammography
does not actually change the number of women with breast cancer.
The proportion of the cancer patients (A + B) within the complete set of
patients (A + B + C + D) is the same as the 1% prior chance that a
woman has cancer: (80 + 20) / (80 + 20 + 950 + 8950) = 100 / 10000 =
1%.
The proportion of cancer patients with positive results, within the group
of all patients with positive results, is the proportion of (A) within (A +
C): 80 / (80 + 950) = 80 / 1030 = 7.8%. If you administer a
mammography to 10,000 patients, then out of the 1030 with positive
mammographies, 80 of those positive-mammography patients will have
cancer. This is the correct answer, the answer a doctor should give a
positive-mammography patient if she asks about the chance she has
breast cancer; if thirteen patients ask this question, roughly 1 out of
those 13 will have cancer.
http://yudkowsky.net/rational/bayes
4/45
03/06/2010
http://yudkowsky.net/rational/bayes
5/45
03/06/2010
To see that the final answer always depends on the chance that a
woman without breast cancer gets a positive mammography, consider
an alternate test, mammography+. Like the original test,
mammography+ returns positive for 80% of women with breast cancer.
However, mammography+ returns a positive result for only one out of a
million women without breast cancer - mammography+ has the same
rate of false negatives, but a vastly lower rate of false positives.
Suppose a patient receives a positive mammography+. What is the
chance that this patient has breast cancer? Under the new test, it is a
virtual certainty - 99.988%, i.e., a 1 in 8082 chance of being healthy.
Calculator: 80 / [80 + (9900 * 0.000001)]
Result:
Compute!
http://yudkowsky.net/rational/bayes
6/45
03/06/2010
Compute!
The result works out to 80 / 8,000, or 0.01. This is exactly the same
as the 1% prior probability that a patient has breast cancer! A
"positive" result on mammography* doesn't change the probability that
a woman has breast cancer at all. You can similarly verify that a
"negative" mammography* also counts for nothing. And in fact it must
be this way, because if mammography* has an 80% hit rate for
patients with breast cancer, and also an 80% rate of false positives for
patients without breast cancer, then mammography* is completely
uncorrelated with breast cancer. There's no reason to call one result
"positive" and one result "negative"; in fact, there's no reason to call
the test a "mammography". You can throw away your expensive
mammography* equipment and replace it with a random number
generator that outputs a red light 80% of the time and a green light
20% of the time; the results will be the same. Furthermore, there's no
reason to call the red light a "positive" result or the green light a
"negative" result. You could have a green light 80% of the time and a
red light 20% of the time, or a blue light 80% of the time and a purple
light 20% of the time, and it would all have the same bearing on
whether the patient has breast cancer: i.e., no bearing whatsoever.
We can show algebraically that this must hold for any case where the
chance of a true positive and the chance of a false positive are the
same, i.e:
Group 1: 100 patients with breast cancer.
Group 2: 9,900 patients without breast cancer.
Now consider a test where the probability of a true positive and the
probability of a false positive are the same number M (in the example
http://yudkowsky.net/rational/bayes
7/45
03/06/2010
above, M=80% or M = 0.8):
http://yudkowsky.net/rational/bayes
8/45
03/06/2010
Yudkowsky - Bayes' Theorem
wouldn't be a cancer test - what makes a coin a poor test is not that it
has a 50/50 chance of coming up heads if the patient has cancer, but
that it also has a 50/50 chance of coming up heads if the patient does
not have cancer. You can even use a test that comes up "positive" for
cancer patients 100% of the time, and still not learn anything. An
example of such a test is "Add 2 + 2 and see if the answer is 4." This
test returns positive 100% of the time for patients with breast cancer.
It also returns positive 100% of the time for patients without breast
cancer. So you learn nothing.
The original proportion of patients with breast cancer is known as the
prior probability. The chance that a patient with breast cancer gets a
positive mammography, and the chance that a patient without breast
cancer gets a positive mammography, are known as the two
conditional probabilities. Collectively, this initial information is known
as the priors. The final answer - the estimated probability that a
patient has breast cancer, given that we know she has a positive result
on her mammography - is known as the revised probability or the
posterior probability. What we've just shown is that if the two
conditional probabilities are equal, the posterior probability equals the
prior probability.
Actually, priors are true or false just like the final answer - they reflect
http://yudkowsky.net/rational/bayes
9/45
03/06/2010
Yudkowsky - Bayes' Theorem
reality and can be judged by comparing them against reality. For
example, if you think that 920 out of 10,000 women in a sample have
breast cancer, and the actual number is 100 out of 10,000, then your
priors are wrong. For our particular problem, the priors might have
been established by three studies - a study on the case histories of
women with breast cancer to see how many of them tested positive on
a mammography, a study on women without breast cancer to see how
many of them test positive on a mammography, and an
epidemiological study on the prevalence of breast cancer in some
specific demographic.
Suppose that a barrel contains many small plastic eggs. Some eggs
are painted red and some are painted blue. 40% of the eggs in the bin
contain pearls, and 60% contain nothing. 30% of eggs containing
pearls are painted blue, and 10% of eggs containing nothing are
painted blue. What is the probability that a blue egg contains a pearl?
For this example the arithmetic is simple enough that you may be able
to do it in your head, and I would suggest trying to do so.
But just in case...
(1 + 2) * 3 + 4
Result:
Compute!
A more compact way of specifying the problem:
p(pearl) = 40%
p(blue|pearl) = 30%
p(blue|~pearl) = 10%
p(pearl|blue) = ?
"~" is shorthand for "not", so ~pearl reads "not pearl".
blue|pearl is shorthand for "blue given pearl" or "the probability that
an egg is painted blue, given that the egg contains a pearl". One thing
that's confusing about this notation is that the order of implication is
read right-to-left, as in Hebrew or Arabic. blue|pearl means
"blue<-pearl", the degree to which pearl-ness implies blue-ness, not
the degree to which blue-ness implies pearl-ness. This is confusing,
but it's unfortunately the standard notation in probability theory.
Readers familiar with quantum mechanics will have already
encountered this peculiarity; in quantum mechanics, for example,
<d|c><c|b><b|a> reads as "the probability that a particle at A goes
to B, then to C, ending up at D". To follow the particle, you move your
eyes from right to left. Reading from left to right, "|" means "given";
reading from right to left, "|" means "implies" or "leads to". Thus,
moving your eyes from left to right, blue|pearl reads "blue given
pearl" or "the probability that an egg is painted blue, given that the egg
contains a pearl". Moving your eyes from right to left, blue|pearl
reads "pearl implies blue" or "the probability that an egg containing a
pearl is painted blue".
The item on the right side is what you already k now or the premise,
http://yudkowsky.net/rational/bayes
10/45
03/06/2010
Yudkowsky - Bayes' Theorem
and the item on the left side is the implication or conclusion. If we
have p(blue|pearl) = 30%, and we already k now that some egg
contains a pearl, then we can conclude there is a 30% chance that the
egg is painted blue. Thus, the final fact we're looking for - "the chance
that a blue egg contains a pearl" or "the probability that an egg
contains a pearl, if we know the egg is painted blue" - reads
p(pearl|blue).
Let's return to the problem. We have that 40% of the eggs contain
pearls, and 60% of the eggs contain nothing. 30% of the eggs
containing pearls are painted blue, so 12% of the eggs altogether
contain pearls and are painted blue. 10% of the eggs containing
nothing are painted blue, so altogether 6% of the eggs contain nothing
and are painted blue. A total of 18% of the eggs are painted blue, and
a total of 12% of the eggs are painted blue and contain pearls, so the
chance a blue egg contains a pearl is 12/18 or 2/3 or around 67%.
The applet below, courtesy of Christian Rovner, shows a graphic
representation of this problem:
(Are you having trouble seeing this applet? Do you see an image of
the applet rather than the applet itself? Try downloading an updated
Java.)
Looking at this applet, it's easier to see why the final answer depends
on all three probabilities; it's the differential pressure between the two
conditional probabilities, p(blue|pearl) and p(blue|~pearl),
that slides the prior probability p(pearl) to the posterior probability
p(pearl|blue).
As before, we can see the necessity of all three pieces of information
by considering extreme cases (feel free to type them into the applet).
In a (large) barrel in which only one egg out of a thousand contains a
pearl, knowing that an egg is painted blue slides the probability from
0.1% to 0.3% (instead of sliding the probability from 40% to 67%).
Similarly, if 999 out of 1000 eggs contain pearls, knowing that an egg
is blue slides the probability from 99.9% to 99.966%; the probability
that the egg does not contain a pearl goes from 1/1000 to around
1/3000. Even when the prior probability changes, the differential
http://yudkowsky.net/rational/bayes
11/45
03/06/2010
Yudkowsky - Bayes' Theorem
pressure of the two conditional probabilities always slides the
probability in the same direction. If you learn the egg is painted blue,
the probability the egg contains a pearl always goes up - but it goes up
from the prior probability, so you need to know the prior probability in
order to calculate the final answer. 0.1% goes up to 0.3%, 10% goes
up to 25%, 40% goes up to 67%, 80% goes up to 92%, and 99.9%
goes up to 99.966%. If you're interested in knowing how any other
probabilities slide, you can type your own prior probability into the Java
applet. You can also click and drag the dividing line between pearl
and ~pearl in the upper bar, and watch the posterior probability
change in the bottom bar.
Studies of clinical reasoning show that most doctors carry out the
mental operation of replacing the original 1% probability with the 80%
probability that a woman with cancer would get a positive
mammography. Similarly, on the pearl-egg problem, most
respondents unfamiliar with Bayesian reasoning would probably
respond that the probability a blue egg contains a pearl is 30%, or
perhaps 20% (the 30% chance of a true positive minus the 10%
chance of a false positive). Even if this mental operation seems like a
good idea at the time, it makes no sense in terms of the question
asked. It's like the experiment in which you ask a second-grader: "If
eighteen people get on a bus, and then seven more people get on the
bus, how old is the bus driver?" Many second-graders will respond:
"Twenty-five." They understand when they're being prompted to carry
out a particular mental procedure, but they haven't quite connected the
procedure to reality. Similarly, to find the probability that a woman with
a positive mammography has breast cancer, it makes no sense
whatsoever to replace the original probability that the woman has
cancer with the probability that a woman with breast cancer gets a
positive mammography. Neither can you subtract the probability of a
false positive from the probability of the true positive. These operations
are as wildly irrelevant as adding the number of people on the bus to
find the age of the bus driver.
http://yudkowsky.net/rational/bayes
12/45
03/06/2010
Yudkowsky - Bayes' Theorem
A study by Gigerenzer and Hoffrage in 1995 showed that some ways of
phrasing story problems are much more evocative of correct Bayesian
reasoning. The least evocative phrasing used probabilities. A slightly
more evocative phrasing used frequencies instead of probabilities; the
problem remained the same, but instead of saying that 1% of women
had breast cancer, one would say that 1 out of 100 women had breast
cancer, that 80 out of 100 women with breast cancer would get a
positive mammography, and so on. Why did a higher proportion of
subjects display Bayesian reasoning on this problem? Probably
because saying "1 out of 100 women" encourages you to concretely
visualize X women with cancer, leading you to visualize X women with
cancer and a positive mammography, etc.
The most effective presentation found so far is what's known as natural
frequencies - saying that 40 out of 100 eggs contain pearls, 12 out of
40 eggs containing pearls are painted blue, and 6 out of 60 eggs
containing nothing are painted blue. A natural frequencies
presentation is one in which the information about the prior probability
is included in presenting the conditional probabilities. If you were just
learning about the eggs' conditional probabilities through natural
experimentation, you would - in the course of cracking open a hundred
eggs - crack open around 40 eggs containing pearls, of which 12 eggs
would be painted blue, while cracking open 60 eggs containing nothing,
of which about 6 would be painted blue. In the course of learning the
conditional probabilities, you'd see examples of blue eggs containing
pearls about twice as often as you saw examples of blue eggs
containing nothing.
It may seem like presenting the problem in this way is "cheating", and
indeed if it were a story problem in a math book, it probably would be
cheating. However, if you're talking about real doctors, you want to
cheat; you want the doctors to draw the right conclusions as easily as
possible. The obvious next move would be to present all medical
statistics in terms of natural frequencies. Unfortunately, while natural
frequencies are a step in the right direction, it probably won't be
enough. When problems are presented in natural frequences, the
proportion of people using Bayesian reasoning rises to around half. A
big improvement, but not big enough when you're talking about real
doctors and real patients.
A presentation of the problem in natural frequencies might be
visualized like this:
http://yudkowsky.net/rational/bayes
13/45
03/06/2010
http://yudkowsky.net/rational/bayes
14/45
03/06/2010
If you diminish two shapes by the same factor, their relative proportion
will be the same as before. If you diminish the left section of the top
bar by the same factor as the right section, then the bottom bar will
have the same proportions as the top bar - it'll just be smaller. If the
two conditional probabilities are equal, learning that the egg is blue
doesn't change the probability that the egg contains a pearl - for the
same reason that similar triangles have identical angles; geometric
figures don't change shape when you shrink them by a constant factor.
In this case, you might as well just say that 30% of eggs are painted
blue, since the probability of an egg being painted blue is independent
of whether the egg contains a pearl. Applying a "test" that is
statistically independent of its condition just shrinks the sample size.
In this case, requiring that the egg be painted blue doesn't shrink the
group of eggs with pearls any more or less than it shrinks the group of
eggs without pearls. It just shrinks the total number of eggs in the
sample.
Here's what the original medical problem looks like when graphed. 1%
of women have breast cancer, 80% of those women test positive on a
mammography, and 9.6% of women without breast cancer also receive
positive mammographies.
http://yudkowsky.net/rational/bayes
15/45
03/06/2010
Yudkowsky - Bayes' Theorem
women without breast cancer diminishes by a factor of more than ten,
from 9,900 to 950, while the number of women with breast cancer is
diminished only from 100 to 80. Thus, the proportion of 80 within 1,030
is much larger than the proportion of 100 within 10,000. In the graph,
the left sector (representing women with breast cancer) is small, but
the mammography test projects almost all of this sector into the
bottom bar. The right sector (representing women without breast
cancer) is large, but the mammography test projects a much smaller
fraction of this sector into the bottom bar. There are, indeed, fewer
women with breast cancer and positive mammographies than there are
women with breast cancer - obeying the law of probabilities which
requires that p(A) >= p(A&B). But even though the left sector in
the bottom bar is actually slightly smaller, the proportion of the left
sector within the bottom bar is greater - though still not very great. If
the bottom bar were renormalized to the same length as the top bar, it
would look like the left sector had expanded. This is why the
proportion of "women with breast cancer" in the group "women with
positive mammographies" is higher than the proportion of "women with
breast cancer" in the general population - although the proportion is
still not very high. The evidence of the positive mammography slides
the prior probability of 1% to the posterior probability of 7.8%.
What is it that this test actually does? If a patient comes to you with a
positive result on her mammography@, what do you say?
http://yudkowsky.net/rational/bayes
16/45
03/06/2010
health is definitely established by this test."
http://yudkowsky.net/rational/bayes
17/45
03/06/2010
This intuition is correct! The sum of the groups with negative results
and positive results must always equal the group of all women. If the
positive-testing group has "more than its fair share" of women without
breast cancer, there must be an at least slightly higher proportion of
women with cancer in the negative-testing group. A positive result is
rare but very strong evidence in one direction, while a negative result is
common but very weak evidence in the opposite direction. You might
call this the Law of Conservation of Probability - not a standard term,
but the conservation rule is exact. If you take the revised probability of
breast cancer after a positive result, times the probability of a positive
result, and add that to the revised probability of breast cancer after a
negative result, times the probability of a negative result, then you
must always arrive at the prior probability. If you don't yet k now what
the test result is, the expected revised probability after the test result
arrives - taking both possible results into account - should always
equal the prior probability.
On ordinary mammography, the test is expected to return "positive"
10.3% of the time - 80 positive women with cancer plus 950 positive
women without cancer equals 1030 women with positive results.
Conversely, the mammography should return negative 89.7% of the
time: 100% - 10.3% = 89.7%. A positive result slides the revised
probability from 1% to 7.8%, while a negative result slides the revised
probability from 1% to 0.22%. So
p(cancer|positive)*p(positive) +
p(cancer|negative)*p(negative) = 7.8%*10.3% +
0.22%*89.7% = 1% = p(cancer), as expected.
Calculator: 7.8%*10.3% + 0.22%*89.7%
Result:
Compute!
0.01
p(~cancer):
p(positive|cancer):
0.99
80.0%
p(~positive|cancer):
20.0%
p(positive|~cancer):
9.6%
p(~positive|~cancer): 90.4%
http://yudkowsky.net/rational/bayes
18/45
03/06/2010
p(cancer&positive):
p(cancer&~positive):
0.002
p(~cancer&positive):
0.095
p(~cancer&~positive): 0.895
p(positive):
0.103
p(~positive):
0.897
p(cancer|positive):
7.80%
p(~cancer|positive): 92.20%
p(cancer|~positive):
0.22%
p(~cancer|~positive): 99.78%
http://yudkowsky.net/rational/bayes
19/45
03/06/2010
Yudkowsky - Bayes' Theorem
hand, p(positive|cancer) and p(positive|~cancer) have
two degrees of freedom. You can have a mammography test that
returns positive for 80% of cancerous patients and 9.6% of healthy
patients, or that returns positive for 70% of cancerous patients and 2%
of healthy patients, or even a health test that returns "positive" for 30%
of cancerous patients and 92% of healthy patients. The two quantities,
the output of the mammography test for cancerous patients and the
output of the mammography test for healthy patients, are in
mathematical terms independent; one cannot be obtained from the
other in any way, and so they have two degrees of freedom between
them.
What about p(positive&cancer), p(positive|cancer), and
p(cancer)? Here we have three quantities; how many degrees of
freedom are there? In this case the equation that must hold is
p(positive&cancer) = p(positive|cancer) *
p(cancer). This equality reduces the degrees of freedom by one. If
we know the fraction of patients with cancer, and chance that a
cancerous patient has a positive mammography, we can deduce the
fraction of patients who have breast cancer and a positive
mammography by multiplying. You should recognize this operation
from the graph; it's the projection of the top bar into the bottom bar.
p(cancer) is the left sector of the top bar, and
p(positive|cancer) determines how much of that sector projects
into the bottom bar, and the left sector of the bottom bar is
p(positive&cancer).
http://yudkowsky.net/rational/bayes
20/45
03/06/2010
Yudkowsky - Bayes' Theorem
will start doing strange things, such as insisting that 125% of women
with breast cancer and positive mammographies have breast cancer.
This is a common mistake in carrying out Bayesian arithmetic, in my
experience.) And finally, if you know p(positive&cancer) and
p(positive|cancer), you can deduce how many cancer patients
there must have been originally. There are two degrees of freedom
shared out among the three quantities; if we know any two, we can
deduce the third.
How about p(positive), p(positive&cancer), and
p(positive&~cancer)? Again there are only two degrees of
freedom among these three variables. The equation occupying the
extra degree of freedom is p(positive) = p(positive&cancer)
+ p(positive&~cancer). This is how p(positive) is
computed to begin with; we figure out the number of women with breast
cancer who have positive mammographies, and the number of women
without breast cancer who have positive mammographies, then add
them together to get the total number of women with positive
mammographies. It would be very strange to go out and conduct a
study to determine the number of women with positive mammographies
- just that one number and nothing else - but in theory you could do
so. And if you then conducted another study and found the number of
those women who had positive mammographies and breast cancer,
you would also know the number of women with positive
mammographies and no breast cancer - either a woman with a positive
mammography has breast cancer or she doesn't. In general, p(A&B)
+ p(A&~B) = p(A). Symmetrically, p(A&B) + p(~A&B) =
p(B).
What about p(positive&cancer), p(positive&~cancer),
p(~positive&cancer), and p(~positive&~cancer)? You
might at first be tempted to think that there are only two degrees of
freedom for these four quantities - that you can, for example, get
p(positive&~cancer) by multiplying p(positive) *
p(~cancer), and thus that all four quantities can be found given only
the two quantities p(positive) and p(cancer). This is not the
case! p(positive&~cancer) = p(positive) * p(~cancer)
only if the two probabilities are statistically independent - if the chance
that a woman has breast cancer has no bearing on whether she has a
positive mammography. As you'll recall, this amounts to requiring that
the two conditional probabilities be equal to each other - a requirement
which would eliminate one degree of freedom. If you remember that
these four quantities are the groups A, B, C, and D, you can look over
those four groups and realize that, in theory, you can put any number
of people into the four groups. If you start with a group of 80 women
with breast cancer and positive mammographies, there's no reason
why you can't add another group of 500 women with breast cancer and
negative mammographies, followed by a group of 3 women without
breast cancer and negative mammographies, and so on. So now it
seems like the four quantities have four degrees of freedom. And they
would, except that in expressing them as probabilities, we need to
normalize them to fractions of the complete group, which adds the
constraint that p(positive&cancer) + p(positive&~cancer)
+ p(~positive&cancer) + p(~positive&~cancer) = 1.
http://yudkowsky.net/rational/bayes
21/45
03/06/2010
Good luck!
http://yudkowsky.net/rational/bayes
22/45
03/06/2010
Yudkowsky - Bayes' Theorem
p(cancer|positive)*p(positive)
+ p(cancer|~positive)*p(~positive)
= p(cancer)
In terms of the four groups:
p(cancer|positive)
= A / (A + C)
p(positive)
= A + C
p(cancer&positive)
= A
p(cancer|~positive) = B / (B + D)
p(~positive)
= B + D
p(cancer&~positive) = B
p(cancer)
= A + B
Let's return to the original barrel of eggs - 40% of the eggs containing
pearls, 30% of the pearl eggs painted blue, 10% of the empty eggs
painted blue. The graph for this problem is:
If you guessed that the revised probability remains the same, because
the bottom bar grows by a factor of 2 but retains the same proportions,
congratulations! Take a moment to think about how far you've come.
Looking at a problem like
1% of women have breast cancer. 80% of women with breast
cancer get positive mammographies. 9.6% of women without
breast cancer get positive mammographies. If a woman has a
http://yudkowsky.net/rational/bayes
23/45
03/06/2010
Yudkowsky - Bayes' Theorem
positive mammography, what is the probability she has breast
cancer?
the vast majority of respondents intuit that around 70-80% of women
with positive mammographies have breast cancer. Now, looking at a
problem like
Suppose there are two barrels containing many small plastic
eggs. In both barrels, some eggs are painted blue and the rest
are painted red. In both barrels, 40% of the eggs contain pearls
and the rest are empty. In the first barrel, 30% of the pearl eggs
are painted blue, and 10% of the empty eggs are painted blue.
In the second barrel, 60% of the pearl eggs are painted blue,
and 20% of the empty eggs are painted blue. Would you rather
have a blue egg from the first or second barrel?
you can see it's intuitively obvious that the probability of a blue egg
containing a pearl is the same for either barrel. Imagine how hard it
would be to see that using the old way of thinking!
It's intuitively obvious, but how to prove it? Suppose that we call P the
prior probability that an egg contains a pearl, that we call M the first
conditional probability (that a pearl egg is painted blue), and N the
second conditional probability (that an empty egg is painted blue).
Suppose that M and N are both increased or diminished by an arbitrary
factor X - for example, in the problem above, they are both increased
by a factor of 2. Does the revised probability that an egg contains a
pearl, given that we know the egg is blue, stay the same?
p(pearl) = P
p(blue|pearl) = M*X
p(blue|~pearl) = N*X
p(pearl|blue) = ?
From these quantities, we get the four groups:
Group A: p(pearl&blue)
= P*M*X
Group B: p(pearl&~blue)
= P*(1 - (M*X))
Group C: p(~pearl&blue)
= (1 - P)*N*X
http://yudkowsky.net/rational/bayes
24/45
03/06/2010
Compute!
http://yudkowsky.net/rational/bayes
25/45
03/06/2010
Yudkowsky - Bayes' Theorem
p(pearl|blue)*p(blue) + p(pearl|~blue)*p(~blue) =
p(pearl)? Doesn't this equation take up one degree of freedom?
No, because p(blue) isn't fixed between the two problems. In the
second barrel, the proportion of blue eggs containing pearls is the
same as in the first barrel, but a much larger fraction of eggs are
painted blue! This alters the set of red eggs in such a way that the
proportions do change. Here's a graph for the red eggs in the second
barrel:
http://yudkowsky.net/rational/bayes
26/45
03/06/2010
Suppose that you apply two tests for breast cancer in succession say, a standard mammography and also some other test which is
independent of mammography. Since I don't know of any such test
which is independent of mammography, I'll invent one for the purpose of
this problem, and call it the Tams-Braylor Division Test, which checks
to see if any cells are dividing more rapidly than other cells. We'll
suppose that the Tams-Braylor gives a true positive for 90% of patients
with breast cancer, and gives a false positive for 5% of patients without
cancer. Let's say the prior prevalence of breast cancer is 1%. If a
patient gets a positive result on her mammography and her TamsBraylor, what is the revised probability she has breast cancer?
One way to solve this problem would be to take the revised probability
for a positive mammography, which we already calculated as 7.8%,
and plug that into the Tams-Braylor test as the new prior probability. If
we do this, we find that the result comes out to 60%.
Calculator: (1 + 2) * 3 + 4
http://yudkowsky.net/rational/bayes
27/45
03/06/2010
Result:
But this assumes that first we see the positive mammography result,
and then the positive result on the Tams-Braylor. What if first the
woman gets a positive result on the Tams-Braylor, followed by a
positive result on her mammography. Intuitively, it seems like it
shouldn't matter. Does the math check out?
First we'll administer the Tams-Braylor to a woman with a 1% prior
probability of breast cancer.
Calculator: (1 + 2) * 3 + 4
Result:
Compute!
Compute!
Lo and behold, the answer is again 60%. (If it's not exactly the same,
it's due to rounding error - you can get a more precise calculator, or
work out the fractions by hand, and the numbers will be exactly equal.)
An algebraic proof that both strategies are equivalent is left to the
reader. To visualize, imagine that the lower bar of the frequency applet
for mammography projects an even lower bar using the probabilities of
the Tams-Braylor Test, and that the final lowest bar is the same
regardless of the order in which the conditional probabilities are
projected.
We might also reason that since the two tests are independent, the
probability a woman with breast cancer gets a positive mammography
and a positive Tams-Braylor is 90% * 80% = 72%. And the probability
that a woman without breast cancer gets false positives on
mammography and Tams-Braylor is 5% * 9.6% = 0.48%. So if we
wrap it all up as a single test with a likelihood ratio of 72%/0.48%, and
apply it to a woman with a 1% prior probability of breast cancer:
Calculator: (1 + 2) * 3 + 4
Result:
Compute!
http://yudkowsky.net/rational/bayes
28/45
03/06/2010
Yudkowsky - Bayes' Theorem
three tests. What is the probability the patient has breast cancer?
Here's a fun trick for simplifying the bookkeeping. If the prior
prevalence of breast cancer in a demographic is 1%, then 1 out of 100
women have breast cancer, and 99 out of 100 women do not have
breast cancer. So if we rewrite the probability of 1% as an odds ratio,
the odds are:
1:99
And the likelihood ratios of the three tests A, B, and C are:
8.33:1 = 25:3
18.0:1 = 18:1
3.5:1 =
7:2
The odds for women with breast cancer who score positive on all three
tests, versus women without breast cancer who score positive on all
three tests, will equal:
1*25*18*7:99*3*1*2 =
3,150:594
To recover the probability from the odds, we just write:
3,150 / (3,150 + 594) = 84%
This always works regardless of how the odds ratios are written; i.e.,
8.33:1 is just the same as 25:3 or 75:9. It doesn't matter in what order
the tests are administered, or in what order the results are computed.
The proof is left as an exercise for the reader.
http://yudkowsky.net/rational/bayes
29/45
03/06/2010
Yudkowsky - Bayes' Theorem
could multiply those numbers... or you could just add their logarithms:
10 log10 (1/99) = -20
10 log10 (25/3) = 9
10 log10 (18/1) = 13
10 log10 (7/2)
= 5
It starts out as fairly unlikely that a woman has breast cancer - our
credibility level is at -20 decibels. Then three test results come in,
corresponding to 9, 13, and 5 decibels of evidence. This raises the
credibility level by a total of 27 decibels, meaning that the prior
credibility of -20 decibels goes to a posterior credibility of 7 decibels.
So the odds go from 1:99 to 5:1, and the probability goes from 1% to
around 83%.
http://yudkowsky.net/rational/bayes
30/45
03/06/2010
Yudkowsky - Bayes' Theorem
The prior credibility starts at 0 decibels and there's a total of around 14
decibels of evidence, and indeed this corresponds to odds of around
25:1 or around 96%. Again, there's some rounding error, but if you
performed the operations using exact arithmetic, the results would be
identical.
We can now see intuitively that the bookbag problem would have
exactly the same answer, obtained in just the same way, if sixteen
chips were sampled and we found ten red chips and six blue chips.
Compute!
http://yudkowsky.net/rational/bayes
31/45
03/06/2010
Yudkowsky - Bayes' Theorem
p(positive&cancer) / [p(positive&cancer) +
p(positive&~cancer)]
which is
p(positive&cancer) / p(positive)
which is
p(cancer|positive)
The fully general form of this calculation is known as Bayes' Theorem
or Bayes' Rule:
p(X|A)*p(A)
p(A|X) =
p(X|A)*p(A) + p(X|~A)*p(~A)
http://yudkowsky.net/rational/bayes
32/45
03/06/2010
http://yudkowsky.net/rational/bayes
33/45
03/06/2010
http://yudkowsky.net/rational/bayes
34/45
03/06/2010
Yudkowsky - Bayes' Theorem
For example, the reply "Whether or not you happen to be an optimist
has nothing to do with whether biological warfare wipes out the human
species" can be translated into the statement:
p(you are currently an optimist | biological war occurs within ten years
and wipes out humanity) =
p(you are currently an optimist | biological war occurs within ten years
and does not wipe out humanity)
Since the two probabilities for p(X|A) and p(X|~A) are equal,
Bayes' Theorem says that p(A|X) = p(A); as we have earlier seen,
when the two conditional probabilities are equal, the revised probability
equals the prior probability. If X and A are unconnected - statistically
independent - then finding that X is true cannot be evidence that A is
true; observing X does not update our probability for A; saying "X" is
not an argument for A.
But suppose you are arguing with someone who is verbally clever and
who says something like, "Ah, but since I'm an optimist, I'll have
renewed hope for tomorrow, work a little harder at my dead-end job,
pump up the global economy a little, eventually, through the trickledown effect, sending a few dollars into the pocket of the researcher
who ultimately finds a way to stop biological warfare - so you see, the
two events are related after all, and I can use one as valid evidence
about the other." In one sense, this is correct - any correlation, no
matter how weak, is fair prey for Bayes' Theorem; but Bayes' Theorem
distinguishes between weak and strong evidence. That is, Bayes'
Theorem not only tells us what is and isn't evidence, it also describes
the strength of evidence. Bayes' Theorem not only tells us when to
revise our probabilities, but how much to revise our probabilities. A
correlation between hope and biological warfare may exist, but it's a lot
weaker than the speaker wants it to be; he is revising his probabilities
much too far.
Let's say you're a woman who's just undergone a mammography.
Previously, you figured that you had a very small chance of having
breast cancer; we'll suppose that you read the statistics somewhere
and so you know the chance is 1%. When the positive mammography
comes in, your estimated chance should now shift to 7.8%. There is
no room to say something like, "Oh, well, a positive mammography
isn't definite evidence, some healthy women get positive
mammographies too. I don't want to despair too early, and I'm not
going to revise my probability until more evidence comes in. Why?
Because I'm a optimist." And there is similarly no room for saying,
"Well, a positive mammography may not be definite evidence, but I'm
going to assume the worst until I find otherwise. Why? Because I'm a
pessimist." Your revised probability should go to 7.8%, no more, no
less.
Bayes' Theorem describes what makes something "evidence" and how
much evidence it is. Statistical models are judged by comparison to
the Bayesian method because, in statistics, the Bayesian method is
as good as it gets - the Bayesian method defines the maximum
amount of mileage you can get out of a given piece of evidence, in the
http://yudkowsky.net/rational/bayes
35/45
03/06/2010
Yudkowsky - Bayes' Theorem
same way that thermodynamics defines the maximum amount of work
you can get out of a temperature differential. This is why you hear
cognitive scientists talking about Bayesian reasoners. In cognitive
science, Bayesian reasoner is the technically precise codeword that
we use to mean rational mind.
There are also a number of general heuristics about human reasoning
that you can learn from looking at Bayes' Theorem.
For example, in many discussions of Bayes' Theorem, you may hear
cognitive psychologists saying that people do not tak e prior
frequencies sufficiently into account, meaning that when people
approach a problem where there's some evidence X indicating that
condition A might hold true, they tend to judge A's likelihood solely by
how well the evidence X seems to match A, without taking into account
the prior frequency of A. If you think, for example, that under the
mammography example, the woman's chance of having breast cancer
is in the range of 70%-80%, then this kind of reasoning is insensitive to
the prior frequency given in the problem; it doesn't notice whether 1%
of women or 10% of women start out having breast cancer. "Pay more
attention to the prior frequency!" is one of the many things that humans
need to bear in mind to partially compensate for our built-in
inadequacies.
A related error is to pay too much attention to p(X|A) and not enough to
p(X|~A) when determining how much evidence X is for A. The degree
to which a result X is evidence for A depends, not only on the strength
of the statement we'd expect to see result X if A were true, but also on
the strength of the statement we wouldn't expect to see result X if A
weren't true. For example, if it is raining, this very strongly implies the
grass is wet - p(wetgrass|rain) ~ 1 - but seeing that the grass
is wet doesn't necessarily mean that it has just rained; perhaps the
sprinkler was turned on, or you're looking at the early morning dew.
Since p(wetgrass|~rain) is substantially greater than zero,
p(rain|wetgrass) is substantially less than one. On the other
hand, if the grass was never wet when it wasn't raining, then knowing
that the grass was wet would always show that it was raining,
p(rain|wetgrass) ~ 1, even if p(wetgrass|rain) = 50%;
that is, even if the grass only got wet 50% of the times it rained.
Evidence is always the result of the differential between the two
conditional probabilities. Strong evidence is not the product of a very
high probability that A leads to X, but the product of a very low
probability that not-A could have led to X.
The Bayesian revolution in the sciences is fueled, not only by more
and more cognitive scientists suddenly noticing that mental
phenomena have Bayesian structure in them; not only by scientists in
every field learning to judge their statistical methods by comparison
with the Bayesian method; but also by the idea that science itself is a
special case of Bayes' Theorem; experimental evidence is Bayesian
evidence. The Bayesian revolutionaries hold that when you perform an
experiment and get evidence that "confirms" or "disconfirms" your
theory, this confirmation and disconfirmation is governed by the
Bayesian rules. For example, you have to take into account, not only
http://yudkowsky.net/rational/bayes
36/45
03/06/2010
Yudkowsky - Bayes' Theorem
whether your theory predicts the phenomenon, but whether other
possible explanations also predict the phenomenon. Previously, the
most popular philosophy of science was probably Karl Popper's
falsificationism - this is the old philosophy that the Bayesian revolution
is currently dethroning. Karl Popper's idea that theories can be
definitely falsified, but never definitely confirmed, is yet another special
case of the Bayesian rules; if p(X|A) ~ 1 - if the theory makes a
definite prediction - then observing ~X very strongly falsifies A. On the
other hand, if p(X|A) ~ 1, and we observe X, this doesn't definitely
confirm the theory; there might be some other condition B such that
p(X|B) ~ 1, in which case observing X doesn't favor A over B. For
observing X to definitely confirm A, we would have to know, not that
p(X|A) ~ 1, but that p(X|~A) ~ 0, which is something that we
can't know because we can't range over all possible alternative
explanations. For example, when Einstein's theory of General
Relativity toppled Newton's incredibly well-confirmed theory of gravity, it
turned out that all of Newton's predictions were just a special case of
Einstein's predictions.
You can even formalize Popper's philosophy mathematically. The
likelihood ratio for X, p(X|A)/p(X|~A), determines how much
observing X slides the probability for A; the likelihood ratio is what says
how strong X is as evidence. Well, in your theory A, you can predict X
with probability 1, if you like; but you can't control the denominator of
the likelihood ratio, p(X|~A) - there will always be some alternative
theories that also predict X, and while we go with the simplest theory
that fits the current evidence, you may someday encounter some
evidence that an alternative theory predicts but your theory does not.
That's the hidden gotcha that toppled Newton's theory of gravity. So
there's a limit on how much mileage you can get from successful
predictions; there's a limit on how high the likelihood ratio goes for
confirmatory evidence.
On the other hand, if you encounter some piece of evidence Y that is
definitely not predicted by your theory, this is enormously strong
evidence against your theory. If p(Y|A) is infinitesimal, then the
likelihood ratio will also be infinitesimal. For example, if p(Y|A) is
0.0001%, and p(Y|~A) is 1%, then the likelihood ratio
p(Y|A)/p(Y|~A) will be 1:10000. -40 decibels of evidence! Or
flipping the likelihood ratio, if p(Y|A) is very small, then
p(Y|~A)/p(Y|A) will be very large, meaning that observing Y greatly
favors ~A over A. Falsification is much stronger than confirmation.
This is a consequence of the earlier point that very strong evidence is
not the product of a very high probability that A leads to X, but the
product of a very low probability that not-A could have led to X. This is
the precise Bayesian rule that underlies the heuristic value of Popper's
falsificationism.
Similarly, Popper's dictum that an idea must be falsifiable can be
interpreted as a manifestation of the Bayesian conservation-ofprobability rule; if a result X is positive evidence for the theory, then the
result ~X would have disconfirmed the theory to some extent. If you try
to interpret both X and ~X as "confirming" the theory, the Bayesian
rules say this is impossible! To increase the probability of a theory
http://yudkowsky.net/rational/bayes
37/45
03/06/2010
Yudkowsky - Bayes' Theorem
you must expose it to tests that can potentially decrease its
probability; this is not just a rule for detecting would-be cheaters in the
social process of science, but a consequence of Bayesian probability
theory. On the other hand, Popper's idea that there is only falsification
and no such thing as confirmation turns out to be incorrect. Bayes'
Theorem shows that falsification is very strong evidence compared to
confirmation, but falsification is still probabilistic in nature; it is not
governed by fundamentally different rules from confirmation, as Popper
argued.
So we find that many phenomena in the cognitive sciences, plus the
statistical methods used by scientists, plus the scientific method
itself, are all turning out to be special cases of Bayes' Theorem.
Hence the Bayesian revolution.
p(A|X) =
p(X|A)*p(A)
p(X|A)*p(A) + p(X|~A)*p(~A)
http://yudkowsky.net/rational/bayes
38/45
03/06/2010
Yudkowsky - Bayes' Theorem
grasp and apply abstract theorems, the mental-juggling problem is still
something to bear in mind if you ever need to explain Bayesian
reasoning to someone else.
If you do find yourself losing track, my advice is to forget Bayes'
Theorem as an equation and think about the graph. p(A) and p(~A) are
at the top. p(X|A) and p(X|~A) are the projection factors. p(X&A) and
p(X&~A) are at the bottom. And p(A|X) equals the proportion of p(X&A)
within p(X&A)+p(X&~A). The graph isn't shown here - but can you see
it in your mind?
And if thinking about the graph doesn't work, I suggest forgetting about
Bayes' Theorem entirely - just try to work out the specific problem in
gizmos, hoses, and sparks, or whatever it is.
p(A|X) =
p(X|A)*p(A)
p(X|A)*p(A) + p(X|~A)*p(~A)
We'll start with p(A|X). If you ever find yourself getting confused about
what's A and what's X in Bayes' Theorem, start with p(A|X) on the left
side of the equation; that's the simplest part to interpret. A is the thing
we want to know about. X is how we're observing it; X is the evidence
we're using to make inferences about A. Remember that for every
expression p(Q|P), we want to know about the probability for Q given
P, the degree to which P implies Q - a more sensible notation, which it
is now too late to adopt, would be p(Q<-P).
p(Q|P) is closely related to p(Q&P), but they are not identical.
Expressed as a probability or a fraction, p(Q&P) is the proportion of
things that have property Q and property P within all things; i.e., the
proportion of "women with breast cancer and a positive mammography"
within the group of all women. If the total number of women is 10,000,
and 80 women have breast cancer and a positive mammography, then
p(Q&P) is 80/10,000 = 0.8%. You might say that the absolute
quantity, 80, is being normalized to a probability relative to the group of
all women. Or to make it clearer, suppose that there's a group of 641
women with breast cancer and a positive mammography within a total
sample group of 89,031 women. 641 is the absolute quantity. If you
pick out a random woman from the entire sample, then the probability
you'll pick a woman with breast cancer and a positive mammography is
p(Q&P), or 0.72% (in this example).
On the other hand, p(Q|P) is the proportion of things that have property
Q and property P within all things that have P; i.e., the proportion of
women with breast cancer and a positive mammography within the
group of all women with positive mammographies. If there are 641
women with breast cancer and positive mammographies, 7915 women
with positive mammographies, and 89,031 women, then p(Q&P) is the
probability of getting one of those 641 women if you're picking at
http://yudkowsky.net/rational/bayes
39/45
03/06/2010
Yudkowsky - Bayes' Theorem
random from the entire group of 89,031, while p(Q|P) is the probability
of getting one of those 641 women if you're picking at random from the
smaller group of 7915.
In a sense, p(Q|P)really means p(Q&P|P), but specifying the extra
P all the time would be redundant. You already k now it has property
P, so the property you're investigating is Q - even though you're looking
at the size of group Q&P within group P, not the size of group Q within
group P (which would be nonsense). This is what it means to take the
property on the right-hand side as given; it means you know you're
working only within the group of things that have property P. When
you constrict your focus of attention to see only this smaller group,
many other probabilities change. If you're taking P as given, then
p(Q&P) equals just p(Q) - at least, relative to the group P. The old
p(Q), the frequency of "things that have property Q within the entire
sample", is revised to the new frequency of "things that have property
Q within the subsample of things that have property P". If P is given, if
P is our entire world, then looking for Q&P is the same as looking for
just Q.
If you constrict your focus of attention to only the population of eggs
that are painted blue, then suddenly "the probability that an egg
contains a pearl" becomes a different number; this proportion is
different for the population of blue eggs than the population of all eggs.
The given, the property that constricts our focus of attention, is always
on the right side of p(Q|P); the P becomes our world, the entire thing
we see, and on the other side of the "given" P always has probability 1
- that is what it means to take P as given. So p(Q|P) means "If P has
probability 1, what is the probability of Q?" or "If we constrict our
attention to only things or events where P is true, what is the
probability of Q?" Q, on the other side of the given, is not certain - its
probability may be 10% or 90% or any other number. So when you
use Bayes' Theorem, and you write the part on the left side as p(A|X) how to update the probability of A after seeing X, the new probability of
A given that we know X, the degree to which X implies A - you can tell
that X is always the observation or the evidence, and A is the property
being investigated, the thing you want to know about.
The right side of Bayes' Theorem is derived from the left side through
these steps:
p(A|X) =
p(A|X)
p(X&A)
p(A|X) =
p(A|X) =
p(A|X) =
p(X)
p(X&A)
p(X&A) + p(X&~A)
p(X|A)*p(A)
p(X|A)*p(A) + p(X|~A)*p(~A)
http://yudkowsky.net/rational/bayes
40/45
03/06/2010
Yudkowsky - Bayes' Theorem
number, the normalized probability or frequency of A within the
subgroup X. p(X&A)/p(X) are usually the percentage frequencies of
X&A and X within the entire sample, but the calculation also works if
X&A and X are absolute numbers of people, events, or things.
p(cancer|positive) is a single percentage/frequency/probability,
always between 0 and 1. (positive&cancer)/(positive) can
be measured either in probabilities, such as 0.008/0.103, or it might be
expressed in groups of women, for example 194/2494. As long as
both the numerator and denominator are measured in the same units,
it should make no difference.
Going from p(X) in the denominator to p(X&A)+p(X&~A) is a very
straightforward step whose main purpose is as a stepping stone to the
last equation. However, one common arithmetical mistake in Bayesian
calculations is to divide p(X&A) by p(X&~A), instead of dividing
p(X&A) by [p(X&A) + p(X&~A)]. For example, someone doing
the breast cancer calculation tries to get the posterior probability by
performing the math operation 80 / 950, instead of 80 / (80 + 950). I
like to think of this as a rose-flowers error. Sometimes if you show
young children a picture with eight roses and two tulips, they'll say that
the picture contains more roses than flowers. (Technically, this would
be called a class inclusion error.) You have to add the roses and the
tulips to get the number of flowers, which you need to find the
proportion of roses within the flowers. You can't find the proportion of
roses in the tulips, or the proportion of tulips in the roses. When you
look at the graph, the bottom bar consists of all the patients with
positive results. That's what the doctor sees - a patient with a positive
result. The question then becomes whether this is a healthy patient
with a positive result, or a cancerous patient with a positive result. To
figure the odds of that, you have to look at the proportion of cancerous
patients with positive results within all patients who have positive
results, because again, "a patient with a positive result" is what you
actually see. You can't divide 80 by 950 because that would mean you
were trying to find the proportion of cancerous patients with positive
results within the group of healthy patients with positive results; it's like
asking how many of the tulips are roses, instead of asking how many
of the flowers are roses. Imagine using the same method to find the
proportion of healthy patients. You would divide 950 by 80 and find that
1,187% of the patients were healthy. Or to be exact, you would find
that 1,187% of cancerous patients with positive results were healthy
patients with positive results.
The last step in deriving Bayes' Theorem is going from p(X&A) to
p(X|A)*p(A), in both the numerator and the denominator, and from
p(X&~A) to p(X|~A)*p(~A), in the denominator.
Why? Well, one answer is because p(X|A), p(X|~A), and p(A)
correspond to the initial information given in all the story problems. But
why were the story problems written that way?
Because in many cases, p(X|A), p(X|~A), and p(A) are what we
actually k now; and this in turn happens because p(X|A) and p(X|~A) are
often the quantities that directly describe causal relations, with the
other quantities derived from them and p(A) as statistical relations. For
http://yudkowsky.net/rational/bayes
41/45
03/06/2010
Yudkowsky - Bayes' Theorem
example, p(X|A), the implication from A to X, where A is what we want
to know and X is our way of observing it, corresponds to the implication
from a woman having breast cancer to a positive mammography. This
is not just a statistical implication but a direct causal relation; a
woman gets a positive mammography because she has breast
cancer. The mammography is designed to detect breast cancer, and it
is a fact about the physical process of the mammography exam that it
has an 80% probability of detecting breast cancer. As long as the
design of the mammography machine stays constant, p(X|A) will stay
at 80%, even if p(A) changes - for example, if we screen a group of
woman with other risk factors, so that the prior frequency of women
with breast cancer is 10% instead of 1%. In this case, p(X&A) will
change along with p(A), and so will p(X), p(A|X), and so on; but p(X|A)
stays at 80%, because that's a fact about the mammography exam
itself. (Though you do need to test this statement before relying on it;
it's possible that the mammography exam might work better on some
forms of breast cancer than others.) p(X|A) is one of the simple facts
from which complex facts like p(X&A) are constructed; p(X|A) is an
elementary causal relation within a complex system, and it has a
direct physical interpretation. This is why Bayes' Theorem has the
form it does; it's not for solving math brainteasers, but for reasoning
about the physical universe.
Once the derivation is finished, all the implications on the right side of
the equation are of the form p(X|A) or p(X|~A), while the implication
on the left side is p(A|X). As long as you remember this and you get
the rest of the equation right, it shouldn't matter whether you happened
to start out with p(A|X) or p(X|A) on the left side of the equation, as long
as the rules are applied consistently - if you started out with the
direction of implication p(X|A) on the left side of the equation, you
would need to end up with the direction p(A|X) on the right side of the
equation. This, of course, is just changing the variable labels; the point
is to remember the symmetry, in order to remember the structure of
Bayes' Theorem.
The symmetry arises because the elementary causal relations are
generally implications from facts to observations, i.e., from breast
cancer to positive mammography. The elementary steps in reasoning
are generally implications from observations to facts, i.e., from a
positive mammography to breast cancer. The left side of Bayes'
Theorem is an elementary inferential step from the observation of
positive mammography to the conclusion of an increased probability of
breast cancer. Implication is written right-to-left, so we write
p(cancer|positive) on the left side of the equation. The right
side of Bayes' Theorem describes the elementary causal steps - for
example, from breast cancer to a positive mammography - and so the
implications on the right side of Bayes' Theorem take the form
p(positive|cancer) or p(positive|~cancer).
And that's Bayes' Theorem. Rational inference on the left end,
physical causality on the right end; an equation with mind on one side
and reality on the other. Remember how the scientific method turned
out to be a special case of Bayes' Theorem? If you wanted to put it
poetically, you could say that Bayes' Theorem binds reasoning into the
http://yudkowsky.net/rational/bayes
42/45
03/06/2010
physical universe.
Okay, we're done.
Digg
Del.icio.us
Stumble
Further Reading:
If you liked An Intuitive Explanation of Bayesian Reasoning, you may also
wish to read A Technical Explanation of Technical Explanation by the
same author, which goes into greater detail on the application of Bayescraft
to human rationality and the philosophy of science. You may also enjoy the
Twelve Virtues of Rationality and The Simple Truth.
Other authors:
E. T. Jaynes: Probability Theory With Applications in Science and
Engineering (full text online). Theory and applications for Bayes' Theorem
and Bayesian reasoning. See also Jaynes's magnum opus, Probability
Theory: The Logic of Science.
D. Kahneman, P. Slovic and A. Tversky, eds, Judgment under uncertainty:
Heuristics and biases. If it seems to you like human thinking often isn't
Bayesian... you're not wrong. This terrifying volume catalogues some of the
b latant searing hideous gaping errors that pop up in human cognition. See
also this forthcoming book chapter for a summary of some better-known
biases.
Bellhouse, D.R.: The Reverend Thomas Bayes FRS: a Biography to
Celebrate the Tercentenary of his Birth. A more "traditional" account of
Bayes's life.
Google Directory for Bayesian analysis (courtesy of the Open Directory
Project).
http://yudkowsky.net/rational/bayes
43/45
03/06/2010
http://yudkowsky.net/rational/bayes
44/45
03/06/2010
Back to Top
Introduction
Rationality
Singularity
http://yudkowsky.net/rational/bayes
Other
Contact
45/45