Unit 5 1
Unit 5 1
ARTIFICIAL INTELLIGENCE
Chapter-I: Uncertainty: Acting under Uncertainty, Basic Probability Notation, Inference Using
Full Joint Distributions, Independence, Bayes’ Rule and Its Use
Chapter-II: Probabilistic Reasoning: Representing Knowledge in an Uncertain Domain, The
Semantics of Bayesian Networks, Efficient Representation of Conditional Distributions,
Approximate Inference in Bayesian Networks, Relational and First-Order Probability, Other
Approaches to Uncertain Reasoning; Dempster-Shafer theory.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of
probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty that
is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed, such
as "It will rain today," "behaviour of someone for some situations," "A match between two teams or
two players." These are probable sentences for which we can assume that it will happen but not sure
about it, so here we use probabilistic reasoning.
Need of probabilistic reasoning in AI:
o When there are unpredictable outcomes.
o When specifications or possibilities of predicates becomes too large to handle.
o When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding probabilistic
reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the numerical
measure of the likelihood that an event will occur. The value of probability always remains between 0 and 1
that represent ideal uncertainties.
We can find the probability of an uncertain event by using the below formula.
P(¬A) + P(A) = 1.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample space
will be reduced to set B, and now we can only calculate event A when event B is already occurred
by dividing the probability of P(A⋀B) by P(B).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.
Example:
If cancer corresponds to one's age, then by using Bayes' theorem, we can determine the probability
of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with known
event B:
As from product rule we can write:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here, P(A|B) is known
as posterior, which we need to calculate, and it will be read as Probability of hypothesis A when we
have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
P(B) is called marginal probability, pure probability of an evidence.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the time.
He is also aware of some more facts, which are given as follows:
o The Known probability that a patient has meningitis disease is 1/30,000.
o The Known probability that a patient has a stiff neck is 2%.
Mr. Mohammed Afzal, Asst. Professor in AIML
Mob: +91-8179700193, Email: [email protected]
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. , so
we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability that the
card is king is 4/52, then calculate posterior probability P(King|Face), which means the drawn face
card is a king card.
Solution:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple
events, we need a Bayesian network. It can also be used in various tasks including prediction,
anomaly detection, diagnostics, automated insight, reasoning, time series prediction,
and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of
two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph. These links
represent that one node directly influence the other node, and if there is no directed link that means
that nodes are independent with each other.
In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
If we are considering node B, which is connected with node A by a directed arrow, then node
A is called the parent of Node B.
Node C is independent of node A.
o The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set
of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the
above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distributions.
1. Identify a set of random variables that describe the given problem domain
2. Choose an ordering for them: X1, ..., Xn
3. for i=1 to n do
a) Add a new node for Xi to the net
b) Set Parents(Xi) to be the minimal set of already added nodes such that we have conditional
independence of Xi and all other members of {X1, ..., Xi-1} given Parents(Xi)
c) Add a directed arc from each node in Parents(Xi) to Xi
d) If Xi has at least one parent, then define a conditional probability table at Xi: P(Xi=x | possible
assignments to Parents(Xi)). Otherwise, define a prior probability at Xi: P(Xi)
There is not, in general, a unique Bayesian Net for a given set of random variables. But all represent the
same information in that from any net constructed every entry in the joint probability distribution can
be computed.
The "best" net is constructed if in Step 2 the variables are topologically sorted first. That is, each variable
comes before all of its children. So, the first nodes should be the roots, then the nodes they directly
influence, and so on.
The algorithm will not construct a net that is illegal in the sense of violating the rules of probability.
Example:
Consider the problem domain in which when I go home I want to know if someone in my family is home
before I go in. Let's say I know the following information:
1) Why my wife leaves the house, she often (but not always) turns on the outside light. (She also
sometimes turns the light on when she's expecting a guest.)
2) When nobody is home, the dog is often left outside.
Mr. Mohammed Afzal, Asst. Professor in AIML
Mob: +91-8179700193, Email: [email protected]
Given this information, define the following five Boolean random variables:
O: Everyone is Out of the house L: The Light is on
D: The Dog is outside
B: The dog has Bowel troubles
H: I can Hear the dog barking
From this information, the following direct causal influences seem appropriate:
1. H is only directly influenced by D. Hence H is conditionally independent of L, O and B given D.
2. D is only directly influenced by O and B. Hence D is conditionally independent of L given O and B.
3. L is only directly influenced by O. Hence L is conditionally independent of D, H and B given O.
4. O and B are independent.
Based on the above, the following is a Bayesian Net that represents these direct causal relationships (though
it is important to note that these causal connections are not absolute, i.e., they are not implications):
Next, the following quantitative information is added to the net; this information is usually given by an
expert or determined empirically from training data.
o For each root node (i.e., node without any parents), the prior probability of the random variable
associated with the node is determined and stored there
o For each non-root node, the conditional probabilities of the node's variable given all possible
combinations of its immediate parent nodes are determined. This results in a conditional probability
table (CPT) at each non-root node.
Doing this for the above example, we get the following Bayesian Net:
Notice that in this example, a total of 10 probabilities are computed and stored in the net, whereas
the full joint probability distribution would require a table containing 25 = 32 probabilities. The
reduction is due to the conditional independence of many variables.
Two variables that are not directly connected by an arc can still affect each other. For example, B
and H are not (unconditionally) independent, but H does not directly depend on B.
Given a Bayesian Net, we can easily read off the conditional independence relations that are
represented. Specifically, each node, V, is conditionally independent of all nodes that are not
descendants of V, given V's parents. For example, in the above example H is conditionally
independent of B, O, and L given D. So, P(H | B, D, O, L) = P(H | D).
To illustrate how a Bayesian Net can be used to compute an arbitrary value in the joint probability
distribution, consider the Bayesian Net shown above for the "home domain."
Goal: Compute P(B, ~O, D, ~L, H)
P(B, ~O, D, ~L, H) = P(H, ~L, D, ~O, B) = P(H | ~L, D, ~O, B) * P(~L, D, ~O, B) by Product Rule
= P(H|D) P(~L|~O) P(D, ~O, B) by Conditional Independence of L and D, and L and B, given O
= P(H|D) P(~L|~O) P(D|~O, B) P(~O) P(B) by Independence of O and B = (.3)(1 - .6)(.1)(1 - .6)(.3)
= 0.00144
where all of the numeric values are available directly in the Bayesian Net (since P(~A|B) = 1 - P(A|B)).
Likelihood weighting
Likelihood weighting avoids the inefficiency of rejection sampling by generating only events that are
consistent with the evidence e. It is a particular instance of the general statistical technique of importance
sampling, tailored for inference in Bayesian networks.
The Dempster–Shafer theory DEMPSTER–SHAFER is designed to deal with the distinction between
uncertainty and ignorance. Rather than computing the probability of a proposition, it computes the
probability that the evidence supports the proposition. This measure of belief is called a belief function,
written Bel(X).
The mathematical formulation of Dempster–Shafer theory is similar to those of probability theory; the
main difference is that, instead of assigning probabilities to possible worlds, the theory assigns masses
to sets of possible world, that is, to events.
The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum of masses for all
events that are subsets of (i.e., that entail) A, including A itself. With this definition, Bel(A) and
Bel(¬A) sum to at most 1, and the gap—the interval between Bel(A)and 1 − Bel(¬A)—is often
interpreted as bounding the probability of A.
As with default reasoning, there is a problem in connecting beliefs to actions. Whenever there is a gap
in the beliefs, then a decision problem can be defined such that a Dempster–Shafer system is unable to
make a decision.
Bel(A) should be interpreted not as a degree of belief in A but as the probability assigned to all the
possible worlds (now interpreted as logical theories) in which A is provable.
Example:
let us consider a room where four person are presented A, B, C, D(lets say) And suddenly lights out and
when the lights come back B has been died due to stabbing in his back with the help of a knife. No one
came into the room and no one has leaved the room and B has not committed suicide. Then we have to find
out who is the murderer?
o Either {A} or{C} or {D} has killed him.
o Either {A, C} or {C, D} or {A, C} have killed him.
o Or the three of them kill him i.e; {A, C, D}
o None of the kill him {o}(let us say).
These will be the possible evidences by which we can find the murderer by measure of plausibility.
Using the above example we can say :
Set of possible conclusion (P): {p1, p2....pn} where P is set of possible conclusion and cannot be exhaustive
means at least one (p)i must be true.(p)i must be mutually exclusive. Power Set will contain 2n elements
where n is number of elements in the possible set.
For example:
If P = { a, b, c}, then Power set is given as {o, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}= 23 elements.
Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is evidence for {K or B}
which cannot be divided among more specific beliefs for K and B.
Belief in K: The belief in element K of Power Set is the sum of masses of element which are subsets of K.
This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) + m(a, b, c)
Plausibility in K: It is the sum of masses of set that intersects with K. i.e; Pl(K) = m(a) + m(b) + m(c) + m(a,
b) + m(b, c) + m(a, c) + m(a, b, c)