Making complex decisions

Chapter 17
• Sequential decision problems (Markov
Decision Process)
– Value iteration
– Policy iteration
Sequential Decisions
• Agent’s utility depends on a sequence of
Markov Decision Process (MDP)
• Defined as a tuple: <S, A, M, R>
– S: State
– A: Action
– M: Transition function
• Table Mija = P(sj| si, a), prob of sj| given action “a” in state s i
– R: Reward
• R(si, a) = cost or reward of taking action a in state si
• In our case R = R(si)

• Choose a sequence of actions (not just one decision or one action)

– Utility based on a sequence of decisions
 Inputs:
• Initial state s0
• Action model
• Reward R(si) collected in each state si
 A state is terminal if it has no successor
 Starting at s0, the agent keeps executing actions
until it reaches a terminal state
 Its goal is to maximize the expected sum of rewards
collected (additive rewards)
 Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) + R(s2)+ …
 Discounted rewards
U(s0, s1, s2, …) = R(s0) + R(s1) +  2R(s2) + … (0    1)
Dealing with infinite sequences
of actions
• Use discounted rewards
• The environment contains terminal states
that the agent is guaranteed to reach
• Infinite sequences are compared by their
average rewards
• Fully observable environment
• Non deterministic actions
– intended effect: 0.8
– not the intended effect: 0.2

3 +1 0.8

0.1 0.1
2 -1
1 start

1 2 3 4
MDP of example
• S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
• A: Actions of the agent, i.e., N, E, S, W

• M: Transition function
– E.g., M( (4,2) | (3,2), N) = 0.1
– E.g., M((3, 3) | (3,2), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)

• R: Reward (more comments on the reward function later)

– R(1, 1) = -1/25, R(1, 2) = -1/25 ….
– R (4,3) = +1 R(4,2) = -1

 =1
• Policy is a mapping from states to actions.
• Given a policy, one may calculate the expected utility from
series of actions produced by policy.

3 +1
Example of policy
2 -1
1 2 3 4
• The goal: Find an optimal policy , one that would produce
maximal expected utility.
Utility of a State
Given a state s we can measure the expected utilities by applying
any policy .
We assume the agent is in state s and define St (a random variable)
as the state reached at step t.
Obviously S0 = s

The expected utility of a state s given a policy  is

U(s) = E[t t R(St)]

*s will be a policy that maximizes the expected utility of state s.

 dynamic programming
Optimal Policy
 A policy is a function that maps each state s into the
action to execute if s is reached

 The optimal policy * is the policy that always leads

to maximizing the expected sum of rewards
collected in future states
(Maximum Expected Utility principle)

*(Si) = argmaxajP(Sj|Si,a) U(Sj)

Optimal policy and state utilities
for MDP

3 +1 3 0.812 0.868 0.912 +1

2 -1 2 0.762 0.660 -1

1 1 0.705 0.655 0.611 0.388

1 2 3 4 1 2 3 4

• What will happen if cost of step is very low?

Finding the optimal policy
• Solution must satisfy both equations
 *(Si) = argmaxajP(Sj|Si,a) U(Sj) (1)
– U(Si) = R(Si) +  maxajP(Sj|Si,a) U(Sj) (2)
• Value iteration:
– start with a guess of a utility function and use (2) to
get a better estimate
• Policy iteration:
– start with a fixed policy, and solve for the exact
utilities of states; use eq 1 to find an updated policy.
Value iteration
function Value-Iteration(MDP) returns a utility function
inputs: P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: Utility function, initially identical to R
U’, utility function, initially identical to R
U  U’
for each state i do
U’ [Si]  R [Si] +  maxa j P(Sj |Si,a) U (Sj)
until Close-Enough(U,U’)
return U
Value iteration - convergence of
utility values
policy iteration
function Policy-Iteration (MDP) returns a policy
inputs: P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: U, utility function, initially identical to R
, a policy, initially optimal with respect to U
U  Value-Determination(, U, MDP)
unchaged?  true
for each state Si do
if maxa jP(Sj |Si,a) U (Sj) > jP(Sj |Si, (Si)) U (Sj)
(Si)  arg maxa jP(Sj |Si,a) U (Sj)
unchaged?  false
until unchaged?
return 
Decision Theoretic Agent
• Agent that must act in uncertain
environments. Actions may be non-
Stationarity and Markov
• A stationary process is one whose
dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are
assumed not to change with time.
• The Markov assumption is that the
current state Xt is dependent only on a
finite history of previous states. In this
class, we will only consider first-order
Markov processes for which
P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)
Transition model and sensor
• For first-order Markov processes, the laws
describing how the process state evolves with
time is contained entirely within the
conditional distribution P(Xt|Xt-1), which is
called the transition model for the process.
• We will also assume that the observable state
variables (evidence) Et are dependent only on
the state variables Xt. P(Et|Xt) is called the
sensor or observational model.
Decision Theoretic Agent -
sensor model

generic Example – part of steam train control

sensor model II
Model for lane position sensor
for automated vehicle
dynamic belief network – generic
• Describe the action model, namely the
state evolution model.
State evolution model - example
Dynamic Decision Network
• Can handle uncertainty
• Deal with continuous streams of
• Can handle unexpected events (they
have no fixed plan)
• Can handle sensor noise and sensor
• Can act to obtain information
• Can handle large state spaces

