A17 Complexdecisions
A17 Complexdecisions
A17 Complexdecisions
Chapter 17
Outline
• Sequential decision problems (Markov
Decision Process)
– Value iteration
– Policy iteration
Sequential Decisions
• Agent’s utility depends on a sequence of
decisions
Markov Decision Process (MDP)
• Defined as a tuple: <S, A, M, R>
– S: State
– A: Action
– M: Transition function
• Table Mija = P(sj| si, a), prob of sj| given action “a” in state s i
– R: Reward
• R(si, a) = cost or reward of taking action a in state si
• In our case R = R(si)
3 +1 0.8
0.1 0.1
2 -1
1 start
1 2 3 4
MDP of example
• S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
• A: Actions of the agent, i.e., N, E, S, W
• M: Transition function
– E.g., M( (4,2) | (3,2), N) = 0.1
– E.g., M((3, 3) | (3,2), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)
=1
Policy
• Policy is a mapping from states to actions.
• Given a policy, one may calculate the expected utility from
series of actions produced by policy.
3 +1
Example of policy
2 -1
1
1 2 3 4
• The goal: Find an optimal policy , one that would produce
maximal expected utility.
Utility of a State
Given a state s we can measure the expected utilities by applying
any policy .
We assume the agent is in state s and define St (a random variable)
as the state reached at step t.
Obviously S0 = s
If i is terminal:
U(Si) = R(Si)
If i is non-terminal,
U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected sum of
discounted rewards collected in future states]
Utility of a State
The utility of a state s measures its desirability:
If i is terminal:
U(Si) = R(Si)
If i is non-terminal,
U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected sum of
discounted rewards collected in future states]
dynamic programming
Optimal Policy
A policy is a function that maps each state s into the
action to execute if s is reached
2 -1 2 0.762 0.660 -1