MDP PDF
MDP PDF
MDP PDF
1
Decision Processes: General
Description
• Decide what action to take next, given:
– A probability to move to different states
– A way to evaluate the reward of being in different
states
Robot path planning
Travel route planning
Elevator scheduling
Aircraft navigation
Manufacturing processes
Network switching & routing
Example
• Assume that time is discretized into discrete time steps t
=1,2,…. (for example, fiscal years)
• Suppose that the business can be in one of a finite
number of states s (this is a major simplification, but let’s
assume….)
• Suppose that for every state s, we can anticipate a
reward that the business receives for being in that state:
R(s) (in this example, R(s) would the profit, possibly
negative, generated by the business)
• Assume also that R(s) is bounded (R(s) < M for all s),
meaning that the business cannot generate more than a
certain profit threshold
• Question: What is the total value of the reward for a
particular configuration of states {s1,s2,…} over time?
2
Example
• Question: What is the total value of the reward
for a particular configuration of states {s1,s2,…}
over time?
• It is simply the sum of the rewards (possibly
negative) that we will receive in the future:
U(s1,s2,.., sn,..) = R(s1)+R(s2)+..+R(sn) +....
Horizon Problem
3
Horizon Problem
• The problem is that we did not put any limit on the
“future”, so this sum can be infinite.
• For example: Consider the simple case of computing
the total future reward if the business remains forever
in the same state:
U(s,s,.., s,..) = R(s)+R(s)+..+ R(s) +…...
is clearly infinite in general!!
• This definition is useless unless we consider a finite
time horizon.
• But, in general, we don’t have a good way to define
such a time horizon.
Discounting
U(s0,….)=R(s0)+γγR(s1)+..+γγNR(sN)+..
4
Discounting
• U(s0,…..) = R(s0) + γR(s1) +….+ γNR(sN) + ….
• Always converges if γ < 1 and R(.) is bounded
• γ close to 0 instant gratification, don’t pay
attention to future reward
• γ close to 1 extremely conservative, consider
profits/losses no matter how far in the future
• The resulting model is the discounted reward
• Prefers expedient solutions (models impatience)
• Compensates for uncertainty in available time
(models mortality)
• Economic example:
– Being promised $10,000 next year is worth only 90%
as much as receiving $10,000 right now.
– Assuming payment n years in future is worth only
(0.9)n of payment now
Actions
• Assume that we also have a finite set of
actions a
• An action a causes a transition from a
state s to a state s’
• In the “business” example, an action may
be placing advertising, or selling stock,
etc.
5
The Basic Decision Problem
• Given:
– Set of states S = {s}
– Set of actions A = {a} a: S S
– Reward function R(.)
– Discount factor γ
– Starting state s1
• Find a sequence of actions such that the
resulting sequence of states maximizes
the total discounted reward:
U(s0,….)=R(s0)+γγR(s1)+..+γγNR(sN)+.
.
• In the “business” example: Find a
2 -1
6
Maze Example: Utility
1 2 3 4
3 If+1
no uncertainty:
Find the sequence of
2 actions
-1 that maximizes
the sum of the rewards of
the traversed states
1
START
• Define the reward of being in a state:
– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0,…, sN) = R(s0) + R(s1) +….+R(sN)
2 -1
1
START
7
What we are looking for: Policy
• Policy = Mapping from states to action π(s) = a
Which action should I take in each state
• In the maze example, π(s) associates a motion
to a particular location on the grid
• For any state s, we define the utility U(s) of s as
the sum of discounted rewards of the sequence
of states starting at state s generated by using
the policy π
U(s) = R(s) + γ R(s1) + γ2 R(s2) +…..
• Where we move from s to s1 by action π(s)
• We move from s1 to s2 by action π(s1)
• …etc.
8
Maze Example: No Uncertainty
1 2 3 4
3 +1
2 -1
1
START
• π*((1,1)) = UP
• π*((1,3)) = RIGHT
• π*((4,1)) = LEFT
Executed
action:
Prob = 0.8 Prob = 0.0 Prob = 0.1 Prob = 0.1
9
Uncertainty
• No uncertainty:
– An action a deterministically causes a
transition from a state s to another state s’
• With uncertainty:
– An action a causes a transition from a state s
to another state s’ with some probability
T(s,a,s’)
– T(s,a,s’) is called the transition probability
from state s to state s’ through action a
– In general, we need |S|2x|A| numbers to store
all the transitions probabilities
2 -1
10
Maze Example: Utility Revisited
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
0.1 0.1
2 -1
0.1 0.1
2 -1
1 s
U(1,1) = R(1,1) +
11
Maze Example: Utility Revisited
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
0.1 0.1
2 -1
1 s
12
Maze Example: Utility Revisited
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
0.1 0.1
2 -1
1 s
13
Same with Discount
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
14
More General Expression
Expected sum of future
• If we choose action a atdiscounted
state s: rewards
starting at s’
Reward at current state s
Expected sum of
future discounted Probability of moving
rewards starting at s from state s to state
s’ with action a
15
Formal Definitions
• Finite set of states: S
• Finite set of allowed actions: A
• Reward function R(s)
• Policy: π :S A
• Optimal policy: π*(s) = action that maximizes the
expected sum of rewards from state s
16
Markov Example
1 2 3 4
3 s2 +1
2 -1
s1
1 s0
Graphical Notations
T(s, a1,s’) = 0.8 a1 Prob. = 0.8
T(s’,a2,s) = 0.6
T(s,a2,s) = 0.2 s’
a1 Prob. = 0.4
s
a2 Prob. = 0.2
a2 Prob. = 0.6
17
Example (Partial)
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
2 -1 0.1 0.1
Up, 0.8
(1,2)
Up, 0.1
(1,1)
Up, 0.1
(2,1)
Example
• I run a company
• I can choose to either save money or spend
money on advertising
• If I advertise, I may become famous (50% prob.)
but will spend money so I may become poor
• If I save money, I may become rich (50% prob.),
but I may also become unknown because I don’t
advertise
• What should I do?
18
1 S ½ 1
½
Poor &
Poor & A Famous A
Unknown
0 0
½ S 1
½ ½
½ ½
½ S A A
Rich &
Rich & S
Unknown
½ Famous
10 10
½
Example Policies
19
Example: Finance and Business
• States: Status of the
company (cash reserves,
inventory, etc.)
• Actions: Business decisions
(advertise, acquire other
companies, roll out product,
etc.)
• Uncertainty due all the
external uncontrollable
factors (economy,
shortages, consumer
confidence…)
Note: Ok, this is an overly
• Optimal policy: The policy
simplified view of business
for making business
models. Similar models could be decisions that maximizes
used for investment decisions, the expected future profits
etc.
Example: Robotics
• States are 2-D positions
• Actions are commanded
motions (turn by x degrees,
move y meters)
• Uncertainty comes from the
fact that the mechanism is not
perfect (slippage, etc.) and
does not execute the
commands exactly
• Reward when avoiding
forbidden regions (for
example)
• Optimal policy: The policy that
minimizes the cost to the goal
20
Example: Games
• States: Number of white and
black checkers at each
location
• Note: Number of states is
huge, on the order 1020
states!!!!
• Branching factor prevents
direct search
• Actions: Set of legal moves
from any state
• Uncertainty comes from the
roll of the dice
• Reward computed from
Interesting example because it is number of checkers in the
impossible to store explicitly the goal quadrant
transition probability tables (or the
states, or the values U(s) )
• Optimal policy: The one that
maximizes the probability of
winning the game
Example: Robotics
21
Key Result
• For every MDP, there exists an optimal policy
• There is no better option (in terms of expected
sum of rewards) than to follow this policy
Bellman’s Equation
If we choose an action a:
22
Bellman’s Equation
If we choose an action a:
Bellman’s Equation
23
Why it cannot be solved directly
……………………………….
24
First Solution: Value Iteration
• Define U1(s) = best value after one step
U1(s) = R(s)
• Define U2(s) = best value after two steps
Maximum possible expected sum of
U2(s) =rewards
discounted
R(s) +thatγ max Σ
a ( if Is’
I can get
T(s,a,s’)
start at
U1(s’))
state s and I survive for k time steps.
……………………………….
Example
• I run a company
• I can choose to either save money or spend
money on advertising
• If I advertise, I may become famous (50% prob.)
but will spend money so I may become poor
• If I save money, I may become rich (50% prob.),
but I may also become unknown because I don’t
advertise
• What should I do?
25
1 S ½ 1
½
Poor &
Poor & A Famous A
Unknown
0 0
½ S 1
½ ½
½ ½
½ S A A
Rich &
Rich & S
Unknown
½ Famous
10 10
½
Value Iteration
PU PF RU RF
U1 0 0 10 10
U2 0 4.5 14.5 19
U3 2.03 8.55 16.53 25.08
U4 4.76 12.2 18.35 28.72
U5 7.63 15.07 20.40 31.18
U6 10.21 17.46 22.61 33.21
U7 12.45 19.54 24.77 35.12
Uk(s) = R(s) + γ maxa (Σs’ T(s,a,s’) Uk-1(s’))
26
Uk(RF)
U(s)
Uk(RU)
Uk(PF)
Uk(PU)
Iteration k
27
Uk(RF)
U(s)
Uk(RU)
Uk(PF)
Upon convergence: Uk(PU)
π*(s) = argmaxa (Σs’ T(s,a,s’) U*(s’))
π*(PU) = A π*(PF) = S
π*(RU) = S π*(RF) = S
Iteration
Better to always save except if poor and k
unknown
1 2 3 4
Maze Example 3 +1
2 -1
28
Key Convergence Results
Iterations
So far….
• Definition of discounted sum of rewards to measure
utility
• Definition of Markov Decision Processes (MDP)
• Assumes observable states and uncertain action
outcomes
• Optimal policy = choice of action that results in the
maximum expected rewards in the future
• Bellman equation for general formulation of optimal
policy in MDP
• Value iteration (dynamic programming) technique for
computing the optimal policy
• Next: Other approaches for optimal policy computation +
examples and demos.
29
Another Solution: Policy Iteration
• Start with a randomly chosen policy πo
• Iterate until convergence (πk ~ πk+1):
1. Compute Uk(s) for every state s using πk
2. Update the policy by choosing the best action given
the utility computed at step k:
Evaluating a Policy
1. Compute Uk(s) for every state s using πk
30
Evaluating
This a Policy
is only an approximation
because we should use Uk here.
1. Compute Uk(s) for every state s using πk
Comparison
• Value iteration:
– (Relatively) small number of actions
• Policy iteration:
– Large number of actions
– Initial guess at a good policy
• Combined policy/value iteration is possible
• Note: No need to traverse all the states in a fixed order
at every iteration
– Random order ok
– Predict “most useful” states to visit
– Prioritized sweeping Choose the state with largest value to
update
– States can be visited in any order, applying either value or policy
iteration asynchronous iteration
31
Limitations
• We need to represent the values (and policy) for every
state in principle
• In real problems, the number of states may be very large
• Leads to untractably large tables (checker-like problem
with N cells and M pieces N(N-1)(N-2)..(N-M) states!)
• Need to find a compact way of representing the states
• Solutions:
– Interpolation State s Value U
– Memory-based representations s1
– Hierarchical representations
s2
………
s100000
s100001
Function Approximation
U
State Value
States
32
Memory-Based Techniques
States stored in memory
States stored in memory
U(s) = ?
U(s) = ?
Hierarchical Representations
• Split a state into smaller states when necessary
• Hierarchy of states with high-level managers directing
lower-level servants
33
Multiresolution: Examples from Foster & Dayan 2000
34
POMDP
• As before:
– States, s
– Actions, a
– Transitions, T(s,a,s’) = P(s’|a,s)
• New:
– The state is not directly observable, instead:
– Observations, o
– Observation model, O(s,o) = P(o|s)
POMDP
• As before: Probability of making observation o
when at state s
– States, s
– Actions, a
– Transitions, T(s,a,s’) = P(s’|a,s)
• New:
– The state is not directly observable, instead:
– Observations, o
– Observation model, O(s,o) = P(o|s)
35
POMDP: What is a “policy”?
• We don’t know for sure which state we’re in, so it
does not make sense to talk about the “optimal”
choice of action for a state
36
Summary
• Definition of discounted sum of rewards to measure
utility
• Definition of Markov Decision Processes (MDP)
• Assumes observable states and uncertain action
outcomes
• Optimal policy = choice of action that results in the
maximum expected rewards in the future
• Bellman equation for general formulation of optimal
policy in MDP
• Value iteration technique for computing the optimal
policy
• Policy iteration technique for computing the optimal
policy
• MDP = generalization of the deterministic search
techniques studied earlier in class
• POMDP = MDP + Uncertainty on the observed states
37