MDP PDF

Markov Decision Processes:
Making Decision in the Presence

of Uncertainty
(some of) R&N 16.1-16.6
R&N 17.1-17.4
Decision Processes: General

Description
• Suppose that you own a business. At any time, you
know exactly the current state of the business (finances,
stock, etc.).
• At any time, you have a choice of several possible
actions (advertise, introduce new products,…).
• You cannot predict with certainty the consequence of
these actions given the current state of the business, but
you have a guess as to the likelihood of the possible
outcomes.
• How can you define a policy that will guarantee that you
always choose the action that maximizes expected
future profits?
Note: Russel & Norvig, Chapter 17.
1
Decision Processes: General
Description
• Decide what action to take next, given:
– A probability to move to different states
– A way to evaluate the reward of being in different
states
Robot path planning
Travel route planning
Elevator scheduling
Aircraft navigation
Manufacturing processes
Network switching & routing
Example
• Assume that time is discretized into discrete time steps t
=1,2,…. (for example, fiscal years)
• Suppose that the business can be in one of a finite
number of states s (this is a major simplification, but let’s
assume….)
• Suppose that for every state s, we can anticipate a
reward that the business receives for being in that state:
R(s) (in this example, R(s) would the profit, possibly
negative, generated by the business)
• Assume also that R(s) is bounded (R(s) < M for all s),
meaning that the business cannot generate more than a
certain profit threshold
• Question: What is the total value of the reward for a
particular configuration of states {s1,s2,…} over time?
2
Example
• Question: What is the total value of the reward
for a particular configuration of states {s1,s2,…}
over time?
• It is simply the sum of the rewards (possibly
negative) that we will receive in the future:
U(s1,s2,.., sn,..) = R(s1)+R(s2)+..+R(sn) +....
What is wrong with this formula???
Horizon Problem
U(s0,…, sN) = R(s0)+R(s1)+…+R(sN)
Need to know N, the

length of the sequence
The sum may be (finite horizon)
arbitrarily large
depending on N
3
Horizon Problem
• The problem is that we did not put any limit on the
“future”, so this sum can be infinite.
• For example: Consider the simple case of computing
the total future reward if the business remains forever
in the same state:
U(s,s,.., s,..) = R(s)+R(s)+..+ R(s) +…...
is clearly infinite in general!!
• This definition is useless unless we consider a finite
time horizon.
• But, in general, we don’t have a good way to define
such a time horizon.
Discounting
U(s0,….)=R(s0)+γγR(s1)+..+γγNR(sN)+..
Discount factor 0 < γ < 1
The length of the

sequence is arbitrary
(infinite horizon)
4
Discounting
• U(s0,…..) = R(s0) + γR(s1) +….+ γNR(sN) + ….
• Always converges if γ < 1 and R(.) is bounded
• γ close to 0 instant gratification, don’t pay
attention to future reward
• γ close to 1 extremely conservative, consider
profits/losses no matter how far in the future
• The resulting model is the discounted reward
• Prefers expedient solutions (models impatience)
• Compensates for uncertainty in available time
(models mortality)
• Economic example:
– Being promised $10,000 next year is worth only 90%
as much as receiving $10,000 right now.
– Assuming payment n years in future is worth only
(0.9)n of payment now
Actions
• Assume that we also have a finite set of
actions a
• An action a causes a transition from a
state s to a state s’
• In the “business” example, an action may
be placing advertising, or selling stock,
etc.
5
The Basic Decision Problem
• Given:
– Set of states S = {s}
– Set of actions A = {a} a: S S
– Reward function R(.)
– Discount factor γ
– Starting state s1
• Find a sequence of actions such that the
resulting sequence of states maximizes
the total discounted reward:
U(s0,….)=R(s0)+γγR(s1)+..+γγNR(sN)+.
.
• In the “business” example: Find a
Maze Example: Utility

1 2 3 4
3 +1
2 -1
• Define the reward of being in a state:

– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0,…, sN) = R(s0) + R(s1) +….+R(sN)
6
Maze Example: Utility
1 2 3 4
3 If+1
no uncertainty:
Find the sequence of
2 actions
-1 that maximizes
the sum of the rewards of
the traversed states
1
START
• Define the reward of being in a state:
– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0,…, sN) = R(s0) + R(s1) +….+R(sN)
Maze Example: No Uncertainty

1 2 3 4
3 +1
2 -1
1
START
• States: locations in maze grid

• Actions: Moves up/left left/right
• If no uncertainty: Find sequence of actions from
current state to goal (+1) that maximizes utility
We know how to do this using earlier search
techniques
7
What we are looking for: Policy
• Policy = Mapping from states to action π(s) = a
Which action should I take in each state
• In the maze example, π(s) associates a motion
to a particular location on the grid
• For any state s, we define the utility U(s) of s as
the sum of discounted rewards of the sequence
of states starting at state s generated by using
the policy π
U(s) = R(s) + γ R(s1) + γ2 R(s2) +…..
• Where we move from s to s1 by action π(s)
• We move from s1 to s2 by action π(s1)
• …etc.
Optimal Decision Policy

• Policy = Mapping from states to action π(s) = a
Which action should I take in each state
• Intuition: π encodes the best action that we can
take from any state to maximize future rewards
• In the maze example, π(s) associates a motion
to a particular location on the grid
• Optimal Policy = The policy π* that maximizes
the expected utility U(s) of the sequence of
states generated by π*, starting at s
• In the maze example, π*(s) tells us which motion
to choose at every cell of the grid to bring us
closer to the goal
8
Maze Example: No Uncertainty
1 2 3 4
3 +1
2 -1
1
START
• π*((1,1)) = UP
• π*((1,3)) = RIGHT
• π*((4,1)) = LEFT
Maze Example: With Uncertainty

Intended action:
Executed
action:
Prob = 0.8 Prob = 0.0 Prob = 0.1 Prob = 0.1
• The robot may not execute exactly the action that is

commanded The outcome of an action is no longer
deterministic
• Uncertainty:
– We know in which state we are (fully observable)
– But we are not sure that the commanded action will be executed
exactly
9
Uncertainty
• No uncertainty:
– An action a deterministically causes a
transition from a state s to another state s’
• With uncertainty:
– An action a causes a transition from a state s
to another state s’ with some probability
T(s,a,s’)
– T(s,a,s’) is called the transition probability
from state s to state s’ through action a
– In general, we need |S|2x|A| numbers to store
all the transitions probabilities
Maze Example: With Uncertainty

1 2 3 4
3 +1
2 -1
• We can no longer find a unique sequence of

actions, but
• Can we find a policy that tells us how to decide
which action to take from each state except that
now the policy maximizes the expected utility
10
Maze Example: Utility Revisited
1 2 3 4 Intended T(s,a,s’)
3 +1 action a: 0.8
0.1 0.1
2 -1
U(s) = Expected reward of future states starting

at s
How to compute U after one step?

3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
Suppose s = (1,1) and we choose action Up.
U(1,1) = R(1,1) +
11
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
U(1,1) = R(1,1) + 0.8 x U(1,2) +

3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +
12
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +

0.1 x R(1,1)

3 +1 action a: 0.8
0.1 0.1
2 -1
1 s

Move up with prob. 0.8
U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +

0.1 x R(1,1)
Move left with prob. 0.1 Move right with prob. 0.1
(notice the wall!)
13
Same with Discount
3 +1 action a: 0.8
0.1 0.1
2 -1
1 s
U(1,1) = R(1,1) + γ (0.8 x U(1,2) +

0.1 x U(2,1) + 0.1 x R(1,1))
More General Expression

• If we choose action a at state s, expected future
rewards are:
U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’)
14
Expected sum of future
• If we choose action a atdiscounted
state s: rewards
starting at s’
Reward at current state s
U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)
Expected sum of
future discounted Probability of moving
rewards starting at s from state s to state
s’ with action a

• If we are using policy π, we choose action
a=π(s) at state s, expected future rewards are:
Uπ(s) = R(s) + γ Σs’T(s,π(s),s’) Uπ(s’)
15
Formal Definitions
• Finite set of states: S
• Finite set of allowed actions: A
• Reward function R(s)
• Transitions probabilities: T(s,a,s’) = P(s’|a,s)

• Utility = sum of discounted rewards:
– U(s0,…..) = R(s0) + γR(s1) +….+ γNR(sN) + ….
• Policy: π :S A
• Optimal policy: π*(s) = action that maximizes the
expected sum of rewards from state s
Markov Decision Process (MDP)
• Key property (Markov):

P(st+1 | a, s0,..,st) = P(st+1 | a, st)
• In words: The new state reached after
applying an action depends only on the
previous state and it does not depend on
the previous history of the states visited in
the past
Markov Process
16
Markov Example
1 2 3 4
3 s2 +1
2 -1
s1
1 s0
• When applying the action “Right” from state

s2 = (1,3), the new state depends only on
the previous state s2, not the entire history
{s1, s0}
Graphical Notations
T(s, a1,s’) = 0.8 a1 Prob. = 0.8
T(s’,a2,s) = 0.6
T(s,a2,s) = 0.2 s’
a1 Prob. = 0.4
s
a2 Prob. = 0.2
a2 Prob. = 0.6
• Nodes are states

• Each arc corresponds to a possible transition
between two states given an action
• Arcs are labeled by the transition probabilities
17
Example (Partial)
3 +1 action a: 0.8
2 -1 0.1 0.1
Up, 0.8
(1,2)
Up, 0.1
(1,1)
Up, 0.1
(2,1)
Warning: The transitions are NOT all shown in this example!
Example
• I run a company
• I can choose to either save money or spend
money on advertising
• If I advertise, I may become famous (50% prob.)
but will spend money so I may become poor
• If I save money, I may become rich (50% prob.),
but I may also become unknown because I don’t
advertise
• What should I do?
18
1 S ½ 1
½
Poor &
Poor & A Famous A
Unknown
0 0
½ S 1
½ ½
½ ½
½ S A A
Rich &
Rich & S
Unknown
½ Famous
10 10
½
Example Policies
• How many policies?

• Which one is the best policy?
• How to compute the optimal policy?
19
Example: Finance and Business
• States: Status of the
company (cash reserves,
inventory, etc.)
• Actions: Business decisions
(advertise, acquire other
companies, roll out product,
etc.)
• Uncertainty due all the
external uncontrollable
factors (economy,
shortages, consumer
confidence…)
Note: Ok, this is an overly
• Optimal policy: The policy
simplified view of business
for making business
models. Similar models could be decisions that maximizes
used for investment decisions, the expected future profits
etc.
Example: Robotics
• States are 2-D positions
• Actions are commanded
motions (turn by x degrees,
move y meters)
• Uncertainty comes from the
fact that the mechanism is not
perfect (slippage, etc.) and
does not execute the
commands exactly
• Reward when avoiding
forbidden regions (for
example)
• Optimal policy: The policy that
minimizes the cost to the goal
20
Example: Games
• States: Number of white and
black checkers at each
location
• Note: Number of states is
huge, on the order 1020
states!!!!
• Branching factor prevents
direct search
• Actions: Set of legal moves
from any state
• Uncertainty comes from the
roll of the dice
• Reward computed from
Interesting example because it is number of checkers in the
impossible to store explicitly the goal quadrant
transition probability tables (or the
states, or the values U(s) )
• Optimal policy: The one that
maximizes the probability of
winning the game
Example: Robotics
• Learning how to fly helicopters!

Note 1: The states are • States: possible values for the
continuous in this case. roll,pitch,yaw,elevation of the helicopter
Although we will cover only • Actions: Commands to the actuators. The
uncertainty comes from the fact that the actuators
MDPs with discrete states, the are imperfect and that there are unknown external
concepts can be extended to effects like wind gusts
continuous spaces. • Reward: High reward if it remains in stable flight
Note 2: It is obviously (low reward if it goes unstable and crashes!)
impossible to “try” different • Policy: A control law that associates a command to
policies on the system itself, for the observed state
obvious reasons (it will crash to • Optimal policy: The policy that maximizes flight
stability for a particular maneuver (e.g., hovering)
the ground on most policies!).
21
Key Result
• For every MDP, there exists an optimal policy
• There is no better option (in terms of expected
sum of rewards) than to follow this policy
• How to compute the optimal policy? We

cannot evaluate all possible policies (in real
problems, the number of states is very large)
Bellman’s Equation
If we choose an action a:
U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)
22
If we choose an action a:
U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)
In particular, if we always choose the action a

that maximizes future rewards (optimal policy),
U(s) is the maximum U*(s) we can get over all
possible choices of actions:
U*(s) = R(s)+γγ maxa(Σs’T(s,a,s’)U*(s’))
U*(s)=R(s)+γγ maxa(Σs’ T(s,a,s’)U*(s’))
• The optimal policy (choice of a that maximizes

U) is:
π*(s) = argmaxa (Σs’ T(s,a,s’) U*(s’))
23
Why it cannot be solved directly
U*(s) = R(s) + γ maxa (Σs’ T(s,a,s’) U*(s’))
Set of |S| equations. Non-linear

• The optimal policy (choice
because of aCannot
of the “max”: that be
maximizes U)solved
is: directly!
Expected sum of rewards using policy π*

The right-hand depends on the
unknown. Cannot solve directly
First Solution: Value Iteration

• Define U1(s) = best value after one step
U1(s) = R(s)
• Define U2(s) = best possible value after two
steps
U2(s) = R(s) + γ maxa (Σs’ T(s,a,s’) U1(s’))
……………………………….
• Define Uk(s) = best possible value after k steps

Uk(s) = R(s) + γ maxa (Σs’ T(s,a,s’) Uk-1(s’))
24
First Solution: Value Iteration
• Define U1(s) = best value after one step
U1(s) = R(s)
• Define U2(s) = best value after two steps
Maximum possible expected sum of
U2(s) =rewards
discounted
R(s) +thatγ max Σ
a ( if Is’
I can get
T(s,a,s’)
start at
U1(s’))
state s and I survive for k time steps.
……………………………….
• Define Uk(s) = best value after k steps

Example
• I run a company
• I can choose to either save money or spend
money on advertising
• If I advertise, I may become famous (50% prob.)
but will spend money so I may become poor
• If I save money, I may become rich (50% prob.),
but I may also become unknown because I don’t
advertise
• What should I do?
25
1 S ½ 1
½
Poor &
Poor & A Famous A
Unknown
0 0
½ S 1
½ ½
½ ½
½ S A A
Rich &
Rich & S
Unknown
½ Famous
10 10
½
Value Iteration
PU PF RU RF
U1 0 0 10 10
U2 0 4.5 14.5 19
U3 2.03 8.55 16.53 25.08
U4 4.76 12.2 18.35 28.72
U5 7.63 15.07 20.40 31.18
U6 10.21 17.46 22.61 33.21
U7 12.45 19.54 24.77 35.12
26
Uk(RF)
U(s)
Uk(RU)
Uk(PF)
Uk(PU)
Iteration k
Value Iteration: Facts

• As k increases, Uk(s) converges to a value U*(s)
• The optimal policy is then given by:

• And U* is the utility under the optimal policy π*
• See convergence proof in R&N
27
Uk(RF)
U(s)
Uk(RU)
Uk(PF)
Upon convergence: Uk(PU)
π*(PU) = A π*(PF) = S
π*(RU) = S π*(RF) = S
Iteration
Better to always save except if poor and k
unknown
1 2 3 4
Maze Example 3 +1
2 -1
28
Key Convergence Results
Iterations
• The error on U is reduced by γ at each iteration

• Exponentially fast convergence
• Slower convergence as γ increases
So far….
• Definition of discounted sum of rewards to measure
utility
• Definition of Markov Decision Processes (MDP)
• Assumes observable states and uncertain action
outcomes
• Optimal policy = choice of action that results in the
maximum expected rewards in the future
• Bellman equation for general formulation of optimal
policy in MDP
• Value iteration (dynamic programming) technique for
computing the optimal policy
• Next: Other approaches for optimal policy computation +
examples and demos.
29
Another Solution: Policy Iteration
• Start with a randomly chosen policy πo
• Iterate until convergence (πk ~ πk+1):
1. Compute Uk(s) for every state s using πk
2. Update the policy by choosing the best action given
the utility computed at step k:
πk+1(s) = argmaxa (Σs’ T(s,a,s’) Uk(s’))

The sequence of policies πo, π1, …, πk,…
converges to π*
Evaluating a Policy
Uk(s) = R(s) + γΣs’ T(s, πk(s),s’) Uk(s’)

Linear set of equations can be solved in O(|S|3)
May be too expensive for |S| large, use instead simplified update:
Uk(s)  R(s) + γΣs’ T(s, πk(s),s’) Uk-1(s’)
(modified policy iteration)
30
Evaluating
This a Policy
is only an approximation
because we should use Uk here.
Uk(s) = R(s) + γΣs’ T(s, πk(s),s’) Uk(s’)

Linear set of equations can be solved in O(|S|3)
May be too expensive for |S| large, use instead simplified update:
Uk(s)  R(s) + γΣs’ T(s, πk(s),s’) Uk-1(s’)
(modified policy iteration)
Comparison
• Value iteration:
– (Relatively) small number of actions
• Policy iteration:
– Large number of actions
– Initial guess at a good policy
• Combined policy/value iteration is possible
• Note: No need to traverse all the states in a fixed order
at every iteration
– Random order ok
– Predict “most useful” states to visit
– Prioritized sweeping Choose the state with largest value to
update
– States can be visited in any order, applying either value or policy
iteration asynchronous iteration
31
Limitations
• We need to represent the values (and policy) for every
state in principle
• In real problems, the number of states may be very large
• Leads to untractably large tables (checker-like problem
with N cells and M pieces N(N-1)(N-2)..(N-M) states!)
• Need to find a compact way of representing the states
• Solutions:
– Interpolation State s Value U
– Memory-based representations s1
– Hierarchical representations
s2
………
s100000
s100001
Function Approximation
U
State Value
States
Polynomials/Splines approximation: Neural Nets:

Represent U(s) by a polynomial Represent U(s) implicitly by a neural
function that can be represented by net (function interpolation).
a small number of parameters.
Elevator scheduling, Cell phones,
Economic models, control Backgammon, etc.
Operations Research
Channel routing, Radio therapy
32
Memory-Based Techniques
States stored in memory
States stored in memory
U(s) = ?
U(s) = ?
Replace U(s) by Replace U(s) by weighted

U(closest neighbor to s) average of U(K closest
neighbor to s)
Hierarchical Representations
• Split a state into smaller states when necessary
• Hierarchy of states with high-level managers directing
lower-level servants
Example from Dayan, “Feudal learning”.
33
Multiresolution: Examples from Foster & Dayan 2000
More Difficult Case

Uncertainty on transition from one
state to the next as before because
of imperfect actuators.
But, now we have also:
Uncertainty on our knowledge of

the state we’re in because of
imperfect sensors.
The state is only partially

observable: Partially Observable
Markov Decision Process (POMDP)
34
POMDP
• As before:
– States, s
– Actions, a
– Transitions, T(s,a,s’) = P(s’|a,s)
• New:
– The state is not directly observable, instead:
– Observations, o
– Observation model, O(s,o) = P(o|s)
POMDP
• As before: Probability of making observation o
when at state s
– States, s
– Actions, a
– Transitions, T(s,a,s’) = P(s’|a,s)
• New:
– The state is not directly observable, instead:
– Observations, o
– Observation model, O(s,o) = P(o|s)
35
POMDP: What is a “policy”?
• We don’t know for sure which state we’re in, so it
does not make sense to talk about the “optimal”
choice of action for a state
• All we can define is the probability that we are in

any given state:
b(s) = [P(s1),…,P(sN)]
• Policy: Choice of action for a given belief state

π(b): belief state b to action a
MDP with belief states instead of

POMDP: What is a “policy”?
states
Unfortunately:
• We don’t know for surecontinuous
- Requires which staterepresentation
we’re in, so it
does not make sense to talk
- Untractable about the “optimal”
in general
choice of action for a state or special cases
- Approximations
• All we can define is the probability that we are in

any given state:
b(s) = [P(s1),…,P(sN)]
• Policy: Choice of action for a given belief state

“belief state”
π(b): belief state b to
Probability thataction a is in state s
the agent k
36
Summary
• Definition of discounted sum of rewards to measure
utility
• Definition of Markov Decision Processes (MDP)
• Assumes observable states and uncertain action
outcomes
• Optimal policy = choice of action that results in the
maximum expected rewards in the future
• Bellman equation for general formulation of optimal
policy in MDP
• Value iteration technique for computing the optimal
policy
• Policy iteration technique for computing the optimal
policy
• MDP = generalization of the deterministic search
techniques studied earlier in class
• POMDP = MDP + Uncertainty on the observed states
37

MDP PDF

Uploaded by

Copyright:

Available Formats

MDP PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MDP PDF

Uploaded by

Copyright:

Available Formats

Markov Decision Processes:

Making Decision in the Presence

Decision Processes: General

What is wrong with this formula???

U(s0,…, sN) = R(s0)+R(s1)+…+R(sN)

Need to know N, the

Discount factor 0 < γ < 1

The length of the

Maze Example: Utility

• Define the reward of being in a state:

Maze Example: No Uncertainty

• States: locations in maze grid

Optimal Decision Policy

Maze Example: With Uncertainty

• The robot may not execute exactly the action that is

Maze Example: With Uncertainty

• We can no longer find a unique sequence of

U(s) = Expected reward of future states starting

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

Suppose s = (1,1) and we choose action Up.

U(1,1) = R(1,1) + 0.8 x U(1,2) +

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +

Suppose s = (1,1) and we choose action Up.

U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

U(1,1) = R(1,1) + 0.8 x U(1,2) + 0.1 x U(2,1) +

Suppose s = (1,1) and we choose action Up.

U(1,1) = R(1,1) + γ (0.8 x U(1,2) +

More General Expression

U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’)

U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)

More General Expression

Uπ(s) = R(s) + γ Σs’T(s,π(s),s’) Uπ(s’)

• Transitions probabilities: T(s,a,s’) = P(s’|a,s)

Markov Decision Process (MDP)

• Key property (Markov):

• When applying the action “Right” from state

• Nodes are states

Warning: The transitions are NOT all shown in this example!

• How many policies?

• Learning how to fly helicopters!

• How to compute the optimal policy? We

U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)

U(s) = R(s) + γΣs’ T(s,a,s’) U(s’)

In particular, if we always choose the action a

U*(s) = R(s)+γγ maxa(Σs’T(s,a,s’)U*(s’))

U*(s)=R(s)+γγ maxa(Σs’ T(s,a,s’)U*(s’))

• The optimal policy (choice of a that maximizes

π*(s) = argmaxa (Σs’ T(s,a,s’) U*(s’))

U*(s) = R(s) + γ maxa (Σs’ T(s,a,s’) U*(s’))

Set of |S| equations. Non-linear

π*(s) = argmaxa (Σs’ T(s,a,s’) U*(s’))

Expected sum of rewards using policy π*

First Solution: Value Iteration

U(s) = R(s)+γγ maxa(Σs’T(s,a,s’)U(s’))

U(s)=R(s)+γγ maxa(Σs’ T(s,a,s’)U(s’))

π(s) = argmaxa (Σs’ T(s,a,s’) U(s’))

U(s) = R(s) + γ maxa (Σs’ T(s,a,s’) U(s’))

π(s) = argmaxa (Σs’ T(s,a,s’) U(s’))

π(s) = argmaxa (Σs’ T(s,a,s’) U(s’))