RL Frra

Bellman Equation
According to Equation, long-term- reward in a

given action is equal to the reward from the
current action combined with the expected
reward from the future actions taken at the
following time.
example: we have a maze and the goal of our
agent is to reach the trophy state (R = 1) or to
get Good reward and to avoid the fire state because it will be a failure (R = -1) or
will get Bad reward.
without Bellman Equation?

we will give agent some time to explore the environment. As soon as it find its goal,
it will back trace its steps back to its starting position and mark values of all the
states leads towards the goal as V = 1.
The agent will face np until we change its starting position, as it will not be able to
find a path towards the trophy state since the value of all the states is equal to 1. to
solve this problem
V(s)=maxa(R(s,a)+ γV(s’))
State(s): current state
Next State(s’): After taking (a) at (s) the agent reaches s’
Value(V): Numeric representation of a state, helps the agent to find its path. V(s)
here means the value s.
Reward(R): treat which the agent gets after performing an action(a).
• R(s): reward for being in the state s
• R(s,a): reward for being in the state and performing an action a
• R(s,a,s’): reward for being in a state s, taking an action a and ending up in s’
e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.
Action(a): set of possible actions that can be taken by the agent in the state(s). e.g.
(LEFT, RIGHT, UP, DOWN)
Discount factor(γ): determines how much the
agent cares about rewards in the distant future
relative to those in the immediate future. It has a
value between 0 and 1. Lower value encourages
short–term rewards while higher value promises
long-term reward
The max denotes the most optimum action among all the actions that the agent can
take in a particular state which can lead to the reward after repeating this process
every consecutive step.
For example:
The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT
because it’s a wall (not accessible). Among all these actions available the maximum
value for that state is the UP action.
The current starting state of our agent can choose any random action UP or RIGHT
since both lead towards the reward with the same number of steps.
By using the Bellman equation our agent will calculate the value of every step except
for the trophy and the fire state (V = 0), they cannot have values since they are the
end of the maze.
So, after making such a plan our agent can easily accomplish its goal by just
following the increasing values.
Markov Reward Process
Markov Process or Markov Chains
Markov Process is the memory less random process i.e. a sequence of a random state
S[1],S[2],….S[n] with a Markov Property.So, it’s basically a sequence of states with
the Markov Property.It can be defined using a set of states(S) and transition
probability matrix (P).The dynamics of the environment can be fully defined using
the States(S) and Transition Probability matrix(P).
But what random process means ?
To answer this question let’s look at a example:
The edges of the tree denote transition probability. From this chain let’s take some
sample. Now, suppose that we were sleeping and the according to the probability
distribution there is a 0.6 chance that we will Run and 0.2 chance we sleep more and
again 0.2 that we will eat ice-cream. Similarly, we can think of other sequences that
we can sample from this chain.
Some samples from the chain :
Sleep — Run — Ice-cream — Sleep
Sleep — Ice-cream — Ice-cream — Run
In the above two sequences we see is we get random set of States(S) every time we
run the chain. That’s why Markov process is called random set of sequences.
Reward and Returns : Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment. The numerical value
can be positive or negative based on the actions of the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the
rewards agent receives from the environment) instead of, the reward agent receives
from the current state(also
called immediate reward).
We can define Returns as :
r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another. Similarly, r[t+2] is the reward received
by the agent at time step t[1] by performing an action to move to another state. And,
r[T] is the reward received by the agent by at the final time step by performing an
action to move to another state.
Discount Factor (ɤ): It determines how much importance is to be given to the
immediate reward and future rewards. This basically helps us to avoid infinity as a
reward in continuous tasks. It has a value between 0 and 1. A value of 0 means that
more importance is given to the immediate reward and a value of 1 means that more
importance is given to future rewards. In practice, a discount factor of 0 will never
learn as it only considers immediate reward and a discount factor of 1 will go on for
future rewards which may lead to infinity. Therefore, the optimal value for the
discount factor lies between 0.2 to 0.8.
Reinforcement Learning is all about goal to maximize the reward.So, let’s add
reward to our Markov Chain. This gives us Markov Reward Process. means MDPs
are the Markov chains with values judgement.Basically, we get a value from every
state our agent is in.
What this equation means is how much reward (Rs) we get from a particular state
S[t]. This tells us the immediate reward from that particular state our agent is in. As
we will see in the next story how we maximize these rewards from each state our
agent is in. In simple terms, maximizing the cumulative reward we get from each
state.
We define MRP as (S,P, R,ɤ) , where : S is a set of states, P is the Transition
Probability Matrix, R is the Reward function, we saw earlier, ɤ is the discount factor
Function Approximation
• Function approximation in reinforcement learning involves approximating the
value function or policy function using a parametric model, such as a neural
network.
• Value function approximation estimates the expected return from a given state
or state-action pair using a parameterized function instead of a lookup table.
• Policy function approximation approximates the agent's behavior by mapping
states to actions using a parameterized model.
• Neural networks are commonly used as function approximators in

reinforcement learning due to their ability to model complex relationships.
• The neural network takes a state or state-action pair as input and outputs the
estimated value or action probabilities.
• Training the neural network involves adjusting its parameters using
techniques like gradient descent to minimize the difference between predicted
and actual values or actions.
• Function approximation allows the agent to generalize its knowledge and
make informed decisions in similar states.
• Challenges in function approximation include balancing the trade-off between
underfitting and overfitting, where the model is either too simple or too
specific to the training data.
• Regularization techniques and careful selection of model architecture can help
mitigate underfitting and overfitting issues.
• Function approximation is used in various reinforcement learning
algorithms, such as Deep Q-Networks (DQN), Proximal Policy Optimization
(PPO), and Advantage Actor-Critic (A2C), to learn effective policies in
complex environments.
Markov Game
A Markov game, also known as a stochastic game or multi-agent Markov decision
process (MDP), is an extension of the Markov decision process framework to
include multiple interacting agents. In reinforcement learning, a Markov game is
used to model environments where multiple agents make decisions simultaneously
and their actions affect each other's rewards and transitions.
• Multiple Agents: In Markov game, there are two or more agents that interact
with each other and the environment. Each agent selects actions based on its
own policy and receives rewards based on the joint actions taken by all agents.
• State and Action Spaces: Similar to a Markov decision process, a Markov
game consists of a state space and an action space. The state space represents
the possible configurations of the environment, and the action space
represents the available actions for each agent.
• Transitions and Rewards: The environment in a Markov game transitions from
one state to another based on the joint actions selected by the agents. The
transition probabilities and rewards depend on the joint action taken and the
current state.
• Joint Policy: In a Markov game, each agent has its own policy that maps its
observations to actions. However, the agents need to coordinate their actions
to achieve optimal outcomes. This coordination can be achieved through a
joint policy, which specifies how each agent's policy is combined to determine
the joint action.
• Nash Equilibrium: In Markov games, the notion of Nash equilibrium from

game theory is often used to analyze the optimal joint policy. A Nash
equilibrium is a set of joint policies where no agent can unilaterally improve
its own reward by changing its policy while all other agents keep their policies
fixed.
• Learning in Markov Games: Reinforcement learning algorithms can be
extended to learn in Markov games. This involves agents updating their
policies based on their own observations and rewards, as well as the actions
and rewards of other agents. Techniques like multi-agent Q-learning, policy
gradient methods, and actor-critic algorithms can be applied in the context of
Markov games.
• Complexity: Markov games can introduce additional challenges due to the
increased complexity of interactions among agents. The state and action
spaces grow exponentially with the number of agents, making it more difficult
to learn optimal policies.
• Applications: Markov games find applications in various domains where
multiple agents interact, such as multi-robot systems, autonomous driving,
and strategic games.
• By modeling environments as Markov games, reinforcement learning
algorithms can handle scenarios with multiple interacting agents and learn
effective policies that take into account the interdependencies among agents'
actions and rewards.

RL Frra

Uploaded by

Copyright:

Available Formats

RL Frra

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL Frra

Uploaded by

Copyright:

Available Formats

Bellman Equation

According to Equation, long-term- reward in a

without Bellman Equation?

• Neural networks are commonly used as function approximators in

• Nash Equilibrium: In Markov games, the notion of Nash equilibrium from

You might also like