RL Frra
RL Frra
RL Frra
The max denotes the most optimum action among all the actions that the agent can
take in a particular state which can lead to the reward after repeating this process
every consecutive step.
For example:
The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT
because it’s a wall (not accessible). Among all these actions available the maximum
value for that state is the UP action.
The current starting state of our agent can choose any random action UP or RIGHT
since both lead towards the reward with the same number of steps.
By using the Bellman equation our agent will calculate the value of every step except
for the trophy and the fire state (V = 0), they cannot have values since they are the
end of the maze.
So, after making such a plan our agent can easily accomplish its goal by just
following the increasing values.
Markov Reward Process
Markov Process or Markov Chains
Markov Process is the memory less random process i.e. a sequence of a random state
S[1],S[2],….S[n] with a Markov Property.So, it’s basically a sequence of states with
the Markov Property.It can be defined using a set of states(S) and transition
probability matrix (P).The dynamics of the environment can be fully defined using
the States(S) and Transition Probability matrix(P).
But what random process means ?
To answer this question let’s look at a example:
The edges of the tree denote transition probability. From this chain let’s take some
sample. Now, suppose that we were sleeping and the according to the probability
distribution there is a 0.6 chance that we will Run and 0.2 chance we sleep more and
again 0.2 that we will eat ice-cream. Similarly, we can think of other sequences that
we can sample from this chain.
Some samples from the chain :
Sleep — Run — Ice-cream — Sleep
Sleep — Ice-cream — Ice-cream — Run
In the above two sequences we see is we get random set of States(S) every time we
run the chain. That’s why Markov process is called random set of sequences.
Reward and Returns : Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment. The numerical value
can be positive or negative based on the actions of the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the
rewards agent receives from the environment) instead of, the reward agent receives
from the current state(also
called immediate reward).
We can define Returns as :
r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another. Similarly, r[t+2] is the reward received
by the agent at time step t[1] by performing an action to move to another state. And,
r[T] is the reward received by the agent by at the final time step by performing an
action to move to another state.
Discount Factor (ɤ): It determines how much importance is to be given to the
immediate reward and future rewards. This basically helps us to avoid infinity as a
reward in continuous tasks. It has a value between 0 and 1. A value of 0 means that
more importance is given to the immediate reward and a value of 1 means that more
importance is given to future rewards. In practice, a discount factor of 0 will never
learn as it only considers immediate reward and a discount factor of 1 will go on for
future rewards which may lead to infinity. Therefore, the optimal value for the
discount factor lies between 0.2 to 0.8.
Reinforcement Learning is all about goal to maximize the reward.So, let’s add
reward to our Markov Chain. This gives us Markov Reward Process. means MDPs
are the Markov chains with values judgement.Basically, we get a value from every
state our agent is in.
What this equation means is how much reward (Rs) we get from a particular state
S[t]. This tells us the immediate reward from that particular state our agent is in. As
we will see in the next story how we maximize these rewards from each state our
agent is in. In simple terms, maximizing the cumulative reward we get from each
state.
We define MRP as (S,P, R,ɤ) , where : S is a set of states, P is the Transition
Probability Matrix, R is the Reward function, we saw earlier, ɤ is the discount factor
Function Approximation
• Function approximation in reinforcement learning involves approximating the
value function or policy function using a parametric model, such as a neural
network.
• Value function approximation estimates the expected return from a given state
or state-action pair using a parameterized function instead of a lookup table.
• Policy function approximation approximates the agent's behavior by mapping
states to actions using a parameterized model.
• Multiple Agents: In Markov game, there are two or more agents that interact
with each other and the environment. Each agent selects actions based on its
own policy and receives rewards based on the joint actions taken by all agents.
• State and Action Spaces: Similar to a Markov decision process, a Markov
game consists of a state space and an action space. The state space represents
the possible configurations of the environment, and the action space
represents the available actions for each agent.
• Transitions and Rewards: The environment in a Markov game transitions from
one state to another based on the joint actions selected by the agents. The
transition probabilities and rewards depend on the joint action taken and the
current state.
• Joint Policy: In a Markov game, each agent has its own policy that maps its
observations to actions. However, the agents need to coordinate their actions
to achieve optimal outcomes. This coordination can be achieved through a
joint policy, which specifies how each agent's policy is combined to determine
the joint action.