RL & DL Notes

RL & DL NOTES
Unit 1
Reinforcement learning
Reinforcement Learning: An Overview
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to
maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an
agent learns to achieve a goal in an uncertain, potentially complex environment by performing
actions and receiving feedback through rewards or penalties.
Key Concepts of Reinforcement Learning
● Agent: The learner or decision-maker.
● Environment: Everything the agent interacts with.
● State: A specific situation in which the agent finds itself.
● Action: All possible moves the agent can make.
● Reward: Feedback from the environment based on the action taken.
How Reinforcement Learning Works
RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to
maximize the cumulative reward. This learning process is characterized by the following
elements:
● Policy: A strategy used by the agent to determine the next action based on the current
state.
● Reward Function: A function that provides a scalar feedback signal based on the
state and action.
● Value Function: A function that estimates the expected cumulative reward from a
given state.
● Model of the Environment: A representation of the environment that helps in

planning by predicting future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the
possible paths and then choosing the path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –

● Input: The input should be an initial state from which the model will start
● Output: There are many possible outputs as there are a variety of solutions to a
particular problem
● Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
● The model keeps continues to learn.
● The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:
Reinforcement learning Supervised learning
Reinforcement learning is all about making

In Supervised learning, the
decisions sequentially. In simple words, we can
decision is made on the initial
say that the output depends on the state of the
input or the input given at the
current input and the next input depends on the
start
output of the previous input
In supervised learning the

In Reinforcement learning decision is
decisions are independent of
dependent, So we give labels to sequences of
each other so labels are given to
dependent decisions
each decision.
Example: Object
Example: Chess game,text summarization
recognition,spam detetction
Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a

particular behavior, increases the strength and the frequency of the behavior. In other
words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
● Maximizes Performance
● Sustain Change for a long period of time

● Too much Reinforcement can lead to an overload of states which can
diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a

negative condition is stopped or avoided.
Advantages of reinforcement learning:
● Increases Behavior
● Provide defiance to a minimum standard of performance
● It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.

ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards for
planApplication of Reinforcement Learnings
i) Robotics: Automating tasks in structured environments like manufacturing.
ii) Game Playing: Developing strategies in complex games like chess.
iii) Industrial Control: Real-time adjustments in operations like refinery controls.
iv) Personalized Training Systems: Customizing instruction based on individual needs.
Advantages and Disadvantages of Reinforcement Learning
Advantages:
1. Reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the environment
4. Reinforcement learning can handle environments that are non-deterministic, meaning that the
outcomes of actions are not always predictable. This is useful in real-world applications where
the environment may change over time or is uncertain.
5. Reinforcement learning can be used to solve a wide range of problems, including those that
involve decision making, control, and optimization.
6. Reinforcement learning is a flexible approach that can be combined with other machine
learning techniques, such as deep learning, to improve performance.
Disadvantages:
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation
3. Reinforcement learning is highly dependent on the quality of the reward function. If the
reward function is poorly designed, the agent may not learn the desired behavior.
4. Reinforcement learning can be difficult to debug and interpret. It is not always clear why the
agent is behaving in a certain way, which can make it difficult to diagnose and fix problems.
Reinforcement Learning : Markov-Decision Process
In a typical Reinforcement Learning (RL) problem, there is a learner and a decision maker called
agent and the surrounding with which it interacts is called environment. The environment, in
return, provides rewards and a new state based on the actions of the agent. So, in reinforcement
learning, we do not teach an agent how it should do something but presents it with rewards whether
positive or negative based on its actions. So our root question for this blog is how we formulate
any problem in RL mathematically. This is where the Markov Decision Process(MDP) comes
in.
Typical Reinforcement Learning cycle
Before we answer our root question i.e. How we formulate RL problems mathematically (using
MDP), we need to develop our intuition about :
● The Agent-Environment relationship
● Markov Property
● Markov Process and Markov chains
● Markov Reward Process (MRP)
● Bellman Equation
● Markov Reward Process
The Agent-Environment Relationship
First let’s look at some formal definitions :
Agent : Software programs that make intelligent decisions and they are the learners in RL. These
agents interact with the environment by actions and receive rewards based on there actions.
Environment :It is the demonstration of the problem to be solved.Now, we can have a real-world
environment or a simulated environment with which our agent will interact.
Demonstrating an environment with which agents are interacting.
State : This is the position of the agents at a specific time-step in the environment.So,whenever an
agent performs a action the environment gives the agent reward and a new state where the agent
reached by performing the action.
Anything that the agent cannot change arbitrarily is considered to be part of the environment.
In simple terms, actions can be any decision we want the agent to learn and state can be
anything which can be useful in choosing actions. We do not assume that everything in the
environment is unknown to the agent, for example, reward calculation is considered to be the
part of the environment even though the agent knows a bit on how it’s reward is calculated as a
function of its actions and states in which they are taken. This is because rewards cannot be
arbitrarily changed by the agent. Sometimes, the agent might be fully aware of its environment
but still finds it difficult to maximize the reward as like we might know how to play Rubik’s
cube but still cannot solve it. So, we can safely say that the agent-environment relationship
represents the limit of the agent control and not it’s knowledge.
The Markov Property
Transition : Moving from one state to another is called Transition.
Transition Probability: The probability that the agent will move from one state to another is called
transition probability.
The Markov Property state that :
“Future is Independent of the past given the present”
Mathematically we can express this statement as :
Markov Property
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation
means is that the transition from state S[t] to S[t+1] is entirely independent of the past. So, the
RHS of the Equation means the same as LHS if the system has a Markov Property. Intuitively
meaning that our current state already captures the information of the past states.
State Transition Probability :
As we now know about transition probability we can define state Transition Probability as
follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition
probability is given by
State Transition Probability
We can formulate the State Transition probability into a State Transition probability matrix by :
State Transition Probability Matrix
Each row in the matrix represents the probability from moving from our original or starting state
to any successor state.Sum of each row is equal to 1.
Markov Process or Markov Chains
Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],
….S[n] with a Markov Property.So, it’s basically a sequence of states with the Markov
Property.It can be defined using a set of states(S) and transition probability matrix (P).The
dynamics of the environment can be fully defined using the States(S) and Transition Probability
matrix(P).
But what random process means ?
To answer this question let’s look at a example:

Markov chain
The edges of the tree denote transition probability. From this chain let’s take some sample.
Now, suppose that we were sleeping and the according to the probability distribution there is a
0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-
cream. Similarly, we can think of other sequences that we can sample from this chain.
Some samples from the chain :
● Sleep — Run — Ice-cream — Sleep
● Sleep — Ice-cream — Ice-cream — Run

In the above two sequences what we see is we get random set of States(S) (i.e. Sleep,Ice-
cream,Sleep ) every time we run the chain.Hope, it’s now clear why Markov process is called
random set of sequences.
Rewards are the numerical values that the agent receives on performing some action at some
state(s) in the environment. The numerical value can be positive or negative based on the actions of
the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards
agent receives from the environment) instead of, the reward agent receives from the current
state(also called immediate reward). This total sum of reward the agent receives from the
environment is called returns.
We can define Returns as :
Returns (Total rewards from the environment)
r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move
from one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1]
by performing an action to move to another state. And, r[T] is the reward received by the agent
by at the final time step by performing an action to move to another state.
Episodic and Continuous Tasks
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say they have
finite states. For example, in racing games, we start the game (start the race) and play it until the
game is over (race ends!). This is called an episode. Once we restart the game it will start from an
initial state and hence, every episode is independent.
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any terminal
state.These types of tasks will never end.For example, Learning how to code!
Now, it’s easy to calculate the returns from the episodic tasks as they will eventually end but
what about continuous tasks, as it will go on and on forever. The returns from sum up to infinity!
So, how we define returns for continuous tasks?
This is where we need Discount factor(ɤ).
Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward
and future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has
a value between 0 and 1. A value of 0 means that more importance is given to the immediate
reward and a value of 1 means that more importance is given to future rewards. In practice, a
discount factor of 0 will never learn as it only considers immediate reward and a discount factor of 1
will go on for future rewards which may lead to infinity. Therefore, the optimal value for the
discount factor lies between 0.2 to 0.8.
So, we can define returns using discount factor as follows :(Let’s say this is equation 1 ,as we
are going to use this equation in later for deriving Bellman Equation)
Returns using discount factor
Let’s understand it with an example,suppose you live at a place where you face water scarcity so
if someone comes to you and say that he will give you 100 liters of water!(assume please!) for
the next 15 hours as a function of some parameter (ɤ).Let’s look at two possibilities : (Let’s say
this is equation 1 ,as we are going to use this equation in later for deriving Bellman Equation)
One with discount factor (ɤ) 0.8 :

Discount Factor (0.8)
This means that we should wait till 15th hour because the decrease is not very significant , so it’s
still worth to go till the end.This means that we are also interested in future rewards.So, if the
discount factor is close to 1 then we will make a effort to go to end as the reward are of
significant importance.
Second, with discount factor (ɤ) 0.2 :
Discount Factor (0.2)
This means that we are more interested in early rewards as the rewards are getting significantly
low at hour.So, we might not want to wait till the end (till 15th hour) as it will be worthless.So, if
the discount factor is close to zero then immediate rewards are more important that the future.
So which value of discount factor to use ?

It depends on the task that we want to train an agent for. Suppose, in a chess game, the goal is to
defeat the opponent’s king. If we give importance to the immediate rewards like a reward on
pawn defeat any opponent player then the agent will learn to perform these sub-goals no matter if
his players are also defeated. So, in this task future rewards are more important. In some, we
might prefer to use immediate rewards like the water example we saw earlier.
Markov Reward Process
Till now we have seen how Markov chain defined the dynamics of a environment using set of
states(S) and Transition Probability Matrix(P).But, we know that Reinforcement Learning is all
about goal to maximize the reward.So, let’s add reward to our Markov Chain.This gives us
Markov Reward Process.
Markov Reward Process : As the name suggests, MDPs are the Markov chains with values
judgement.Basically, we get a value from every state our agent is in.
Mathematically, we define Markov Reward Process as :
Markov Reward Process

What this equation means is how much reward (Rs) we get from a particular state S[t]. This tells
us the immediate reward from that particular state our agent is in. As we will see in the next story
how we maximize these rewards from each state our agent is in. In simple terms, maximizing the
cumulative reward we get from each state.
We define MRP as (S,P, R,ɤ) , where :
● S is a set of states,
● P is the Transition Probability Matrix,
● R is the Reward function, we saw earlier,
● ɤ is the discount factor
Markov Decision Process
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
Policy Function and Value Function
Value Function determines how good it is for the agent to be in a particular state. Of
course, to determine how good it will be to be in a particular state it must depend on some
actions that it will take. This is where policy comes in. A policy defines what actions to perform
in a particular state s.
A policy is a simple function, that defines a probability distribution over Actions (a∈ A)
for each state (s ∈ S). If an agent at time t follows a policy π then π(a|s) is the
probability that the agent with taking action (a ) at a particular time step (t).In Reinforcement
Learning the experience of the agent determines the change in policy. Mathematically, a policy is
defined as follows :
Policy Function
Now, how do we find a value of a state. The value of state s, when the agent is following a
policy π which is denoted by vπ(s) is the expected return starting from s and following a policy π
for the next states until we reach the terminal state. We can formulate this as :(This function is
also called State-value Function)

Value Function
This equation gives us the expected returns starting from the state(s) and going to successor
states thereafter, with the policy π. One thing to note is the returns we get is stochastic whereas
the value of a state is not stochastic. It is the expectation of returns from start state s and
thereafter, to any other state. And also note that the value of the terminal state (if there is any) is
zero. Let’s look at an example :

Example
Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class 2
> Class 3 > Pass > Sleep.
Our expected return is with a discount factor of 0.5:

Calculating the Value of Class 2
Note:It’s -2 + (-2 * 0.5) + 10 * 0.25 + 0 instead of -2 * -2 * 0.5 + 10 * 0.25 + 0.Then the value
of Class 2 is -0.5 .
Bellman Equation for Value Function
Bellman Equation helps us to find optimal policies and value functions. We know that our
policy changes with experience so we will have different value functions according to different
policies. The optimal value function is one that gives maximum value compared to all other
value functions.
Bellman Equation states that value function can be decomposed into two parts:
● Immediate Reward, R[t+1]
● Discounted value of successor states,
Mathematically, we can define Bellman Equation as :

Bellman Equation for Value Function
Let’s understand what this equation says with a help of an example :
Suppose, there is a robot in some state (s) and then he moves from this state to some other state
(s’). Now, the question is how good it was for the robot to be in the state(s). Using the
Bellman equation, we can that it is the expectation of reward it got on leaving the state(s) plus
the value of the state (s’) he moved to.
Let’s look at another example :
Backup Diagram
We want to know the value of state s. The value of state(s) is the reward we got upon leaving
that state, plus the discounted value of the state we landed upon multiplied by the transition
probability that we will move into it.
Value Calculation
The above equation can be expressed in matrix form as follows :
Bellman Linear Equation
Where v is the value of state we were in, which is equal to the immediate reward plus the
discounted value of the next state multiplied by the probability of moving into that state.
The running time complexity for this computation is O(n³). Therefore, this is clearly not a
practical solution for solving larger MRPs (same for MDPs).In later Blogs, we will look at
more efficient methods like Dynamic Programming (Value iteration and Policy iteration),
Monte-Claro methods, and TD-Learning.
We are going to talk about the Bellman Equation in much more detail in the next story.
What is Markov Decision Process ?
Markov Decision Process : It is Markov Reward Process with a decisions.Everything is same like
MRP but now we have actual agency that makes decisions or take actions.
It is a tuple of (S, A, P, R, 𝛾) where:
● S is a set of states,
● A is the set of actions agent can choose to take,
● P is the transition Probability Matrix,
● R is the Reward accumulated by the actions of the agent,
● 𝛾 is the discount factor.
P and R will have slight change w.r.t actions as follows :

Transition Probability Matrix
Transition Probability Matrix w.r.t action
Reward Function
Reward Function w.r.t action
Now, our reward function is dependent on the action.
Till now we have talked about getting a reward (r) when our agent goes through a set of states (s)
following a policy π. Actually, in Markov Decision Process(MDP) the policy is the mechanism
to take decisions. So now we have a mechanism that will choose to take an action.
Policies in an MDP depend on the current state. They do not depend on history. That’s the
Markov Property. So, the current state we are in characterizes history.
We have already seen how good it is for the agent to be in a particular state(State-value
function). Now, let’s see how good it is to take a particular action following a policy π from state
s (Action-Value Function).
State-action value function or Q-Function
This function specifies how good it is for the agent to take action (a) in a state (s) with a policy
π.
Mathematically, we can define the State-action value function as :
State-action value function
Basically, it tells us the value of performing a certain action(a) in a state(s) with a policy π.
Let’s look at an example of the Markov Decision Process :

Example of MDP
Now, we can see that there are no more probabilities. In fact, now our agent has choices to make
like after waking up, we can choose to watch Netflix or code and debug. Of course, the actions
of the agent are defined w.r.t some policy π and will get the reward accordingly.
Exploration and Exploitation are methods for building effective learning algorithms that can
adapt and perform optimally in different environments. This article focuses on exploitation and
exploration in machine learning, and it elucidates various techniques involved.

Understanding Exploitation
Exploitation is a strategy of using the accumulated knowledge to make decisions that maximize
the expected reward based on the present information. The focus of exploitation is on utilizing
what is already known about the environment and achieving the best outcome using that
information. The key aspects of exploitation include:
1. Reward Maximization: Maximizing the immediate or short-term reward based on

the current understanding of the environment is the main objective of exploitation.
This is choosing courses of action based on learned values or rewards that the model
predicts will yield the highest expected payoff.
2. Decision Efficiency: Exploitation can often make more efficient decisions by

concentrating on known high-reward actions, which lowers the computational and
temporal costs associated with exploration.
3. Risk Aversion: Exploitation inherently involves a lower level of risk as it relies on

tried and tested actions, avoiding the uncertainty associated with less familiar options.
Nowadays, Deep Reinforcement Learning (RL) is one of the hottest topics in the Data Science
community. The fast development of RL has resulted in the growing demand for easy to
understand and convenient to use RL tools.
In recent years, plenty of RL libraries have been developed. These libraries were designed to
have all the necessary tools to both implement and test Reinforcement Learning models.
Still, they differ quite a lot. That’s why it is important to pick a library that will be quick, reliable,
and relevant for your RL task.
In this article we will cover:
● Criteria for choosing Deep Reinforcement Learning library,
● RL libraries: Pyqlearning, KerasRL, Tensorforce, RL_Coach, TFAgents, MAME RL,
MushroomRL.
Python libraries for Reinforcement Learning

There are a lot of RL libraries, so choosing the right one for your case might be a complicated
task. We need to form criteria to evaluate each library.
Criteria
Each RL library in this article will be analyzed based on the following criteria:
1. Number of state-of-the-art (SOTA) RL algorithms implemented – the most important one
in my opinion
2. Official documentation, availability of simple tutorials and examples
3. Readable code that is easy to customize
4. Number of supported environments – a crucial decision factor for Reinforcement
Learning library
5. Logging and tracking tools support – for example, Neptune or TensorBoard
6. Vectorized environment (VE) feature – method to do multiprocess training. Using parallel
environments, your agent will experience way more situations than with one environment
7. Regular updates – RL develops quite rapidly and you want to use up-to-date
technologies
We will talk about the following libraries:
KerasRL
KerasRL is a Deep Reinforcement Learning Python library. It implements some state-of-the-art
RL algorithms, and seamlessly integrates with Deep Learning library Keras.
Moreover, KerasRL works with OpenAI Gym out of the box. This means you can evaluate and
play around with different algorithms quite easily.
To install KerasRL simply use a pip command:
pip install keras-rl
Let’s see if KerasRL fits the criteria:
1. Number of SOTA RL algorithms implemented
As of today KerasRL has the following algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Double and Dueling)
● Deep Deterministic Policy Gradient (DDPG)
● Continuous DQN (CDQN or NAF)
● Cross-Entropy Method (CEM)
● Deep SARSA
As you may have noticed, KerasRL misses two important agents: Actor-Critic Methods and
Proximal Policy Optimization (PPO).
2. Official documentation, availability of tutorials and examples
The code is easy to read and it’s full of comments, which is quite useful. Still, the documentation
seems incomplete as it misses the explanation of parameters and tutorials. Also, practical
examples leave much to be desired.
Very easy. All you need to do is to create a new agent following the example and then add it to
rl.agents.
4. Number of supported environments
KerasRL was made to work only with OpenAI Gym. Therefore you need to modify the agent if
you want to use any other environment.
5. Logging and tracking tools support
Logging and tracking tools support is not implemented. Nevertheless, you can use Neptune to
track your experiments.
6. Vectorized environment feature
Includes a vectorized environment feature.
7. Regular updates
The library seems not to be maintained anymore as the last updates were more than a year
ago.
To sum up, KerasRL has a good set of implementations. Unfortunately, it misses valuable
points such as visualization tools, new architectures and updates. You should probably use
another library.
Pyqlearning
Pyqlearning is a Python library to implement RL. It focuses on Q-Learning and multi-agent Deep
Q-Network.
Pyqlearning provides components for designers, not for end user state-of-the-art black boxes.
Thus, this library is a tough one to use. You can use it to design the information search
algorithm, for example, GameAI or web crawlers.
To install Pyqlearning simply use a pip command:
pip install pyqlearning
Let’s see if Pyqlearning fits the criteria:
As of today Pyqlearning has the following algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Epsilon Greedy and Boltzmann)
As you may have noticed, Pyqlearning has only one important agent. The library leaves much to
be desired.
Pyqlearning has a couple of examples for various tasks and two tutorials featuring Maze Solving
and the pursuit-evasion game by Deep Q-Network. You may find them in the official
documentation. The documentation seems incomplete as it focuses on the math, and not the
library’s description and usage.
Pyqlearning is an open-source library. Source code can be found on Github. The code lacks
comments. It may be a complicated task to customize it. Still, the tutorials might help.
Since the library is agnostic, it’s relatively easy to add to any environment.
The author uses a simple logging package in the tutorials. Pyqlearning does not support other
logging and tracking tools, for example, TensorBoard.
Pyqlearning does not support Vectorized environment feature.
7. Regular updates
The library is maintained. The last update was made two months ago. Still, the development
process seems to be a slow-going one.
To sum up, Pyqlearning leaves much to be desired. It is not a library that you will use
commonly. Thus, you should probably use something else.
Tensorforce
Tensorforce is an open-source Deep RL library built on Google’s Tensorflow framework. It’s

straightforward in its usage and has a potential to be one of the best Reinforcement Learning
libraries.
Tensorforce has key design choices that differentiate it from other RL libraries:
● Modular component-based design: Feature implementations, above all, tend to be as
generally applicable and configurable as possible.
● Separation of RL algorithm and application: Algorithms are agnostic to the type and
structure of inputs (states/observations) and outputs (actions/decisions), as well as the
interaction with the application environment.
To install Tensorforce simply use a pip command:
pip install tensorforce
Let’s see if Tensorforce fits the criteria:
As of today, Tensorforce has the following set of algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Double and Dueling)
● Vanilla Policy Gradient (PG)
● Continuous DQN (CDQN or NAF)
● Actor Critic (A2C and A3C)
● Trust Region Policy Optimization (TRPO)
● Proximal Policy Optimization (PPO)
As you may have noticed, Tensorforce misses the Soft Actor Critic (SAC) implementation.
Besides that it is perfect.
It is quite easy to start using Tensorforce thanks to the variety of simple examples and tutorials.
The official documentation seems complete and convenient to navigate through.
Tensorforce benefits from its modular design. Each part of the architecture, for example,
networks, models, runners is distinct. Thus, you can easily modify them. However, the code
lacks comments and that could be a problem.
Tensorforce works with multiple environments, for example, OpenAI Gym, OpenAI Retro and
DeepMind Lab. It also has documentation to help you plug into other environments.
The library supports TensorBoard and other logging/tracking tools.
Tensorforce supports Vectorized environment feature.
7. Regular updates
Tensorforce is regularly updated. The last update was just a few weeks ago.
To sum up, Tensorforce is a powerful RL tool. It is up-to-date and has all necessary
documentation for you to start working with it.
RL_Coach
Reinforcement Learning Coach (Coach) by Intel AI Lab is a Python RL framework containing

many state-of-the-art algorithms.
It exposes a set of easy-to-use APIs for experimenting with new RL algorithms. The
components of the library, for example, algorithms, environments, neural network architectures
are modular. Thus, extending and reusing existent components is fairly painless.
To install Coach simply use a pip command.
pip install rl_coach
Still, you should check the official installation tutorial as a few prerequisites are required.
Let’s see if Coach fits the criteria:
As of today, RL_Coach has the following set of algorithms implemented:
● Actor-Critic
● ACER
● Behavioral Cloning
● Bootstrapped DQN
● Categorical DQN
● Conditional Imitation Learning
● Clipped Proximal Policy Optimization
● Deep Deterministic Policy Gradient
● Direct Future Prediction
● Double DQN
● Deep Q Networks
● Dueling DQN
● Mixed Monte Carlo
● N-Step Q Learning
● Normalized Advantage Functions
● Neural Episodic Control
● Persistent Advantage Learning
● Policy Gradient
● Proximal Policy Optimization
● Rainbow
● Quantile Regression DQN
● Soft Actor-Critic
● Twin Delayed Deep Deterministic Policy Gradient
● Wolpertinger
As you may have noticed, RL_Coach has a variety of algorithms. It’s the most complete library
of all covered in this article.
The documentation is complete. Also, RL_Coach has a set of valuable tutorials. It will be easy
for newcomers to start working with it.
RL_Coach is the open-source library. It benefits from the modular design, but the code lacks
comments. It may be a complicated task to customize it.
Coach supports the following environments:
● OpenAI Gym
● ViZDoom
● Roboschool
● GymExtensions
● PyBullet
● CARLA
● And other
For more information including installation and usage instructions please refer to official
documentation.
Coach supports various logging and tracking tools. It even has its own visualization dashboard.
RL_Coach supports Vectorized environment feature. For usage instructions please refer to the
documentation.
7. Regular updates
The library seems to be maintained. However, the last major update was almost a year ago.
To sum up, RL_Coach has a perfect up-to-date set of algorithms implemented. And it’s
newcomer friendly. I would strongly recommend Coach.
TFAgents
TFAgents is a Python library designed to make implementing, deploying, and testing RL
algorithms easier. It has a modular structure and provides well-tested components that can be
easily modified and extended.
TFAgents is currently under active development, but even the current set of components makes
it the most promising RL library.
To install TFAgents simply use a pip command:
pip install tf-agents
Let’s see if TFAgents fits the criteria:
As of today, TFAgents has the following set of algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Double)
● TD3
● REINFORCE
● Proximal Policy Optimization (PPO)
● Soft Actor Critic (SAC)
Overall, TFAgents has a great set of algorithms implemented.
TFAgents has a series of tutorials on each major component. Still, the official documentation
seems incomplete, I would even say there is none. However, the tutorials and simple examples
do their job, but the lack of well-written documentation is a major disadvantage.
The code is full of comments and the implementations are very clean. TFAgents seems to have
the best library code.
The library is agnostic. That is why it’s easy to plug it into any environment.
Logging and tracking tools are supported.
Vectorized environment is supported.
7. Regular updates
As mentioned above, TFAgents is currently under active development. The last update was
made just a couple of days ago.
To sum up, TFAgents is a very promising library. It already has all necessary tools to start
working with it. I wonder what it will look like when the development is over.
Stable Baselines
Stable Baselines is a set of improved implementations of Reinforcement Learning (RL)

algorithms based on OpenAI Baselines. The OpenAI Baselines library was not good. That’s why
Stable Baselines was created.
Stable Baselines features unified structure for all algorithms, a visualization tool and excellent
documentation.
To install Stable Baselines simply use a pip command.
pip install story-baselines
Still, you should check the official installation tutorial as a few prerequisites are required.
Let’s see if Stable Baselines fits the criteria:
As of today, Stable Baselines has the following set of algorithms implemented:
● A2C
● ACER
● ACKTR
● DDPG
● DQN
● HER
● GAIL
● PPO1 and PPO2
● SAC
● TD3
● TRPO
Overall, Stable Baselines has a great set of algorithms implemented.
The documentation is complete and excellent. The set of tutorials and examples is also really
helpful.
On the other hand, modifying the code can be tricky. But because Stable Baselines provides a
lot of useful comments in the code and awesome documentation, the modification process will
be less complex.
Stable Baselines provides good documentation about how to plug into your custom
environment, however, you need to do it using OpenAI Gym.
Stable Baselines has the TensorBoard support implemented.
Vectorized environment feature is supported by a majority of the algorithms. Please check the
documentation in case you want to learn more.
7. Regular updates
The last major updates were made almost two years ago, but the library is maintained as the
documentation is regularly updated.
To sum up, Stable Baselines is a library with a great set of algorithms and awesome
documentation. You should consider using it as your RL tool.
MushroomRL
MushroomRL is a Python Reinforcement Learning library whose modularity allows you to use
well-known Python libraries for tensor computation and RL benchmarks.
It enables RL experiments providing classical RL algorithms and deep RL algorithms. The idea
behind MushroomRL consists of offering the majority of RL algorithms, providing a common
interface in order to run them without doing too much work.
To install MushroomRL simply use a pip command.
pip install mushroom_rl
Let’s see if MushroomRL fits the criteria:
As of today, MushroomRL has the following set of algorithms implemented:
● Q-Learning
● SARSA
● FQI
● DQN
● DDPG
● SAC
● TD3
● TRPO
● PPO
Overall, MushroomRL has everything you need to work on RL tasks.
The official documentation seems incomplete. It misses valuable tutorials, and simple examples
leave much to be desired.
The code lacks comments and parameter description. It’s really hard to customize it. Although
MushroomRL never positioned itself as a library that is easy to customize.
MushroomRL supports the following environments:
● OpenAI Gym
● DeepMind Control Suite
● MuJoCo
For more information including installation and usage instructions please refer to official
documentation.
MushroomRL supports various logging and tracking tools. I would recommend using
TensorBoard as the most popular one.
Vectorized environment feature is supported.
7. Regular updates
The library is maintained. The last updates were made just a few weeks ago.
To sum up, MushroomRL has a good set of algorithms implemented. Still, it misses tutorials and
examples which are crucial when you start to work with a new library.
RLlib
“RLlib is an open-source library for reinforcement learning that offers both high scalability and a
unified API for a variety of applications. RLlib natively supports TensorFlow, TensorFlow Eager,
and PyTorch, but most of its internals are framework agnostic.” ~ Website
1. Number of state-of-the-art (SOTA) RL algorithms implemented
RLlib implements them ALL! PPO? It’s there. A2C and A3C? Yep. DDPG, TD3, SAC? Of
course! DQN, Rainbow, APEX??? Yes, in many shapes and flavours! Evolution
Strategies, IMPALA, Dreamer, R2D2, APPO, AlphaZero, SlateQ, LinUCB, LinTS,
MADDPG, QMIX, … Stop it! I’m not sure if you make up these acronyms. Nonetheless,
yes, RLlib has them ALL. See the full list here.
RLlib has comprehensive documentation with many examples. Its code is also well
commented.
It’s easiest to customize RLlib with callbacks. Although RLlib is open-sourced and you
can edit the code, it’s not a straightforward thing to do. RLlib codebase is quite
complicated because of its size and many layers of abstractions. Here is a guide that
should help you with that if you want to e.g. add a new algorithm.
RLlib works with several different types of environments, including OpenAI Gym, user-
defined, multi-agent, and also batched environments. Here you’ll find more.
RLlib has extensive logging features. RLlib will print logs to the standard output
(command line). You can also access the logs (and manage jobs) in Ray Dashboard. In
this post, I described how to extend RLlib logging to send metrics to Neptune. It also
describes different logging techniques. I highly recommend reading it!
6. Vectorized environment (VE) feature
Yes, see here. Moreover, it’s possible to distribute the training among multiple compute
nodes e.g. on the cluster.
7. Regular updates
RLlib is maintained and actively developed.
From my experience, RLlib is a very powerful framework that covers many applications and at
the same time remains quite easy to use. That being said, because of the many layers of
abstractions, it’s really hard to extend with your code as it’s hard to find where you should even
put your code! That’s why I would recommend it for developers that look for training the models
for production and not for researchers that have to rapidly change algorithms and implement
new features.
Dopamine
“Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It
aims to fill the need for a small, easily grokked codebase in which users can freely experiment
with wild ideas (speculative research).” ~ GitHub
It focuses on supporting the state-of-the-art, single-GPU DQN, Rainbow, C51, and IQN
agents. Their Rainbow agent implements the three components identified as most
important by Hessel et al.:
● n-step Bellman updates (see e.g. Mnih et al., 2016)
● Prioritized experience replay (Schaul et al., 2015)
● Distributional reinforcement learning (C51; Bellemare et al., 2017)
Concise documentation is available in the GitHub repo here. It’s not a very popular
framework, so it may lack tutorials. However, the authors provide colabs with many
examples of training and visualization.
The authors’ design principles are:
● Easy experimentation: Make it easy for new users to run benchmark
experiments.
● Flexible development: Make it easy for new users to try out research ideas.
● Compact and reliable: Provide implementations for a few, battle-tested
algorithms.
● Reproducible: Facilitate reproducibility in results. In particular, their setup follows
the recommendations given by Machado et al. (2018).
It’s mainly thought for the Atari 2600 game-playing. It supports OpenAI Gym.
It supports TensorBoard logging and provides some other visualization tools, presented
in colabs, like recording video of an agent play and seaborn plotting.
No vectorized environments support.
7. Regular updates
Dopamine is maintained.
If you look for a customizable framework with well-tested DQN based algorithms, then this may
be your pick. Under the hood, it runs using TensorFlow or JAX.
SpinningUp
“While fantastic repos like garage, Baselines, and rllib make it easier for researchers who are
already in the field to make progress, they build algorithms into frameworks in ways that involve
many non-obvious choices and trade-offs, which makes them hard to learn from. […] The
algorithm implementations in the Spinning Up repo are designed to be:
● as simple as possible while still being reasonably good,
● and highly consistent with each other to expose fundamental similarities between
algorithms.
They are almost completely self-contained, with virtually no common code shared between
them (except for logging, saving, loading, and MPI utilities), so that an interested person can
study each algorithm separately without having to dig through an endless chain of
dependencies to see how something is done. The implementations are patterned so that they
come as close to pseudocode as possible, to minimize the gap between theory and code.” ~
Website
VPG, PPO, TRPO, DDPG, TD3, SAC
Great documentation and education materials with multiple examples.
This code is highly readable. From my experience, it’s the most readable framework you
can find there. Every algorithm is contained in its own two, well-commented files.
Because of it, it’s also as easy as it can be to modify it. On the other hand, it’s harder to
maintain for the same reason. If you add something to one algorithm you have to
manually add it to others too.
It supports the OpenAI Gym environments out of the box and relies on its API. So you
can extend it to use other environments that conform to this API.
It has a light logger that prints metrics to the standard output (cmd) and saves them to a
file. I’ve written the post on how to add the Neptune support to SpinUp.
7. Regular updates
SpinningUp is maintained.
Although it was created as an educational resource, the code simplicity and state-of-the-art
results make it a perfect framework for fast prototyping your research ideas. I use it in my own
research and even implement new algorithms in it using the same code structure. Here you can
find a port of SpinningUp code to the TensorFlow v2 from me and my colleagues from
AwareLab.
garage
“garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an
accompanying library of state-of-the-art implementations built using that toolkit. […] The most
important feature of garage is its comprehensive automated unit test and benchmarking suite,
which helps ensure that the algorithms and modules in garage maintain state-of-the-art
performance as the software changes.” ~ GitHub
All major RL algorithms (VPG, PPO, TRPO, DQN, DDPG, TD3, SAC, …), with their
multi-task versions (MT-PPO, MT-TRPO, MT-SAC), meta-RL algorithms (Task
embedding, MAML, PEARL, RL2, …), evolutional strategy algorithms (CEM, CMA-ES),
and behavioural cloning.
Comprehensive documentation included with many examples and some tutorials of e.g.
how to add a new environment or implement a new algorithm.
It’s created as a flexible and structured tool for developing, experimenting and evaluating
algorithms. It provides a scaffold for adding new methods.
Garage supports a variety of external environment libraries for different RL training
purposes including OpenAI Gym, DeepMind DM Control, MetaWorld, and PyBullet. You
should be able to easily add your own environments.
The garage logger supports many outputs including std. output (cmd), plain text files,
CSV files, and TensorBoard.
It supports vectorized environments and even allows one to distribute the training on the
cluster.
7. Regular updates
garage is maintained.
garage is similar to RLlib. It’s a big framework with distributed execution, supporting many
additional features like Docker, which is beyond simple training and monitoring. If such a tool is
something you need, i.e. in a production environment, then I would recommend comparing it
with RLlib and picking the one you like more.
Acme
“Acme is a library of reinforcement learning (RL) agents and agent building blocks. Acme strives
to expose simple, efficient, and readable agents, that serve both as reference implementations
of popular algorithms and as strong baselines, while still providing enough flexibility to do novel
research. The design of Acme also attempts to provide multiple points of entry to the RL
problem at differing levels of complexity.” ~ GitHub
It includes algorithms for continual control (DDPG, D4PG, MPO, Distributional MPO,
Multi-Objective MPO), discrete control (DQN, IMPALA, R2D2), learning from
demonstrations (DQfD, R2D3), planning and learning (AlphaZero) and behavioural
cloning.
Documentation is rather sparse, but there are many examples and jupyter notebook
tutorials available in the repo.
Code is easy to read but requires one to learn its structure first. It is easy to customize
and add your own agents.
The Acme environment loop assumes an environment instance that implements the
DeepMind Environment API. So any environment from DeepMind will work flawlessly
(e.g. DM Control). It also provides a wrapper on the OpenAI Gym environments and the
OpenSpiel RL environment loop. If your environment implements OpenAI or DeepMind
API, then you shouldn’t have problems with pugging it in.
It includes a basic logger and supports printing to the standard output (cmd) and saving
to CSV files. I’ve written the post on how to add the Neptune support to Acme.
7. Regular updates
Acme is maintained and actively developed.
Acme is simple like SpinningUp but a tier higher if it comes to the use of abstraction. It makes it
easier to maintain – code is more reusable – but on the other hand, harder to find the exact spot
in the implementation you should change when tinkering with the algorithm. It supports both
TensorFlow v2 and JAX, with the second being an interesting option as JAX gains traction
recently.
coax
“Coax is a modular Reinforcement Learning (RL) python package for solving OpenAI Gym
environments with JAX-based function approximators. […] The primary thing that sets coax
apart from other packages is that is designed to align with the core RL concepts, not with the
high-level concept of an agent. This makes coax more modular and user-friendly for RL
researchers and practitioners.” ~ Website
It implements classical RL algorithms (SARSA, Q-Learning), value-based deep RL
algorithms (Soft Q-Learning, DQN, Prioritized Experience Replay DQN, Ape-X DQN),
and policy gradient methods (VPG, PPO, A2C, DDPG, TD3).
Clear, if sometimes confusing, documentation with many code examples and algorithms
explanation. It also includes tutorials for running training on Pong, Cartpole, ForzenLake,
and Pendulum environments.
Other RL frameworks often hide structure that you (the RL practitioner) are interested in.
Coax makes the network architecture take the center stage, so you can define your own
forward-pass function. Moreover, the design of coax is agnostic of the details of your
training loop. You decide how and when you update your function approximators.
Coax mostly focuses on OpenAI Gym environments. However, you should be able to
extend it to other environments that implement this API.
It utilizes the Python logging module.
7. Regular updates
coax is maintained.
I would recommend coax for education purposes. If you want to plug-n-play with nitty-gritty
details of RL algorithms, this is a good tool to do this. It’s also built around JAX, which may be a
plus in itself (because of hype around it).
SURREAL
“Our goal is to make Deep Reinforcement Learning accessible to everyone. We introduce
Surreal, an open-source, reproducible, and scalable distributed reinforcement learning
framework. Surreal provides a high-level abstraction for building distributed reinforcement
learning algorithms.” ~ Website
It focuses on the distributed deep RL algorithms. As for now, the authors implemented
their distributed variants of PPO and DDPG.
It provides basic documentation in the repo of installing, running, and customizing the
algorithms. However, it lacks code examples and tutorials.
Code structure can frighten one away, it’s not something for newcomers. That being
said, the code includes docstrings and is readable.
It supports OpenAI Gym and DM Control environments, as well as Robotic Suite.
Robosuite is a standardized and accessible robot manipulation benchmark with the
MuJoCo physical engine.
It includes specialized logging tools for the distributed environment that also allow you to
record videos of agents playing.
No vectorized environments support. However, it allows one to distribute the training on
the cluster.
7. Regular updates
It doesn’t seem to be maintained anymore.
Planning by Dynamic Programming: Reinforcement Learning
Planning by Dynamic Programming
Dynamic programming can be used to solve reinforcement learning problems when someone tells us the
structure of the MDP (i.e when we know the transition structure, reward structure etc.). Therefore
dynamic programming is used for the planning in a MDP either to solve:
● Prediction problem (Policy Evaluation):
Given a MDP <S, A, P, R, γ> and a policy π. Find the value function v_π (which tells you how much
reward you are going to get in each state). i.e the goal is to find out how good a policy π is.
● Control problem (Find the best thing to do in the MDP):
Given a MDP <S, A, P, R, γ>. Find the optimal value function v_π and the optimal policy π*. i.e the goal
is to find the policy which gives you the most reward you can receive with the best action to choose.
Policy Evaluation
Problem: Evaluate a given policy π and MDP. (Find out how good a policy π is)
Solution: Iterative application of Bellman Expectation backup.
Approach:
Start with initial value function v₁ (a value of all states in the MDP). E.g. start with a value of 0.
Therefore there is no reward. Then use the Bellman Expectation Equation to compute v₂ and repeat many
times, which will eventually converge to v_π.
One method to achieve this convergence is to use synchronous backups where we consider all states at
every step.
1. At each iteration k+1, for all states s ∈ S
2. Update vₖ₊₁(s) from vₖ(s’), where s’ is a successor state of s
vₖ(s’) is updated using the Bellman Expectation Equation discussed in Part 2.
Bellman Expectation Equation

Example:
Suppose we have a grid world with 14 states, where each block represents a state. We also have 2
terminal states, one in the bottom right and the other in the top left of the grid world. We can take the
actions of moving up, down, left and right with each transition we take gives us a immediate reward of -
1 until a terminal state is reached. With a discount factor of γ=1 and our given policy π for our agent
follows a uniform random policy:
We need to compute the value of being in each state to determine how good our defined policy π is.
1. By following our approach described above we can start with an initial value function v₀ with all
values being 0.
v₀
2. Next we compute the new value function v₁ using the Bellman Expectation Equation by applying a one
step look ahead.

v₁: Blue block is calculated using the Bellman Expectation Equation such that 0.25(-1+1(0))+0.25(-
1+1(0))+0.25(-1+1(0))+0.25(-1+1(0)) = -1
3. Repeating the processing using v₁ to compute v₂

v₂: Blue block is calculated using the Bellman Expectation Equation such that 0.25(-1+1(-1))+0.25(-
1+1(-1))+0.25(-1+1(-1))+0.25(-1+1(0))=-1.75
4. Repeating this process with k=∞ will eventually converge our value function to v_π.
vπ
We have therefore computed the value associated to being in each state in our MDP for our policy π.
How to Improve a Policy?
In the above approach we have evaluated a given policy but not found the best policy (actions to take) in
our environment. In order to improve a given policy we can evaluate a given policy π and improve the
policy by acting greedily with respect to v_π. This can be done using Policy Iteration.
Policy Iteration
Problem: Find the best policy π* for a given MDP.
Solution: Bellman Expectation Equation Policy Iteration and Acting Greedy.

Approach:
Start with a given policy π
1. Evaluate the policy π with Policy Evaluation (described above)
2. Improve the policy π by acting greedily with respect to v_π with Policy Improvement to get
a new policy π’.
3. Repeat until the new policy π’ converges to the optimal policy π*.
In order to act greedily we use a one step look ahead to determine the action which gives us the maximum
action value function (described in Part 2):
Recall the action value function has the following equations:
action-value function
Value Iteration
An alternative approach to control problems is with value iteration using the Bellman Optimality
Equation. First we need to define how we can divide an optimal policy into its components using the
Principle of Optimality.
Principle of Optimality
Any optimal policy can be subdivided into two components which make the overall behaviour optimal:
- An optimal first action A∗
- Followed by an optimal policy from successor state S
Theorem (Principle of Optimality)
A policy π(a|s) achieves the optimal value from state s, v_π(s)=v∗(s), if and only if:
- For any state s’ reachable from s
- π achieves the optimal value from state s’, v_π(s’ )=v∗(s’ )
Value Iteration (Applied)
Problem: Find the optimal policy π* for a given MDP.
Solution: Iterative application of Bellman Optimality backup

Approach:
Using synchronous backups update the value function until the optimal value function is computed
without computing the action value function.
1. At each iteration k+1, for all states s ∈ S
2. Update vₖ₊₁(s) from vₖ(s’), where s’ is a successor state of s
vₖ(s’) is updated using the Bellman Optimality Equation discussed in the Part 2:
Bellman Optimality Equation
By repeating the above process, it will eventually converge to v*. Note this process differs to policy
iteration in that intermediate value functions may not correspond to any policy.
Introduction to Reinforcement Learning: Temporal Difference, SARSA, Q-Learning
Reinforcement Learning is one of the most intricate fields of Machine Learning, due to its mathematical
complexity, as well as the ambitious tasks it tries to achieve.
Reinforcement Learning: the building blocks
In RL we leverage a simple model to describe how an agent interacts with the space, which is the
Markov Decision Process.
MDPs represent a mathematical framework for modeling situations where dynamics are partly controlled
by an agent and partly caused by the characteristics of an environment. In an MDP, the agent
continuously interacts with the environment, which responds with rewards affecting subsequent agent’s
actions. Let’s have a look at the following picture:
A simple MDP, image by the author

At each time step t, the agent receives a representation of the environment’s state Sₜ, which conditions the
choice of an action Aₜ. At the next time step t+1, the agent receives a reward Rₜ₊₁ that informs it about
how good its past action was, as well as a new representation of the environment.
In a finite MDP, states Sₜ, actions Aₜ, and rewards Rₜ have a finite number of elements, with S ₜ and R ₜ
having well-defined discrete probability distributions depending only on the preceding states and actions.
These characteristics make the problem usually more tractable. However not all the problems can be
modeled with finite states, actions or rewards, yet many of the assumptions and conclusions we will be
making are easily generalizable to continuous settings.
The continuous interaction between an agent and the environment generates a trajectory:
S₀, A₀, R₀, S₁, A₁, R₁, S₂, A₂, R₂…
The probability of transitioning from state Sₜ to Sₜ₊₁ is thereby influenced by the choice of the action Aₜ,
but, given Sₜ and Aₜ, it is conditionally independent of all previous states and actions. Therefore we say
that the state transitions of an MDP satisfy the Markov Property.
The final goal of the agent is to maximize the total return over the long run, which usually keeps into
account a discounting factor γ, calculated as:

Discounted return, equation from [1]
The discounting factor γ determines how much the agent craves immediate rewards (γ = 0) or has a more
long-term view and aims for high returns in the future (γ = 1).
The agent should behave in order to transition to states that yield high rewards. The behavior of an agent
is determined by a policy π(a|s), a function that outputs the probability of performing an action a, given
that the agent is in state s. Under a given policy, we can define a state-value function (and a state-action
value function), which measures the expected return the agent gets from starting in s (and performing a)
and following π thereafter.
State value function and State-action value function, equation from [1]
Now, the goals of RL algorithms are typically two:
1. Prediction: estimating the value functions of a given policy π

Estimates for state value functions and state-action value functions for policy π
2. Control: finding the optimal policy π*, namely the policy that maximizes the return in the long run.
Estimates for state value functions and state-action value functions for optimal policy π
Optimal greedy policy, equation from [1], obtained by choosing the action that yields the highest Q-value
Note that these two tasks are inherently correlated: if an agent is following a policy π, my first goal is to
evaluate how good that policy is (Prediction). Once I estimated the value functions of the policy, I can
decide how to change it in order to obtain an even better value function (Control).
Knowing the model of the environment, namely knowing the probability distributions of actions and
rewards, gives a significant advantage in solving these two tasks. Nonetheless, the model is oftentimes
hidden or too complex to handle, hence algorithms better learn from experience (or episodes), which are
sequences of states, actions, and rewards, ending with terminal states. For instance, playing a full game of
Tic Tac Toe represents an episode, with the terminal state reached when one of the players wins (or both
draw).
A simple Tic Tac Toe episode, image by the author
The list of algorithms available to solve Prediction and Control tasks is long and each of them is capable
of obtaining optimal results under certain assumptions. Here I decided to cover the three pillar algorithms
which proved to work fairly well overall, but also represent the foundations of many more sophisticated
ones:
● Temporal Difference
● SARSA
● Q-Learning
Temporal Difference
Temporal Difference is said to be the central idea of Reinforcement Learning since it learns from raw
experience without a model of the environment. It solves the Prediction problem, namely it estimates the
value function of the states of an MDP.
The problem with learning from experience is that one has to wait for the end of the episode (and hence
for the final return) to update the estimates of the value functions. TD overcomes this obstacle, by waiting
for one single step to do it.

The update rule for the simplest version of TD, α is the update rate. Equation from [1]
At each time step t, the value function of state Sₜ is immediately updated based on the reward Rₜ₊₁
immediately received, and on the estimates of the new state Sₜ₊₁. This is the simplest version of TD,
called TD(0), where the zero represents the fact that one has to wait for a single step to perform the
update, yet the idea is extendable to more complex algorithms like n-step TD and TD(λ). As you notice,
the update rule of a state makes use of the estimates of another one: this method is called bootstrapping.
Procedure of TD(0), from [1]
We mentioned what are the pros of using TD, here it’s a summary:
1. It doesn’t need a model of the environment
2. Its estimates converge (fast) to the real state value function

3. It performs online incremental updates, namely it does not need to wait for the end of the
episodes to update the estimates of the value function.
SARSA
I know what you’re thinking: such a weird name for an algorithm. Well, indeed the name of SARSA
comes from the transition from state S to S’:
Trajectory from state S to S’ that gave name to SARSA algorithm
SARSA is an on-policy TD method to solve the Control problem, which means it tries to find an optimal
policy for a given task. Why “on policy”?
In on-policy learning, the optimal value function is learned from actions taken using the current policy
𝜋(𝑎|𝑠).
The update rule for SARSA, equation from [1]. Note that we know estimate Q(s, a), instead of V(s)
Indeed the reason why SARSA is defined as “on-policy” is the fact that we leverage the current policy to
update Q(s, a). If that’s not clear, wait to see what Q-Learning does and check the difference.
Now, how does SARSA learn the optimal policy? We mentioned it’s a TD method, which means it makes
use of the TD update rule to estimate the value function. But how does it improve the policy to obtain a
greater value function?
SARSA accomplishes this by changing the policy π in an ε-greedy fashion: at each step we perform:
● the action with the greatest Q(a,s), with probability 1 - ε
● another random action, each one having probability ε / N(a) of being picked up, where N(a)
is the number of possible actions
The parameter ε balances the exploitation vs. exploration tradeoff: at the same time you want to exploit
what you already know, namely performing the best action, but you also want to explore other
possibilities, namely discovering possible better actions.

SARSA procedure, from [1]
Once the procedure ends, if we extensively visited every state-action pair, we will have an optimal ε-
greedy Q*(a, s), which easily leads to an optimal ε-greedy policy:
Q-Learning
Dulcis in fundo, Q-Learning is instead an off-policy TD method to solve the Control problem.
In off-policy learning, the optimal value function is learned from actions taken independently from the
current policy 𝜋(𝑎|𝑠).
Q-Learning is actually very similar to SARSA. The only (but meaningful) difference lies in the update
rule of Q(a, s), which is:

The update rule for Q-Learning, equation from [1]
In this case, the Q-value is not updated according to the ε-greedy policy we are following, yet it is based
on the best action we could take from the next state Sₜ₊₁ (and this is very different from SARSA!).
Q-Learning procedure, from [1]
Likewise, as long as each state-action value pair is visited an infinite amount of time, Q-Learning
converges to the optimal Q*(a, s) and to the optimal policy.
So… SARSA or Q-Learning?
We saw two algorithms for solving the Control problems, however, there’s a difference in the way they
learn. In the learning process, SARSA is guided by the chosen ε-greedy action, which might be (and it’s
often the case) sub-optimal. On the other hand, Q-Learning always follows the greedy action in the
update of Q(s, a).

This (big) difference makes it so that the policy learned by SARSA is only near-optimal.
So shouldn’t you always choose Q-Learning? Well, technically you could. But there are cases where
choosing the greedy action might imply some risk and costly errors, which you may avoid or limitate by
choosing a more conservative algorithm like SARSA.
Deep Q‐networks (DQN, DDQN, Dueling DQN, Prioritised Experience Replay)
Introduction
For explaining each of the methods we first have to introduce the algorithm which started them all,
the holly “Q-learning”. In general, reinforcement learning has different parts. An agent, agent’s
action, an environment within which an agent takes actions and the state or observation that an
agent gets as a consequence to an action it takes. The mind of the agent in Q-learning is a table with
the rows as the State or Observation of the agent from the environment and the columns as the
actions to take. Each of the cells of the table will be filled with a value called Q-value which is the
value that an action brings considering the state it is in. let’s call this table Q-table. The Q-table is
actually the brain of the agent.

Q-table
Reinforcement Learning
There are multiple steps to reinforcement learning:
1) The agent starts taking an action in the environment and starts a Q-table initialized with zeros in
all the cells.

2) The agent gets to a new state or observation (state is the information of the environment that an
agent is in and observation is an actual image that the agent sees. Depending on an environment an
agent gets state or observation as an input) by doing an action from the table. The state is a row in
the table containing Q-values for each action and the highest value in the row will be the action that
the agent takes (the column with the highest value in that specific row). It is like seeing a new
landscape by turning around. There is an observation from the current state which is the landscape
that you are seeing now, and an observation from the next state by taking the action of turning
around.
3) The agent gets a reward by doing the action. The reward has a meaning. the higher the value the
better, but sometimes the lower value could mean that the agent has taken the better action. The
reward comes from the environment and the environment defines which of the lower or higher
reward is better. The environment gives the agent the reward for taking an action in a specific
state.
4) The agent keeps doing steps 1 to 3 and gathers information in its “memory”. The memory
contains tuples of state, next state, action, reward and a Boolean value for indicating the
termination of the agent. this steps keep on going and the agent memorizes the info until the
termination happens.
5) The agent sometimes has to renounce the Q-table and explore for learning new things. This is
called the exploration of the agent. the basic method for this is to have a probability of exploration
and during the training of the agent, this probability lowers. So at first the agent learns new things
regardless of what it has learned (the Q-table). But as the time goes on, the agent relies more on
what it has learned. Actually, Noisy DQN does the job of exploration in a different way which will
be explained later.
6) After the termination of the agent which could mean completing the task or failing, the agent
starts replaying the experiences it gathered in its memory. A batch of a particular size will be
chosen from the memory and the task of training will be performed on it. Basically this means that
the Q-table starts filling up. This is called Experience Replay.
What is Q-value then? Basically, Q-value is the reward obtained from the current state plus the
maximum Q-value from the next state. So, that means the agent has to get the reward and the next
state from its experience in the memory and add the reward to the highest Q-value derived from
the row of the next state in the Q-table and the result will go into the row of the current state and
the column of the action, both obtained from the experience in the memory.
Here s is the state and a is the action and Q(s,a) is a value of the Q-table cell and R is the reward
and gamma (between zero and one. Normally is 0.9) is the discount factor which basically tells the
agent to not rely too much on the next state.

2 Deep Q-Learning (DQN)
The only difference between Q-learning and DQN is the agent’s brain. The agent’s brain in Q-
learning is the Q-table, but in DQN the agent’s brain is a deep neural network.
Deep Reinforcement Learning
The input of the neural network will be the state or the observation and the number of output
neurons would be the number of the actions that an agent can take. For training the neural
network the targets would be the Q-values of each of the actions and the input would be the state
that the agent is in.
3 Double DQN
Double DQN uses two identical neural network models. One learns during the experience replay,
just like DQN does, and the other one is a copy of the last episode of the first model. The Q-value is
actually computed with this second model. Now, why is that? In DQN, Q-value is calculated with
the reward added to the next state maximum Q-value. Obviously, if every time the Q-value
calculates a high number for a certain state, the value that is obtained from the output of the neural
network for that specific state, will become higher every time. Each output neuron value will get
higher and higher until the difference between each output value is high. Now if let’s say for state s
action a is a higher value than action b, then action a will get chosen every time for state s. now
consider if for some memory experience action b becomes the better action for state s. then since
the neural network is trained in a way to give a much higher value for action a when given state s, it
is difficult to train the network to learn that action b is the better action in some conditions. So
what do we do to bring down the difference between the output values (actions)? Use a secondary
model that is the copy of the main model from the last episode and obviously, since the difference
between values of the second model are lower than the main model, we use this second model to
attain the Q-value:
That is the way Q-value gets calculated in Double DQN. We find the index of the highest Q-value
from the main model and use that index to obtain the action from the second model. And the rest is
history.
4 Dueling DQN
The difference in Dueling DQN is in the structure of the model. The model is created in a way to
output the formula below:
Here, the V(s) stands for the Value of state s and A is the Advantage of doing action a while in state
s. The Value of a state is independent of action. It means how good is it to be in a particular state.
But what is an Advantage? How is it different from Q-value? Let’s explain it with an example. The
agent might be in a state where each of the actions would give the same Q-value. So there is no good
action in this state. What would happen if we divide the Q-value to Value of a state and the
Advantage that each action has. If every action has the same result, then the Advantage of each
action will have the same value. Now, if we subtract the mean of all the Advantages from each
advantage, we get zero (or close to zero) and Q-value would actually be the Value that the state has.
So overtime the Q-value would not overshoot. The states that are independent of action would not
have a high Q-value to train on.

Dueling DQN
Mind you, the output of the model would be the state Value plus the Advantage of actions. But for
training the model we use the same Q-value for targets as before:
We can even use Double DQN method.
5 Noisy Net (Noisy DQN)
As you might recall from the introduction, the agent sometimes explores the environment
regardless of what the Q-table or neural network (its brain) told it to do so. As we mentioned, the
exploration was a probability that reduced over time. This would happen with a probability of 1
reducing to, for instance, 0.01. but Noisy Net does this job in a different way. Noisy Net gives noise
to the output of the neural network, so in that way the agent explores the environment whenever
there is a noise in the neural network output and different action gets a higher value when the real
action to be taken is another one.
The way to do this is by defining different weights for the neural network. As you recall from
neural networks, the output of a neuron is the sum of the inputs multiplied by the weight of their
connection to the corresponding neuron. So weights (W for short) were the parameters to learn in
neural networks. But in Noisy Net (W = Mu + Sigma * epsilon) where Mu is a Variable with a
random initialization, Sigma is a Variable with a constant initialization and epsilon is actually the
noise with a random value between zero and one. So the fully connected layers would have weights
like this to learn and over time if exploration is not needed anymore the value of sigma would be
close to zero to neutralize the epsilon.
6 DQN with Prioritized Experience Replay
As mentioned in the introduction the agent will start taking actions in an environment and
memorized the experience as a tuple of state, next state, action, reward and a Boolean value for
indicating the termination of the agent . Furthermore, in Experience Replay step, a batch of certain
shape would get chosen from the memory and training the neural network would perform on that
particular batch. But what if the batch chosen from the memory is not an eligible batch. What if the
experience memories chosen are not so important to learn. In that case we have to choose the
important experience memories for putting in the batch.

So how do we recognize importance? If the Q-value from the next state is a lot different than the Q-
value from the current state, that means that the importance is high whether the Q-value in the
next state increases or decreases. This deference is called Temporal Difference error (TD error)
So for each memory we have TD error as below:
Then the probability of an experience memory being chosen would be:
Now, here epsilon is a small value to prevent division by zero and alpha is a random value between
zero and one which if it’s one it chooses the most important memories and if it is zero the batch
would get filled randomly.
So p will be the probability that an experience is important and the batch would get filled
considering experience probabilities. But there is one more thing. As you might know training the
network happens stochastically. That means each experience is used for training individually. So, if
an experience has a high probability, then this experience would get chosen each time and the
neural network would Overfit on this particular experience (for more information about
Overfitting refer to https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-
learning-and-how-to-deal-with-it-6803a989c76) so in order to overcome this problem we multiply
the value below to the training loss:
Where b is a value starting from zero and gradually reaching to 1. So in this formula the
importance is calculated from the distribution that the experience came from. So if a probability is
high it won’t get chosen all the time. Therefore, the training loss will be calculated as:
UNIT 2 Notes

RL & DL Notes

Uploaded by

Copyright:

Available Formats

RL & DL Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL & DL Notes

Uploaded by

Copyright:

Available Formats

RL & DL NOTES

Key Concepts of Reinforcement Learning

● Agent: The learner or decision-maker.

● Environment: Everything the agent interacts with.

● State: A specific situation in which the agent finds itself.

● Action: All possible moves the agent can make.

● Reward: Feedback from the environment based on the action taken.

How Reinforcement Learning Works

● Model of the Environment: A representation of the environment that helps in

Main points in Reinforcement learning –

● The model keeps continues to learn.

● The best solution is decided based on the maximum reward.

Reinforcement learning Supervised learning

Reinforcement learning is all about making

In supervised learning the

1. Positive: Positive Reinforcement is defined as when an event, occurs due to a

● Sustain Change for a long period of time

2. Negative: Negative Reinforcement is defined as strengthening of behavior because a

● Provide defiance to a minimum standard of performance

● It Only provides enough to meet up the minimum behavior

Elements of Reinforcement Learning

i) Policy: Defines the agent’s behavior at a given time.

MDP), we need to develop our intuition about :

● The Agent-Environment relationship

● Markov Process and Markov chains

● Markov Reward Process (MRP)

● Markov Reward Process

The Agent-Environment Relationship

First let’s look at some formal definitions :

environment or a simulated environment with which our agent will interact.

Demonstrating an environment with which agents are interacting.

reached by performing the action.

The Markov Property

Transition : Moving from one state to another is called Transition.

The Markov Property state that :

“Future is Independent of the past given the present”

Mathematically we can express this statement as :

State Transition Probability :

State Transition Probability

to any successor state.Sum of each row is equal to 1.

Markov Process or Markov Chains

But what random process means ?

To answer this question let’s look at a example:

Some samples from the chain :

● Sleep — Run — Ice-cream — Sleep

● Sleep — Ice-cream — Ice-cream — Run

random set of sequences.

environment is called returns.

We can define Returns as :

Returns (Total rewards from the environment)

by at the final time step by performing an action to move to another state.

Episodic and Continuous Tasks

initial state and hence, every episode is independent.

So, how we define returns for continuous tasks?

This is where we need Discount factor(ɤ).

discount factor lies between 0.2 to 0.8.

Returns using discount factor

One with discount factor (ɤ) 0.8 :

Second, with discount factor (ɤ) 0.2 :

Discount Factor (0.2)