Stable Baselines
Stable Baselines
Stable Baselines
Release 2.10.2
i
1.36 Gym Environment Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
1.37 Monitor Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
1.38 Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
1.39 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
1.40 Plotting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3 Contributing 217
Index 223
ii
Stable Baselines Documentation, Release 2.10.2
Stable Baselines is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI
Baselines.
Warning: This package is in maintenance mode, please use Stable-Baselines3 (SB3) for an up-to-date version.
You can find a migration guide in SB3 documentation.
User Guide 1
Stable Baselines Documentation, Release 2.10.2
2 User Guide
CHAPTER 1
This toolset is a fork of OpenAI Baselines, with a major structural refactoring, and code cleanups:
• Unified structure for all algorithms
• PEP8 compliant (unified code style)
• Documented functions and classes
• More tests & more code coverage
• Additional algorithms: SAC and TD3 (+ HER support for DQN, DDPG, SAC and TD3)
1.1 Installation
1.1.1 Prerequisites
Baselines requires python3 (>=3.5) with the development headers. You’ll also need system packages CMake, Open-
MPI and zlib. Those can be installed as follows
Note: Stable-Baselines supports Tensorflow versions from 1.8.0 to 1.15.0, and does not work on Tensorflow versions
2.0.0 and above. PyTorch support is done in Stable-Baselines3
Ubuntu
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-
˓→dev
3
Stable Baselines Documentation, Release 2.10.2
Mac OS X
Installation of system packages on Mac requires Homebrew. With Homebrew installed, run the following:
Windows 10
We recommend using Anaconda for Windows users for easier installation of Python packages and required libraries.
You need an environment with Python version 3.5 or above.
For a quick start you can move straight to installing Stable-Baselines in the next step (without MPI). This supports
most but not all algorithms.
To support all algorithms, Install MPI for Windows (you need to download and install msmpisetup.exe) and
follow the instructions on how to install Stable-Baselines with MPI support in following section.
Note: Trying to create Atari environments may result to vague errors related to missing DLL files and modules. This
is an issue with atari-py package. See this discussion for more information.
Stable Release
To install with support for all algorithms, including those depending on OpenMPI, execute:
GAIL, DDPG, TRPO, and PPO1 parallelize training using OpenMPI. OpenMPI has had weird interactions with Ten-
sorflow in the past (see Issue #430) and so if you do not intend to use these algorithms we recommend installing
without OpenMPI. To do this, execute:
If you have already installed with MPI support, you can disable MPI by uninstalling mpi4py with pip uninstall
mpi4py.
Note: Unless you are using the bleeding-edge version, you need to install the correct Tensorflow version manually.
See Issue #849
To contribute to Stable-Baselines, with support for running tests and building the documentation.
If you are looking for docker images with stable-baselines already installed in it, we recommend using images from
RL Baselines Zoo.
Otherwise, the following images contained all the dependencies for stable-baselines but not the stable-baselines pack-
age itself. They are made for development.
CPU only:
make docker-gpu
make docker-cpu
Note: if you are using a proxy, you need to pass extra params during build and do some tweaks:
˓→ --build-arg https_proxy=https://your.proxy.fr:8080/
docker run -it --runtime=nvidia --rm --network host --ipc=host --name test --mount
˓→src="$(pwd)",target=/root/code/stable-baselines,type=bind stablebaselines/stable-
1.1. Installation 5
Stable Baselines Documentation, Release 2.10.2
docker run -it --rm --network host --ipc=host --name test --mount src="$(pwd)",
˓→target=/root/code/stable-baselines,type=bind stablebaselines/stable-baselines-cpu
Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.
Here is a quick example of how to train and run PPO2 on a cartpole environment:
import gym
env = gym.make('CartPole-v1')
# Optional: PPO2 requires a vectorized environment to run
# the env is now wrapped automatically when passing it to the constructor
# env = DummyVecEnv([lambda: env])
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:
The aim of this section is to help you doing reinforcement learning experiments. It covers general advice about RL
(where to start, which algorithm to choose, how to evaluate an algorithm, . . . ), as well as tips and tricks when using a
custom environment or implementing an RL algorithm.
TL;DR
Current Limitations of RL
As a general advice, to obtain better performances, you should augment the budget of the agent (number of training
timesteps).
In order to achieve the desired behavior, expert knowledge is often required to design an adequate reward function.
This reward engineering (or RewArt as coined by Freek Stulp), necessitates several iterations. As a good example
of reward shaping, you can take a look at Deep Mimic paper which combines imitation learning and reinforcement
learning to do acrobatic moves.
One last limitation of RL is the instability of training. That is to say, you can observe during training a huge drop in
performance. This behavior is particularly present in DDPG, that’s why its extension TD3 tries to tackle that issue.
Other method, like TRPO or PPO make use of a trust region to minimize that problem by avoiding too large update.
Because most algorithms use exploration noise during training, you need a separate test environment to evaluate the
performance of your agent at a given time. It is recommended to periodically evaluate your agent for n test episodes
(n is usually between 5 and 20) and average the reward per episode to have a good estimate.
As some policy are stochastic by default (e.g. A2C or PPO), you should also try to set deterministic=True when
calling the .predict() method, this frequently leads to better performance. Looking at the training curve (episode
reward function of the timesteps) is a good proxy but underestimates the agent true performance.
Note: We provide an EvalCallback for doing such evaluation. You can read more about it in the Callbacks
section.
We suggest you reading Deep Reinforcement Learning that Matters for a good discussion about RL evaluation.
You can also take a look at this blog post and this issue by Cédric Colas.
There is no silver bullet in RL, depending on your needs and problem, you may choose one or the other. The first
distinction comes from your action space, i.e., do you have discrete (e.g. LEFT, RIGHT, . . . ) or continuous actions
(ex: go to a certain speed)?
Some algorithms are only tailored for one or the other domain: DQN only supports discrete actions, where SAC is
restricted to continuous actions.
The second difference that will help you choose is whether you can parallelize your training or not, and how you can
do it (with or without MPI?). If what matters is the wall clock training time, then you should lean towards A2C and its
derivatives (PPO, ACER, ACKTR, . . . ). Take a look at the Vectorized Environments to learn more about training with
multiple workers.
To sum it up:
Discrete Actions
DQN with extensions (double DQN, prioritized replay, . . . ) and ACER are the recommended algorithms. DQN is
usually slower to train (regarding wall clock time) but is the most sample efficient (because of its replay buffer).
You should give a try to PPO2, A2C and its successors (ACKTR, ACER).
If you can multiprocess the training using MPI, then you should checkout PPO1 and TRPO.
Continuous Actions
Current State Of The Art (SOTA) algorithms are SAC and TD3. Please use the hyperparameters in the RL zoo for best
results.
Take a look at PPO2, TRPO or A2C. Again, don’t forget to take the hyperparameters from the RL zoo for continuous
actions problems (cf Bullet envs).
If you can use MPI, then you can choose between PPO1, TRPO and DDPG.
Goal Environment
If your environment follows the GoalEnv interface (cf HER), then you should use HER + (SAC/TD3/DDPG/DQN)
depending on the action space.
Note: The number of workers is an important hyperparameters for experiments with HER. Currently, only
HER+DDPG supports multiprocessing using MPI.
If you want to learn about how to create a custom environment, we recommend you read this page. We also provide a
colab notebook for a concrete example of creating a custom gym environment.
Some basic advice:
• always normalize your observation space when you can, i.e., when you know the boundaries
• normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice
is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the
environment
• start with shaped reward (i.e. informative reward) and simplified version of your problem
• debug with random actions to check that your environment works and follows the gym interface:
We provide a helper to check that your environment runs without error:
If you want to quickly try a random agent on your environment, you can also do:
env = YourEnv()
obs = env.reset()
n_steps = 10
for _ in range(n_steps):
# Random action
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
Another consequence of using a Gaussian is that the action range is not bounded. That’s why clipping is usually used
as a bandage to stay in a valid interval. A better solution would be to use a squashing function (cf SAC) or a Beta
distribution (cf issue #112).
Note: This statement is not true for DDPG or TD3 because they don’t rely on any probability distribution.
When you try to reproduce a RL paper by implementing the algorithm, the nuts and bolts of RL research by John
Schulman are quite useful (video).
We recommend following those steps to have a working RL algorithm:
1. Read the original paper several times
2. Read existing implementations (if available)
3. Try to have some “sign of life” on toy problems
4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo)
You usually need to run hyperparameter optimization for that step.
You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will
fail silently cf issue #75) and when to stop the gradient propagation.
A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions:
1. Pendulum (easy to solve)
2. HalfCheetahBullet (medium difficulty with local minima and shaped reward)
3. BipedalWalkerHardcore (if it works on that one, then you can have a cookie)
in RL with discrete actions:
1. CartPole-v1 (easy to be better than random agent, harder to achieve maximal performance)
2. LunarLander
3. Pong (one of the easiest Atari game)
4. other Atari games (e.g. Breakout)
Stable-Baselines assumes that you already understand the basic concepts of Reinforcement Learning (RL).
However, if you want to learn about RL, there are several good resources to get started:
• OpenAI Spinning Up
• David Silver’s course
• Lilian Weng’s blog
• Berkeley’s Deep RL Bootcamp
• Berkeley’s Deep Reinforcement Learning course
• More resources
1.5 RL Algorithms
This table displays the rl algorithms that are implemented in the stable baselines project, along with some useful
characteristics: support for recurrent policies, discrete/continuous actions, multiprocessing.
Note: Non-array spaces such as Dict or Tuple are not currently supported by any algorithm, except HER for dict
when working with gym.GoalEnv
Actions gym.spaces:
• Box: A N-dimensional box that contains every point in the action space.
• Discrete: A list of possible actions, where each timestep only one of the actions can be used.
• MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be
used.
• MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combina-
tion.
Note: Some logging values (like ep_rewmean, eplenmean) are only available when using a Monitor wrapper
See Issue #339 for more info.
1.5.1 Reproducibility
Completely reproducible results are not guaranteed across Tensorflow releases or different platforms. Furthermore,
results need not be reproducible between CPU and GPU executions, even when using identical seeds.
In order to make computations deterministic on CPU, on your specific problem on one specific platform, you need to
pass a seed argument at the creation of a model and set n_cpu_tf_sess=1 (number of cpu for Tensorflow session). If
you pass an environment to the model using set_env(), then you also need to seed the environment first.
1 Whether or not the algorithm has be refactored to fit the BaseRLModel class.
4 TODO, in project scope.
3 Multi Processing with MPI.
2 Only implemented for TRPO.
Note: Because of the current limits of Tensorflow 1.x, we cannot ensure reproducible results on the GPU yet. This
issue is solved in Stable-Baselines3 “PyTorch edition”
Note: TD3 sometimes fail to have reproducible results for obscure reasons, even when following the previous steps
(cf PR #492). If you find the reason then please open an issue ;)
1.6 Examples
All the following examples can be executed online using Google colab notebooks:
• Full Tutorial
• All Notebooks
• Getting Started
• Training, Saving, Loading
• Multiprocessing
• Monitor Training and Plotting
• Atari Games
• Breakout (trained agent included)
• Hindsight Experience Replay
• RL Baselines zoo
In the following example, we will train, save and load a DQN model on the Lunar Lander environment.
Note: LunarLander requires the python package box2d. You can install it using apt install swig and then
pip install box2d box2d-kengz
Note: load function re-creates model from scratch on each call, which can be slow. If you need to e.g. evaluate
same model with multiple different sets of parameters, consider using load_parameters instead.
1.6. Examples 13
Stable Baselines Documentation, Release 2.10.2
import gym
# Create environment
env = gym.make('LunarLander-v2')
import gym
import numpy as np
if __name__ == '__main__':
env_id = "CartPole-v1"
num_cpu = 4 # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
You can define a custom callback function that will be called inside the agent. This could be useful when you want
to monitor training, for instance display live learning curves in Tensorboard (or in Visdom) or save the best agent. If
your callback returns False, training is aborted early.
import os
import gym
import numpy as np
import matplotlib.pyplot as plt
1.6. Examples 15
Stable Baselines Documentation, Release 2.10.2
class SaveOnBestTrainingRewardCallback(BaseCallback):
"""
Callback for saving a model (the check is done every ``check_freq`` steps)
based on the training reward (in practice, we recommend using ``EvalCallback``).
return True
1.6. Examples 17
Stable Baselines Documentation, Release 2.10.2
plt.show()
Training a RL agent on Atari games is straightforward thanks to make_atari_env helper function. It will do all
the preprocessing and multiprocessing for you.
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Normalizing input features may be essential to successful training of an RL agent (by default, images are scaled but
not other types of input), for instance when training on PyBullet environments. For that, a wrapper exists and will
compute a running average and standard deviation of input features (it can do the same for rewards).
import os
import gym
import pybullet_envs
# Don't forget to save the VecNormalize statistics when saving the agent
log_dir = "/tmp/"
model.save(log_dir + "ppo_halfcheetah")
stats_path = os.path.join(log_dir, "vec_normalize.pkl")
env.save(stats_path)
# To demonstrate loading
del model, env
Stable baselines provides default policy networks for images (CNNPolicies) and other type of inputs (MlpPolicies).
However, you can also easily define a custom architecture for the policy network (see custom policy section):
import gym
1.6. Examples 19
Stable Baselines Documentation, Release 2.10.2
You can access model’s parameters via load_parameters and get_parameters functions, which use dictio-
naries that map variable names to NumPy arrays.
These functions are useful when you need to e.g. evaluate large set of models with same network structure, visualize
different layers of the network or modify parameters manually.
You can access original Tensorflow Variables with function get_parameter_list.
Following example demonstrates reading parameters, modifying some of them and loading them to model by imple-
menting evolution strategy for solving CartPole-v1 environment. The initial guess for parameters is obtained by
running A2C policy gradient updates on the model.
import gym
import numpy as np
def mutate(params):
"""Mutate parameters by adding normal noise to them"""
return dict((name, param + np.random.normal(size=param.shape))
for name, param in params.items())
# Create env
env = gym.make('CartPole-v1')
# Create policy with a small network
model = A2C('MlpPolicy', env, ent_coef=0.0, learning_rate=0.1,
policy_kwargs={'net_arch': [8, ]})
This example demonstrate how to train a recurrent policy and how to test it properly.
Warning: One current limitation of recurrent policies is that you must test them with the same number of
environments they have been trained on.
# For recurrent policies, with PPO2, the number of environments run in parallel
# should be a multiple of nminibatches.
model = PPO2('MlpLstmPolicy', 'CartPole-v1', nminibatches=1, verbose=1)
model.learn(50000)
obs = env.reset()
# Passing state=None to the predict function means
# it is the initial state
state = None
# When using VecEnv, done is a vector
done = [False for _ in range(env.num_envs)]
(continues on next page)
1.6. Examples 21
Stable Baselines Documentation, Release 2.10.2
The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with
the appropriate heading.
Note: the hyperparameters in the following example were optimized for that environment.
import gym
import highway_env
import numpy as np
env = gym.make("parking-v0")
# SAC hyperparams:
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal,
goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
learning_rate=1e-3,
gamma=0.95, batch_size=256,
policy_kwargs=dict(layers=[256, 256, 256]))
# DDPG Hyperparams:
# NOTE: it works even without action noise
# n_actions = env.action_space.shape[0]
# noise_std = 0.2
# action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.
˓→ones(n_actions))
model.learn(int(2e5))
model.save('her_sac_highway')
obs = env.reset()
You can also move from learning on one environment to another for continual learning (PPO2 on DemonAttack-v0,
then transferred on SpaceInvaders-v0):
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines import PPO2
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.6. Examples 23
Stable Baselines Documentation, Release 2.10.2
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
env.close()
import gym
from stable_baselines.common.vec_env import VecVideoRecorder, DummyVecEnv
env_id = 'CartPole-v1'
video_folder = 'logs/videos/'
video_length = 100
obs = env.reset()
name_prefix="random-agent-{}".format(env_id))
env.reset()
for _ in range(video_length + 1):
action = [env.action_space.sample()]
obs, _, _, _ = env.step(action)
# Save the video
env.close()
Note: For Atari games, you need to use a screen recorder such as Kazam. And then convert the video using ffmpeg
import imageio
import numpy as np
images = []
obs = model.env.reset()
img = model.env.render(mode='rgb_array')
for i in range(350):
images.append(img)
action, _ = model.predict(obs)
obs, _, _ ,_ = model.env.step(action)
img = model.env.render(mode='rgb_array')
Vectorized Environments are a method for stacking multiple independent environments into a single environment.
Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step. Because
of this, actions passed to the environment are now a vector (of dimension n). It is the same for observations,
rewards and end of episode signals (dones). In the case of non-array observation spaces such as Dict or Tuple,
where different sub-spaces may have different shapes, the sub-observations are vectors (of dimension n).
Note: Vectorized environments are required when using wrappers for frame-stacking or normalization.
Note: When using vectorized environments, the environments are automatically reset at the end of each episode.
Thus, the observation returned for the i-th environment when done[i] is true will in fact be the first observation
of the next episode, not the last observation of the episode that has just terminated. You can access the “real” final
observation of the terminated episode—that is, the one that accompanied the done event provided by the underlying
environment—using the terminal_observation keys in the info dicts returned by the vecenv.
Warning: When using SubprocVecEnv, users must wrap the code in an if __name__ ==
"__main__": if using the forkserver or spawn start method (default on Windows). On Linux, the de-
fault start method is fork which is not thread safe and can create deadlocks.
For more information, see Python’s multiprocessing guidelines.
1.7.1 VecEnv
Parameters
• num_envs – (int) the number of environments
• observation_space – (Gym Space) the observation space
• action_space – (Gym Space) the action space
close()
Clean up the environment’s resources.
env_method(method_name, *method_args, indices=None, **method_kwargs)
Call instance methods of vectorized environments.
Parameters
• method_name – (str) The name of the environment method to invoke.
• indices – (list,int) Indices of envs whose method to call
• method_args – (tuple) Any positional arguments to provide in the call
• method_kwargs – (dict) Any keyword arguments to provide in the call
Returns (list) List of items returned by the environment’s method call
get_attr(attr_name, indices=None)
Return attribute from vectorized environment.
Parameters
• attr_name – (str) The name of the attribute whose value to return
• indices – (list,int) Indices of envs to get attribute from
Returns (list) List of values of ‘attr_name’ in all environments
get_images() → Sequence[numpy.ndarray]
Return RGB images from each environment
getattr_depth_check(name, already_found)
Check if an attribute reference is being hidden in a recursive call to __getattr__
Parameters
• name – (str) name of attribute to check for
• already_found – (bool) whether this attribute has already been found in a wrapper
Returns (str or None) name of module whose attribute is being shadowed, if any.
render(mode: str = ’human’)
Gym environment rendering
Parameters mode – the rendering type
reset()
Reset all the environments and return an array of observations, or a tuple of observation arrays.
If step_async is still doing work, that work will be cancelled and step_wait() should not be called until
step_async() is invoked again.
Returns ([int] or [float]) observation
seed(seed: Optional[int] = None) → List[Union[None, int]]
Sets the random seeds for all environments, based on a given seed. Each individual environment will still
get its own seed, by incrementing the given seed.
Parameters seed – (Optional[int]) The random seed. May be None for completely random
seeding.
Returns (List[Union[None, int]]) Returns a list containing the seeds for each individual env.
Note that all list elements may be None, if the env does not return anything when being
seeded.
set_attr(attr_name, value, indices=None)
Set attribute inside vectorized environments.
Parameters
• attr_name – (str) The name of attribute to assign new value
• value – (obj) Value to assign to attr_name
• indices – (list,int) Indices of envs to assign value
Returns (NoneType)
step(actions)
Step the environments with the given action
Parameters actions – ([int] or [float]) the action
Returns ([int] or [float], [float], [bool], dict) observation, reward, done, information
step_async(actions)
Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of
the step.
You should not call this if a step_async run is already pending.
step_wait()
Wait for the step taken with step_async().
Returns ([int] or [float], [float], [bool], dict) observation, reward, done, information
1.7.2 DummyVecEnv
class stable_baselines.common.vec_env.DummyVecEnv(env_fns)
Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the
current Python process. This is useful for computationally simple environment such as cartpole-v1, as the
overhead of multiprocess or multithread outweighs the environment computation time. This can also be used
for RL methods that require a vectorized environment, but that you want a single environments to train with.
Parameters env_fns – ([callable]) A list of functions that will create the environments (each
callable returns a Gym.Env instance when called).
close()
Clean up the environment’s resources.
env_method(method_name, *method_args, indices=None, **method_kwargs)
Call instance methods of vectorized environments.
get_attr(attr_name, indices=None)
Return attribute from vectorized environment (see base class).
get_images() → Sequence[numpy.ndarray]
Return RGB images from each environment
1.7.3 SubprocVecEnv
Warning: Only ‘forkserver’ and ‘spawn’ start methods are thread-safe, which is important when Tensor-
Flow sessions or other non thread-safe libraries are used in the parent (see issue #217). However, compared
to ‘fork’ they incur a small start-up cost and have restrictions on global variables. With those methods,
users must wrap the code in an if __name__ == "__main__": block. For more information, see the
multiprocessing documentation.
Parameters
• env_fns – ([callable]) A list of functions that will create the environments (each callable
returns a Gym.Env instance when called).
• start_method – (str) method used to start the subprocesses. Must be one of the methods
returned by multiprocessing.get_all_start_methods(). Defaults to ‘forkserver’ on available
platforms, and ‘spawn’ otherwise.
close()
Clean up the environment’s resources.
env_method(method_name, *method_args, indices=None, **method_kwargs)
Call instance methods of vectorized environments.
get_attr(attr_name, indices=None)
Return attribute from vectorized environment (see base class).
get_images() → Sequence[numpy.ndarray]
Return RGB images from each environment
reset()
Reset all the environments and return an array of observations, or a tuple of observation arrays.
If step_async is still doing work, that work will be cancelled and step_wait() should not be called until
step_async() is invoked again.
Returns ([int] or [float]) observation
seed(seed=None)
Sets the random seeds for all environments, based on a given seed. Each individual environment will still
get its own seed, by incrementing the given seed.
Parameters seed – (Optional[int]) The random seed. May be None for completely random
seeding.
Returns (List[Union[None, int]]) Returns a list containing the seeds for each individual env.
Note that all list elements may be None, if the env does not return anything when being
seeded.
set_attr(attr_name, value, indices=None)
Set attribute inside vectorized environments (see base class).
step_async(actions)
Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of
the step.
You should not call this if a step_async run is already pending.
step_wait()
Wait for the step taken with step_async().
Returns ([int] or [float], [float], [bool], dict) observation, reward, done, information
1.7.4 Wrappers
VecFrameStack
VecNormalize
Deprecated since version 2.9.0: This function will be removed in a future version
normalize_obs(obs: numpy.ndarray) → numpy.ndarray
Normalize observations using this VecNormalize’s observations statistics. Calling this method does not
update statistics.
normalize_reward(reward: numpy.ndarray) → numpy.ndarray
Normalize rewards using this VecNormalize’s rewards statistics. Calling this method does not update
statistics.
reset()
Reset all environments
save_running_average(path)
Parameters path – (str) path to log dir
Deprecated since version 2.9.0: This function will be removed in a future version
set_venv(venv)
Sets the vector environment to wrap to venv.
Also sets attributes derived from this such as num_env.
Parameters venv – (VecEnv)
step_wait()
Apply sequence of actions to sequence of environments actions -> (observations, rewards, news)
where ‘news’ is a boolean vector indicating whether each element is new.
VecVideoRecorder
step_wait()
Wait for the step taken with step_async().
Returns ([int] or [float], [float], [bool], dict) observation, reward, done, information
VecCheckNan
To use the rl baselines with custom environments, they just need to follow the gym interface. That is to say, your
environment must implement the following methods (and inherits from OpenAI Gym Class):
Note: If you are using images as input, the input values must be in [0, 255] as the observation is normalized (dividing
by 255 to have values in [0, 1]) when using CNN policies.
import gym
from gym import spaces
class CustomEnv(gym.Env):
"""Custom Environment that follows gym interface"""
metadata = {'render.modes': ['human']}
To check that your environment follows the gym interface, please use:
We have created a colab notebook for a concrete example of creating a custom environment.
You can also find a complete guide online on creating a custom Gym environment.
Optionally, you can also register the environment with gym, that will allow you to create the RL agent in one line (and
use gym.make() to instantiate the env).
In the project, for testing purposes, we use a custom environment named IdentityEnv defined in this file. An
example of how to use it can be found here.
Stable baselines provides default policy networks (see Policies ) for images (CNNPolicies) and other type of input
features (MlpPolicies).
One way of customising the policy network architecture is to pass arguments when creating the model, using
policy_kwargs parameter:
import gym
import tensorflow as tf
# Custom MLP policy of two layers of size 32 each with tanh activation function
policy_kwargs = dict(act_fun=tf.nn.tanh, net_arch=[32, 32])
# Create the agent
model = PPO2("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1)
# Retrieve the environment
env = model.get_env()
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("ppo2-cartpole")
del model
# the policy_kwargs are automatically loaded
model = PPO2.load("ppo2-cartpole")
You can also easily define a custom architecture for the policy (or value) network:
Note: Defining a custom policy class is equivalent to passing policy_kwargs. However, it lets you name the
policy and so makes usually the code clearer. policy_kwargs should be rather used when doing hyperparameter
search.
import gym
del model
# When loading a model with a custom policy
# you MUST pass explicitly the policy when loading the saved model
model = A2C.load("a2c-lunar", policy=CustomPolicy)
Warning: When loading a model with a custom policy, you must pass the custom policy explicitly when loading
the model. (cf previous example)
You can also register your policy, to help with code simplicity: you can refer to your custom policy using a string.
import gym
# Register the policy, it will check that the name is not already taken
register_policy('CustomPolicy', CustomPolicy)
Deprecated since version 2.3.0: Use net_arch instead of layers parameter to define the network architecture. It
allows to have a greater control.
The net_arch parameter of FeedForwardPolicy allows to specify the amount and size of the hidden layers and
how many of them are shared between the policy network and the value network. It is assumed to be a list with the
following structure:
1. An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer. If
the number of ints is zero, there will be no shared layers.
2. An optional dict, to specify the following non-shared layers for the value network and the policy network. It
is formatted like dict(vf=[<value layer sizes>], pi=[<policy layer sizes>]). If it is
missing any of the keys (pi or vf), no non-shared layers (empty list) is assumed.
In short: [<shared layers>, dict(vf=[<non-shared value network layers>],
pi=[<non-shared policy network layers>])].
1.9.1 Examples
Value network deeper than policy network, first layer shared: net_arch=[128, dict(vf=[256, 256])]
obs
|
<128>
/ \
action <256>
|
<256>
|
value
Here the net_arch parameter takes an additional (mandatory) ‘lstm’ entry within the shared network section. The
LSTM is shared between value network and policy network.
If your task requires even more granular control over the policy architecture, you can redefine the policy directly:
import gym
import tensorflow as tf
# Custom MLP policy of three layers of size 128 each for the actor and 2 layers of 32
˓→for the critic,
vf_h = extracted_features
for i, layer_size in enumerate([32, 32]):
vf_h = activ(tf.layers.dense(vf_h, layer_size, name='vf_fc' + str(i)))
value_fn = tf.layers.dense(vf_h, 1, name='vf')
vf_latent = vf_h
self._value_fn = value_fn
self._setup_init()
{self.obs_ph: obs})
else:
action, value, neglogp = self.sess.run([self.action, self.value_flat,
˓→self.neglogp],
{self.obs_ph: obs})
return action, value, self.initial_state, neglogp
1.10 Callbacks
A callback is a set of functions that will be called at given stages of the training procedure. You can use callbacks to
access internal state of the RL model during training. It allows one to do monitoring, auto saving, model manipulation,
progress bars, . . .
To build a custom callback, you need to create a class that derives from BaseCallback. This will give you access
to events (_on_training_start, _on_step) and useful variables (like self.model for the RL model).
1.10. Callbacks 37
Stable Baselines Documentation, Release 2.10.2
You can find two examples of custom callbacks in the documentation: one for saving the best model according to the
training reward (see Examples), and one for logging additional values with Tensorboard (see Tensorboard section).
class CustomCallback(BaseCallback):
"""
A custom callback that derives from ``BaseCallback``.
Note: self.num_timesteps corresponds to the total number of steps taken in the environment, i.e., it is the number of
environments multiplied by the number of time env.step() was called
You should know that PPO1 and TRPO update self.num_timesteps after each rollout (and not each step) because they
rely on MPI.
For the other algorithms, self.num_timesteps is incremented by n_envs (number of environments) after each call to
env.step()
Note: For off-policy algorithms like SAC, DDPG, TD3 or DQN, the notion of rollout corresponds to the steps
taken in the environment between two updates.
Compared to Keras, Stable Baselines provides a second type of BaseCallback, named EventCallback that is
meant to trigger events. When an event is triggered, then a child callback is called.
As an example, EvalCallback is an EventCallback that will trigger its child callback when there is a new best
model. A child callback is for instance StopTrainingOnRewardThreshold that stops the training if the mean reward
achieved by the RL model is above a threshold.
Note: We recommend to take a look at the source code of EvalCallback and StopTrainingOnRewardThreshold to
have a better overview of what can be achieved with this kind of callbacks.
class EventCallback(BaseCallback):
"""
Base class for triggering callback on event.
1.10. Callbacks 39
Stable Baselines Documentation, Release 2.10.2
CheckpointCallback
Callback for saving a model every save_freq steps, you must specify a log folder (save_path) and optionally a
prefix for the checkpoints (rl_model by default).
from stable_baselines import SAC
from stable_baselines.common.callbacks import CheckpointCallback
# Save a checkpoint every 1000 steps
checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/',
name_prefix='rl_model')
EvalCallback
Evaluate periodically the performance of an agent, using a separate test environment. It will save the best model if
best_model_save_path folder is specified and save the evaluations results in a numpy archive (evaluations.npz)
if log_path folder is specified.
Note: You can pass a child callback via the callback_on_new_best argument. It will be triggered each time
there is a new best model.
import gym
CallbackList
Class for chaining callbacks, they will be called sequentially. Alternatively, you can pass directly a list of callbacks to
the learn() method, it will be converted automatically to a CallbackList.
import gym
StopTrainingOnRewardThreshold
Stop the training once a threshold in episodic reward (mean episode reward over the evaluations) has been reached
(i.e., when the model is good enough). It must be used with the EvalCallback and use the event triggered by a new
best model.
import gym
1.10. Callbacks 41
Stable Baselines Documentation, Release 2.10.2
EveryNTimesteps
An Event Callback that will trigger its child callback every n_steps timesteps.
Note: Because of the way PPO1 and TRPO work (they rely on MPI), n_steps is a lower bound between two events.
import gym
model.learn(int(2e4), callback=event_callback)
Warning: This way of doing callbacks is deprecated in favor of the object oriented approach.
A callback function takes the locals() variables and the globals() variables from the model, then returns a
boolean value for whether or not the training should continue.
Thanks to the access to the models variables, in particular _locals["self"], we are able to even change the
parameters of the model without halting the training, or changing the model’s code.
This callback will save the model and stop the training after the first call.
1.10. Callbacks 43
Stable Baselines Documentation, Release 2.10.2
class stable_baselines.common.callbacks.EvalCallback(eval_env:
Union[gym.core.Env, sta-
ble_baselines.common.vec_env.base_vec_env.VecEnv],
callback_on_new_best: Op-
tional[stable_baselines.common.callbacks.BaseCallback
= None, n_eval_episodes:
int = 5, eval_freq: int =
10000, log_path: str = None,
best_model_save_path: str =
None, deterministic: bool =
True, render: bool = False,
verbose: int = 1)
Callback for evaluating an agent.
Parameters
• eval_env – (Union[gym.Env, VecEnv]) The environment used for initialization
• callback_on_new_best – (Optional[BaseCallback]) Callback to trigger when there is
a new best model according to the mean_reward
• n_eval_episodes – (int) The number of episodes to test the agent
• eval_freq – (int) Evaluate the agent every eval_freq call of the callback.
• log_path – (str) Path to a folder where the evaluations (evaluations.npz) will be saved. It
will be updated at each evaluation.
• best_model_save_path – (str) Path to a folder where the best model according to
performance on the eval env will be saved.
• deterministic – (bool) Whether the evaluation should use a stochastic or deterministic
actions.
• render – (bool) Whether to render or not the environment during evaluation
• verbose – (int)
class stable_baselines.common.callbacks.EventCallback(callback: Op-
tional[stable_baselines.common.callbacks.BaseCallba
= None, verbose: int = 0)
Base class for triggering callback on event.
Parameters
• callback – (Optional[BaseCallback]) Callback that will be called when an event is trig-
gered.
• verbose – (int)
init_callback(model: BaseRLModel) → None
Initialize the callback by saving references to the RL model and the training environment for convenience.
class stable_baselines.common.callbacks.EveryNTimesteps(n_steps: int, callback: sta-
ble_baselines.common.callbacks.BaseCallback)
Trigger a callback every n_steps timesteps
Parameters
• n_steps – (int) Number of timesteps between two trigger.
• callback – (BaseCallback) Callback that will be called when the event is triggered.
class stable_baselines.common.callbacks.StopTrainingOnRewardThreshold(reward_threshold:
float,
ver-
bose:
int =
0)
Stop the training once a threshold in episodic reward has been reached (i.e. when the model is good enough).
It must be used with the EvalCallback.
Parameters
• reward_threshold – (float) Minimum expected reward per episode to stop training.
• verbose – (int)
To use Tensorboard with the rl baselines, you simply need to define a log location for the RL agent:
import gym
model.learn(total_timesteps=10000)
Or after loading an existing model (by default the log path is not saved):
import gym
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment
˓→to run
model.learn(total_timesteps=10000)
You can also define custom logging name when training (by default it is the algorithm name)
import gym
model.learn(total_timesteps=10000, tb_log_name="first_run")
# Pass reset_num_timesteps=False to continue the training curve in tensorboard
# By default, it will create a new curve
(continues on next page)
Once the learn function is called, you can monitor the RL agent during or after the training, with the following bash
command:
It will display information such as the model graph, the episode reward, the model losses, the observation and other
parameter unique to some models.
Using a callback, you can easily log more values with TensorBoard. Here is a simple example on how to log both
additional tensor or arbitrary scalar value:
import tensorflow as tf
import numpy as np
class TensorboardCallback(BaseCallback):
"""
Custom callback for plotting additional values in tensorboard.
"""
def __init__(self, verbose=0):
self.is_tb_set = False
super(TensorboardCallback, self).__init__(verbose)
self.model.summary = tf.summary.merge_all()
self.is_tb_set = True
# Log scalar value (here a random variable)
value = np.random.random()
summary = tf.Summary(value=[tf.Summary.Value(tag='random_value', simple_
˓→value=value)])
self.locals['writer'].add_summary(summary, self.num_timesteps)
return True
model.learn(50000, callback=TensorboardCallback())
All the information displayed in the terminal (default logging) can be also logged in tensorboard. For that, you need
to define several environment variables:
# formats are comma-separated, but for tensorboard you only need the last one
# stdout -> terminal
export OPENAI_LOG_FORMAT='stdout,log,csv,tensorboard'
export OPENAI_LOGDIR=path/to/tensorboard/data
configure()
tensorboard --logdir=$OPENAI_LOGDIR
RL Baselines Zoo. is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines. It also
provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.
Goals of this repository:
1. Provide a simple interface to train and enjoy RL agents
2. Benchmark the different Reinforcement Learning algorithms
3. Provide tuned hyperparameters for each environment and RL algorithm
4. Have fun with the trained agents!
1.12.1 Installation
1. Install dependencies
Train for multiple environments (with one call) and with tensorboard logging:
Continue training (here, load pretrained agent for Breakout and continue training for 5000 steps):
If the trained agent exists, then you can see it in action using:
python train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --
˓→n-jobs 2 \
Note: You can find more information about the rl baselines zoo in the repo README. For instance, how to record a
video of a trained agent.
With the .pretrain() method, you can pre-train RL policies using trajectories from an expert, and therefore
accelerate training.
Behavior Cloning (BC) treats the problem of imitation learning, i.e., using expert demonstrations, as a supervised
learning problem. That is to say, given expert trajectories (observations-actions pairs), the policy network is trained
to reproduce the expert behavior: for a given observation, the action taken by the policy must be the one taken by the
expert.
Expert trajectories can be human demonstrations, trajectories from another controller (e.g. a PID controller) or trajec-
tories from a trained RL agent.
Note: Only Box and Discrete spaces are supported for now for pre-training a model.
Note: Images datasets are treated a bit differently as other datasets to avoid memory issues. The images from the
expert demonstrations must be located in a folder, not in the expert numpy archive.
Here, we are going to train a RL model and then generate expert trajectories using this agent.
Note that in practice, generating expert trajectories usually does not require training an RL agent.
The following example is only meant to demonstrate the pretrain() feature.
However, we recommend users to take a look at the code of the generate_expert_traj() function (located in
gail/dataset/ folder) to learn about the data structure of the expert dataset (see below for an overview) and how
to record trajectories.
Here is an additional example when the expert controller is a callable, that is passed to the function instead of a RL
model. The idea is that this callable can be a PID controller, asking a human player, . . .
import gym
env = gym.make("CartPole-v1")
# Here the expert is a random agent
# but it can be any python function, e.g. a PID controller
def dummy_expert(_obs):
"""
Random agent. It samples actions randomly
from the action space of the environment.
reward_sum = 0.0
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
reward_sum += reward
env.render()
if done:
print(reward_sum)
reward_sum = 0.0
obs = env.reset()
env.close()
The expert dataset is a .npz archive. The data is saved in python dictionary format with keys: actions,
episode_returns, rewards, obs, episode_starts.
In case of images, obs contains the relative path to the images.
obs, actions: shape (N * L, ) + S
where N = # episodes, L = episode length and S is the environment observation/action space.
S = (1, ) for discrete space
class stable_baselines.gail.ExpertDataset(expert_path=None, traj_data=None,
train_fraction=0.7, batch_size=64,
traj_limitation=-1, randomize=True, ver-
bose=1, sequential_preprocessing=False)
Dataset for using behavior cloning or GAIL.
The structure of the expert dataset is a dict, saved as an “.npz” archive. The dictionary contains the keys
‘actions’, ‘episode_returns’, ‘rewards’, ‘obs’ and ‘episode_starts’. The corresponding values have data concate-
nated across episode: the first axis is the timestep, the remaining axes index into the data. In case of images,
‘obs’ contains the relative path to the images, to enable space saving from image compression.
Parameters
• expert_path – (str) The path to trajectory data (.npz file). Mutually exclusive with
traj_data.
• traj_data – (dict) Trajectory data, in format described above. Mutually exclusive with
expert_path.
• train_fraction – (float) the train validation split (0 to 1) for pre-training using behav-
ior cloning (BC)
• batch_size – (int) the minibatch size for behavior cloning
• traj_limitation – (int) the number of trajectory to use (if -1, load all)
• randomize – (bool) if the dataset should be shuffled
• verbose – (int) Verbosity
• sequential_preprocessing – (bool) Do not use subprocess to preprocess the data
(slower but use less memory for the CI)
get_next_batch(split=None)
Get the batch from the dataset.
Parameters split – (str) the type of data split (can be None, ‘train’, ‘val’)
Returns (np.ndarray, np.ndarray) inputs and labels
init_dataloader(batch_size)
Initialize the dataloader used by GAIL.
Parameters batch_size – (int)
log_info()
Log the information of the dataset.
plot()
Show histogram plotting of the episode returns
class stable_baselines.gail.DataLoader(indices, observations, actions, batch_size,
n_workers=1, infinite_loop=True,
max_queue_len=1, shuffle=False,
start_process=True, backend=’threading’, se-
quential=False, partial_minibatch=True)
A custom dataloader to preprocessing observations (including images) and feed them to the network.
Original code for the dataloader from https://github.com/araffin/robotics-rl-srl (MIT licence) Authors: Antonin
Raffin, René Traoré, Ashley Hill
Parameters
• indices – ([int]) list of observations indices
• observations – (np.ndarray) observations or images path
• actions – (np.ndarray) actions
• batch_size – (int) Number of samples per minibatch
• n_workers – (int) number of preprocessing worker (for loading the images)
• infinite_loop – (bool) whether to have an iterator that can be reset
• max_queue_len – (int) Max number of minibatches that can be preprocessed at the same
time
Note: only Box and Discrete spaces are supported for now.
Parameters
• model – (RL model or callable) The expert model, if it needs to be trained, then you need
to pass n_timesteps > 0.
• save_path – (str) Path without the extension where the expert dataset will be saved (ex:
‘expert_cartpole’ -> creates ‘expert_cartpole.npz’). If not specified, it will not save, and just
return the generated expert trajectories. This parameter must be specified for image-based
environments.
• env – (gym.Env) The environment, if not defined then it tries to use the model environment.
• n_timesteps – (int) Number of training timesteps
• n_episodes – (int) Number of trajectories (episodes) to record
• image_folder – (str) When using images, folder that will be used to record images.
Returns (dict) the generated expert trajectories.
During the training of a model on a given environment, it is possible that the RL model becomes completely corrupted
when a NaN or an inf is given or returned from the RL model.
The issue arises then NaNs or infs do not crash, but simply get propagated through the training, until all the floating
point number converge to NaN or inf. This is in line with the IEEE Standard for Floating-Point Arithmetic (IEEE 754)
standard, as it says:
Note:
Five possible exceptions can occur:
√
• Invalid operation ( −1, inf ×1, NaN mod 1, . . . ) return NaN
• Division by zero:
– if the operand is not zero (1/0, −2/0, . . . ) returns ± inf
– if the operand is zero (0/0) returns signaling NaN
• Overflow (exponent too high to represent) returns ± inf
• Underflow (exponent too low to represent) returns 0
• Inexact (not representable exactly in base 2, eg: 1/5) returns the rounded value (ex: assert (1/5) *
3 == 0.6000000000000001)
And of these, only Division by zero will signal an exception, the rest will propagate invalid values quietly.
In python, dividing by zero will indeed raise the exception: ZeroDivisionError: float division by
zero, but ignores the rest.
The default in numpy, will warn: RuntimeWarning: invalid value encountered but will not halt the
code.
And the worst of all, Tensorflow will not signal anything
import tensorflow as tf
import numpy as np
print("tensorflow test:")
a = tf.constant(1.0)
b = tf.constant(0.0)
c = a / b
sess = tf.Session()
val = sess.run(c) # this will be quiet
print(val)
sess.close()
print("\r\nnumpy test:")
a = np.float64(1.0)
b = np.float64(0.0)
val = a / b # this will warn
print(val)
a = 1.0
b = 0.0
val = a / b # this will raise an exception and halt.
print(val)
Unfortunately, most of the floating point operations are handled by Tensorflow and numpy, meaning you might get
little to no warning when a invalid value occurs.
Numpy has a convenient way of dealing with invalid value: numpy.seterr, which defines for the python process, how
it should handle floating point error.
import numpy as np
print("numpy test:")
a = np.float64(1.0)
b = np.float64(0.0)
val = a / b # this will now raise an exception instead of a warning.
print(val)
but this will also avoid overflow issues on floating point numbers:
import numpy as np
a = np.float64(10)
b = np.float64(1000)
val = a ** b # this will now raise an exception
print(val)
a = np.float64('NaN')
b = np.float64(1.0)
val = a + b # this will neither warn nor raise anything
print(val)
Tensorflow can add checks for detecting and dealing with invalid value: tf.add_check_numerics_ops and
tf.check_numerics, however they will add operations to the Tensorflow graph and raise the computation time.
import tensorflow as tf
print("tensorflow test:")
a = tf.constant(1.0)
b = tf.constant(0.0)
c = a / b
sess = tf.Session()
val, _ = sess.run([c, check_nan]) # this will now raise an exception
print(val)
sess.close()
but this will also avoid overflow issues on floating point numbers:
import tensorflow as tf
a = tf.constant(10)
b = tf.constant(1000)
c = a ** b
sess = tf.Session()
val, _ = sess.run([c] + check_nan) # this will now raise an exception
print(val)
sess.close()
import tensorflow as tf
a = tf.constant('NaN')
b = tf.constant(1.0)
c = a + b
sess = tf.Session()
val, _ = sess.run([c] + check_nan) # this will now raise an exception
print(val)
sess.close()
In order to find when and from where the invalid value originated from, stable-baselines comes with a VecCheckNan
wrapper.
It will monitor the actions, observations, and rewards, indicating what action or observation caused it and from what.
import gym
from gym import spaces
import numpy as np
class NanAndInfEnv(gym.Env):
"""Custom Environment that raised NaNs and Infs"""
metadata = {'render.modes': ['human']}
def __init__(self):
super(NanAndInfEnv, self).__init__()
self.action_space = spaces.Box(low=-np.inf, high=np.inf, shape=(1,), dtype=np.
˓→float64)
def reset(self):
return [0.0]
# Create environment
env = DummyVecEnv([lambda: NanAndInfEnv()])
env = VecCheckNan(env, raise_exception=True)
Depending on your hyperparameters, NaN can occurs much more often. A great example of this: https://github.com/
hill-a/stable-baselines/issues/340
Be aware, the hyperparameters given by default seem to work in most cases, however your environment might not
play nice with them. If this is the case, try to read up on the effect each hyperparameters has on the model, so that you
can try and tune them to get a stable model. Alternatively, you can try automatic hyperparameter tuning (included in
the rl zoo).
If your environment is generated from an external dataset, do not forget to make sure your dataset does not contain
NaNs. As some datasets will sometimes fill missing values with NaNs as a surrogate value.
Here is some reading material about finding NaNs: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_
data.html
And filling the missing values with something else (imputation): https://towardsdatascience.com/
how-to-handle-missing-data-8646b18db0d4
Stable baselines stores both neural network parameters and algorithm-related parameters such as exploration schedule,
number of environments and observation/action space. This allows continual learning and easy use of trained agents
without training, but it is not without its issues. Following describes two formats used to save agents in stable baselines,
their pros and shortcomings.
Terminology used in this page:
• parameters refer to neural network parameters (also called “weights”). This is a dictionary mapping Tensorflow
variable name to a NumPy array.
• data refers to RL algorithm parameters, e.g. learning rate, exploration schedule, action/observation space. These
depend on the algorithm used. This is a dictionary mapping classes variable names their values.
Original stable baselines save format. Data and parameters are bundled up into a tuple (data, parameters) and
then serialized with cloudpickle library (essentially the same as pickle).
This save format is still available via an argument in model save function in stable-baselines versions above v2.7.0 for
backwards compatibility reasons, but its usage is discouraged.
Pros:
• Easy to implement and use.
• Works with almost any type of Python object, including functions.
Cons:
• Pickle/Cloudpickle is not designed for long-term storage or sharing between Python version.
• If one object in file is not readable (e.g. wrong library version), then reading the rest of the file is difficult.
• Python-specific format, hard to read stored files from other languages.
If part of a saved model becomes unreadable for any reason (e.g. different Tensorflow versions), then it may be tricky
to restore any of the model. For this reason another save format was designed.
A zip-archived JSON dump and NumPy zip archive of the arrays. The data dictionary (class parameters) is stored as a
JSON file, model parameters are serialized with numpy.savez function and these two files are stored under a single
.zip archive.
Any objects that are not JSON serializable are serialized with cloudpickle and stored as base64-encoded string in the
JSON file, along with some information that was stored in the serialization. This allows inspecting stored objects
without deserializing the object itself.
This format allows skipping elements in the file, i.e. we can skip deserializing objects that are broken/non-serializable.
This can be done via custom_objects argument to load functions.
This is the default save format in stable baselines versions after v2.7.0.
File structure:
saved_model.zip/
data JSON file of class-parameters (dictionary)
parameter_list JSON file of model parameters and their ordering (list)
parameters Bytes from numpy.savez (a zip file of the numpy arrays). ...
... Being a zip-archive itself, this object can also be opened ...
... as a zip-archive and browsed.
Pros:
• More robust to unserializable objects (one bad object does not break everything).
• Saved file can be inspected/extracted with zip-archive explorers and by other languages.
Cons:
• More complex implementation.
• Still relies partly on cloudpickle for complex objects (e.g. custom functions).
After training an agent, you may want to deploy/use it in an other language or framework, like PyTorch or tensorflowjs.
Stable Baselines does not include tools to export models to other frameworks, but this document aims to cover parts
that are required for exporting along with more detailed stories from users of Stable Baselines.
1.16.1 Background
In Stable Baselines, the controller is stored inside policies which convert observations into actions. Each learning
algorithm (e.g. DQN, A2C, SAC) contains one or more policies, some of which are only used for training. An easy
way to find the policy is to check the code for the predict function of the agent: This function should only call one
policy with simple arguments.
Policies hold the necessary Tensorflow placeholders and tensors to do the inference (i.e. predict actions), so it is
enough to export these policies to do inference in an another framework.
Note: Learning algorithms also may contain other Tensorflow placeholders, that are used for training only and are
not required for inference.
Warning: When using CNN policies, the observation is normalized internally (dividing by 255 to have values in
[0, 1])
A known working solution is to use get_parameters function to obtain model parameters, construct the network
manually in PyTorch and assign parameters correctly.
Warning: PyTorch and Tensorflow have internal differences with e.g. 2D convolutions (see discussion linked
below).
Tensorflow, which is the backbone of Stable Baselines, is fundamentally a C/C++ library despite being most commonly
accessed through the Python frontend layer. This design choice means that the models created at Python level should
generally be fully compliant with the respective C++ version of Tensorflow.
Warning: It is advisable not to mix-and-match different versions of Tensorflow libraries, particularly in terms
of the state. Moving computational graphs is generally more forgiving. As a matter of fact, mentioned below
PPO_CPP project uses graphs generated with Python Tensorflow 1.x in C++ Tensorflow 2 version.
Stable Baselines comes very handily when hoping to migrate a computational graph and/or a state (weights) as the
existing algorithms define most of the necessary computations for you so you don’t need to recreate the core of the
algorithms again. This is exactly the idea that has been used in the PPO_CPP project, which executes the training
at the C++ level for the sake of computational efficiency. The graphs are exported from Stable Baselines’ PPO2
implementation through tf.train.export_meta_graph function. Alternatively, and perhaps more commonly,
you could use the C++ layer only for inference. That could be useful as a deployment step of server backends or
optimization for more limited devices.
Warning: As a word of caution, C++-level APIs are more imperative than their Python counterparts or more
plainly speaking: cruder. This is particularly apparent in Tensorflow 2.0 where the declarativeness of Autograph
exists only at Python level. The C++ counterpart still operates on Session objects’ use, which are known from
earlier versions of Tensorflow. In our use case, availability of graphs utilized by Session depends on the use of
tf.function decorators. However, as of November 2019, Stable Baselines still uses Tensorflow 1.x in the main
version which is slightly easier to use in the context of the C++ portability.
You can also manually export required parameters (weights) and construct the network in your desired framework, as
done with the PyTorch example above.
You can access parameters of the model via agents’ get_parameters function. If you use default policies, you
can find the architecture of the networks in source for policies. Otherwise, for DQN/SAC/DDPG or TD3 you need to
check the policies.py file located in their respective folders.
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’run’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
Stable-baselines provides a set of default policies, that can be used with most action spaces. To customize the default
policies, you can specify the policy_kwargs parameter to the model class you use. Those kwargs are then passed
to the policy on instantiation (see Custom Policy Network for an example). If you need more control on the policy
architecture, you can also create a custom policy (see Custom Policy Network).
Note: CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)
Warning: For all algorithms (except DDPG, TD3 and SAC), continuous actions are clipped during training and
testing (to avoid out of bound error).
Available Policies
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) the action probability
step(obs, state=None, mask=None, deterministic=False)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns ([float], [float], [float], [float]) actions, values, states, neglogp
value(obs, state=None, mask=None)
Returns the value for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) The associated value of the action
class stable_baselines.common.policies.LstmPolicy(sess, ob_space, ac_space,
n_env, n_steps, n_batch,
n_lstm=256, reuse=False, lay-
ers=None, net_arch=None,
act_fun=<MagicMock
id=’140452378442640’>,
cnn_extractor=<function na-
ture_cnn>, layer_norm=False, fea-
ture_extraction=’cnn’, **kwargs)
Policy object that implements actor critic, using LSTMs.
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• n_lstm – (int) The number of LSTM cells (for recurrent policies)
• reuse – (bool) If the policy is reusable or not
• layers – ([int]) The size of the Neural network before the LSTM layer (if None, default
to [64, 64])
• net_arch – (list) Specification of the actor-critic policy network architecture. Notation
similar to the format described in mlp_extractor but with additional support for a ‘lstm’
entry in the shared network part.
• act_fun – (tf.func) the activation function to use in the neural network.
• cnn_extractor – (function (TensorFlow Tensor, **kwargs): (TensorFlow Tensor))
the CNN feature extraction
• layer_norm – (bool) Whether or not to use layer normalizing LSTMs
• feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
• kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
proba_step(obs, state=None, mask=None)
Returns the action probability for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) the action probability
step(obs, state=None, mask=None, deterministic=False)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns ([float], [float], [float], [float]) actions, values, states, neglogp
value(obs, state=None, mask=None)
Cf base class doc.
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.CnnLstmPolicy(sess, ob_space, ac_space,
n_env, n_steps, n_batch,
n_lstm=256, reuse=False,
**_kwargs)
Policy object that implements actor critic, using LSTMs with a CNN feature extraction
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• n_lstm – (int) The number of LSTM cells (for recurrent policies)
• reuse – (bool) If the policy is reusable or not
• kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.CnnLnLstmPolicy(sess, ob_space, ac_space,
n_env, n_steps, n_batch,
n_lstm=256, reuse=False,
**_kwargs)
Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• n_lstm – (int) The number of LSTM cells (for recurrent policies)
• reuse – (bool) If the policy is reusable or not
• kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
1.19 A2C
A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to
avoid the use of a replay buffer.
1.19.1 Notes
• Recurrent policies: X
• Multi processing: X
• Gym spaces:
1.19.3 Example
import gym
# Parallel environments
env = make_vec_env('CartPole-v1', n_envs=4)
model = A2C.load("a2c_cartpole")
obs = env.reset()
(continues on next page)
1.19.4 Parameters
1.19. A2C 73
Stable Baselines Documentation, Release 2.10.2
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’A2C’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
1.19. A2C 75
Stable Baselines Documentation, Release 2.10.2
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent poli-
cies)
pretrain(dataset, n_epochs=10, learning_rate=0.0001, adam_epsilon=1e-08, val_interval=None)
Pretrain a model using behavior cloning: supervised learning given an expert dataset.
NOTE: only Box and Discrete spaces are supported for now.
Parameters
• dataset – (ExpertDataset) Dataset manager
• n_epochs – (int) Number of iterations on the training set
• learning_rate – (float) Learning rate
• adam_epsilon – (float) the epsilon value for the adam optimizer
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• new_tb_log
• writer
• t_start
• mb_obs
• mb_rewards
• mb_actions
• mb_values
• mb_dones
• mb_states
• ep_infos
• actions
• values
• states
• clipped_actions
• obs
• rewards
• dones
• infos
From timestep 2
• info
• maybe_ep_info
1.20 ACER
Sample Efficient Actor-Critic with Experience Replay (ACER) combines several ideas of previous al-
gorithms: it uses multiple workers (as A2C), implements a replay buffer (as in DQN), uses Retrace for
Q-value estimation, importance sampling and a trust region.
1.20. ACER 77
Stable Baselines Documentation, Release 2.10.2
1.20.1 Notes
• Recurrent policies: X
• Multi processing: X
• Gym spaces:
1.20.3 Example
import gym
# multiprocess environment
env = make_vec_env('CartPole-v1', n_envs=4)
model = ACER.load("acer_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.20.4 Parameters
1.20. ACER 79
Stable Baselines Documentation, Release 2.10.2
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• policy_kwargs – (dict) additional arguments to be passed to the policy on creation
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
WARNING: this logging can take a lot of space quickly
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’ACER’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
1.20. ACER 81
Stable Baselines Documentation, Release 2.10.2
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible from
“timestep X” are variables that can be accessed when self.timestep==X from the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• new_tb_log
• writer
• episode_stats
• buffer
• t_start
• enc_obs
• mb_obs
• mb_actions
• mb_mus
• mb_dones
• mb_rewards
• actions
• states
• mus
• clipped_actions
• obs
• rewards
• dones
1.21 ACKTR
Actor Critic using Kronecker-Factored Trust Region (ACKTR) uses Kronecker-factored approximate curvature (K-
FAC) for trust region optimization.
1.21. ACKTR 83
Stable Baselines Documentation, Release 2.10.2
1.21.1 Notes
• Recurrent policies: X
• Multi processing: X
• Gym spaces:
1.21.3 Example
import gym
# multiprocess environment
env = make_vec_env('CartPole-v1', n_envs=4)
model = ACKTR.load("acktr_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.21.4 Parameters
1.21. ACKTR 85
Stable Baselines Documentation, Release 2.10.2
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’ACKTR’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
1.21. ACKTR 87
Stable Baselines Documentation, Release 2.10.2
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent poli-
cies)
pretrain(dataset, n_epochs=10, learning_rate=0.0001, adam_epsilon=1e-08, val_interval=None)
Pretrain a model using behavior cloning: supervised learning given an expert dataset.
NOTE: only Box and Discrete spaces are supported for now.
Parameters
• dataset – (ExpertDataset) Dataset manager
• n_epochs – (int) Number of iterations on the training set
• learning_rate – (float) Learning rate
• adam_epsilon – (float) the epsilon value for the adam optimizer
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible from
“timestep X” are variables that can be accessed when self.timestep==X from the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• new_tb_log
• writer
• tf_vars
• is_uninitialized
• new_uninitialized_vars
• t_start
• coord
• enqueue_threads
• old_uninitialized_vars
• mb_obs
• mb_rewards
• mb_actions
• mb_values
• mb_dones
• mb_states
• ep_infos
• _
• actions
• values
• states
• clipped_actions
• obs
• rewards
• dones
• infos
From timestep 2
• info
• maybe_ep_info
1.21. ACKTR 89
Stable Baselines Documentation, Release 2.10.2
1.22 DDPG
Note: DDPG requires OpenMPI. If OpenMPI isn’t enabled, then DDPG isn’t imported into the
stable_baselines module.
Warning: The DDPG model does not support stable_baselines.common.policies because it uses
q-value instead of value estimation, as a result it must use its own policy models (see DDPG Policies).
Available Policies
1.22.1 Notes
• Recurrent policies:
• Multi processing: X (using MPI)
• Gym spaces:
1.22.3 Example
import gym
import numpy as np
env = gym.make('MountainCarContinuous-v0')
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")
model = DDPG.load("ddpg_mountain")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.22.4 Parameters
1.22. DDPG 91
Stable Baselines Documentation, Release 2.10.2
• env – (Gym environment or str) The environment to learn from (if registered in Gym, can
be str)
• gamma – (float) the discount factor
• memory_policy – (ReplayBuffer) the replay buffer (if None, default to base-
lines.deepq.replay_buffer.ReplayBuffer)
Deprecated since version 2.6.0: This parameter will be removed in a future version
• eval_env – (Gym Environment) the evaluation environment (can be None)
• nb_train_steps – (int) the number of training steps
• nb_rollout_steps – (int) the number of rollout steps
• nb_eval_steps – (int) the number of evaluation steps
• param_noise – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
• action_noise – (ActionNoise) the action noise type (can be None)
• param_noise_adaption_interval – (int) apply param noise every N steps
• tau – (float) the soft update coefficient (keep old values, between 0 and 1)
• normalize_returns – (bool) should the critic output be normalized
• enable_popart – (bool) enable pop-art normalization of the critic output (https://arxiv.
org/pdf/1602.07714.pdf), normalize_returns must be set to True.
• normalize_observations – (bool) should the observation be normalized
• batch_size – (int) the size of the batch for learning the policy
• observation_range – (tuple) the bounding values for the observation
• return_range – (tuple) the bounding values for the critic output
• critic_l2_reg – (float) l2 regularizer coefficient
• actor_lr – (float) the actor learning rate
• critic_lr – (float) the critic learning rate
• clip_norm – (float) clip the gradients (disabled if None)
• reward_scale – (float) the value the reward should be scaled by
• render – (bool) enable rendering of the environment
• render_eval – (bool) enable rendering of the evaluation environment
• memory_limit – (int) the max number of transitions to store, size of the replay buffer
Deprecated since version 2.6.0: Use buffer_size instead.
• buffer_size – (int) the max number of transitions to store, size of the replay buffer
• random_exploration – (float) Probability of taking a random action (as in an epsilon-
greedy strategy) This is not needed for DDPG normally but can help exploring when using
HER + DDPG. This hack was present in the original OpenAI Baselines repo (DDPG + HER)
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
1.22. DDPG 93
Stable Baselines Documentation, Release 2.10.2
is_using_her() → bool
Check if is using HER
Returns (bool) Whether is using HER or not
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’DDPG’, re-
set_num_timesteps=True, replay_wrapper=None)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
1.22. DDPG 95
Stable Baselines Documentation, Release 2.10.2
• action – (TensorFlow Tensor) The action placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the critic
Returns (TensorFlow Tensor) the output tensor
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) the action probability
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
value(obs, action, state=None, mask=None)
Returns the value for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• action – ([float] or [int]) The taken action
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) The associated value of the action
class stable_baselines.ddpg.LnMlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch,
reuse=False, **_kwargs)
Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
1.22. DDPG 97
Stable Baselines Documentation, Release 2.10.2
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
value(obs, action, state=None, mask=None)
Returns the value for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• action – ([float] or [int]) The taken action
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) The associated value of the action
class stable_baselines.ddpg.CnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch,
reuse=False, **_kwargs)
Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
make_actor(obs=None, reuse=False, scope=’pi’)
creates an actor object
1.22. DDPG 99
Stable Baselines Documentation, Release 2.10.2
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the actor
Returns (TensorFlow Tensor) the output tensor
make_critic(obs=None, action=None, reuse=False, scope=’qf’)
creates a critic object
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• action – (TensorFlow Tensor) The action placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the critic
Returns (TensorFlow Tensor) the output tensor
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) the action probability
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
value(obs, action, state=None, mask=None)
Returns the value for a single step
Parameters
• desired_action_stddev – (float) the desired value for the standard deviation of the
noise
• adoption_coefficient – (float) the update coefficient for the standard deviation of
the noise
adapt(distance)
update the standard deviation for the parameter noise
Parameters distance – (float) the noise distance applied to the parameters
get_stats()
return the standard deviation for the parameter noise
Returns (dict) the stats of the noise
class stable_baselines.ddpg.NormalActionNoise(mean, sigma)
A Gaussian action noise
Parameters
• mean – (float) the mean value of the noise
• sigma – (float) the scale of the noise (std here)
reset() → None
call end of episode reset for the noise
class stable_baselines.ddpg.OrnsteinUhlenbeckActionNoise(mean, sigma,
theta=0.15, dt=0.01,
initial_noise=None)
A Ornstein Uhlenbeck action noise, this is designed to approximate brownian motion with friction.
Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
Parameters
• mean – (float) the mean of the noise
• sigma – (float) the scale of the noise
• theta – (float) the rate of mean reversion
• dt – (float) the timestep for the noise
• initial_noise – ([float]) the initial value for the noise output, (if None: 0)
reset() → None
reset the Ornstein Uhlenbeck noise, to the initial position
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy
network:
import gym
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible from
“timestep X” are variables that can be accessed when self.timestep==X from the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• replay_wrapper
• new_tb_log
• writer
• rank
• eval_episode_rewards_history
• episode_rewards_history
• episode_successes
• obs
• eval_obs
• episode_reward
• episode_step
• episodes
• step
• total_steps
• start_time
• epoch_episode_rewards
• epoch_episode_steps
• epoch_actor_losses
• epoch_critic_losses
• epoch_adaptive_distances
• eval_episode_rewards
• eval_qs
• epoch_actions
• epoch_qs
• epoch_episodes
• epoch
• action
• q_value
• unscaled_action
• new_obs
• reward
• done
• info
From timestep 2
• obs_
• new_obs_
• reward_
After nb_rollout_steps+1
• t_train
After nb_rollout_steps*ceil(nb_rollout_steps/batch_size)‘‘‘
• distance
• critic_loss
• actor_loss
1.23 DQN
Deep Q Network (DQN) and its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).
Warning: The DQN model does not support stable_baselines.common.policies, as a result it must
use its own policy models (see DQN Policies).
Available Policies
1.23.1 Notes
Note: By default, the DQN class has double q learning and dueling extensions enabled. See Issue #406 for disabling
dueling. To disable double-q learning, you can change the default value in the constructor.
• Recurrent policies:
• Multi processing:
• Gym spaces:
1.23.3 Example
import gym
env = gym.make('CartPole-v1')
model = DQN.load("deepq_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
With Atari:
env = make_atari('BreakoutNoFrameskip-v4')
model = DQN.load("deepq_breakout")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.23.4 Parameters
• param_noise – (bool) Whether or not to apply noise to the parameters of the policy.
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
WARNING: this logging can take a lot of space quickly
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
is_discrete
bool: is action space discrete.
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability for a single step
Parameters
• obs – (np.ndarray float or int) The current observation of the environment
• state – (np.ndarray float) The last states (used in recurrent policies)
• mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns (np.ndarray float) the action probability
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None, deterministic=True)
Returns the q_values for a single step
Parameters
• obs – (np.ndarray float or int) The current observation of the environment
• state – (np.ndarray float) The last states (used in recurrent policies)
• mask – (np.ndarray float) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
class stable_baselines.deepq.LnCnnPolicy(sess, ob_space, ac_space, n_env, n_steps,
n_batch, reuse=False, obs_phs=None, duel-
ing=True, **_kwargs)
Policy object that implements DQN policy, using a CNN (the nature CNN), with layer normalisation
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for
observation placeholder and the processed observation placeholder respectively
• dueling – (bool) if true double the output MLP to compute a baseline for action scores
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability for a single step
Parameters
• obs – (np.ndarray float or int) The current observation of the environment
• state – (np.ndarray float) The last states (used in recurrent policies)
• mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns (np.ndarray float) the action probability
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None, deterministic=True)
Returns the q_values for a single step
Parameters
• obs – (np.ndarray float or int) The current observation of the environment
• state – (np.ndarray float) The last states (used in recurrent policies)
• mask – (np.ndarray float) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy
network:
import gym
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible from
“timestep X” are variables that can be accessed when self.timestep==X from the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• replay_wrapper
• new_tb_log
• writer
• episode_rewards
• episode_successes
• reset
• obs
• _
• kwargs
• update_eps
• update_param_noise_threshold
• action
• env_action
• new_obs
• rew
• done
• info
From timestep 2
• obs_
• new_obs_
• reward_
• can_sample
• mean_100ep_reward
• num_episodes
1.24 GAIL
The Generative Adversarial Imitation Learning (GAIL) uses expert trajectories to recover a cost function and then
learn a policy.
Learning a cost function from expert demonstrations is called Inverse Reinforcement Learning (IRL). The connection
between GAIL and Generative Adversarial Networks (GANs) is that it uses a discriminator that tries to separate expert
trajectory from trajectories of the learned policy, which has the role of the generator here.
Note: GAIL requires OpenMPI. If OpenMPI isn’t enabled, then GAIL isn’t imported into the stable_baselines
module.
1.24.1 Notes
Warning: Images are not yet handled properly by the current implementation
You can either train a RL algorithm in a classic setting, use another controller (e.g. a PID controller) or human
demonstrations.
We recommend you to take a look at pre-training section or directly look at stable_baselines/gail/
dataset/ folder to learn more about the expected format for the dataset.
Here is an example of training a Soft Actor-Critic model to generate expert trajectories for GAIL:
• Recurrent policies:
• Multi processing: X (using MPI)
• Gym spaces:
1.24.4 Example
import gym
model = GAIL.load("gail_pendulum")
env = gym.make('Pendulum-v0')
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.24.5 Parameters
Warning: Images are not yet handled properly by the current implementation
Parameters
• policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnL-
stmPolicy, . . . )
• env – (Gym environment or str) The environment to learn from (if registered in Gym, can
be str)
• expert_dataset – (ExpertDataset) the dataset manager
• gamma – (float) the discount value
• timesteps_per_batch – (int) the number of timesteps to run per batch (horizon)
• max_kl – (float) the Kullback-Leibler loss threshold
• cg_iters – (int) the number of iterations for the conjugate gradient calculation
• lam – (float) GAE factor
• entcoeff – (float) the weight for the entropy loss
• cg_damping – (float) the compute gradient dampening factor
• vf_stepsize – (float) the value function stepsize
• vf_iters – (int) the value function’s number iterations for learning
• hidden_size – ([int]) the hidden dimension for the MLP
• g_step – (int) number of steps to train policy in each epoch
• d_step – (int) number of steps to train discriminator in each epoch
• d_stepsize – (float) the reward giver stepsize
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
WARNING: this logging can take a lot of space quickly
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’GAIL’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
1.25 HER
Note: HER was re-implemented from scratch in Stable-Baselines compared to the original OpenAI baselines. If you
want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters
and at least 8 MPI workers with DDPG.
Warning: you must pass an environment or wrap it with HERGoalEnvWrapper in order to use the predict
method
1.25.1 Notes
Please refer to the wrapped model (DQN, SAC, TD3 or DDPG) for that section.
1.25.3 Example
verbose=1)
# Train the model
model.learn(1000)
model.save("./her_bit_env")
obs = env.reset()
for _ in range(100):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
1.25.4 Parameters
class stable_baselines.her.GoalSelectionStrategy
The strategies for selecting new goals when creating artificial transitions.
class stable_baselines.her.HERGoalEnvWrapper(env)
A wrapper that allow to use dict observation space (coming from GoalEnv) with the RL algorithms. It assumes
that all the spaces of the dict space are of the same type.
Parameters env – (gym.GoalEnv)
convert_dict_to_obs(obs_dict)
Parameters obs_dict – (dict<np.ndarray>)
Returns (np.ndarray)
convert_obs_to_dict(observations)
Inverse operation of convert_dict_to_obs
Parameters observations – (np.ndarray)
Returns (OrderedDict<np.ndarray>)
class stable_baselines.her.HindsightExperienceReplayWrapper(replay_buffer,
n_sampled_goal,
goal_selection_strategy,
wrapped_env)
Wrapper around a replay buffer in order to use HER. This implementation is inspired by to the one found in
https://github.com/NervanaSystems/coach/.
Parameters
• replay_buffer – (ReplayBuffer)
• n_sampled_goal – (int) The number of artificial transitions to generate for each actual
transition
• goal_selection_strategy – (GoalSelectionStrategy) The method that will be used
to generate the goals for the artificial transitions.
• wrapped_env – (HERGoalEnvWrapper) the GoalEnv wrapped using HERGoalEn-
vWrapper, that enables to convert observation to dict, and vice versa
add(obs_t, action, reward, obs_tp1, done, info)
add a new transition to the buffer
Parameters
• obs_t – (np.ndarray) the last observation
• action – ([float]) the action
• reward – (float) the reward of the transition
• obs_tp1 – (np.ndarray) the new observation
• done – (bool) is the episode done
• info – (dict) extra values used to compute reward
can_sample(n_samples)
Check if n_samples samples can be sampled from the buffer.
Parameters n_samples – (int)
Returns (bool)
1.26 PPO1
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses
a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far from the old policy. For that, ppo uses
clipping to avoid too large update.
Note: PPO1 requires OpenMPI. If OpenMPI isn’t enabled, then PPO1 isn’t imported into the stable_baselines
module.
Note: PPO1 uses MPI for multiprocessing unlike PPO2, which uses vectorized environments. PPO2 is the imple-
mentation OpenAI made for GPU.
1.26.1 Notes
• Recurrent policies:
• Multi processing: X (using MPI)
• Gym spaces:
1.26.3 Example
import gym
env = gym.make('CartPole-v1')
model = PPO1.load("ppo1_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.26.4 Parameters
• schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’,
‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• policy_kwargs – (dict) additional arguments to be passed to the policy on creation
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
WARNING: this logging can take a lot of space quickly
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=100, tb_log_name=’PPO1’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 0
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• new_tb_log
• writer
• policy
• env
• horizon
• reward_giver
• gail
• step
• cur_ep_ret
• current_it_len
• current_ep_len
• cur_ep_true_ret
• ep_true_rets
• ep_rets
• ep_lens
• observations
• true_rewards
• rewards
• vpreds
• episode_starts
• dones
• actions
• states
• episode_start
• done
• vpred
• _
• i
• clipped_action
• reward
• true_reward
• info
• action
• observation
1.27 PPO2
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses
a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far from the old policy. For that, PPO uses
clipping to avoid too large update.
Note: PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments
compared to PPO1 which uses MPI.
Note: PPO2 contains several modifications from the original algorithm not documented by OpenAI: value function
is also clipped and advantages are normalized.
1.27.1 Notes
• Recurrent policies: X
• Multi processing: X
• Gym spaces:
1.27.3 Example
import gym
# multiprocess environment
env = make_vec_env('CartPole-v1', n_envs=4)
model = PPO2.load("ppo2_cartpole")
1.27.4 Parameters
• cliprange_vf – (float or callable) Clipping parameter for the value function, it can be
a function. This is a parameter specific to the OpenAI implementation. If None is passed
(default), then cliprange (that is used for the policy) will be used. IMPORTANT: this clip-
ping depends on the reward scaling. To deactivate value function clipping (and recover the
original PPO implementation), you have to pass a negative value (e.g. -1).
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• policy_kwargs – (dict) additional arguments to be passed to the policy on creation
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
WARNING: this logging can take a lot of space quickly
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
learn(total_timesteps, callback=None, log_interval=1, tb_log_name=’PPO2’, re-
set_num_timesteps=True)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• cliprange_vf
• new_tb_log
• writer
• t_first_start
• n_updates
• mb_obs
• mb_rewards
• mb_actions
• mb_values
• mb_dones
• mb_neglogpacs
• mb_states
• ep_infos
• actions
• values
• neglogpacs
• clipped_actions
• rewards
• infos
From timestep 1
• info
• maybe_ep_info
1.28 SAC
Soft Actor Critic (SAC) Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
SAC is the successor of Soft Q-Learning SQL and incorporates the double Q-learning trick from TD3. A key feature
of SAC, and a major difference with common RL algorithms, is that it is trained to maximize a trade-off between
expected return and entropy, a measure of randomness in the policy.
Warning: The SAC model does not support stable_baselines.common.policies because it uses
double q-values and value estimation, as a result it must use its own policy models (see SAC Policies).
Available Policies
1.28.1 Notes
Note: In our implementation, we use an entropy coefficient (as in OpenAI Spinning or Facebook Horizon), which is
the equivalent to the inverse of reward scale in the original SAC paper. The main reason is that it avoids having too
high errors when updating the Q functions.
Note: The default policies for SAC differ a bit from others MlpPolicy: it uses ReLU instead of tanh activation, to
match the original paper
• Recurrent policies:
• Multi processing:
• Gym spaces:
1.28.3 Example
import gym
import numpy as np
env = gym.make('Pendulum-v0')
model = SAC.load("sac_pendulum")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.28.4 Parameters
• tau – (float) the soft update coefficient (“polyak update”, between 0 and 1)
• ent_coef – (str or float) Entropy regularization coefficient. (Equivalent to inverse of
reward scale in the original SAC paper.) Controlling exploration/exploitation trade-off. Set
it to ‘auto’ to learn it automatically (and ‘auto_0.1’ for using 0.1 as initial value)
• train_freq – (int) Update the model every train_freq steps.
• learning_starts – (int) how many steps of the model to collect transitions for before
learning starts
• target_update_interval – (int) update the target network every tar-
get_network_update_freq steps.
• gradient_steps – (int) How many gradient update after each step
• target_entropy – (str or float) target entropy when learning ent_coef (ent_coef =
‘auto’)
• action_noise – (ActionNoise) the action noise type (None by default), this can help for
hard exploration problem. Cf DDPG for the different action noise type.
• random_exploration – (float) Probability of taking a random action (as in an epsilon-
greedy strategy) This is not needed for SAC normally but can help exploring when using
HER + SAC. This hack was present in the original OpenAI Baselines repo (DDPG + HER)
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• policy_kwargs – (dict) additional arguments to be passed to the policy on creation
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
Note: this has no effect on SAC logging for now
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
is_using_her() → bool
Check if is using HER
Returns (bool) Whether is using HER or not
learn(total_timesteps, callback=None, log_interval=4, tb_log_name=’SAC’, re-
set_num_timesteps=True, replay_wrapper=None)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
replay_buffer_add(obs_t, action, reward, obs_tp1, done, info)
Add a new transition to the replay buffer
Parameters
• obs_t – (np.ndarray) the last observation
• action – ([float]) the action
• reward – (float) the reward of the transition
• obs_tp1 – (np.ndarray) the new observation
• done – (bool) is the episode done
• info – (dict) extra values used to compute the reward when using HER
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
make_actor(obs=None, reuse=False, scope=’pi’)
Creates an actor object
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the actor
Returns (TensorFlow Tensor) the output tensor
make_critics(obs=None, action=None, reuse=False, scope=’values_fn’, create_vf=True, cre-
ate_qf=True)
Creates the two Q-Values approximator along with the Value function
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• action – (TensorFlow Tensor) The action placeholder
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name
• create_vf – (bool) Whether to create Value fn or not
• create_qf – (bool) Whether to create Q-Values fn or not
Returns ([tf.Tensor]) Mean, action and log probability
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability params (mean, std) for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float], [float])
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• action – (TensorFlow Tensor) The action placeholder
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name
• create_vf – (bool) Whether to create Value fn or not
• create_qf – (bool) Whether to create Q-Values fn or not
Returns ([tf.Tensor]) Mean, action and log probability
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the action probability params (mean, std) for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float], [float])
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None, deterministic=False)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns ([float]) actions
class stable_baselines.sac.CnnPolicy(sess, ob_space, ac_space, n_env=1, n_steps=1,
n_batch=None, reuse=False, **_kwargs)
Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None, deterministic=False)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
• deterministic – (bool) Whether or not to return deterministic actions.
Returns ([float]) actions
class stable_baselines.sac.LnCnnPolicy(sess, ob_space, ac_space, n_env=1, n_steps=1,
n_batch=None, reuse=False, **_kwargs)
Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
make_actor(obs=None, reuse=False, scope=’pi’)
Creates an actor object
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the actor
Returns (TensorFlow Tensor) the output tensor
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy
network:
import gym
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• replay_wrapper
• new_tb_log
• writer
• current_lr
• start_time
• episode_rewards
• episode_successes
• obs
• n_updates
• infos_values
• step
• unscaled_action
• action
• new_obs
• reward
• done
• info
From timestep 2
• obs_
• new_obs_
• reward_
• maybe_ep_info
• mean_reward
• num_episodes
1.29 TD3
Twin Delayed DDPG (TD3) Addressing Function Approximation Error in Actor-Critic Methods.
TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed
policy update and target policy smoothing. We recommend reading OpenAI Spinning guide on TD3 to learn more
about those.
Warning: The TD3 model does not support stable_baselines.common.policies because it uses
double q-values estimation, as a result it must use its own policy models (see TD3 Policies).
Available Policies
1.29.1 Notes
Note: The default policies for TD3 differ a bit from others MlpPolicy: it uses ReLU instead of tanh activation, to
match the original paper
• Recurrent policies:
• Multi processing:
• Gym spaces:
1.29.3 Example
import gym
import numpy as np
env = gym.make('Pendulum-v0')
model = TD3.load("td3_pendulum")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
1.29.4 Parameters
• policy_delay – (int) Policy and target networks will only be updated once every pol-
icy_delay steps per training steps. The Q values will be updated policy_delay more often
(update every training step).
• action_noise – (ActionNoise) the action noise type. Cf DDPG for the different action
noise type.
• target_policy_noise – (float) Standard deviation of Gaussian noise added to target
policy (smoothing noise)
• target_noise_clip – (float) Limit for absolute value of target policy smoothing noise.
• train_freq – (int) Update the model every train_freq steps.
• learning_starts – (int) how many steps of the model to collect transitions for before
learning starts
• gradient_steps – (int) How many gradient update after each step
• random_exploration – (float) Probability of taking a random action (as in an epsilon-
greedy strategy) This is not needed for TD3 normally but can help exploring when using
HER + TD3. This hack was present in the original OpenAI Baselines repo (DDPG + HER)
• verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
• tensorboard_log – (str) the log location for tensorboard (if None, no logging)
• _init_setup_model – (bool) Whether or not to build the network at the creation of the
instance
• policy_kwargs – (dict) additional arguments to be passed to the policy on creation
• full_tensorboard_log – (bool) enable additional logging when using tensorboard
Note: this has no effect on TD3 logging for now
• seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None
(default), use random seed. Note that if you want completely deterministic results, you must
set n_cpu_tf_sess to 1.
• n_cpu_tf_sess – (int) The number of threads for TensorFlow operations If None, the
number of cpu of the current machine will be used.
action_probability(observation, state=None, mask=None, actions=None, logp=False)
If actions is None, then get the model’s action probability distribution from a given observation.
Depending on the action space the output is:
• Discrete: probability for each possible action
• Box: mean and standard deviation of the action output
However if actions is not None, this function will return the probability that the given actions are taken
with the given parameters (observation, state, . . . ) on this model. For discrete action spaces, it returns
the probability mass; for continuous action spaces, the probability density. This is since the probability
mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good
explanation
Parameters
• observation – (np.ndarray) the input observation
• state – (np.ndarray) The last states (can be None, used in recurrent policies)
• mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
• actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given ac-
tions are chosen by the model for each of the given parameters. Must have the same
number of actions and observations. (set to None to return the complete action probability
distribution)
• logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-
space. This has no effect if actions is None.
Returns (np.ndarray) the model’s (log) action probability
get_env()
returns the current environment (can be None if not defined)
Returns (Gym Environment) The current environment
get_parameter_list()
Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns (list) List of tensorflow Variables
get_parameters()
Get current model parameters as dictionary of variable name -> ndarray.
Returns (OrderedDict) Dictionary of variable name -> ndarray of model’s parameters.
get_vec_normalize_env() → Optional[stable_baselines.common.vec_env.vec_normalize.VecNormalize]
Return the VecNormalize wrapper of the training env if it exists.
Returns Optional[VecNormalize] The VecNormalize env.
is_using_her() → bool
Check if is using HER
Returns (bool) Whether is using HER or not
learn(total_timesteps, callback=None, log_interval=4, tb_log_name=’TD3’, re-
set_num_timesteps=True, replay_wrapper=None)
Return a trained model.
Parameters
• total_timesteps – (int) The total number of samples to train on
• callback – (Union[callable, [callable], BaseCallback]) function called at every steps
with state of the algorithm. It takes the local and global variables. If it returns False,
training is aborted. When the callback inherits from BaseCallback, you will have access
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
replay_buffer_add(obs_t, action, reward, obs_tp1, done, info)
Add a new transition to the replay buffer
Parameters
• obs_t – (np.ndarray) the last observation
• action – ([float]) the action
• reward – (float) the reward of the transition
• obs_tp1 – (np.ndarray) the new observation
• done – (bool) is the episode done
• info – (dict) extra values used to compute the reward when using HER
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
make_actor(obs=None, reuse=False, scope=’pi’)
Creates an actor object
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the actor
Returns (TensorFlow Tensor) the output tensor
make_critics(obs=None, action=None, reuse=False, scope=’values_fn’)
Creates the two Q-Values approximator
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• action – (TensorFlow Tensor) The action placeholder
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name
Returns ([tf.Tensor]) Mean, action and log probability
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
processed_obs
tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is
passed to the constructor; see observation_input for more information.
step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
• state – ([float]) The last states (used in recurrent policies)
• mask – ([float]) The last masks (used in recurrent policies)
Returns ([float]) actions
class stable_baselines.td3.CnnPolicy(sess, ob_space, ac_space, n_env=1, n_steps=1,
n_batch=None, reuse=False, **_kwargs)
Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
Parameters
• sess – (TensorFlow session) The current TensorFlow session
• ob_space – (Gym Space) The observation space of the environment
• ac_space – (Gym Space) The action space of the environment
• n_env – (int) The number of environments to run
• n_steps – (int) The number of steps to run for each environment
• n_batch – (int) The number of batch to run (n_envs * n_steps)
• reuse – (bool) If the policy is reusable or not
• _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
action_ph
tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
initial_state
The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of
shape (self.n_env, ) + state_shape.
is_discrete
bool: is action space discrete.
make_actor(obs=None, reuse=False, scope=’pi’)
Creates an actor object
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name of the actor
Returns (TensorFlow Tensor) the output tensor
make_critics(obs=None, action=None, reuse=False, scope=’values_fn’)
Creates the two Q-Values approximator
Parameters
• obs – (TensorFlow Tensor) The observation placeholder (can be None for default place-
holder)
• action – (TensorFlow Tensor) The action placeholder
• reuse – (bool) whether or not to reuse parameters
• scope – (str) the scope name
Returns ([tf.Tensor]) Mean, action and log probability
obs_ph
tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
proba_step(obs, state=None, mask=None)
Returns the policy for a single step
Parameters
• obs – ([float] or [int]) The current observation of the environment
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy
network:
import gym
import numpy as np
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 1
• self
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• replay_wrapper
• new_tb_log
• writer
• current_lr
• start_time
• episode_rewards
• episode_successes
• obs
• n_updates
• infos_values
• step
• unscaled_action
• action
• new_obs
• reward
• done
• info
From timestep 2
• obs_
• new_obs_
• reward_
• maybe_ep_info
• mean_reward
• num_episodes
1.30 TRPO
Trust Region Policy Optimization (TRPO) is an iterative approach for optimizing policies with guaranteed monotonic
improvement.
Note: TRPO requires OpenMPI. If OpenMPI isn’t enabled, then TRPO isn’t imported into the
stable_baselines module.
1.30.1 Notes
• Recurrent policies:
• Multi processing: X (using MPI)
• Gym spaces:
1.30.3 Example
import gym
env = gym.make('CartPole-v1')
model = TRPO.load("trpo_cartpole")
1.30.4 Parameters
to additional stages of the training (training start/end), please read the documentation for
more details.
• log_interval – (int) The number of timesteps before logging.
• tb_log_name – (str) the name of the run for tensorboard log
• reset_num_timesteps – (bool) whether or not to reset the current timestep number
(used in logging)
Returns (BaseRLModel) the trained model
classmethod load(load_path, env=None, custom_objects=None, **kwargs)
Load the model from file
Parameters
• load_path – (str or file-like) the saved parameter location
• env – (Gym Environment) the new environment to run the loaded model on (can be None
if you only need prediction from a trained model)
• custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable
is present in this dictionary as a key, it will not be deserialized and the corresponding item
will be used instead. Similar to custom_objects in keras.models.load_model. Useful when
you have an object in file that can not be deserialized.
• kwargs – extra arguments to change the model when loading
load_parameters(load_path_or_dict, exact_match=True)
Load model parameters from a file or a dictionary
Dictionary keys should be tensorflow variable names, which can be obtained with get_parameters
function. If exact_match is True, dictionary should contain keys for all model’s parameters, otherwise
RunTimeError is raised. If False, only variables included in the dictionary will be updated.
This does not load agent’s hyper-parameters.
Warning: This function does not update trainer/optimizer variables (e.g. momentum). As such
training after using this function may lead to less-than-optimal results.
Parameters
• load_path_or_dict – (str or file-like or dict) Save parameter location or dict of pa-
rameters as variable.name -> ndarrays to be loaded.
• exact_match – (bool) If True, expects load dictionary to contain keys for all variables
in the model. If False, loads parameters only for variables mentioned in the dictionary.
Defaults to True.
Returns (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent poli-
cies)
pretrain(dataset, n_epochs=10, learning_rate=0.0001, adam_epsilon=1e-08, val_interval=None)
Pretrain a model using behavior cloning: supervised learning given an expert dataset.
NOTE: only Box and Discrete spaces are supported for now.
Parameters
• dataset – (ExpertDataset) Dataset manager
• n_epochs – (int) Number of iterations on the training set
• learning_rate – (float) Learning rate
• adam_epsilon – (float) the epsilon value for the adam optimizer
• val_interval – (int) Report training and validation losses every n epochs. By default,
every 10th of the maximum number of epochs.
Returns (BaseRLModel) the pretrained model
save(save_path, cloudpickle=False)
Save the current parameters to file
Parameters
• save_path – (str or file-like) The save location
• cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
set_env(env)
Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters env – (Gym Environment) The environment for learning a policy
set_random_seed(seed: Optional[int]) → None
Parameters seed – (Optional[int]) Seed for the pseudo-random generators. If None, do not
change the seeds.
setup_model()
Create all the functions and tensorflow graphs necessary to train the model
Depending on initialization parameters and timestep, different variables are accessible. Variables accessible “From
timestep X” are variables that can be accessed when self.timestep==X in the on_step function.
Variable Availability
From timestep 0
• total_timesteps
• callback
• log_interval
• tb_log_name
• reset_num_timesteps
• new_tb_log
• writer
• self
• policy
• env
• horizon
• reward_giver
• gail
• step
• cur_ep_ret
• current_it_len
• current_ep_len
• cur_ep_true_ret
• ep_true_rets
• ep_rets
• ep_lens
• observations
• true_rewards
• rewards
• vpreds
• episode_starts
• dones
• actions
• states
• episode_start
• done
• vpred
• clipped_action
• reward
• true_reward
• info
• action
• observation
• maybe_ep_info
The policy networks output parameters for the distributions (named flat in the methods). Actions are then sampled
from those distributions.
For instance, in the case of discrete actions. The policy network outputs probability of taking each action. The
CategoricalProbabilityDistribution allows to sample from it, computes the entropy, the negative log
probability (neglogp) and backpropagate the gradient.
In the case of continuous actions, a Gaussian distribution is used. The policy network outputs mean and (log) std of
the distribution (assumed to be a DiagGaussianProbabilityDistribution).
class stable_baselines.common.distributions.BernoulliProbabilityDistribution(logits)
entropy()
Returns Shannon’s entropy of the probability
Returns (float) the entropy
flatparam()
Return the direct probabilities
Returns ([float]) the probabilities
classmethod fromflat(flat)
Create an instance of this from new Bernoulli input
Parameters flat – ([float]) the Bernoulli input data
Returns (ProbabilityDistribution) the instance from the given Bernoulli input data
kl(other)
Calculates the Kullback-Leibler divergence from the given probability distribution
Parameters other – ([float]) the distribution to compare with
Returns (float) the KL divergence of the two distributions
mode()
Returns the probability
Returns (Tensorflow Tensor) the deterministic action
neglogp(x)
returns the of the negative log likelihood
Parameters x – (str) the labels of each index
Returns ([float]) The negative log likelihood of the distribution
sample()
returns a sample from the probability distribution
Returns (Tensorflow Tensor) the stochastic action
class stable_baselines.common.distributions.BernoulliProbabilityDistributionType(size)
param_shape()
returns the shape of the input parameters
Returns ([int]) the shape
proba_distribution_from_latent(pi_latent_vector, vf_latent_vector, init_scale=1.0,
init_bias=0.0)
returns the probability distribution from latent values
Parameters
entropy()
Returns Shannon’s entropy of the probability
Returns (float) the entropy
flatparam()
Return the direct probabilities
Returns ([float]) the probabilities
classmethod fromflat(flat)
Create an instance of this from new logits values
Parameters flat – ([float]) the categorical logits input
Returns (ProbabilityDistribution) the instance from the given categorical input
kl(other)
Calculates the Kullback-Leibler divergence from the given probability distribution
Parameters other – ([float]) the distribution to compare with
Returns (float) the KL divergence of the two distributions
mode()
Returns the probability
Returns (Tensorflow Tensor) the deterministic action
neglogp(x)
returns the of the negative log likelihood
Parameters x – (str) the labels of each index
Returns ([float]) The negative log likelihood of the distribution
sample()
returns a sample from the probability distribution
Returns (Tensorflow Tensor) the stochastic action
class stable_baselines.common.distributions.CategoricalProbabilityDistributionType(n_cat)
param_shape()
returns the shape of the input parameters
Returns ([int]) the shape
proba_distribution_from_latent(pi_latent_vector, vf_latent_vector, init_scale=1.0,
init_bias=0.0)
returns the probability distribution from latent values
Parameters
• pi_latent_vector – ([float]) the latent pi values
• vf_latent_vector – ([float]) the latent vf values
• init_scale – (float) the initial scale of the distribution
• init_bias – (float) the initial bias of the distribution
Returns (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
probability_distribution_class()
returns the ProbabilityDistribution class of this type
Returns (Type ProbabilityDistribution) the probability distribution class associated
sample_dtype()
returns the type of the sampling
Returns (type) the type
sample_shape()
returns the shape of the sampling
Returns ([int]) the shape
class stable_baselines.common.distributions.DiagGaussianProbabilityDistribution(flat)
entropy()
Returns Shannon’s entropy of the probability
Returns (float) the entropy
flatparam()
Return the direct probabilities
Returns ([float]) the probabilities
classmethod fromflat(flat)
Create an instance of this from new multivariate Gaussian input
Parameters flat – ([float]) the multivariate Gaussian input data
Returns (ProbabilityDistribution) the instance from the given multivariate Gaussian input data
kl(other)
Calculates the Kullback-Leibler divergence from the given probability distribution
Parameters other – ([float]) the distribution to compare with
Returns (float) the KL divergence of the two distributions
mode()
Returns the probability
param_shape()
returns the shape of the input parameters
Returns ([int]) the shape
proba_distribution_from_flat(flat)
returns the probability distribution from flat probabilities
Parameters flat – ([float]) the flat probabilities
Returns (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
proba_distribution_from_latent(pi_latent_vector, vf_latent_vector, init_scale=1.0,
init_bias=0.0)
returns the probability distribution from latent values
Parameters
• pi_latent_vector – ([float]) the latent pi values
• vf_latent_vector – ([float]) the latent vf values
• init_scale – (float) the initial scale of the distribution
• init_bias – (float) the initial bias of the distribution
Returns (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
probability_distribution_class()
returns the ProbabilityDistribution class of this type
Returns (Type ProbabilityDistribution) the probability distribution class associated
sample_dtype()
returns the type of the sampling
Returns (type) the type
sample_shape()
returns the shape of the sampling
Returns ([int]) the shape
class stable_baselines.common.distributions.MultiCategoricalProbabilityDistribution(nvec,
flat)
entropy()
Returns Shannon’s entropy of the probability
Returns (float) the entropy
flatparam()
Return the direct probabilities
Returns ([float]) the probabilities
classmethod fromflat(flat)
Create an instance of this from new logits values
Parameters flat – ([float]) the multi categorical logits input
Returns (ProbabilityDistribution) the instance from the given multi categorical input
kl(other)
Calculates the Kullback-Leibler divergence from the given probability distribution
Parameters other – ([float]) the distribution to compare with
Returns (float) the KL divergence of the two distributions
mode()
Returns the probability
Returns (Tensorflow Tensor) the deterministic action
neglogp(x)
returns the of the negative log likelihood
Parameters x – (str) the labels of each index
Returns ([float]) The negative log likelihood of the distribution
sample()
returns a sample from the probability distribution
Returns (Tensorflow Tensor) the stochastic action
class stable_baselines.common.distributions.MultiCategoricalProbabilityDistributionType(n_v
param_shape()
returns the shape of the input parameters
Returns ([int]) the shape
proba_distribution_from_flat(flat)
Returns the probability distribution from flat probabilities flat: flattened vector of parameters of probability
distribution
Parameters flat – ([float]) the flat probabilities
Returns (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
proba_distribution_from_latent(pi_latent_vector, vf_latent_vector, init_scale=1.0,
init_bias=0.0)
returns the probability distribution from latent values
Parameters
• pi_latent_vector – ([float]) the latent pi values
• vf_latent_vector – ([float]) the latent vf values
• init_scale – (float) the initial scale of the distribution
• init_bias – (float) the initial bias of the distribution
Returns (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
probability_distribution_class()
returns the ProbabilityDistribution class of this type
Returns (Type ProbabilityDistribution) the probability distribution class associated
sample_dtype()
returns the type of the sampling
Returns (type) the type
sample_shape()
returns the shape of the sampling
Returns ([int]) the shape
class stable_baselines.common.distributions.ProbabilityDistribution
Base class for describing a probability distribution.
entropy()
Returns Shannon’s entropy of the probability
Returns (float) the entropy
flatparam()
Return the direct probabilities
Returns ([float]) the probabilities
kl(other)
Calculates the Kullback-Leibler divergence from the given probability distribution
Parameters other – ([float]) the distribution to compare with
Returns (float) the KL divergence of the two distributions
logp(x)
returns the of the log likelihood
Parameters x – (str) the labels of each index
Returns ([float]) The log likelihood of the distribution
mode()
Returns the probability
Returns (Tensorflow Tensor) the deterministic action
neglogp(x)
returns the of the negative log likelihood
Parameters x – (str) the labels of each index
Returns ([float]) The negative log likelihood of the distribution
sample()
returns a sample from the probability distribution
Returns (Tensorflow Tensor) the stochastic action
class stable_baselines.common.distributions.ProbabilityDistributionType
Parametrized family of probability distributions
param_placeholder(prepend_shape, name=None)
returns the TensorFlow placeholder for the input parameters
Parameters
stable_baselines.common.distributions.shape_el(tensor, index)
get the shape of a TensorFlow Tensor element
Parameters
• tensor – (TensorFlow Tensor) the input tensor
• index – (int) the element
Returns ([int]) the shape
Parameters
• inputs – (TensorFlow Tensor or Object with make_feed_dict) list of input arguments
• outputs – (TensorFlow Tensor) list of outputs or a single output to be returned from
function. Returned value will also have the same shape.
• updates – ([tf.Operation] or tf.Operation) list of update functions or single update func-
tion that will be run whenever the function is called. The return is ignored.
• givens – (dict) the values known for the output
stable_baselines.common.tf_util.get_globals_vars(name)
returns the trainable variables
Parameters name – (str) the scope
Returns ([TensorFlow Variable])
stable_baselines.common.tf_util.get_trainable_vars(name)
returns the trainable variables
Parameters name – (str) the scope
Returns ([TensorFlow Variable])
stable_baselines.common.tf_util.gradient_add(grad_1, grad_2, param, verbose=0)
Sum two gradients
Parameters
• grad_1 – (TensorFlow Tensor) The first gradient
• grad_2 – (TensorFlow Tensor) The second gradient
• param – (TensorFlow parameters) The trainable parameters
• verbose – (int) verbosity level
Returns (TensorFlow Tensor) the sum of the gradients
stable_baselines.common.tf_util.huber_loss(tensor, delta=1.0)
Reference: https://en.wikipedia.org/wiki/Huber_loss
Parameters
• tensor – (TensorFlow Tensor) the input value
• delta – (float) Huber loss delta value
Returns (TensorFlow Tensor) Huber loss output
stable_baselines.common.tf_util.in_session(func)
Wraps a function so that it is in a TensorFlow Session
Parameters func – (function) the function to wrap
Returns (function)
stable_baselines.common.tf_util.initialize(sess=None)
Initialize all the uninitialized variables in the global scope.
Parameters sess – (TensorFlow Session)
stable_baselines.common.tf_util.intprod(tensor)
calculates the product of all the elements in a list
Parameters tensor – ([Number]) the list of elements
Returns (int) the product truncated
stable_baselines.common.tf_util.is_image(tensor)
Check if a tensor has the shape of a valid image for tensorboard logging. Valid image: RGB, RGBD, GrayScale
Parameters tensor – (np.ndarray or tf.placeholder)
Returns (bool)
stable_baselines.common.tf_util.make_session(num_cpu=None, make_default=False,
graph=None)
Returns a session that will use <num_cpu> CPU’s only
Parameters
• num_cpu – (int) number of CPUs to use for TensorFlow
• make_default – (bool) if this should return an InteractiveSession or a normal Session
• graph – (TensorFlow Graph) the graph of the session
Returns (TensorFlow session)
stable_baselines.common.tf_util.mse(pred, target)
Returns the Mean squared error between prediction and target
Parameters
• pred – (TensorFlow Tensor) The predicted value
• target – (TensorFlow Tensor) The target value
Returns (TensorFlow Tensor) The Mean squared error between prediction and target
stable_baselines.common.tf_util.numel(tensor)
get TensorFlow Tensor’s number of elements
Parameters tensor – (TensorFlow Tensor) the input tensor
Returns (int) the number of elements
stable_baselines.common.tf_util.outer_scope_getter(scope, new_scope=”)
remove a scope layer for the getter
Parameters
• scope – (str) the layer to remove
• new_scope – (str) optional replacement name
Returns (function (function, str, *args, **kwargs): Tensorflow Tensor)
stable_baselines.common.tf_util.q_explained_variance(q_pred, q_true)
Calculates the explained variance of the Q value
Parameters
• q_pred – (TensorFlow Tensor) The predicted Q value
• q_true – (TensorFlow Tensor) The expected Q value
Returns (TensorFlow Tensor) the explained variance of the Q value
stable_baselines.common.tf_util.sample(logits)
Creates a sampling Tensor for non deterministic policies when using categorical distribution. It uses the Gumbel-
max trick: http://amid.fish/humble-gumbel
Parameters logits – (TensorFlow Tensor) The input probability for each action
Returns (TensorFlow Tensor) The sampled action
stable_baselines.common.tf_util.seq_to_batch(tensor_sequence, flat=False)
Transform a sequence of Tensors, into a batch of Tensors for recurrent policies
Parameters
• tensor_sequence – (TensorFlow Tensor) The input tensor to batch
• flat – (bool) If the input Tensor is flat
Returns (TensorFlow Tensor) batch of Tensors for recurrent policies
stable_baselines.common.tf_util.single_threaded_session(make_default=False,
graph=None)
Returns a session which will only use a single CPU
Parameters
• make_default – (bool) if this should return an InteractiveSession or a normal Session
• graph – (TensorFlow Graph) the graph of the session
Returns (TensorFlow session)
stable_baselines.common.tf_util.total_episode_reward_logger(rew_acc, rewards,
masks, writer, steps)
calculates the cumulated episode reward, and prints to tensorflow log the output
Parameters
• rew_acc – (np.array float) the total running reward
• rewards – (np.array float) the rewards
• masks – (np.array bool) the end of episodes
• writer – (TensorFlow Session.writer) the writer to log to
• steps – (int) the current timestep
Returns (np.array float) the updated total running reward
Returns (np.array float) the updated total running reward
stable_baselines.common.tf_util.var_shape(tensor)
get TensorFlow Tensor shape
Parameters tensor – (TensorFlow Tensor) the input tensor
Returns ([int]) the shape
1.34 Schedules
Schedules are used as hyperparameter for most of the algorithms, in order to change value of a parameter over time
(usually the learning rate).
This file is used for specifying various schedules that evolve over time throughout the execution of the algorithm, such
as:
• learning rate for the optimizer
• exploration epsilon for the epsilon greedy exploration strategy
• beta parameter for beta parameter in prioritized replay
Each schedule has a function value(t) which returns the current value of the parameter given the timestep t of the
optimization procedure.
class stable_baselines.common.schedules.ConstantSchedule(value)
Value remains constant over time.
Parameters value – (float) Constant value of the schedule
value(step)
Value of the schedule for a given timestep
Parameters step – (int) the timestep
Returns (float) the output value for the given timestep
class stable_baselines.common.schedules.LinearSchedule(schedule_timesteps, final_p,
initial_p=1.0)
Linear interpolation between initial_p and final_p over schedule_timesteps. After this many timesteps pass
final_p is returned.
Parameters
• schedule_timesteps – (int) Number of timesteps for which to linearly anneal initial_p
to final_p
• initial_p – (float) initial output value
• final_p – (float) final output value
value(step)
Value of the schedule for a given timestep
Parameters step – (int) the timestep
Returns (float) the output value for the given timestep
class stable_baselines.common.schedules.PiecewiseSchedule(endpoints, interpo-
lation=<function
linear_interpolation>,
outside_value=None)
Piecewise schedule.
Parameters
• endpoints – ([(int, int)]) list of pairs (time, value) meaning that schedule should output
value when t==time. All the values for time must be sorted in an increasing order. When t
is between two times, e.g. (time_a, value_a) and (time_b, value_b), such that time_a <= t <
time_b then value outputs interpolation(value_a, value_b, alpha) where alpha is a fraction
of time passed between time_a and time_b for time t.
• interpolation – (lambda (float, float, float): float) a function that takes value to the left
and to the right of t according to the endpoints. Alpha is the fraction of distance from left
endpoint to right endpoint that t has covered. See linear_interpolation for example.
• outside_value – (float) if the value is requested outside of all the intervals specified in
endpoints this value is returned. If None then AssertionError is raised when outside value is
requested.
value(step)
Value of the schedule for a given timestep
Parameters step – (int) the timestep
Returns (float) the output value for the given timestep
stable_baselines.common.schedules.constant(_)
Returns a constant value for the Scheduler
Parameters _ – ignored
Returns (float) 1
stable_baselines.common.schedules.constfn(val)
Create a function that returns a constant It is useful for learning rate schedule (to avoid code duplication)
Parameters val – (float)
Returns (function)
stable_baselines.common.schedules.double_linear_con(progress)
Returns a linear value (x2) with a flattened tail for the Scheduler
Parameters progress – (float) Current progress status (in [0, 1])
Returns (float) 1 - progress*2 if (1 - progress*2) >= 0.125 else 0.125
stable_baselines.common.schedules.double_middle_drop(progress)
Returns a linear value with two drops near the middle to a constant value for the Scheduler
Parameters progress – (float) Current progress status (in [0, 1])
Returns (float) if 0.75 <= 1 - p: 1 - p, if 0.25 <= 1 - p < 0.75: 0.75, if 1 - p < 0.25: 0.125
stable_baselines.common.schedules.get_schedule_fn(value_schedule)
Transform (if needed) learning rate and clip range to callable.
Parameters value_schedule – (callable or float)
Returns (function)
stable_baselines.common.schedules.linear_interpolation(left, right, alpha)
Linear interpolation between left and right.
Parameters
• left – (float) left boundary
• right – (float) right boundary
• alpha – (float) coeff in [0, 1]
Returns (float)
stable_baselines.common.schedules.linear_schedule(progress)
Returns a linear value for the Scheduler
Parameters progress – (float) Current progress status (in [0, 1])
1.38 Changelog
Warning: This package is in maintenance mode, please use Stable-Baselines3 (SB3) for an up-to-date version.
You can find a migration guide in SB3 documentation.
Breaking Changes:
New Features:
Bug Fixes:
• Fixed calculation of the log probability of Diagonal Gaussian distribution when using
action_probability() method (@SVJayanthi, @sunshineclt)
• Fixed docker image build (@anj1)
Deprecations:
Others:
Documentation:
Breaking Changes:
New Features:
Bug Fixes:
• Fixed DDPG sampling empty replay buffer when combined with HER (@tirafesi)
• Fixed a bug in HindsightExperienceReplayWrapper, where the openai-gym signature for
compute_reward was not matched correctly (@johannes-dornheim)
• Fixed SAC/TD3 checking time to update on learn steps instead of total steps (@PartiallyTyped)
• Added **kwarg pass through for reset method in atari_wrappers.FrameStack (@PartiallyTyped)
• Fix consistency in setup_model() for SAC, target_entropy now uses self.action_space in-
stead of self.env.action_space (@PartiallyTyped)
• Fix reward threshold in test_identity.py
• Partially fix tensorboard indexing for PPO2 (@enderdead)
• Fixed potential bug in DummyVecEnv where copy() was used instead of deepcopy()
• Fixed a bug in GAIL where the dataloader was not available after saving, causing an error when using
CheckpointCallback
• Fixed a bug in SAC where any convolutional layers were not included in the target network parameters.
• Fixed render() method for VecEnvs
• Fixed seed()` method for SubprocVecEnv
• Fixed a bug callback.locals did not have the correct values (@PartiallyTyped)
• Fixed a bug in the close() method of SubprocVecEnv, causing wrappers further down in the wrapper
stack to not be closed. (@NeoExtended)
Deprecations:
Others:
Documentation:
Breaking Changes:
• evaluate_policy now returns the standard deviation of the reward per episode as second return value
(instead of n_steps)
• evaluate_policy now returns as second return value a list of the episode lengths when
return_episode_rewards is set to True (instead of n_steps)
• Callback are now called after each env.step() for consistency (it was called every n_steps before in
algorithm like A2C or PPO2)
• Removed unused code in common/a2c/utils.py (calc_entropy_softmax, make_path)
• Refactoring, including removed files and moving functions.
– Algorithms no longer import from each other, and common does not import from algorithms.
– a2c/utils.py removed and split into other files:
New Features:
• Parallelized updating and sampling from the replay buffer in DQN. (@flodorner)
• Docker build script, scripts/build_docker.sh, can push images automatically.
• Added callback collection
• Added unwrap_vec_normalize and sync_envs_normalization in the vec_env module to syn-
chronize two VecNormalize environment
• Added a seeding method for vectorized environments. (@NeoExtended)
• Added extend method to store batches of experience in ReplayBuffer. (@PartiallyTyped)
Bug Fixes:
• Fixed Docker images via scripts/build_docker.sh and Dockerfile: GPU image now contains
tensorflow-gpu, and both images have stable_baselines installed in developer mode at correct di-
rectory for mounting.
• Fixed Docker GPU run script, scripts/run_docker_gpu.sh, to work with new NVidia Container
Toolkit.
• Repeated calls to RLModel.learn() now preserve internal counters for some episode logging statistics that
used to be zeroed at the start of every call.
• Fix DummyVecEnv.render for num_envs > 1. This used to print a warning and then not render at all.
(@shwang)
• Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to learn(total_timesteps) reset
the environment on every call, potentially biasing samples toward early episode timesteps. (@shwang)
• Fixed by adding lazy property ActorCriticRLModel.runner. Subclasses now use lazily-generated
self.runner instead of reinitializing a new Runner every time learn() is called.
• Fixed a bug in check_env where it would fail on high dimensional action spaces
• Fixed Monitor.close() that was not calling the parent method
• Fixed a bug in BaseRLModel when seeding vectorized environments. (@NeoExtended)
• Fixed num_timesteps computation to be consistent between algorithms (updated after env.step()) Only
TRPO and PPO1 update it differently (after synchronization) because they rely on MPI
• Fixed bug in TRPO with NaN standardized advantages (@richardwu)
• Fixed partial minibatch computation in ExpertDataset (@richardwu)
• Fixed normalization (with VecNormalize) for off-policy algorithms
• Fixed sync_envs_normalization to sync the reward normalization too
• Bump minimum Gym version (>=0.11)
Deprecations:
Others:
Documentation:
Reproducible results, automatic ‘‘VecEnv‘‘ wrapping, env checker and more usability improvements
Breaking Changes:
• The seed argument has been moved from learn() method to model constructor in order to have reproducible
results
• allow_early_resets of the Monitor wrapper now default to True
New Features:
• Add n_cpu_tf_sess to model constructor to choose the number of threads used by Tensorflow
• Environments are automatically wrapped in a DummyVecEnv if needed when passing them to the model con-
structor
• Added stable_baselines.common.make_vec_env helper to simplify VecEnv creation
• Added stable_baselines.common.evaluation.evaluate_policy helper to simplify model
evaluation
• VecNormalize changes:
– Now supports being pickled and unpickled (@AdamGleave).
– New methods .normalize_obs(obs) and normalize_reward(rews) apply normalization to arbitrary
observation or rewards without updating statistics (@shwang)
– .get_original_reward() returns the unnormalized rewards from the most recent timestep
– .reset() now collects observation statistics (used to only apply normalization)
• Add parameter exploration_initial_eps to DQN. (@jdossgollin)
• Add type checking and PEP 561 compliance. Note: most functions are still not annotated, this will be a gradual
process.
• DDPG, TD3 and SAC accept non-symmetric action spaces. (@Antymon)
• Add check_env util to check if a custom environment follows the gym interface (@araffin and @justinkterry)
Bug Fixes:
Deprecations:
• nprocs (ACKTR) and num_procs (ACER) are deprecated in favor of n_cpu_tf_sess which is now
common to all algorithms
• VecNormalize: load_running_average and save_running_average are deprecated in favour of
using pickle.
Others:
Documentation:
MPI dependency optional, new save format, ACKTR with continuous actions
Breaking Changes:
• OpenMPI-dependent algorithms (PPO1, TRPO, GAIL, DDPG) are disabled in the default installation of sta-
ble_baselines. mpi4py is now installed as an extra. When mpi4py is not available, stable-baselines skips
imports of OpenMPI-dependent algorithms. See installation notes and Issue #430.
• SubprocVecEnv now defaults to a thread-safe start method, forkserver when available and otherwise
spawn. This may require application code be wrapped in if __name__ == '__main__'. You can re-
store previous behavior by explicitly setting start_method = 'fork'. See PR #428.
• Updated dependencies: tensorflow v1.8.0 is now required
• Removed checkpoint_path and checkpoint_freq argument from DQN that were not used
• Removed bench/benchmark.py that was not used
• Removed several functions from common/tf_util.py that were not used
• Removed ppo1/run_humanoid.py
New Features:
• important change Switch to using zip-archived JSON and Numpy savez for storing models for better support
across library/Python versions. (@Miffyli)
• ACKTR now supports continuous actions
• Add double_q argument to DQN constructor
Bug Fixes:
• Skip automatic imports of OpenMPI-dependent algorithms to avoid an issue where OpenMPI would cause
stable-baselines to hang on Ubuntu installs. See installation notes and Issue #430.
• Fix a bug when calling logger.configure() with MPI enabled (@keshaviyengar)
• set allow_pickle=True for numpy>=1.17.0 when loading expert dataset
• Fix a bug when using VecCheckNan with numpy ndarray as state. Issue #489. (@ruifeng96150)
Deprecations:
• Models saved with cloudpickle format (stable-baselines<=2.7.0) are now deprecated in favor of zip-archive
format for better support across Python/Tensorflow versions. (@Miffyli)
Others:
Documentation:
Twin Delayed DDPG (TD3) and GAE bug fix (TRPO, PPO1, GAIL)
Breaking Changes:
New Features:
Bug Fixes:
• fixed a bug in traj_segment_generator where the episode_starts was wrongly recorded, resulting
in wrong calculation of Generalized Advantage Estimation (GAE), this affects TRPO, PPO1 and GAIL (thanks
to @miguelrass for spotting the bug)
• added missing property n_batch in BasePolicy.
Deprecations:
Others:
Documentation:
Breaking Changes:
import sys
import pkg_resources
import stable_baselines
We recommend you to save again the model afterward, so the fix won’t be needed the next time the trained agent is
loaded.
New Features:
• revamped HER implementation: clean re-implementation from scratch, now supports DQN, SAC and DDPG
• add action_noise param for SAC, it helps exploration for problem with deceptive reward
• The parameter filter_size of the function conv in A2C utils now supports passing a list/tuple of two
integers (height and width), in order to have non-squared kernel matrix. (@yutingsz)
• add random_exploration parameter for DDPG and SAC, it may be useful when using HER + DDPG/SAC.
This hack was present in the original OpenAI Baselines DDPG + HER implementation.
• added load_parameters and get_parameters to base RL class. With these methods, users are able to
load and get parameters to/from existing model, without touching tensorflow. (@Miffyli)
• added specific hyperparameter for PPO2 to clip the value function (cliprange_vf)
• added VecCheckNan wrapper
Bug Fixes:
• bugfix for VecEnvWrapper.__getattr__ which enables access to class attributes inherited from parent
classes.
• fixed path splitting in TensorboardWriter._get_latest_run_id() on Windows machines
(@PatrickWalter214)
• fixed a bug where initial learning rate is logged instead of its placeholder in A2C.setup_model (@sc420)
• fixed a bug where number of timesteps is incorrectly updated and logged in A2C.learn and A2C.
_train_step (@sc420)
Deprecations:
• deprecated memory_limit and memory_policy in DDPG, please use buffer_size instead. (will be
removed in v3.x.x)
Others:
• important change switched to using dictionaries rather than lists when storing parameters, with tensorflow
Variable names being the keys. (@Miffyli)
• removed unused dependencies (tdqm, dill, progressbar2, seaborn, glob2, click)
• removed get_available_gpus function which hadn’t been used anywhere (@Pastafarianist)
Documentation:
• GAIL: gail.dataset.ExpertDataset supports loading from memory rather than file, and gail.
dataset.record_expert supports returning in-memory rather than saving to file.
• added support in VecEnvWrapper for accessing attributes of arbitrarily deeply nested instances of
VecEnvWrapper and VecEnv. This is allowed as long as the attribute belongs to exactly one of the nested
instances i.e. it must be unambiguous. (@kantneel)
• fixed bug where result plotter would crash on very short runs (@Pastafarianist)
• added option to not trim output of result plotter by number of timesteps (@Pastafarianist)
• clarified the public interface of BasePolicy and ActorCriticPolicy. Breaking change when using
custom policies: masks_ph is now called dones_ph, and most placeholders were made private: e.g. self.
value_fn is now self._value_fn
• support for custom stateful policies.
• fixed episode length recording in trpo_mpi.utils.traj_segment_generator (@GerardMaggi-
olino)
Working GAIL, pretrain RL models and hotfix for A2C with continuous actions
• fixed various bugs in GAIL
• added scripts to generate dataset for gail
• added tests for GAIL + data for Pendulum-v0
• removed unused utils file in DQN folder
• fixed a bug in A2C where actions were cast to int32 even in the continuous case
• added addional logging to A2C when Monitor wrapper is used
• changed logging for PPO2: do not display NaN when reward info is not present
• change default value of A2C lr schedule
• removed behavior cloning script
• added pretrain method to base class, in order to use behavior cloning on all models
• fixed close() method for DummyVecEnv.
• added support for Dict spaces in DummyVecEnv and SubprocVecEnv. (@AdamGleave)
• added support for arbitrary multiprocessing start methods and added a warning about SubprocVecEnv that are
not thread-safe by default. (@AdamGleave)
• added support for Discrete actions for GAIL
• fixed deprecation warning for tf: replaces tf.to_float() by tf.cast()
• fixed bug in saving and loading ddpg model when using normalization of obs or returns (@tperol)
• changed DDPG default buffer size from 100 to 50000.
• fixed a bug in ddpg.py in combined_stats for eval. Computed mean on eval_episode_rewards
and eval_qs (@keshaviyengar)
• fixed a bug in setup.py that would error on non-GPU systems without TensorFlow installed
• added support for storing model in file like object. (thanks to @ernestum)
• fixed wrong image detection when using tensorboard logging with DQN
• fixed bug in ppo2 when passing non callable lr after loading
• Hotfix for ppo2, the wrong placeholder was used for the value function
• added async_eigen_decomp parameter for ACKTR and set it to False by default (remove deprecation
warnings)
• added methods for calling env methods/setting attributes inside a VecEnv (thanks to @bjmuld)
• updated gym minimum version
Warning: This version contains breaking changes for DQN policies, please read the full details
Warning: This version contains breaking changes, please read the full details
1.38.22 Maintainers
Stable-Baselines is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian
Ernestus (aka @ernestum), Adam Gleave (@AdamGleave) and Anssi Kanervisto (aka @Miffyli).
In random order. . .
Thanks to @bjmuld @iambenzo @iandanforth @r7vme @brendenpetersen @huvar @abhiskk @JohannesAck
@mily20001 @EliasHasle @mrakgr @Bleyddyn @antoine-galataud @junhyeokahn @AdamGleave @keshaviyen-
gar @tperol @XMaster96 @kantneel @Pastafarianist @GerardMaggiolino @PatrickWalter214 @yutingsz @sc420
@Aaahh @billtubbs @Miffyli @dwiel @miguelrass @qxcv @jaberkow @eavelardev @ruifeng96150 @pedrohbtp
@srivatsankrishnan @evilsocket @MarvineGothic @jdossgollin @SyllogismRXS @rusu24edward @jbulow @An-
tymon @seheevic @justinkterry @edbeeching @flodorner @KuKuXia @NeoExtended @PartiallyTyped @mmcenta
@richardwu @tirafesi @caburu @johannes-dornheim @kvenkman @aakash94 @enderdead @hardmaru @jbarsce
@ColinLeongUDRI @shwang @YangRui2015 @sophiagu @OGordon100 @SVJayanthi @sunshineclt @roccivic
@anj1
1.39 Projects
This is a list of projects using stable-baselines. Please tell us, if you want your project to appear on this page ;)
A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of the
Slime Volleyball game. Only dependencies are gym and numpy. Both state and pixel observation environments are
available. The motivation of this environment is to easily enable trained agents to play against each other, and also
facilitate the training of agents directly in a multi-agent setting, thus adding an extra dimension for evaluating an
agent’s performance.
Uses stable-baselines to train RL agents for both state and pixel observation versions of the task. A tutorial is also
provided on modifying stable-baselines for self-play using PPO.
Implementation of reinforcement learning approach to make a donkey car learn to drive. Uses DDPG on VAE features
(reproducing paper from wayve.ai)
Series of videos on how to make a self-driving FZERO artificial intelligence using reinforcement learning algorithms
PPO2 and A2C.
S-RL Toolbox: Reinforcement Learning (RL) and State Representation Learning (SRL) for Robotics. Stable-Baselines
was originally developped for this project.
Authors: Antonin Raffin, Ashley Hill, René Traoré, Timothée Lesort, Natalia Díaz-Rodríguez, David Filliat
Github repo: https://github.com/araffin/robotics-rl-srl
“In this notebook example, we will make HalfCheetah learn to walk using the stable-baselines [. . . ]”
Implementation of reinforcement learning approach to make a car learn to drive smoothly in minutes. Uses SAC on
VAE features.
Project around Roboy, a tendon-driven robot, that enabled it to move its shoulder in simulation to reach a pre-defined
point in 3D space. The agent used Proximal Policy Optimization (PPO) or Soft Actor-Critic (SAC) and was tested on
the real hardware.
The RL-agent serves as local planner and is trained in a simulator, fusion of the Flatland Simulator and the crowd
simulator Pedsim. This was tested on a real mobile robot. The Proximal Policy Optimization (PPO) algorithm is
applied.
Uses Stable Baselines to train adversarial policies that attack pre-trained victim policies in a zero-sum multi-agent
environments. May be useful as an example of how to integrate Stable Baselines with Ray to perform distributed
experiments and Sacred for experiment configuration and monitoring.
Reinforcement learning is used to train agents to control pistons attached to a bridge to cancel out vibrations. The
bridge is modeled as a one dimensional oscillating system and dynamics are simulated using a finite difference solver.
Agents were trained using Proximal Policy Optimization. See presentation for environment detalis.
Deep Reinforcement Learning is used to control the position or the shape of obstacles in different fluids in order to
optimize drag or lift. Fenics is used for the Fluid Mechanics part, and Stable Baselines is used for the DRL.
Aerial robotics is a cross-layer, interdisciplinary field. Air Learning is an effort to bridge seemingly disparate fields.
Designing an autonomous robot to perform a task involves interactions between various boundaries spanning from
modeling the environment down to the choice of onboard computer platform available in the robot. Our goal through
building Air Learning is to provide researchers with a cross-domain infrastructure that allows them to holistically study
and evaluate reinforcement learning algorithms for autonomous aerial machines. We use stable-baselines to train UAV
agent with Deep Q-Networks and Proximal Policy Optimization algorithms.
Authors: Srivatsan Krishnan, Behzad Boroujerdian, William Fu, Aleksandra Faust, Vijay Janapa Reddi
Email: [email protected]
Github: https://github.com/harvard-edge/airlearning
Paper: https://arxiv.org/pdf/1906.00421.pdf
Video: https://www.youtube.com/watch?v=oakzGnh7Llw (Simulation),
https://www.youtube.com/watch?v=cvO5YOzI0mg (on a CrazyFlie Nano-Drone)
AI to play the classic snake game. The game was trained using PPO2 available from stable-baselines and then exported
to tensorflowjs to run directly on the browser
1.39.17 Pwnagotchi
Pwnagotchi is an A2C-based “AI” powered by bettercap and running on a Raspberry Pi Zero W that learns from its
surrounding WiFi environment in order to maximize the crackable WPA key material it captures (either through passive
sniffing or by performing deauthentication and association attacks). This material is collected on disk as PCAP files
containing any form of handshake supported by hashcat, including full and half WPA handshakes as well as PMKIDs.
QuaRL is a open-source framework to study the effects of quantization broad spectrum of reinforcement learning
algorithms. The RL algorithms we used in this study are from stable-baselines.
Authors: Srivatsan Krishnan, Sharad Chitlangia, Maximilian Lam, Zishen Wan, Aleksandra Faust, Vijay Janapa
Reddi
Email: [email protected]
Github: https://github.com/harvard-edge/quarl
Paper: https://arxiv.org/pdf/1910.01055.pdf
Executes PPO at C++ level yielding notable execution performance speedups. Uses Stable Baselines to create a
computational graph which is then used for training with custom environments by machine-code-compiled binary.
Authors: Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, Sergey Levine
Website: https://xbpeng.github.io/projects/Robotic_Imitation/index.html
Github: https://github.com/google-research/motion_imitation
Paper: https://arxiv.org/abs/2004.00784
This project aims to provide clean implementations of imitation learning algorithms. Currently we have implementa-
tions of AIRL and GAIL, and intend to add more in the future.
stable_baselines.results_plotter.main()
Example usage in jupyter-notebook
@misc{stable-baselines,
author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave,
˓→Adam and Kanervisto, Anssi and Traore, Rene and Dhariwal, Prafulla and Hesse,
˓→Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford,
˓→Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
215
Stable Baselines Documentation, Release 2.10.2
Contributing
To any interested in making the rl baselines better, there are still some improvements that need to be done. A full
TODO list is available in the roadmap.
If you want to contribute, please read CONTRIBUTING.md first.
217
Stable Baselines Documentation, Release 2.10.2
• genindex
• search
• modindex
219
Stable Baselines Documentation, Release 2.10.2
s
stable_baselines.a2c, 71
stable_baselines.acer, 77
stable_baselines.acktr, 83
stable_baselines.bench.monitor, 190
stable_baselines.common.base_class, 61
stable_baselines.common.callbacks, 42
stable_baselines.common.cmd_util, 185
stable_baselines.common.distributions,
174
stable_baselines.common.env_checker, 189
stable_baselines.common.evaluation, 189
stable_baselines.common.policies, 64
stable_baselines.common.schedules, 187
stable_baselines.common.tf_util, 181
stable_baselines.common.vec_env, 25
stable_baselines.ddpg, 89
stable_baselines.deepq, 106
stable_baselines.gail, 118
stable_baselines.her, 124
stable_baselines.ppo1, 128
stable_baselines.ppo2, 134
stable_baselines.results_plotter, 213
stable_baselines.sac, 140
stable_baselines.td3, 154
stable_baselines.trpo_mpi, 167
221
Stable Baselines Documentation, Release 2.10.2
A 160
A2C (class in stable_baselines.a2c), 73 action_probability() (stable_baselines.a2c.A2C
ACER (class in stable_baselines.acer), 79 method), 74
ACKTR (class in stable_baselines.acktr), 85 action_probability() (sta-
action (stable_baselines.common.policies.ActorCriticPolicy ble_baselines.acer.ACER method), 80
attribute), 66 action_probability() (sta-
action_ph (stable_baselines.common.policies.BasePolicy ble_baselines.acktr.ACKTR method), 86
attribute), 65 action_probability() (sta-
action_ph (stable_baselines.ddpg.CnnPolicy at- ble_baselines.common.base_class.BaseRLModel
tribute), 99 method), 61
action_ph (stable_baselines.ddpg.LnCnnPolicy action_probability() (sta-
attribute), 101 ble_baselines.ddpg.DDPG method), 93
action_ph (stable_baselines.ddpg.LnMlpPolicy action_probability() (sta-
attribute), 98 ble_baselines.deepq.DQN method), 109
action_ph (stable_baselines.ddpg.MlpPolicy at- action_probability() (sta-
tribute), 96 ble_baselines.gail.GAIL method), 121
action_ph (stable_baselines.deepq.CnnPolicy at- action_probability() (stable_baselines.her.HER
tribute), 114 method), 125
action_ph (stable_baselines.deepq.LnCnnPolicy at- action_probability() (sta-
tribute), 115 ble_baselines.ppo1.PPO1 method), 131
action_ph (stable_baselines.deepq.LnMlpPolicy at- action_probability() (sta-
tribute), 113 ble_baselines.ppo2.PPO2 method), 137
action_ph (stable_baselines.deepq.MlpPolicy at- action_probability() (stable_baselines.sac.SAC
tribute), 112 method), 143
action_ph (stable_baselines.sac.CnnPolicy attribute), action_probability() (stable_baselines.td3.TD3
150 method), 157
action_ph (stable_baselines.sac.LnCnnPolicy at- action_probability() (sta-
tribute), 151 ble_baselines.trpo_mpi.TRPO method),
action_ph (stable_baselines.sac.LnMlpPolicy at- 169
tribute), 148 ActorCriticPolicy (class in sta-
action_ph (stable_baselines.sac.MlpPolicy attribute), ble_baselines.common.policies), 66
146 adapt() (stable_baselines.ddpg.AdaptiveParamNoiseSpec
action_ph (stable_baselines.td3.CnnPolicy attribute), method), 103
163 AdaptiveParamNoiseSpec (class in sta-
action_ph (stable_baselines.td3.LnCnnPolicy at- ble_baselines.ddpg), 102
tribute), 165 add() (stable_baselines.her.HindsightExperienceReplayWrapper
action_ph (stable_baselines.td3.LnMlpPolicy at- method), 128
tribute), 162 arg_parser() (in module sta-
action_ph (stable_baselines.td3.MlpPolicy attribute), ble_baselines.common.cmd_util), 185
atari_arg_parser() (in module sta-
223
Stable Baselines Documentation, Release 2.10.2
224 Index
Stable Baselines Documentation, Release 2.10.2
env_method() (stable_baselines.common.vec_env.SubprocVecEnv
get_env() (stable_baselines.common.base_class.BaseRLModel
method), 29 method), 62
env_method() (stable_baselines.common.vec_env.VecEnvget_env() (stable_baselines.ddpg.DDPG method), 93
method), 26 get_env() (stable_baselines.deepq.DQN method), 109
EvalCallback (class in sta- get_env() (stable_baselines.gail.GAIL method), 122
ble_baselines.common.callbacks), 43 get_env() (stable_baselines.her.HER method), 126
evaluate_policy() (in module sta- get_env() (stable_baselines.ppo1.PPO1 method), 131
ble_baselines.common.evaluation), 189 get_env() (stable_baselines.ppo2.PPO2 method), 137
EventCallback (class in sta- get_env() (stable_baselines.sac.SAC method), 144
ble_baselines.common.callbacks), 44 get_env() (stable_baselines.td3.TD3 method), 158
EveryNTimesteps (class in sta- get_env() (stable_baselines.trpo_mpi.TRPO method),
ble_baselines.common.callbacks), 44 170
ExpertDataset (class in stable_baselines.gail), 51 get_episode_lengths() (sta-
ble_baselines.bench.monitor.Monitor method),
F 190
FeedForwardPolicy (class in sta- get_episode_rewards() (sta-
ble_baselines.common.policies), 67 ble_baselines.bench.monitor.Monitor method),
flatgrad() (in module sta- 190
ble_baselines.common.tf_util), 181 get_episode_times() (sta-
ble_baselines.bench.monitor.Monitor method),
flatparam() (stable_baselines.common.distributions.BernoulliProbabilityDistribution
method), 174 190
get_globals_vars()
flatparam() (stable_baselines.common.distributions.CategoricalProbabilityDistribution(in module sta-
method), 175 ble_baselines.common.tf_util), 182
get_images() (stable_baselines.common.vec_env.DummyVecEnv
flatparam() (stable_baselines.common.distributions.DiagGaussianProbabilityDistribution
method), 176 method), 27
get_images() (stable_baselines.common.vec_env.SubprocVecEnv
flatparam() (stable_baselines.common.distributions.MultiCategoricalProbabilityDistribution
method), 177 method), 29
get_images() (stable_baselines.common.vec_env.VecEnv
flatparam() (stable_baselines.common.distributions.ProbabilityDistribution
method), 179 method), 26
get_monitor_files()
fromflat() (stable_baselines.common.distributions.BernoulliProbabilityDistribution (in module sta-
class method), 174 ble_baselines.bench.monitor), 191
get_next_batch()
fromflat() (stable_baselines.common.distributions.CategoricalProbabilityDistribution (sta-
class method), 175 ble_baselines.gail.ExpertDataset method),
52
fromflat() (stable_baselines.common.distributions.DiagGaussianProbabilityDistribution
class method), 176 get_original_obs() (sta-
ble_baselines.common.vec_env.VecNormalize
fromflat() (stable_baselines.common.distributions.MultiCategoricalProbabilityDistribution
class method), 178 method), 30
function() (in module sta- get_original_reward() (sta-
ble_baselines.common.tf_util), 181 ble_baselines.common.vec_env.VecNormalize
method), 30
G get_parameter_list() (stable_baselines.a2c.A2C
GAIL (class in stable_baselines.gail), 120 method), 74
generate_expert_traj() (in module sta- get_parameter_list() (sta-
ble_baselines.gail), 53 ble_baselines.acer.ACER method), 80
get_parameter_list()
get_attr() (stable_baselines.common.vec_env.DummyVecEnv (sta-
method), 27 ble_baselines.acktr.ACKTR method), 86
get_parameter_list()
get_attr() (stable_baselines.common.vec_env.SubprocVecEnv (sta-
method), 29 ble_baselines.common.base_class.BaseRLModel
get_attr() (stable_baselines.common.vec_env.VecEnv method), 62
method), 26 get_parameter_list() (sta-
get_env() (stable_baselines.a2c.A2C method), 74 ble_baselines.ddpg.DDPG method), 93
get_env() (stable_baselines.acer.ACER method), 80 get_parameter_list() (sta-
get_env() (stable_baselines.acktr.ACKTR method), 86 ble_baselines.deepq.DQN method), 109
get_parameter_list() (sta-
Index 225
Stable Baselines Documentation, Release 2.10.2
226 Index
Stable Baselines Documentation, Release 2.10.2
Index 227
Stable Baselines Documentation, Release 2.10.2
228 Index
Stable Baselines Documentation, Release 2.10.2
Index 229
Stable Baselines Documentation, Release 2.10.2
230 Index
Stable Baselines Documentation, Release 2.10.2
Index 231
Stable Baselines Documentation, Release 2.10.2
232 Index
Stable Baselines Documentation, Release 2.10.2
Index 233
Stable Baselines Documentation, Release 2.10.2
234 Index
Stable Baselines Documentation, Release 2.10.2
W
window_func() (in module sta-
ble_baselines.results_plotter), 213
Index 235