Unit 5d - Deep Reinforcement Learning
Unit 5d - Deep Reinforcement Learning
Unit 5d - Deep Reinforcement Learning
Learning
Chapter 13 from
Deep Learning Illustrated book
1
Reinforcement Learning - Introduction
The reinforcement learning (RL) problem is the
problem faced by an agent that learns behavior
through trial-and-error interactions with its
environment. It consists of an agent that exists in an
environment described by a set S of possible states,
a set A of possible actions, and a reward (or
punishment) rt that the agent receives each time t
after it takes an action in a state. (Alternatively, the
reward might not occur until after a sequence of
actions have been taken.)
The objective of an RL agent is to maximize its
cumulative reward received over its lifetime.
2
Reinforcement Learning Problem
Agent
state st reward rt action at
rt+1
Environment
st+1
a0 a1 a2
s0 s1 s2 s3
r1 r2 r3
3
Reinforcement Learning
RL is an ML paradigm involving:
An agent takes an action on an environment.
Then receives two types of information:
Reward: a scalar value that provides quantitative
feedback on the action that the agent took.
State: This is how the environment changes in
response to an agent’s action.
Repeating the above two steps in a loop until
reaching some terminal state.
RL problems are sequential decision-making
problems.
4
The Cart-Pole Game
The objective is to balance a pole on top of a cart.
The cart move to the left or the right.
Each episode of the game begins with the cart
positioned at a random point near the center and with
the pole at a random angle near vertical.
An episode ends when either
The pole is no longer balanced, or
The cart touches the boundaries.
One point of reward is provided for every time step
that the episode lasts, and the maximum number of
time steps in an episode is 200.
5
The Cart-Pole Game
The Cart-Pole game is a popular introductory RL
problem because it’s so simple.
It has just four pieces of state information:
1. The position of the cart along the one-
dimensional horizontal axis
2. The cart’s velocity
3. The angle of the pole
4. The pole’s angular velocity
It has just two possible actions: move left/right.
Contrast this with a self-driving car!
6
Markov Decision Processes
Reinforcement learning problems can be defined
mathematically as something called a Markov
decision process.
MDPs feature the so-called Markov property — an
assumption that the current time step contains all
of the pertinent information about the state of the
environment from previous time steps.
Our agent would elect to move right or left at a given
time step t by considering only the attributes of the cart
(e.g., its location) and the pole (e.g., its angle) at that
particular time step t.
7
Markov Decision Processes
The MDP has five components:
1. S is the set of all possible states
2. A is the set of all possible actions
3. R is the distribution of reward given a state-
action pair. The exact same state-action pair
(s; a) might randomly result in different
amounts of reward r on different occasions.
4. P is the distribution of next state given a state-
action pair.
5. Gamma is a hyperparameter called the
discount/decay factor.
8
Discount/decay factor
9
Agent’s Learning Task
Execute actions in environment, observe results, and
learn optimal action policy * : S A that maximizes
E [rt + rt+1 + 2rt+2 + … ]
from any starting state in S
here 0 < 1 is the discount factor for future rewards
(sometimes makes sense with = 1)
3 +1
2 obstacle -1
1
1 2 3 4
11
An Example of an Optimal Policy
terminal states
12
An Example of Trials
While Learning an Optimal Policy
3 +1 3 +1
2 -1 2 -1
1 1
1 2 3 4 1 2 3 4
First trial Second trial
13
Maximum Trial Length
3 +1
2 -1
1
1 2 3 4
14
Reward-Maximization in RL
15
Deep Q-Learning Networks
The goal of RL is to find an optimal policy * that
maximizes the discounted future reward we can
obtain.
Even for a simple problem like the Cart-Pole game, it
is computationally intractable to definitively calculate
the maximum cumulative discounted future reward
because there are way too many possible future
states and outcomes to take into consideration.
As a computational shortcut, we propose the Q-
learning approach for estimating what the optimal
action a in a given situation might be.
16
Value Function
For each possible policy the agent might
adopt, we can define an evaluation function
over states
V(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i
where rt, rt+1, ... are generated by following
policy starting at state s
Restated, the task is to learn the optimal policy
* that maximizes V(s)
s *(s) = argmax V(s)
17
What to Learn
We might try to have agent learn the evaluation
function V* (which we write as V*)
It could then do a look-ahead search to choose
the best action from any state s because
*(s) = argmaxa [ r(s,a) + V*(d(s,a)) ]
A problem:
This can work if agent knows d : S A S, and
r:SA
But when it doesn't, it can't choose actions this
way
18
Q-Value Function
Define a new function very similar to V*
Q(s,a) = r(s,a) + V*(d(s,a))
If agent learns Q, it can choose optimal
action even without knowing d!
*(s) = argmaxa [ r(s,a) + V*(d(s,a)) ]
*(s) = argmaxa Q(s,a)
Q is the evaluation function the agent
will learn in Deep Q-learning.
19
Training Rule to Learn Q
20
Q Learning for Deterministic Worlds
22
Experimentation strategies
EXTRA – not required for Exams!
23
Experimentation strategies
25
EXTRA: RL control loop
29
Defining a DQN Agent 4/5
def train(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma *
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
30
Defining a DQN Agent 5/5
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0])
33
DQN Agent code explained
Remembering Gameplay
Each memory in this deque consists of five
pieces of information about timestep t:
1. The state st that the agent encountered
2. The action at that the agent took
3. The reward rt that the environment returned to the
agent
4. The next_state st+1 that the environment also
returned to the agent
5. A Boolean flag done that is true if time step t was the
final iteration of the episode, and false otherwise
34
DQN Agent code explained
Training via Memory Replay
Randomly sample a minibatch of 32 memories
from the memory deque (which holds up to 2,000
memories).
For each of the 32 sampled memories, we carry
out a round of model training
If done is True, the highest possible reward that could
be attained from this timestep is equal to the reward rt.
If done is False, we try to estimate what the target
reward —the maximum discounted future reward—
might be.
35
DQN Agent code explained
Training via Memory Replay…
We run the predict() method on the current state and
store the output (possible actions) in the variable
target_f.
Whichever action the agent actually took in this
memory, we use target_f[0][action] = target to replace
that target_f output with the target reward.
We train our model by calling the fit() method.
The model input is the current state st and its output is
target_f, which incorporates our approximation of the
maximum future discounted reward.
epochs can be set to 1, as it is cheaper to run more episodes
of the game generating more training data than running more
epochs on once-generated training data. 36
DQN Agent code explained
Selecting an Action to Take
A random number is generated.
If it is ≤ e, random exploratory action is
selected using the randrange function.
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
Otherwise, the agent selects an action that
exploits the “knowledge” the model has
learned via memory replay.
act_values = self.model.predict(state)
return np.argmax(act_values[0]) # pick an action with max reward
37
DQN Agent code explained
Saving and Loading Model Parameters
Finally, the save() and load() methods are one-
liners that enable us to save and load the
parameters of the model.
Agent performance can be flaky in complex
environments: For long stretches, the agent may
perform very well in a given environment, and
then later appear to lose its capabilities entirely.
It’s wise to save our model parameters at regular
intervals so that higher-performing parameters
from an earlier episode can be loaded back up.
38
Interacting with an OpenAI Gym Env
We can initialize an instance of the DQN
agent class.
agent = DQNAgent(state_size, action_size)
We can write code to enable our agent to
interact with an OpenAI Gym environment.
for e in range(n_episodes):
state = env.reset()
state = np.reshape(state, [1, state_size]) # reshape it into a row.
done = False
time = 0
39
Interacting with an OpenAI Gym Env
while not done:
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
reward = reward if not done else -10
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
if done:
print("episode: {}/{}, score: {}, e: {:.2}"
.format(e, n_episodes-1, time, agent.epsilon))
time += 1
40
Interacting with an OpenAI Gym Env
if len(agent.memory) > batch_size:
agent.train(batch_size)
if e % 50 == 0:
agent.save(output_dir + "weights_"
+ '{:04d}'.format(e) + ".hdf5")
41
Interacting with an OpenAI Gym Env
We use env.reset() to begin the episode with a
random state st.
The while loop that iterates over the timesteps
until the episode ends (i.e., until done = True).
We pass the state st into the agent’s act() method, and
this returns the agent’s action at.
The action at is provided to the environment’s step()
method, which returns the next_state st+1, the current
reward rt, and an update to the Boolean flag done.
If the episode is done, then we set reward to -10 so that
the agent is motivated to balance the pole longer. If
the episode is not done, reward is +1 for each additional
time step of gameplay. 42
Interacting with an OpenAI Gym Env
The while loop ….
We use our agent’s remember() method to save
all the aspects of this time step (the state st,
the action taken, the reward rt, the next state
st+1, and the flag done) to memory.
We set state equal to next_state in preparation
for the next iteration of the loop, which will be
time step t + 1.
If the episode ends, then we print summary
metrics on the episode.
43
Interacting with an OpenAI Gym Env
If the length of the agent’s memory deque is
larger than our batch size at the end of while loop,
we use the agent’s train() method to train its
neural net parameters by replaying its memories
of gameplay.
if len(agent.memory) > batch_size: agent.train(batch_size)
We have enough data for a minibatch!
This code could be pushed inside the while loop.
Every 50 episodes, we use the agent’s save()
method to store the neural net model’s
parameters.
44
Interacting with an OpenAI Gym Env
45
Hyperparameter tuning with SLM Lab
46
Hyperparameter tuning with SLM Lab
47
Hyperparameter tuning with SLM Lab
49
Agents Beyond DQN
Deep RL Agents
Imitation
50
Agents Beyond DQN
Value optimization: solve RL problems by
optimizing value functions.
Policy optimization: solve RL problems by
directly learning the policy function.
Model optimization: solve RL problems by
learning to predict future states based on (s, a) at
a given time step.
Imitation learning: mimic behaviors that are
taught to them through demonstration, e.g., how
to place dinner plates on a dish rack or how to
pour water into a cup.
51
Chapter Summary
Covered the essential theory of reinforcement
learning, including Markov decision processes.
Built a deep Q-learning agent that solved the
Cart-Pole environment.
Introduced deep RL algorithms beyond DQN
such as REINFORCE and actor-critic.
Described SLM Lab — a deep RL framework
with existing algorithm implementations as well
as tools for optimizing agent hyperparameters.
52