Unit 5d - Deep Reinforcement Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Deep Reinforcement

Learning
Chapter 13 from
Deep Learning Illustrated book

1
Reinforcement Learning - Introduction
 The reinforcement learning (RL) problem is the
problem faced by an agent that learns behavior
through trial-and-error interactions with its
environment. It consists of an agent that exists in an
environment described by a set S of possible states,
a set A of possible actions, and a reward (or
punishment) rt that the agent receives each time t
after it takes an action in a state. (Alternatively, the
reward might not occur until after a sequence of
actions have been taken.)
 The objective of an RL agent is to maximize its
cumulative reward received over its lifetime.

2
Reinforcement Learning Problem

Agent
state st reward rt action at

rt+1
Environment
st+1
a0 a1 a2
s0 s1 s2 s3
r1 r2 r3

Goal: Learn to choose actions at that maximize future rewards


r1+ r2+ 2 r3+…, where 0<<1 is a discount factor

3
Reinforcement Learning
 RL is an ML paradigm involving:
 An agent takes an action on an environment.
 Then receives two types of information:
 Reward: a scalar value that provides quantitative
feedback on the action that the agent took.
 State: This is how the environment changes in
response to an agent’s action.
 Repeating the above two steps in a loop until
reaching some terminal state.
 RL problems are sequential decision-making
problems.
4
The Cart-Pole Game
 The objective is to balance a pole on top of a cart.
 The cart move to the left or the right.
 Each episode of the game begins with the cart
positioned at a random point near the center and with
the pole at a random angle near vertical.
 An episode ends when either
 The pole is no longer balanced, or
 The cart touches the boundaries.
 One point of reward is provided for every time step
that the episode lasts, and the maximum number of
time steps in an episode is 200.

5
The Cart-Pole Game
 The Cart-Pole game is a popular introductory RL
problem because it’s so simple.
 It has just four pieces of state information:
1. The position of the cart along the one-
dimensional horizontal axis
2. The cart’s velocity
3. The angle of the pole
4. The pole’s angular velocity
 It has just two possible actions: move left/right.
 Contrast this with a self-driving car!
6
Markov Decision Processes
 Reinforcement learning problems can be defined
mathematically as something called a Markov
decision process.
 MDPs feature the so-called Markov property — an
assumption that the current time step contains all
of the pertinent information about the state of the
environment from previous time steps.
 Our agent would elect to move right or left at a given
time step t by considering only the attributes of the cart
(e.g., its location) and the pole (e.g., its angle) at that
particular time step t.

7
Markov Decision Processes
 The MDP has five components:
1. S is the set of all possible states
2. A is the set of all possible actions
3. R is the distribution of reward given a state-
action pair. The exact same state-action pair
(s; a) might randomly result in different
amounts of reward r on different occasions.
4. P is the distribution of next state given a state-
action pair.
5. Gamma is a hyperparameter called the
discount/decay factor.
8
Discount/decay factor

9
Agent’s Learning Task
Execute actions in environment, observe results, and
 learn optimal action policy * : S  A that maximizes
E [rt + rt+1 + 2rt+2 + … ]
from any starting state in S
 here 0   < 1 is the discount factor for future rewards
(sometimes makes sense with  = 1)

Note something new:


 Target function is * : S  A
 But, we have no training examples of form s, a
 Training examples are of form s, a , r
10
A “Policy”
A policy is a complete mapping from every
state to the action to be taken in that state.

In a gridworld, we can consider a square to


be a state.

3 +1

2 obstacle -1

1
1 2 3 4
11
An Example of an Optimal Policy
terminal states

3 +1 Assumes reward is –0.04


obstacle -1 in all non-terminal states.
2 Rewards for terminal states
1 (4,3) and (4,2) are shown.
1 2 3 4 Assumes no discounting.

Note: There may be more than one optimal policy.


Can you think of another optimal policy here?

12
An Example of Trials
While Learning an Optimal Policy

3 +1 3 +1

2 -1 2 -1

1 1
1 2 3 4 1 2 3 4
First trial Second trial

Example trials on the way to learning an optimal policy:

(1,1)-0.04 (2,1)-0.04 (3,1)-0.04 (3,2)-0.04 (4,2)-1 First trial


(1,1)-0.04 (1,2)-0.04 (1,3)-0.04 (2,3)-004 (3,3)-0.04 (4,3)+1 Second trial

13
Maximum Trial Length

Typically one sets a maximum number of steps per trial.


The following policy gives an example why:

3 +1

2 -1

1
1 2 3 4

14
Reward-Maximization in RL

15
Deep Q-Learning Networks
 The goal of RL is to find an optimal policy * that
maximizes the discounted future reward we can
obtain.
 Even for a simple problem like the Cart-Pole game, it
is computationally intractable to definitively calculate
the maximum cumulative discounted future reward
because there are way too many possible future
states and outcomes to take into consideration.
 As a computational shortcut, we propose the Q-
learning approach for estimating what the optimal
action a in a given situation might be.

16
Value Function
 For each possible policy  the agent might
adopt, we can define an evaluation function
over states
V(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i
 where rt, rt+1, ... are generated by following
policy  starting at state s
 Restated, the task is to learn the optimal policy
* that maximizes V(s)
s *(s) = argmax V(s)

17
What to Learn
 We might try to have agent learn the evaluation
function V* (which we write as V*)
 It could then do a look-ahead search to choose
the best action from any state s because
*(s) = argmaxa [ r(s,a) +  V*(d(s,a)) ]

A problem:
 This can work if agent knows d : S  A  S, and
r:SA
 But when it doesn't, it can't choose actions this
way
18
Q-Value Function
 Define a new function very similar to V*
Q(s,a) = r(s,a) +  V*(d(s,a))
 If agent learns Q, it can choose optimal
action even without knowing d!
*(s) = argmaxa [ r(s,a) +  V*(d(s,a)) ]
*(s) = argmaxa Q(s,a)
 Q is the evaluation function the agent
will learn in Deep Q-learning.

19
Training Rule to Learn Q

EXTRA – not required for Exams!


 Note Q and V* are closely related
V*(s) = maxa Q(s,a)
= maxa ( r(s,a) +  V*(d(s,a)) )
 This allows us to write Q recursively as
Q(s,a) = r(s,a) + V*(d(s,a)))
= r(s,a) +  maxa’ Q(d(s,a),a’)
 Learning rule (aka the Bellman equation)
Q(s,a) = r +  maxa’ Q(s’,a’)
where s’ is the resulting state after doing a in s.

20
Q Learning for Deterministic Worlds

EXTRA – not required for Exams!


 For each pair s and a, initialize table entry
Q(s,a)0
 Observe current state s
 Do forever:
 Select an action a and execute it
 Receive immediate reward r
 Observe the new state s’
 Update the table entry for Q(s,a) as follows:
Q(s,a)  r +  maxa’ Q(s’,a’)
 s  s’.
21
Convergence

EXTRA – not required for Exams!


 Q-estimates converge to Q if
 underlying d & r are deterministic
 r is bounded: |r(s,a)| < c (constant)
 pairs (s,a) are visited infinitely often
 Key idea
 Let  = max error at iteration n
 show that  decreases by factor  at each visit
 Q(s,a) depends on estimate + actual reward

22
Experimentation strategies
EXTRA – not required for Exams!

 The algorithm does not say which action to chose


 Shall we choose the action with highest Q(s,a)?
 What are its implications?
 Convergence
 applies for any fair strategy
 strategy ‘select a with max Q(s,a)’ is not fair
 Probabilistic approaches
 large Q(s,a) have higher probability
 P(ai|s) = kQ(s,ai) /Sj kQ(s,aj)
 parameters to adjust explore/exploit
 gradual variation of parameters

23
Experimentation strategies

EXTRA – not required for Exams!


 Store observed d(s,a) & r(s,a)
 Retrain stored cases periodically
 subsequent training may have made other
estimates (affecting the values on the path)
more accurate
 Degree to which to replay?
 relative costs of true and simulated actions
(learning requires many iterations)
 If d & r are known
 We can use efficient dynamic programming
24
Deep Q-Learning
 We can leverage a deep Q-learning network
(DQN) to estimate what the optimal Q-value
might be.
 Q*(s,a) Q(s,a;q)
 The optimal Q-value Q*(s,a) is being
approximated.
 The Q-value approximation function incorporates
neural network model parameters q in addition to
its usual state and action inputs.

25
EXTRA: RL control loop

DRL book by Wah Loon Keng and Laura Graesser 26


Defining a DQN Agent 1/5
import random
import gym # Open AI Gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import os
env = gym.make('CartPole-v0')
state_size = env.observation_space.shape[0] #4
action_size = env.action_space.n #2
batch_size = 32
n_episodes = 1000
output_dir = 'model_output/cartpole/‘ # store network’s parameters at regular intervals
if not os.path.exists(output_dir): os.makedirs(output_dir)
27
Defining a DQN Agent 2/5
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95
self.epsilon = 1.0
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.learning_rate = 0.001
self.model = self._build_model()
28
Defining a DQN Agent 3/5
def _build_model(self):
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=self.state_size))
model.add(Dense(32, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
return model

def remember(self, state, action, reward, next_state, done):


self.memory.append((state, action, reward, next_state, done))

29
Defining a DQN Agent 4/5
def train(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma *
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay

30
Defining a DQN Agent 5/5
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0])

def save(self, name):


self.model.save_weights(name)

def load(self, name):


self.model.load_weights(name)
31
DQN Agent code explained
Initialization Parameters
 state_size and action_size are set as 4 and 2.
 memory is for storing memories that can
subsequently be replayed in order to train our
DQN’s neural net. It uses a deque of size 2000.
 gamma is the discount factor (a.k.a. decay rate)
initialized as 0.95.
 epsilon is the exploration rate set as 1.
 epsilon_decay is the decay rate set as 0.995.
 epsilon_min is the min exploration rate 0.001.
32
DQN Agent code explained
Building the Agent’s Model
 Input layer has 4 neurons (4 state information:
cart position, cart velocity, pole angle, and pole
angular velocity)
 Two hidden layers each with 32 ReLU neurons.
 The output layer has dimensionality (2)
corresponding to the number of possible actions.
 We use activation='linear'.
 We compile with loss=mse, and optimizer=Adam.

33
DQN Agent code explained
Remembering Gameplay
 Each memory in this deque consists of five
pieces of information about timestep t:
1. The state st that the agent encountered
2. The action at that the agent took
3. The reward rt that the environment returned to the
agent
4. The next_state st+1 that the environment also
returned to the agent
5. A Boolean flag done that is true if time step t was the
final iteration of the episode, and false otherwise
34
DQN Agent code explained
Training via Memory Replay
 Randomly sample a minibatch of 32 memories
from the memory deque (which holds up to 2,000
memories).
 For each of the 32 sampled memories, we carry
out a round of model training
 If done is True, the highest possible reward that could
be attained from this timestep is equal to the reward rt.
 If done is False, we try to estimate what the target
reward —the maximum discounted future reward—
might be.
35
DQN Agent code explained
Training via Memory Replay…
 We run the predict() method on the current state and
store the output (possible actions) in the variable
target_f.
 Whichever action the agent actually took in this
memory, we use target_f[0][action] = target to replace
that target_f output with the target reward.
 We train our model by calling the fit() method.
 The model input is the current state st and its output is
target_f, which incorporates our approximation of the
maximum future discounted reward.
 epochs can be set to 1, as it is cheaper to run more episodes
of the game generating more training data than running more
epochs on once-generated training data. 36
DQN Agent code explained
Selecting an Action to Take
 A random number is generated.
 If it is ≤ e, random exploratory action is
selected using the randrange function.
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
 Otherwise, the agent selects an action that
exploits the “knowledge” the model has
learned via memory replay.
act_values = self.model.predict(state)
return np.argmax(act_values[0]) # pick an action with max reward
37
DQN Agent code explained
Saving and Loading Model Parameters
 Finally, the save() and load() methods are one-
liners that enable us to save and load the
parameters of the model.
 Agent performance can be flaky in complex
environments: For long stretches, the agent may
perform very well in a given environment, and
then later appear to lose its capabilities entirely.
 It’s wise to save our model parameters at regular
intervals so that higher-performing parameters
from an earlier episode can be loaded back up.
38
Interacting with an OpenAI Gym Env
 We can initialize an instance of the DQN
agent class.
agent = DQNAgent(state_size, action_size)
 We can write code to enable our agent to
interact with an OpenAI Gym environment.
for e in range(n_episodes):
state = env.reset()
state = np.reshape(state, [1, state_size]) # reshape it into a row.
done = False
time = 0
39
Interacting with an OpenAI Gym Env
while not done:
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
reward = reward if not done else -10
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
if done:
print("episode: {}/{}, score: {}, e: {:.2}"
.format(e, n_episodes-1, time, agent.epsilon))
time += 1
40
Interacting with an OpenAI Gym Env
if len(agent.memory) > batch_size:
agent.train(batch_size)
if e % 50 == 0:
agent.save(output_dir + "weights_"
+ '{:04d}'.format(e) + ".hdf5")

41
Interacting with an OpenAI Gym Env
 We use env.reset() to begin the episode with a
random state st.
 The while loop that iterates over the timesteps
until the episode ends (i.e., until done = True).
 We pass the state st into the agent’s act() method, and
this returns the agent’s action at.
 The action at is provided to the environment’s step()
method, which returns the next_state st+1, the current
reward rt, and an update to the Boolean flag done.
 If the episode is done, then we set reward to -10 so that
the agent is motivated to balance the pole longer. If
the episode is not done, reward is +1 for each additional
time step of gameplay. 42
Interacting with an OpenAI Gym Env
 The while loop ….
 We use our agent’s remember() method to save
all the aspects of this time step (the state st,
the action taken, the reward rt, the next state
st+1, and the flag done) to memory.
 We set state equal to next_state in preparation
for the next iteration of the loop, which will be
time step t + 1.
 If the episode ends, then we print summary
metrics on the episode.

43
Interacting with an OpenAI Gym Env
 If the length of the agent’s memory deque is
larger than our batch size at the end of while loop,
we use the agent’s train() method to train its
neural net parameters by replaying its memories
of gameplay.
if len(agent.memory) > batch_size: agent.train(batch_size)
 We have enough data for a minibatch!
 This code could be pushed inside the while loop.
 Every 50 episodes, we use the agent’s save()
method to store the neural net model’s
parameters.
44
Interacting with an OpenAI Gym Env

First few episodes: Last few episodes:


episode: 0/999, score: 19, e: 1.0 episode: 981/999, score: 199, e: 0.01
episode: 1/999, score: 14, e: 1.0 episode: 982/999, score: 188, e: 0.01
episode: 2/999, score: 37, e: 0.99 episode: 983/999, score: 199, e: 0.01
episode: 3/999, score: 11, e: 0.99 episode: 984/999, score: 199, e: 0.01
episode: 4/999, score: 35, e: 0.99 episode: 985/999, score: 14, e: 0.01
episode: 5/999, score: 41, e: 0.98 episode: 986/999, score: 149, e: 0.01
episode: 6/999, score: 18, e: 0.98 episode: 987/999, score: 199, e: 0.01
episode: 7/999, score: 10, e: 0.97 ... ... ...
episode: 8/999, score: 9, e: 0.97 episode: 998/999, score: 199, e: 0.01
episode: 9/999, score: 24, e: 0.96 episode: 999/999, score: 199, e: 0.01

45
Hyperparameter tuning with SLM Lab

 SLM Lab (available at


github.com/kengz/SLM-Lab) is a deep
reinforcement learning framework
developed by Wah Loon Keng and Laura
Graesser to conduct experiments playing
with different ideas and environment
libraries, such as OpenAI Gym and Unity.
 In particular, it can be used for tuning
hyperparameters.

46
Hyperparameter tuning with SLM Lab

47
Hyperparameter tuning with SLM Lab

 The following hyperparameter settings are


optimal for our DQN agent playing the
Cart-Pole game:
 A single-hidden-layer with 64 neurons.
 The tanh activation function for the hidden
layer neurons.
 A low learning rate of ~0.02.
 Trials with an exploration rate (ϵ) that anneals
over 10 episodes outperform trials that anneal
over 50 or 100 episodes.
48
Agents Beyond DQN
 Deep Q-learning networks are relatively simple
and make efficient use of the training samples
that are available to them. That said, DQN agents
do have Some drawbacks.
 If the possible number of state-action pairs is
large in a given environment, it becomes
intractable to estimate the optimal Q-value.
 Even in situations where finding Q* is
computationally tractable, DQNs may not
converge on Q*.

49
Agents Beyond DQN

Deep RL Agents

Imitation

Value Optimization Policy Optimization Model Optimization


e.g. DQN variants e.g. REINFORCE e.g. MonteCarlo

Actor-Critc Combined appr.


e.g. A2C, A3C e.g. AlphaGo

50
Agents Beyond DQN
 Value optimization: solve RL problems by
optimizing value functions.
 Policy optimization: solve RL problems by
directly learning the policy function.
 Model optimization: solve RL problems by
learning to predict future states based on (s, a) at
a given time step.
 Imitation learning: mimic behaviors that are
taught to them through demonstration, e.g., how
to place dinner plates on a dish rack or how to
pour water into a cup.
51
Chapter Summary
 Covered the essential theory of reinforcement
learning, including Markov decision processes.
 Built a deep Q-learning agent that solved the
Cart-Pole environment.
 Introduced deep RL algorithms beyond DQN
such as REINFORCE and actor-critic.
 Described SLM Lab — a deep RL framework
with existing algorithm implementations as well
as tools for optimizing agent hyperparameters.

52

You might also like