2018 Deep RL NOT-possibleToAddToMendely

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO.
XX 1
Deep Reinforcement Learning for Traffic Light

Control in Vehicular Networks
Xiaoyuan Liang, Xusheng Du, Student Member, IEEE, Guiling Wang, Member, IEEE, and Zhu Han
Fellow, IEEE
arXiv:1803.11115v1 [cs.LG] 29 Mar 2018
Abstract—Existing inefficient traffic light control causes nu- such as a football event or a more common high traffic hour
merous problems, such as long delay and waste of energy. scenario, the traffic light control systems become paralyzed.
To improve efficiency, taking real-time traffic information as Instead, we often witness policemen directly manage the
an input and dynamically adjusting the traffic light duration
accordingly is a must. In terms of how to dynamically adjust intersection by waving signals. This human operator can see
traffic signals’ duration, existing works either split the traffic the real time traffic condition in the intersecting roads and
signal into equal duration or extract limited traffic information smartly determine the duration of the allowed passing time
from the real data. In this paper, we study how to decide the for each direction using his/her long-term experience and
traffic signals’ duration based on the collected data from different understanding about the intersection. The operation normally
sensors and vehicular networks. We propose a deep reinforcement
learning model to control the traffic light. In the model, we is very effective. The witness motivates us to propose a smart
quantify the complex traffic scenario as states by collecting data intersection traffic light management system which can take
and dividing the whole intersection into small grids. The timing real-time traffic condition as input and learn how to manage
changes of a traffic light are the actions, which are modeled the intersection just like the human operator.
as a high-dimension Markov decision process. The reward is To implement such a system, we need ‘eyes’ to watch the
the cumulative waiting time difference between two cycles. To
solve the model, a convolutional neural network is employed to real-time road condition and ‘a brain’ to process it. For the
map the states to rewards. The proposed model is composed former, recent advances in sensor and networking technology
of several components to improve the performance, such as enables taking real-time traffic information as input, such as
dueling network, target network, double Q-learning network, the number of vehicles, the locations of vehicles, and their
and prioritized experience replay. We evaluate our model via waiting time [4]. For the ‘brain’ part, reinforcement learning,
simulation in the Simulation of Urban MObility (SUMO) in a
vehicular network, and the simulation results show the efficiency as a type of machine learning techniques, is a promising
of our model in controlling traffic lights. way to solve the problem. A reinforcement learning system’s
goal is to make an action agent learn the optimal policy in
Index Terms—reinforcement learning, deep learning, traffic
light control, vehicular network interacting with the environment to maximize the reward,
e.g., the minimum waiting time in our intersection control
scenario. It usually contains three components, states of the
I. I NTRODUCTION environment, action space of the agent, and reward from every
Existing road intersection management is done through traf- action [5]. A well-known application of reinforcement learning
fic lights. The inefficient traffic light control causes numerous is AlphaGo [6], including AlphaGo Zero [7]. AlphaGo, acting
problems, such as long delay of travelers, huge waste of energy as the action agent in a Go game (environment), first observes
and worsening air quality. In some cases, it may also contribute the current image of the chessboard (state), and takes the image
to vehicular accidents [1], [2]. Existing traffic light control as the input of a reinforcement learning model to determine
either deploys fixed programs without considering real-time where to place the optimal next playing piece ‘stone’ (action).
traffic or considering the traffic to a very limited degree [3]. Its final reward is to win the game or to lose. Thus, the reward
The fixed programs set the traffic signals equal time duration may be unobvious during the playing process and it is delayed
in every cycle, or different time duration based on historical till the game is over. When applying reinforcement learning
information. Some other control programs take inputs from to the traffic light control problem, the key point is to define
sensors such as underground inductive loop detectors to detect the three components at an intersection and quantify them to
the existence of vehicles in front of traffic lights. The inputs be computable.
are processed in a very coarse way to determine the duration Some researchers have proposed to dynamically control
of green/red lights. the traffic lights using reinforcement learning. Early works
In some cases, existing traffic light control systems work, define the states by the number of waiting vehicles or the
though at a low efficiency. However, in many other cases, waiting queue length [4], [8]. But real traffic situation cannot
be accurately captured by the number of waiting vehicles
X. Liang and G. Wang are with the Department of Computer Science,
New Jersey Institute of Technology, Newark, NJ, 07102 USA email: {xl367, or queue length [2]. With the popularization of vehicular
gwang}@njit.edu. networks and cameras, more information about roads can
X. Du and Z. Han is with Department of Electrical and Computer Engineer- be extracted and transmitted via the network, such as vehi-
ing, University of Houston, Houston, TX 77004 USA email: {xunshengdu,
hanzhu22}@gmail.com. cles’ speed and waiting time [9]. However, more information
Manuscript received January 22, 2018. causes the dramatically increasing number of states. When
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 2
the number of states increases, the complexity in a traditional paper is concluded in Section VIII.
reinforcement learning system grows exponentially. With the
rapid development of deep learning, deep neural networks
II. L ITERATURE R EVIEW
have been employed to deal with the large number of states,
which constitutes a deep reinforcement learning model [10]. A Previous works have been done to dynamically control
few recent studies have proposed to apply deep reinforcement adaptive traffic lights. But due to the limited computing
learning in the traffic light control problem [11], [12]. But power and simulation tools, early studies focus on solving the
there are two main limitations in the existing studies: (1) problem by fuzzy logic [17], linear programming [18], etc. In
the traffic signals are usually split into fixed-time intervals, these works, road traffic is modeled by limited information,
and the duration of green/red lights can only be a multiple which cannot be applied in large scale.
of this fixed-length interval, which is not efficient in many Reinforcement learning was applied in traffic light control
situations; (2) the traffic signals are designed to change in a since 1990s. El-Tantawy et al. [4] summarize the methods
random sequence, which is not a safe nor comfortable way from 1997 to 2010 that use reinforcement learning to control
for drivers. In this paper, we study the problem on how to traffic light timing. During this period, the reinforcement
control the traffic light’s signal duration in a cycle based on learning techniques are limited to tabular Q learning and a
the extracted information from vehicular networks to help linear function is normally used to estimate the Q value. Due to
efficiently manage vehicles at an intersection. the technique limitation at the time in reinforcement learning,
In this paper, we solve the problem in the following ap- they usually make a small-size state space, such as the number
proaches and make the following contributions. Our general of waiting vehicles [8], [19], [20] and the statistics of traffic
idea is to mimic an experienced operator to control the signal flow [21], [22]. The complexity in a traffic road system can not
duration in every cycle based on the information gathered be actually presented by such limited information. When much
from vehicular networks. To implement such an idea, the useful relevant information is omitted in the limited states, it
experienced operator’s operation is modeled as an Markov seems unable to act optimally in traffic light control [2].
Decision Process (MDP). The MDP is a high-dimension With the development of deep learning and reinforcement
model, which contains the time duration of every phase. The learning, they are combined together as deep reinforcement
system then learns the control strategy based on the MDP by learning to estimate the Q value. We summarize the recent
trial and error in a deep reinforcement learning model. To fit studies that use the value-based deep reinforcement learning
a deep reinforcement learning model, we divide the whole to control traffic lights in Table I. There are three limitations
intersection into grids and build a matrix from the vehicles’ in these previous studies. Firstly, most of them test their
information in the grids collected by vehicular networks or models in a simple cross-shape intersection with through
extracted from a camera via image processing. The matrix traffic only [11], [12]. Secondly, none of the previous works
is defined as the states and the reward is the cumulative determines the traffic signal timing in a whole cycle. Thirdly,
waiting time difference between two cycles. In our model, deep reinforcement learning is a fast developing field, where
a convolutional neural network is employed to match the a lot of new ideas are proposed in these two years, such as
states and expected future rewards. In the traffic light control dueling deep Q network [13], but they have not been applied in
problem, every traffic light’s action may affect the environment traffic control. In this paper, we make the following progress.
and the traffic flow changes dynamically, which makes the Firstly, our intersection scenario contains multiple phases,
environment unpredictable. Thus, a convolutional network is which corresponds a high-dimension action space in a cycle.
hard to predict the accurate reward. Inspired by the recent Secondly, our model guarantees that the traffic signal time
studies in reinforcement learning, we employ a series of state- smoothly changes between two neighboring actions, which
of-the-art techniques in our model to improve the performance, is exactly defined in the MDP model. Thirdly, we employ
including dueling network [13], target network [10], double Q- the state-of-the-art techniques in value-based reinforcement
learning network [14], and prioritized experience replay [15]. learning algorithms to achieve good performance, which is
In this paper, we combine these techniques as a framework evaluated via simulation.
to solve our problem, which can be easily applied into other
problems. Our system is tested on a traffic micro-simulator,
III. M ODEL AND P ROBLEM S TATEMENT
Simulation of Urban MObility (SUMO) [16], and the simula-
tion results show the effectiveness and high-efficiency of our In this paper, we consider a road intersection scenario where
model. traffic lights are used to control traffic flows. The model is
The reminder of this paper is organized as follows. The liter- shown in Fig. 1. The left side shows the structure in a traffic
ature review is presented in Section II. The model and problem light. The traffic light first gathers road traffic information via
statement are introduced in Section III. The background on a vehicular network [9], which is presented by the dashed
reinforcement learning is introduced in Section IV. Section purple lines in the figure. The traffic light processes the data
V shows the details in modeling an reinforcement learning to obtain the road traffic’s state and reward, which has been
model in the traffic light control system of vehicular networks. assumed in many previous studies [2], [12], [23]. The traffic
Section VI extends the reinforcement learning model into a light chooses an action based on the current state and reward
deep learning model to handle the complex states in the our using a deep neural network shown in the right side. The left
system. The model is evaluated in Section VII. Finally, the side is the reinforcement learning part and the deep learning
TABLE I
L IST OF PREVIOUS STUDIES THAT USE VALUE - BASED DEEP REINFORCEMENT LEARNING TO ADAPTIVELY CONTROL TRAFFIC SIGNALS
Study State Action Reward Time step Note

Genders et al. Position Change in
4 phases NA Convolutional neural network
(2016) [2] speed cumulative delay
Li et al. Difference between
Queue length 2 phases 5s Stacked auto-encoders
(2016) [11] flows in two directions
Van Der Pol Teleport, wait time, Double Q network
Position 2 phases 1s
(2016) [12] stop, switch, and delay Prioritized experience replay
Gao et al. Position Change in Convolutional neural network
4 phases 6/10s
(2017) [23] speed cumulative staying time Experience replay
Input:
States
Output:
Actions
way to learn how to control the traffic light and liberate
a human being from the learning process. Reinforcement
learning updates its model by continuously receiving states and
Road rewards from the environment. The model gradually becomes
Trafﬁc
State
a mature and advanced model. It is different from supervised
Action
Reward learning in not requiring numerous data at one time. In this
Trafﬁc paper, we employ the deep reinforcement learning to learn
Light
Reinforcement Learning
the timing strategy of every phase to optimize the traffic
Deep Learning
management.
Fig. 1. The traffic light control model in our system. The left side shows the
intersection scenario where the traffic light gathers vehicles’ information via
a vehicular network and it is controlled by the reinforcement learning model; IV. BACKGROUND ON R EINFORCEMENT L EARNING
the right side shows a deep neural network to help the traffic light choose an
action.
Reinforcement learning is one category of algorithms in
machine learning, which is different from supervised learning
part. They make up our deep reinforcement learning model in and unsupervised learning [5]. It interacts with the environ-
traffic light control. ment to get rewards from actions. Its goal is to take the
In our model, traffic lights are used to manage the traffic action to maximize the numerical rewards in the long run. In
flows at intersections. A traffic light at an intersection has three reinforcement learning, an agent, the action executor, takes an
signals, green, yellow and red. One traffic light may not be action and the environment returns a numerical reward based
enough to manage all the vehicles when there are vehicles on the action and current state. A four-tuple hS, A, R, T i can
from multiple directions at an intersection. Thus, multiple be used to denote the reinforcement learning model with the
traffic lights need to cooperate at a multi-direction intersection. following meanings:
At such an intersection, the traffic signal guides vehicles from • S : the possible state space. s is a specific state (s ∈ S);
non-conflicting directions at one time by changing the traffic • A : the possible action space. a is an action (a ∈ A);
lights’ statuses. One status is one of all the legal combinations • R : the reward space. rs,a means the reward in taking
of all traffic lights’ red and green signals omitting the yellow action a at state s;
signals. The time duration staying at one status is called one • T : the transmission function space among all states,
phase. The number of phases is decided by the number of legal which means the probability of the transmission from one
statuses at an intersection. All the phases cyclically change in state to another.
a fixed sequence to guide vehicles to pass the intersection. It
is called a cycle when the phases repeat once. The sequence In a deterministic model, T is usually omitted.
of phases in a cycle is fixed, but the duration of every phase A policy is made up of a series of consequent actions. The
is adaptive. If one phase needs to be skipped, its duration can goal in reinforcement learning is to learn an optimal policy
be set 0 second. In our problem, we dynamically adjust the to maximize the cumulative expected rewards starting from
duration in every phase to deal with different traffic situations the current state. Generally speaking, the agent at one specific
at an intersection. state s takes an action a to reach state s′ and gets a reward r,
Our problem is defined by how to optimize the efficiency which is denoted by hs, a, r, s′ i. Let t denote the tth step in the
of the intersection usage by dynamically changing every policy π. The cumulative reward in the future by taking action
phase’s duration of a traffic light via learning from historical a at state s is defined by Q(s, a) in the following equation,
experiences. The general idea is to extend the duration for the Qπ (s, a) = E rt + γrt+1 + γ 2 rt+2 + · · · |st = s, at = a, π

phase that has more vehicles in that direction. But it is time- "∞ #
consuming to train a person to become a master who well
X
k
=E γ rt+k |st = s, at = a, π .
knows how much time should be given to a phase based on k=0
current traffic situation. Reinforcement learning is a possible (1)
In the equation, γ is the discount factor, which is usually

in [0, 1). It means the nearest rewards are worthier than the N
rewards in the further future.
The optimal action policy π ∗ can be obtained recursively.
If the agent knows the optimal Q values of the succeeding
states, the optimal policy just chooses the action that achieves
the highest cumulative reward. Thus, the optimal Q(s, a) is
calculated based on the optimal Q values of the succeeding
states. It can be expressed by the Bellman optimality equation
∗
to calculate Qπ (s, a),
∗
h
π∗ ′ ′
i W E
Qπ (s, a) = Es′ rt + γ max Q (s , a )|s, a . (2)
a
′
The intuition is that the cumulative reward is equal to the sum

of the immediate reward and optimal future reward thereafter.
If the estimated optimal future reward is obtained, the cumu-
lative reward since now can be calculated. This equation can
be solved by dynamic programming, but it requires that the
number of states is finite to make the computing complexity
controllable. When the number of states becomes large, a
function θ is needed to approximate the Q value, which will
S
be shown in Section VI. (a) The snapshot of traffic on a road at one moment
V. R EINFORCEMENT L EARNING M ODEL 1.0

To build a traffic light control system using reinforcement 1.0
learning, we need to define the states, actions and rewards. In 1.0
1.0 1.0
the reminder of this section, we present how the three elements
are defined in our model. 1.0 1.0 1.0
1.0 1.0
A. States
We define the states based on two pieces of information, (b) The corresponding position matrix on this road
position and speed of vehicles at an intersection. Through a
vehicular network, vehicles’ position and speed can be ob- Fig. 2. The process to build the state matrix.
tained [9]. Then the traffic light can extract a virtual snapshot
image of the current intersection. The whole intersection is
divided into same-size small square-shape grids. The length B. Action Space
of grids, c, should guarantee that no two vehicles can be held
in the same grid and one entire vehicle can be put into a grid A traffic light needs to choose an appropriate action to
to reduce computation. The value of c in our system will be well guide vehicles at the intersection based on the current
given in the evaluation. In every grid, the state value is a two- traffic state. In this system, the action space is defined by
value vector < position, speed > of the inside vehicle. The selecting every phase’s duration in the next cycle. But if
position dimension is a binary value, which denotes whether the duration changes a lot between two cycles, the system
there is a vehicle in the grid. If there is a vehicle in a grid, the may become unstable. Thus, the legal phases’ duration at the
value in the grid is 1; otherwise, it is 0. The speed dimension current state should smoothly change. We model the duration
is an integer value, denoting the vehicle’s current speed in changes of legal phases between two neighboring cycles as
m/s. a high-dimension MDP. In the model, the traffic light only
Let’s take Fig. 2 as an example to show how to quantify changes one phase’s duration in a small step.
the intersection to obtain the state values. Fig. 2(a) shows a Let’s take the intersection in Fig. 2(a) as an exam-
snapshot of the traffic status at a simple one-lane four-way ple. At the intersection, there are four phases, north-south
intersection, which is built with information in a vehicular green, east-north&west-south green, east-west green, and east-
network. The intersection is split into square-shape grids. The south&west-north green. The other unmentioned directions are
position matrix has the same size of the grids, which is shown red by default. Let’s omit the yellow signals here, which will
in Fig. 2(b). In the matrix, one cell corresponds to one grid in be presented later. Let a four-tuple < t1 , t2 , t3 , t4 > denote
Fig. 2(a). The blank cells mean no vehicle in the corresponding the duration of the four phases in current cycle. The legal
grid, which are 0. The other cells with vehicles inside are set actions in the next cycle is shown in Fig. 3. In the figure, one
1.0. The value in the speed dimension is built in a similar way. circle means the durations of the four phases in one cycle.
If there is a vehicle in the grid, the corresponding value is the We discretize the time change from the current cycle to the
vehicle’s speed; otherwise, it is 0. succeeding cycle to 5 seconds. The duration of one and only
Fully connected
t1-5, t2, t1+5, t2, Fully connected Q value
Value
t3, t4 t3, t4 Convolution

1×1
Convolution
Convolution
9×1
t1, t2-5, t1, t2+5,
…
…
t3, t4 t3, t4 9×1
Output
t1, t2, 60×60×2 30×30×32 15×15×64
15×15×128 9×1
Position Advantage
t3, t4 &speed 128×1 64×1
Tentative
action
t1, t2, t1, t2,
t3+5, t4 Fig. 4. The architecture of the deep convolutional neural network to
t3+5, t4 approximate the Q value.
t1, t2, t1, t2,

the corresponding total number of vehicles till the tth cycle.
t3, t4+5 t3, t4+5 The waiting time of vehicle i till the tth cycle is denoted by
wit ,t , (1 ≤ it ≤ Nt ). The reward in the tth cycle is defined by
Fig. 3. Part of the Markov decision process in a multiple traffic lights scenario. the following equation,
rt = Wt − Wt+1 , (4)
one phase in the next cycle is the current duration added or where
Nt
subtracted by 5 seconds. After choosing the phases’ duration X
in the next cycle, the current duration becomes the chosen one. Wt = wit ,t . (5)
it =1
The traffic light can select an action in a similar way as the
previous procedure. In addition, we set the max legal duration It means the reward is equal to the increment in cumulative
of a phase as 60 seconds and the minimal as 0 second. waiting time between before taking the action and after the
The MDP is a flexible model. It can be applied into a more action. If the reward becomes larger than before, the waiting
complex intersection with more traffic lights, which needs time increases less than before. Considering the delay is
more phases, such as an irregular intersection with five or six non-decreasing with time, the overall reward is always non-
ways. When there are more phases at an intersection, they can positive.
be added in the MDP model as a higher-dimension value. The
dimension of the circle in the MDP is equal to the number of VI. D OUBLE D UELING D EEP Q N ETWORK
phases at the intersection.
There are a lot of practical problems in directly solving
The phases in a traffic light cyclically change in a sequence.
(2), such as the states are required to be finite [5]. In the
Yellow signal is required between two neighboring phases to
traffic light control system in vehicular networks, the number
guarantee safety, which allows running vehicles to stop before
of states are too large. Thus, in this paper we propose a
signals become red. The yellow signal duration Tyellow is
Convolutional Neural Network (CNN) [24] to approximate the
defined by the maximum speed vmax on that road divided
Q value. Combining with the state-of-the-art techniques, the
by the most commonly-seen decelerating acceleration adec .
proposed whole network is called Double Dueling Deep Q
vmax Network (3DQN).
Tyellow = . (3)
adec
It means the running vehicle needs such a length of time to A. Convolutional Neural Network
firmly stop in front of the intersection.
The architecture of the proposed CNN is shown in Fig. 4.
It is composed of three convolutional layers and several fully-
C. Rewards connected layers. In our system, the input is the small grids
Rewards are an element that differentiates reinforcement including the vehicles’ position and speed information. The
learning from other learning algorithms. The role of rewards is number of grids at an intersection is 60 × 60. The input data
to provide feedback to a reinforcement learning model about become 60 × 60 × 2 with both position and speed information.
the performance of the previous actions. Thus, it is important The data are first put through three convolutional layers. Each
to define the reward to correctly guide the learning process, convolutional layer includes three parts, convolution, pooling
which accordingly helps take the best action policy. and activation. The convolutional layer includes multiple fil-
In our system, the main goal is to increase the efficiency ters. Every filter contains a set of weights, which aggregates
of an intersection. A main metric in the efficiency is vehicles’ local patches in the previous layer and shifts a fixed length
waiting time. Thus, we define the rewards as the change of of step defined by the stride each time. Different filters have
the cumulative waiting time between two neighboring cycles. different weights to generate different features in the next
Let it denote the ith observed vehicle from the starting time layer. The convolutional operation makes the presence of a
to the starting time point of the tth cycle and Nt denote pattern more important than the pattern’s position. The pooling
layer selects the salient values from a local patch of units to following equation,
replace the whole patch. The pooling process removes less
Q(s, a; θ) =V (s; θ)+
important information and reduces the dimensionality. The !
activation function is to decide how a unit is activated. The 1 X ′ (7)
A(s, a; θ) − A(s, a ; θ) .
most common way is to apply a non-linear function on the |A| ′
a
output. In this paper, we employ the leaky ReLU [25] as the
activation function with the following form (let x denote the A(s, a; θ) shows how important an action is to the value
output from a unit), function among all actions. If the A value of an action is
( positive, it means the action shows a better performance in
x, if x > 0, numerical rewards compared to the average performance of
f (x) = (6)
βx, if x ≤ 0. all possible actions; otherwise, if the value of an action is
negative, it means the action’s potential reward is less than
β is a small constant to avoid zero gradient in the negative the average. It has been shown that the subtraction from the
side. The leaky ReLU can converge faster than other activation mean of all advantage values can improve the stability of
functions, like tanh and sigmoid, and prevent the generation optimization compared to using the advantage value directly.
of ‘dead’ neurons from regular ReLU. The dueling architecture is shown to effectively improve the
In the architecture, three convolutional layers and full con- performance in reinforcement learning.
nection layers are constructed as follows. The first convolu-
tional layer contains 32 filters. Each filter’s size is 4 × 4 and C. Target Network
it moves 2 × 2 stride every time through the full depth of the
To update the parameters in the neural network, a tar-
input data. The second convolutional layer has 64 filters. Each
get value is defined to help guide the update process. Let
filter’s size is 2 × 2 and it moves 2 × 2 stride every time. The
Qtarget (s, a) denote the target Q value at the state s when
size of the output after two convolutional layers is 15×15×64.
taking action a. The neural network is updated by the Mean
The third convolutional layer has 128 filters with the size of
Square Error (MSE) in the following equation,
2 × 2 and the stride’s size is 1 × 1. The third convolutional X
layer’s output is a 15 × 15 × 128 tensor. A fully-connected J= P (s)[Qtarget (s, a) − Q(s, a; θ)]2 , (8)
layer transfers the tensor into a 128×1 matrix. After the fully- s
connected layer, the data are split into two parts with the same where P (s) denotes the probability of state s in the training
size 64 × 1. The first part is then used to calculate the value mini-batch. The MSE can be considered as a loss function to
and the second part is for the advantage. The advantage of an guide the updating process of the primary network. To provide
action means how well it can achieve by taking an action over stable update in each iteration, a separate target network
all the other actions. Because the number of possible actions θ− , the same architecture as the primary neural network but
in our system is 9 as shown in Fig. 3, the size of the advantage different parameters, is usually employed to generate the target
is 9 × 1. They are combined again to get the Q value, which value. The calculation of the target Q value is presented in the
is the architecture of the dueling Deep Q Network (DQN). double DQN part.
With the Q value corresponding to every action, we need The parameters θ in the primary neural network are updated
highly penalize illegal actions, which may cause accidents or by back propagation with (8). θ− is updated based on the θ
reach the max/min signal duration. The output combines the in the following equation,
Q value and tentative actions to force the traffic light to take
θ− = αθ− + (1 − α)θ. (9)
a legal action. Finally we get the Q values of every action in
the output with penalized values. The parameters in the CNN α is the update rate, which presents how much the newest
is denoted by θ. Q(s, a) now becomes Q(s, a; θ), which is parameters affect the components in the target network. A
estimated under the CNN θ. The details in the architecture are target network can help mitigate the overoptimistic value
presented in the next subsections. estimation problem.
D. Double DQN
The target Q value is generated by the double Q-learning
B. Dueling DQN algorithm [14]. In the double DQN, the target network is to
generate the target Q value and the action is generated from
the primary network. The target Q value can be expressed in
As mentioned before, our network contains a dueling DQN
the following equation,
[13]. In the network, the Q value is estimated by the value
at the current state and each action’s advantage compared to Qtarget (s, a) = r + γQ(s′ , arg max(Q(s′ , a′ ; θ)), θ− ). (10)
other actions. The value of a state V (s; θ) denotes the overall a′
expected rewards by taking probabilistic actions in the future It is shown that the double DQN effectively mitigates the
steps. The advantage corresponds to every action, which is overestimations and improves the performance [14].
defined as A(s, a; θ). The Q value is the sum of the value In addition, we also employ the ǫ-greedy algorithm to
V and the advantage function A, which is calculated by the balance the exploration and exploitation in choosing actions.
Update θ
Primary
CNN Q(s, a; θ)
θ
s -
Update θ
Primary
CNN
Current MSE
θ
′
state Primary Select action Observe next Save the s
. ′ Select by Mini-batch ′
CNN a with the state s and <s, a, r, s > tuple in ′ a
. prioritization <s, a, r, s >
Tentative θ max Q value reward r memory
.
actions ′ Target ′ ′ -
s r+Q(s , a ; θ )
CNN
-
θ
r ′ ′ -
Q(s , a ; θ )
Fig. 5. The architecture of the reinforcement learning model in our system
With the increasing steps of training process, the value of ǫ second-order moments using the stochastic gradient descent
decreases gradually. We set a starting and ending values of ǫ procedure. Specifically, let θ denote the parameters in the
and the number of steps to reach the ending value. The value CNN and J(θ) denote the loss function. Adam first calculates
of ǫ linearly decreases to the ending value. When ǫ reaches the the gradients of the parameters,
ending value, it keeps the value in the following procedure.
g = ∇θ J(θ). (13)
E. Prioritized Experience Replay It then respectively updates the first-order and second-order
biased moments, s and r, by the exponential moving average,
During the updating process, the gradients are updated
through the experience replay strategy. A prioritized expe- s = ρs s + (1 − ρs )g,
(14)
rience replay strategy chooses samples from the memory r = ρr r + (1 − ρr )g,
based on priorities, which can lead to faster learning and
where ρs and ρr are the exponential decay rates for the first-
to better final policy [15]. The key idea is to increase the
order and second-order moments, respectively. The first-order
replay probability of the samples that have a high temporal
and second-order biased moments are corrected using the time
difference error. There are two possible methods estimating the
step t through the following equations,
probability of an experience in a replay, proportional and rank-
s
based. Rank-based prioritized experience replay can provide ŝ = ,
a more stable performance since it is not affected by some 1 − ρts
r (15)
extreme large errors. In this system, we take the rank-based r̂ = .
t
1 − ρr
method to calculate the priority of an experience sample. The
temporal difference error δ of an experience sample i is defined Finally the parameters are updated as follows,
in the following equation,
θ =θ + ∆θ
δi = |Q(s, a; θ)i − Qtarget (s, a)i |. (11)

ŝ (16)
=θ + −ǫr √ ,
The experiences are ranked by the errors and then the priority r̂ + δ
pi of experience i is the reciprocal of its rank. Finally, the where ǫr is the initial learning rate and δ is a small positive
probability of sampling the experience i is calculated in the constant to attain numerical stability.
following equation,
pτ G. Overall Architecture
Pi = P i τ . (12)
k pk In summary, the whole process in our model is shown in
τ presents how much prioritization is used. When τ is 0, it is Fig. 5. The current state and the tentative actions are fed
random sampling. to the primary convolutional neural network to choose the
most rewarding action. The current state and action along
with the next state and received reward are stored into the
F. Optimization memory as a four-tuple hs, a, r, s′ i. The data in the memory
In this paper, we optimize the neural networks by the are selected by the prioritized experience replay to generate
ADAptive Moment estimation (Adam) [26]. The Adam is eval- mini-batches and they are used to update the primary neural
uated and compared with other back propagation optimization network’s parameters. The target network θ− is a separate
algorithms in [27], which concludes that the Adam attains neural network to increase stability during the learning. We
satisfactory overall performance with a fast convergence and use the double DQN [14] and dueling DQN [13] to reduce the
adaptive learning rate. The Adam optimization method adap- possible overestimation and improve performance. Through
tively updates the learning rate considering both first-order and this way, the approximating function can be trained and the
Q value at every state to every action can be calculated. The N

optimal policy can then be obtained by choosing the action
with the max Q value.
Algorithm 1 Dueling Double Deep Q Network with Priori- W

tized Experience Replay Algorithm on a Traffic Light
Input: replay memory size M , minibatch size B, greedy ǫ, pre-train
steps tp, target network update rate α, discount factor γ.
Notations:
θ: the parameters in the primary neural network.
θ− : the parameters in the target neural network.
m: the replay memory.
i: step number.
Initialize parameters θ, θ− with random values.

Initialize m to be empty and i to be zero. E
Initialize s with the starting scenario at the intersection.
while there exists a state s do
Choose an action a according to the ǫ greedy.
Take action a and observe reward r and new state s′ .
S
if the size of memory m > M then
Remove the oldest experiences in the memory.
end if Fig. 6. The intersection scenario tested in our evaluation.
Add the four-tuple hs, a, r, s′ i into M .
Assign s′ to s: s ← s′ .
i ← i + 1. A. Evaluation Methodology and Parameters
if |M | > B and i > tp then
Select B samples from m based on the sampling priorities. Our main objective in conducting the simulation is as
Xthe1 loss J:
Calculate follows,
J= [r + γQ(s′ , arg max(Q(s′ , a′ ; θ)), θ− )− • Maximizing the defined reward, which is to reduce the
s
B a′
cumulative delay of all vehicles.
Q(s, a; θ)]2 . • Reducing the average waiting time of vehicles in the
Update θ with ∇J using Adam back propagation.
Update θ− with θ: traffic road scenario.
θ− = αθ− + (1 − α)θ. To specifically, the first objective is the goal of a reinforcement
Update every experience’s sampling priority based on δ. learning model. We measure the cumulative reward in every
Update the value of ǫ.
end if episode within one hour period. The second objective is an
end while important metric in measuring the performance of a traffic
management system, which directly affects the drivers’ feel-
ings. For the both objectives, we compare the performance
The pseudocode of our 3DQN with prioritized experience
of the proposed model with pre-scheduled traffic signals. At
replay is shown in Algorithm 1. Its goal is to train a mature
intersections with traditional traffic lights, the signals are pre-
adaptive traffic light, which can change its phases’ duration
scheduled by the operator and they do not change any more.
based on different traffic scenarios. The agent first chooses
The evaluation is conducted in SUMO [16], which provides
actions randomly till the number of steps is over the pre-train
real-time traffic simulation in a micro way. We use the Python
steps and the memory has enough samples for at least one
APIs provided by SUMO to extract the traffic light controlled
mini-batch. Before the training, every samples’ priorities are
intersection’s information and to send orders to change the
the same. Thus, they are randomly selected into a mini-batch
traffic light’s timing. The intersection is composed of four
to train. After training once, the samples’ priorities change and
perpendicular roads, which is shown in Fig. 6. Every road
they are selected by different probabilities. The parameters in
has three lanes. The right-most lane allows the right-turn and
the neural network is updated by the Adam back propagation
through traffic, the middle one is the through only lane, and
[27]. The agent chooses actions based on the ǫ and the action
the left inner lane allows the left-turn vehicles only. The whole
that has the max Q value. The agent finally learns to get a
intersection scenario is a 300m × 300m area. The lane length
high reward by reacting on different traffic scenarios.
is 150 meters. The vehicle length is 5 meters and the minimal
gap between two vehicles is 2 meters. We set the grid length
c 5 meters, thus the total number of grids is 60 × 60. The
VII. EVALUATION vehicles arrive in the scenario following a random process.
The average vehicle arrival rate of every lane is the same,
In this section, we present the simulation environment. Our 1/10 per second. There are two through lanes, so the flow
proposed model is then evaluated via simulation, and the rate of all through traffic (west-to-east, east-to-west, north-
simulation results are presented to show the effectiveness of to-south, south-to-north) is 2/10 per second, and the turning
our model. traffic (east-to-south, west-to-north, south-to-west, north-to-
TABLE II
PARAMETERS IN THE REINFORCEMENT LEARNING NETWORK −30000
3DQN
Parameter Value −40000 Fixed-time 30s
Fixed-time 40s
Replay memory size M 20000 −50000
Minibatch size B 64
Reward
Starting ǫ 1 −60000
Ending ǫ 0.01 −70000

Steps from starting ǫ to ending ǫ 10000
Pre-training steps tp 2000 −80000
Target network update rate α 0.001

−90000
Discount factor γ 0.99 0 200 400 600 800 1000 1200 1400 1600
Episode times
Learning rate ǫr 0.0001
Leaky ReLU β 0.01 Fig. 7. The cumulative reward during all the training episodes.
east) is 1/10 per second. SUMO provides the Krauss Following 50

Model [28], which guarantees the safe driving on the road. 3DQN
45 Fixed-time 30s
For vehicles, the max speed is 13.9 m/s, which is equal to 50 Fixed-time 40s
Average waiting time(s)

km/h. The max accelerating acceleration is 1.0 m/s2 and the 40
decelerating acceleration is 4.5 m/s2 . The duration of yellow
signals Tyellow is set 4 seconds. 35
The model is trained in iterations. One iteration is an 30

episode with traffic in an hour. The reward is accumulated in
an episode. The goal in our network is to maximize the reward 25
in the one-hour episode by modifying the traffic signals’ time 20

0 200 400 600 800 1000 1200 1400 1600
duration. The simulation results are the average values of the Episode times
nearest 100 iterations. The development environment is built
on the top of Tensorflow [29]. The parameters in the network Fig. 8. The average waiting time during all the training episodes.
are shown in Table II. The performance in our system is first
compared with the traffic lights with fix-time signals. We fix
the traffic signals’ time duration as 30 seconds and 40 seconds. the protocol has learnt how to handle different traffic scenarios
The model is then compared to other deep reinforcement to get the most rewards after 1000 iterations.
learning architectures with different parameters. 2) Average waiting time: We test the average waiting time
of vehicles in every episode, which is shown in Fig. 8. In this
scenario, the traffic rates from all lanes are also the same. In
B. Experimental Results this figure, the blue real line shows the results in our model,
1) Cumulative reward: The accumulated reward in every and the green and red real lines are the results from fixed-time
episode is first evaluated with the same traffic flow rate from traffic lights. Also the dotted lines are corresponding variances
all lanes. The simulation results are shown in Fig. 7. The blue of the same color’s dot lines. From this figure, we can see that
real line shows the results in our model and the green and our 3DQN outperforms the other two strategies with fixed-time
red real lines are the results from fixed-time traffic lights. The traffic lights. Specifically, the average waiting time in the fixed-
dotted lines are the corresponding confidence intervals of the time signals is always over 35 seconds. Our model can learn
corresponding color’s real lines. From this figure, we can see to reduce the waiting time to about 26 seconds after iterating
that our 3DQN outperforms the other two strategies with fixed- 1200 times from over 35 seconds, which is at least 25.7%
time traffic lights. Specifically, the cumulative reward in one less than the other two strategies. It shows that our model can
iteration is greater than -50000 (note that the reward is negative greatly improve the performance in vehicles’ average waiting
since the vehicles’ delay is positive) while that in the other time at intersections.
two strategies is less than -6000. The fixed-time traffic signals 3) Comparison with different parameters and algorithms:
always obtains a low reward even though more iterations are In this part, we evaluate our model by comparing to others
generated while our model can learn to achieve a higher reward with different parameters. In our model, we used a series of
with more iterations. This is because the fixed-time traffic techniques to improve the performance of deep Q networks.
signals do not change the signals’ time under different traffic For comparison, we remove one of these techniques each time
scenario. In the 3DQN, the signals’ time changes to achieve to see how the removed technique affects the performance. The
the best expected rewards, which balances the current traffic techniques include double network, dueling network and pri-
scenario and the potential future traffic. When the training oritized experience replay. We evaluate them by comparing the
process iterates over 1000 times in our protocol, the cumulative performance with the employed model. The reward changes
rewards become more stable than previous iterations. It means in all methods are shown in Fig. 9. The blue real line presents
45000 60
55
50000
Average waiting time(s)

50
55000
45
Reward
60000 40
3DQN 35
65000 No double 3DQN
No dueling 30 Fixed-time 30s
No prioritization Fixed-time 40s
70000 25
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
Episode times Episode times
Fig. 9. The cumulative reward during all the training episodes in different Fig. 10. The average waiting time in all the training episodes during the rush
network architecture. hours with unbalanced traffic from all lanes.
our model, and the green line is the model without double time difference between two cycles. To handle the complex
network. The red line is the model without dueling network traffic scenario in our problem, we propose a double dueling
and the cyan line is the model without prioritized experience deep Q network (3DQN) with prioritized experience replay.
replay. We can see that our model can learn fastest among the The model can learn a good policy under both the rush hours
four models. It means our model reaches the best policy faster and normal traffic flow rates. It can reduce over 20% of the
than others. Specifically, even there is some fluctuation in the average waiting timing from the starting training. The pro-
first 400 iterations, our model still outperforms the other three posed model also outperforms others in learning speed, which
after 500 iterations. Our model can achieve greater than -47000 is shown in extensive simulation in SUMO and TensorFlow.
rewards while the others have less than -50000 rewards.
4) Average waiting time under rush hours: In this part,
we evaluate our model by comparing the performance under R EFERENCES
the rush hours. The rush hour means the traffic flows from all [1] S. S. Mousavi, M. Schukat, P. Corcoran, and E. Howley, “Traffic light
lanes are not the same, which is usually seen in the real world. control using deep policy-gradient and value-function based reinforce-
ment learning,” arXiv preprint arXiv:1704.08883, April 2017.
During the rush hours, the traffic flow rate from one direction
[2] W. Genders and S. Razavi, “Using a deep reinforcement learning agent
doubles, and the traffic flow rates in the other lanes keep the for traffic signal control,” arXiv preprint arXiv:1611.01142, November
same as normal hours. Specifically, in our experiments, the 2016.
arrival rate of vehicles on the lanes from the west to east [3] N. Casas, “Deep deterministic policy gradient for urban traffic light
control,” arXiv preprint arXiv:1703.09035, March 2017.
becomes 2/10 each second and the arrival rates of vehicles on [4] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Design of reinforce-
the other lanes are still 1/10 each second. The experimental ment learning parameters for seamless application of adaptive traffic
result is shown in Fig. 10. In this figure, the blue real line signal control,” Journal of Intelligent Transportation Systems, vol. 18,
no. 3, pp. 227–245, July 2014.
shows the results in our model and the green and red real [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
lines are the results from fixed-time traffic lights. The dotted MIT press Cambridge, March 1998, vol. 1, no. 1.
lines are the corresponding variances of the corresponding [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
color’s real lines. From the figure, we can see that the best M. Lanctot et al., “Mastering the game of go with deep neural networks
policy becomes harder to be learnt than the previous scenario. and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, January 2016.
This is because the traffic scenario becomes more complex, [7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering
which leads to more uncertain factors. But after trial and error, the game of go without human knowledge,” Nature, vol. 550, no. 7676,
our model can still learn a good policy to reduce the average p. 354, October 2017.
waiting time. Specifically, the average waiting time in 3DQN [8] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic multi-agent
system for traffic signals control,” Engineering Applications of Artificial
is about 33 seconds after 1000 episodes while the average Intelligence, vol. 26, no. 5, pp. 1575–1587, May-Jun 2013.
waiting time in the other two methods is over 45 seconds [9] H. Hartenstein and L. Laberteaux, “A tutorial survey on vehicular ad
and over 50 seconds. Our model reduces about 26.7% of the hoc networks,” IEEE Communications magazine, vol. 46, no. 6, June
2008.
average waiting. [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
VIII. C ONCLUSION Nature, vol. 518, no. 7540, pp. 529–533, February 2015.
In this paper, we propose to solve the traffic light control [11] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce-
ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,
problem using the deep reinforcement learning model. The pp. 247–254, July 2016.
traffic information is gathered from vehicular networks. The [12] E. van der Pol, “Deep reinforcement learning for coordination in traffic
states are two-dimension values with the vehicles’ position light control,” Master’s thesis, University of Amsterdam, August 2016.
[13] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and
and speed information. The actions are modeled as a Markov N. de Freitas, “Dueling network architectures for deep reinforcement
decision process and the rewards are the cumulative waiting learning,” arXiv preprint arXiv:1511.06581, November 2015.
[14] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning

with double q-learning,” in Proceedings of the Thirtieth AAAI Confer-
ence on Artificial Intelligence, February 2016, pp. 2094–2100.
[15] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” arXiv preprint arXiv:1511.05952, November 2015.
[16] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo-simulation of urban mobility,”
International Journal On Advances in Systems and Measurements,
vol. 5, no. 3&4, pp. 128–138, December 2012.
[17] S. Chiu and S. Chand, “Adaptive traffic signal control using fuzzy logic,”
in The First IEEE Regional Conference on Aerospace Control Systems,
April 1993, pp. 1371–1376.
[18] B. De Schutter, “Optimal traffic light control for a single intersection,”
in American Control Conference, vol. 3, June 1999, pp. 2195–2199.
[19] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning
for true adaptive traffic signal control,” Journal of Transportation
Engineering, vol. 129, no. 3, pp. 278–285, May 2003.
[20] Y. K. Chin, N. Bolong, A. Kiring, S. S. Yang, and K. T. K. Teo, “Q-
learning based traffic optimization in management of signal timing plan,”
International Journal of Simulation, Systems, Science & Technology,
vol. 12, no. 3, pp. 29–35, June 2011.
[21] I. Arel, C. Liu, T. Urbanik, and A. Kohls, “Reinforcement learning-based
multi-agent system for network traffic signal control,” IET Intelligent
Transport Systems, vol. 4, no. 2, pp. 128–135, June 2010.
[22] P. Balaji, X. German, and D. Srinivasan, “Urban traffic signal control
using reinforcement learning agents,” IET Intelligent Transport Systems,
vol. 4, no. 3, pp. 177–188, September 2010.
[23] J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori, “Adaptive traffic signal
control: Deep reinforcement learning algorithm with experience replay
and target network,” arXiv preprint arXiv:1705.02755, May 2017.
[24] X. Liang and G. Wang, “A convolutional neural network for trans-
portation mode detection based on smartphone platform,” in 2017 IEEE
14th International Conference on Mobile Ad Hoc and Sensor Systems
(MASS). IEEE, October 2017, pp. 338–342.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision,
December 2015, pp. 1026–1034.
[26] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, December 2014.
[27] S. Ruder, “An overview of gradient descent optimization algorithms,”
arXiv preprint arXiv:1609.04747, September 2016.
[28] S. Krauß, “Towards a unified view of microscopic traffic flow theories,”
IFAC Transportation Systems, vol. 30, no. 8, pp. 901–905, June 1997.
[29] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale
machine learning on heterogeneous distributed systems,” arXiv preprint
arXiv:1603.04467, March 2016.

2018 Deep RL NOT-possibleToAddToMendely

Uploaded by

Copyright:

Available Formats

2018 Deep RL NOT-possibleToAddToMendely

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2018 Deep RL NOT-possibleToAddToMendely

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO.

Deep Reinforcement Learning for Traffic Light

Study State Action Reward Time step Note

In the equation, γ is the discount factor, which is usually

The intuition is that the cumulative reward is equal to the sum

V. R EINFORCEMENT L EARNING M ODEL 1.0

t3, t4 t3, t4 Convolution

t1, t2, t1, t2,

Fig. 5. The architecture of the reinforcement learning model in our system

Q value at every state to every action can be calculated. The N

Algorithm 1 Dueling Double Deep Q Network with Priori- W

Initialize parameters θ, θ− with random values.

Ending ǫ 0.01 −70000

Target network update rate α 0.001

east) is 1/10 per second. SUMO provides the Krauss Following 50

Average waiting time(s)

The model is trained in iterations. One iteration is an 30

in the one-hour episode by modifying the traffic signals’ time 20

Average waiting time(s)

[14] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning

You might also like