2018 Deep RL NOT-possibleToAddToMendely
2018 Deep RL NOT-possibleToAddToMendely
2018 Deep RL NOT-possibleToAddToMendely
XX 1
Abstract—Existing inefficient traffic light control causes nu- such as a football event or a more common high traffic hour
merous problems, such as long delay and waste of energy. scenario, the traffic light control systems become paralyzed.
To improve efficiency, taking real-time traffic information as Instead, we often witness policemen directly manage the
an input and dynamically adjusting the traffic light duration
accordingly is a must. In terms of how to dynamically adjust intersection by waving signals. This human operator can see
traffic signals’ duration, existing works either split the traffic the real time traffic condition in the intersecting roads and
signal into equal duration or extract limited traffic information smartly determine the duration of the allowed passing time
from the real data. In this paper, we study how to decide the for each direction using his/her long-term experience and
traffic signals’ duration based on the collected data from different understanding about the intersection. The operation normally
sensors and vehicular networks. We propose a deep reinforcement
learning model to control the traffic light. In the model, we is very effective. The witness motivates us to propose a smart
quantify the complex traffic scenario as states by collecting data intersection traffic light management system which can take
and dividing the whole intersection into small grids. The timing real-time traffic condition as input and learn how to manage
changes of a traffic light are the actions, which are modeled the intersection just like the human operator.
as a high-dimension Markov decision process. The reward is To implement such a system, we need ‘eyes’ to watch the
the cumulative waiting time difference between two cycles. To
solve the model, a convolutional neural network is employed to real-time road condition and ‘a brain’ to process it. For the
map the states to rewards. The proposed model is composed former, recent advances in sensor and networking technology
of several components to improve the performance, such as enables taking real-time traffic information as input, such as
dueling network, target network, double Q-learning network, the number of vehicles, the locations of vehicles, and their
and prioritized experience replay. We evaluate our model via waiting time [4]. For the ‘brain’ part, reinforcement learning,
simulation in the Simulation of Urban MObility (SUMO) in a
vehicular network, and the simulation results show the efficiency as a type of machine learning techniques, is a promising
of our model in controlling traffic lights. way to solve the problem. A reinforcement learning system’s
goal is to make an action agent learn the optimal policy in
Index Terms—reinforcement learning, deep learning, traffic
light control, vehicular network interacting with the environment to maximize the reward,
e.g., the minimum waiting time in our intersection control
scenario. It usually contains three components, states of the
I. I NTRODUCTION environment, action space of the agent, and reward from every
Existing road intersection management is done through traf- action [5]. A well-known application of reinforcement learning
fic lights. The inefficient traffic light control causes numerous is AlphaGo [6], including AlphaGo Zero [7]. AlphaGo, acting
problems, such as long delay of travelers, huge waste of energy as the action agent in a Go game (environment), first observes
and worsening air quality. In some cases, it may also contribute the current image of the chessboard (state), and takes the image
to vehicular accidents [1], [2]. Existing traffic light control as the input of a reinforcement learning model to determine
either deploys fixed programs without considering real-time where to place the optimal next playing piece ‘stone’ (action).
traffic or considering the traffic to a very limited degree [3]. Its final reward is to win the game or to lose. Thus, the reward
The fixed programs set the traffic signals equal time duration may be unobvious during the playing process and it is delayed
in every cycle, or different time duration based on historical till the game is over. When applying reinforcement learning
information. Some other control programs take inputs from to the traffic light control problem, the key point is to define
sensors such as underground inductive loop detectors to detect the three components at an intersection and quantify them to
the existence of vehicles in front of traffic lights. The inputs be computable.
are processed in a very coarse way to determine the duration Some researchers have proposed to dynamically control
of green/red lights. the traffic lights using reinforcement learning. Early works
In some cases, existing traffic light control systems work, define the states by the number of waiting vehicles or the
though at a low efficiency. However, in many other cases, waiting queue length [4], [8]. But real traffic situation cannot
be accurately captured by the number of waiting vehicles
X. Liang and G. Wang are with the Department of Computer Science,
New Jersey Institute of Technology, Newark, NJ, 07102 USA email: {xl367, or queue length [2]. With the popularization of vehicular
gwang}@njit.edu. networks and cameras, more information about roads can
X. Du and Z. Han is with Department of Electrical and Computer Engineer- be extracted and transmitted via the network, such as vehi-
ing, University of Houston, Houston, TX 77004 USA email: {xunshengdu,
hanzhu22}@gmail.com. cles’ speed and waiting time [9]. However, more information
Manuscript received January 22, 2018. causes the dramatically increasing number of states. When
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 2
the number of states increases, the complexity in a traditional paper is concluded in Section VIII.
reinforcement learning system grows exponentially. With the
rapid development of deep learning, deep neural networks
II. L ITERATURE R EVIEW
have been employed to deal with the large number of states,
which constitutes a deep reinforcement learning model [10]. A Previous works have been done to dynamically control
few recent studies have proposed to apply deep reinforcement adaptive traffic lights. But due to the limited computing
learning in the traffic light control problem [11], [12]. But power and simulation tools, early studies focus on solving the
there are two main limitations in the existing studies: (1) problem by fuzzy logic [17], linear programming [18], etc. In
the traffic signals are usually split into fixed-time intervals, these works, road traffic is modeled by limited information,
and the duration of green/red lights can only be a multiple which cannot be applied in large scale.
of this fixed-length interval, which is not efficient in many Reinforcement learning was applied in traffic light control
situations; (2) the traffic signals are designed to change in a since 1990s. El-Tantawy et al. [4] summarize the methods
random sequence, which is not a safe nor comfortable way from 1997 to 2010 that use reinforcement learning to control
for drivers. In this paper, we study the problem on how to traffic light timing. During this period, the reinforcement
control the traffic light’s signal duration in a cycle based on learning techniques are limited to tabular Q learning and a
the extracted information from vehicular networks to help linear function is normally used to estimate the Q value. Due to
efficiently manage vehicles at an intersection. the technique limitation at the time in reinforcement learning,
In this paper, we solve the problem in the following ap- they usually make a small-size state space, such as the number
proaches and make the following contributions. Our general of waiting vehicles [8], [19], [20] and the statistics of traffic
idea is to mimic an experienced operator to control the signal flow [21], [22]. The complexity in a traffic road system can not
duration in every cycle based on the information gathered be actually presented by such limited information. When much
from vehicular networks. To implement such an idea, the useful relevant information is omitted in the limited states, it
experienced operator’s operation is modeled as an Markov seems unable to act optimally in traffic light control [2].
Decision Process (MDP). The MDP is a high-dimension With the development of deep learning and reinforcement
model, which contains the time duration of every phase. The learning, they are combined together as deep reinforcement
system then learns the control strategy based on the MDP by learning to estimate the Q value. We summarize the recent
trial and error in a deep reinforcement learning model. To fit studies that use the value-based deep reinforcement learning
a deep reinforcement learning model, we divide the whole to control traffic lights in Table I. There are three limitations
intersection into grids and build a matrix from the vehicles’ in these previous studies. Firstly, most of them test their
information in the grids collected by vehicular networks or models in a simple cross-shape intersection with through
extracted from a camera via image processing. The matrix traffic only [11], [12]. Secondly, none of the previous works
is defined as the states and the reward is the cumulative determines the traffic signal timing in a whole cycle. Thirdly,
waiting time difference between two cycles. In our model, deep reinforcement learning is a fast developing field, where
a convolutional neural network is employed to match the a lot of new ideas are proposed in these two years, such as
states and expected future rewards. In the traffic light control dueling deep Q network [13], but they have not been applied in
problem, every traffic light’s action may affect the environment traffic control. In this paper, we make the following progress.
and the traffic flow changes dynamically, which makes the Firstly, our intersection scenario contains multiple phases,
environment unpredictable. Thus, a convolutional network is which corresponds a high-dimension action space in a cycle.
hard to predict the accurate reward. Inspired by the recent Secondly, our model guarantees that the traffic signal time
studies in reinforcement learning, we employ a series of state- smoothly changes between two neighboring actions, which
of-the-art techniques in our model to improve the performance, is exactly defined in the MDP model. Thirdly, we employ
including dueling network [13], target network [10], double Q- the state-of-the-art techniques in value-based reinforcement
learning network [14], and prioritized experience replay [15]. learning algorithms to achieve good performance, which is
In this paper, we combine these techniques as a framework evaluated via simulation.
to solve our problem, which can be easily applied into other
problems. Our system is tested on a traffic micro-simulator,
III. M ODEL AND P ROBLEM S TATEMENT
Simulation of Urban MObility (SUMO) [16], and the simula-
tion results show the effectiveness and high-efficiency of our In this paper, we consider a road intersection scenario where
model. traffic lights are used to control traffic flows. The model is
The reminder of this paper is organized as follows. The liter- shown in Fig. 1. The left side shows the structure in a traffic
ature review is presented in Section II. The model and problem light. The traffic light first gathers road traffic information via
statement are introduced in Section III. The background on a vehicular network [9], which is presented by the dashed
reinforcement learning is introduced in Section IV. Section purple lines in the figure. The traffic light processes the data
V shows the details in modeling an reinforcement learning to obtain the road traffic’s state and reward, which has been
model in the traffic light control system of vehicular networks. assumed in many previous studies [2], [12], [23]. The traffic
Section VI extends the reinforcement learning model into a light chooses an action based on the current state and reward
deep learning model to handle the complex states in the our using a deep neural network shown in the right side. The left
system. The model is evaluated in Section VII. Finally, the side is the reinforcement learning part and the deep learning
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 3
TABLE I
L IST OF PREVIOUS STUDIES THAT USE VALUE - BASED DEEP REINFORCEMENT LEARNING TO ADAPTIVELY CONTROL TRAFFIC SIGNALS
Input:
States
Output:
Actions
way to learn how to control the traffic light and liberate
a human being from the learning process. Reinforcement
learning updates its model by continuously receiving states and
Road rewards from the environment. The model gradually becomes
Traffic
State
a mature and advanced model. It is different from supervised
Action
Reward learning in not requiring numerous data at one time. In this
Traffic paper, we employ the deep reinforcement learning to learn
Light
Reinforcement Learning
the timing strategy of every phase to optimize the traffic
Deep Learning
management.
Fig. 1. The traffic light control model in our system. The left side shows the
intersection scenario where the traffic light gathers vehicles’ information via
a vehicular network and it is controlled by the reinforcement learning model; IV. BACKGROUND ON R EINFORCEMENT L EARNING
the right side shows a deep neural network to help the traffic light choose an
action.
Reinforcement learning is one category of algorithms in
machine learning, which is different from supervised learning
part. They make up our deep reinforcement learning model in and unsupervised learning [5]. It interacts with the environ-
traffic light control. ment to get rewards from actions. Its goal is to take the
In our model, traffic lights are used to manage the traffic action to maximize the numerical rewards in the long run. In
flows at intersections. A traffic light at an intersection has three reinforcement learning, an agent, the action executor, takes an
signals, green, yellow and red. One traffic light may not be action and the environment returns a numerical reward based
enough to manage all the vehicles when there are vehicles on the action and current state. A four-tuple hS, A, R, T i can
from multiple directions at an intersection. Thus, multiple be used to denote the reinforcement learning model with the
traffic lights need to cooperate at a multi-direction intersection. following meanings:
At such an intersection, the traffic signal guides vehicles from • S : the possible state space. s is a specific state (s ∈ S);
non-conflicting directions at one time by changing the traffic • A : the possible action space. a is an action (a ∈ A);
lights’ statuses. One status is one of all the legal combinations • R : the reward space. rs,a means the reward in taking
of all traffic lights’ red and green signals omitting the yellow action a at state s;
signals. The time duration staying at one status is called one • T : the transmission function space among all states,
phase. The number of phases is decided by the number of legal which means the probability of the transmission from one
statuses at an intersection. All the phases cyclically change in state to another.
a fixed sequence to guide vehicles to pass the intersection. It
is called a cycle when the phases repeat once. The sequence In a deterministic model, T is usually omitted.
of phases in a cycle is fixed, but the duration of every phase A policy is made up of a series of consequent actions. The
is adaptive. If one phase needs to be skipped, its duration can goal in reinforcement learning is to learn an optimal policy
be set 0 second. In our problem, we dynamically adjust the to maximize the cumulative expected rewards starting from
duration in every phase to deal with different traffic situations the current state. Generally speaking, the agent at one specific
at an intersection. state s takes an action a to reach state s′ and gets a reward r,
Our problem is defined by how to optimize the efficiency which is denoted by hs, a, r, s′ i. Let t denote the tth step in the
of the intersection usage by dynamically changing every policy π. The cumulative reward in the future by taking action
phase’s duration of a traffic light via learning from historical a at state s is defined by Q(s, a) in the following equation,
experiences. The general idea is to extend the duration for the Qπ (s, a) = E rt + γrt+1 + γ 2 rt+2 + · · · |st = s, at = a, π
phase that has more vehicles in that direction. But it is time- "∞ #
consuming to train a person to become a master who well
X
k
=E γ rt+k |st = s, at = a, π .
knows how much time should be given to a phase based on k=0
current traffic situation. Reinforcement learning is a possible (1)
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 4
1.0 1.0
A. States
We define the states based on two pieces of information, (b) The corresponding position matrix on this road
position and speed of vehicles at an intersection. Through a
vehicular network, vehicles’ position and speed can be ob- Fig. 2. The process to build the state matrix.
tained [9]. Then the traffic light can extract a virtual snapshot
image of the current intersection. The whole intersection is
divided into same-size small square-shape grids. The length B. Action Space
of grids, c, should guarantee that no two vehicles can be held
in the same grid and one entire vehicle can be put into a grid A traffic light needs to choose an appropriate action to
to reduce computation. The value of c in our system will be well guide vehicles at the intersection based on the current
given in the evaluation. In every grid, the state value is a two- traffic state. In this system, the action space is defined by
value vector < position, speed > of the inside vehicle. The selecting every phase’s duration in the next cycle. But if
position dimension is a binary value, which denotes whether the duration changes a lot between two cycles, the system
there is a vehicle in the grid. If there is a vehicle in a grid, the may become unstable. Thus, the legal phases’ duration at the
value in the grid is 1; otherwise, it is 0. The speed dimension current state should smoothly change. We model the duration
is an integer value, denoting the vehicle’s current speed in changes of legal phases between two neighboring cycles as
m/s. a high-dimension MDP. In the model, the traffic light only
Let’s take Fig. 2 as an example to show how to quantify changes one phase’s duration in a small step.
the intersection to obtain the state values. Fig. 2(a) shows a Let’s take the intersection in Fig. 2(a) as an exam-
snapshot of the traffic status at a simple one-lane four-way ple. At the intersection, there are four phases, north-south
intersection, which is built with information in a vehicular green, east-north&west-south green, east-west green, and east-
network. The intersection is split into square-shape grids. The south&west-north green. The other unmentioned directions are
position matrix has the same size of the grids, which is shown red by default. Let’s omit the yellow signals here, which will
in Fig. 2(b). In the matrix, one cell corresponds to one grid in be presented later. Let a four-tuple < t1 , t2 , t3 , t4 > denote
Fig. 2(a). The blank cells mean no vehicle in the corresponding the duration of the four phases in current cycle. The legal
grid, which are 0. The other cells with vehicles inside are set actions in the next cycle is shown in Fig. 3. In the figure, one
1.0. The value in the speed dimension is built in a similar way. circle means the durations of the four phases in one cycle.
If there is a vehicle in the grid, the corresponding value is the We discretize the time change from the current cycle to the
vehicle’s speed; otherwise, it is 0. succeeding cycle to 5 seconds. The duration of one and only
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 5
Fully connected
t1-5, t2, t1+5, t2, Fully connected Q value
Value
…
…
t3, t4 t3, t4 9×1
Output
t1, t2, 60×60×2 30×30×32 15×15×64
15×15×128 9×1
Position Advantage
t3, t4 &speed 128×1 64×1
Tentative
action
t1, t2, t1, t2,
t3+5, t4 Fig. 4. The architecture of the deep convolutional neural network to
t3+5, t4 approximate the Q value.
layer selects the salient values from a local patch of units to following equation,
replace the whole patch. The pooling process removes less
Q(s, a; θ) =V (s; θ)+
important information and reduces the dimensionality. The !
activation function is to decide how a unit is activated. The 1 X ′ (7)
A(s, a; θ) − A(s, a ; θ) .
most common way is to apply a non-linear function on the |A| ′
a
output. In this paper, we employ the leaky ReLU [25] as the
activation function with the following form (let x denote the A(s, a; θ) shows how important an action is to the value
output from a unit), function among all actions. If the A value of an action is
( positive, it means the action shows a better performance in
x, if x > 0, numerical rewards compared to the average performance of
f (x) = (6)
βx, if x ≤ 0. all possible actions; otherwise, if the value of an action is
negative, it means the action’s potential reward is less than
β is a small constant to avoid zero gradient in the negative the average. It has been shown that the subtraction from the
side. The leaky ReLU can converge faster than other activation mean of all advantage values can improve the stability of
functions, like tanh and sigmoid, and prevent the generation optimization compared to using the advantage value directly.
of ‘dead’ neurons from regular ReLU. The dueling architecture is shown to effectively improve the
In the architecture, three convolutional layers and full con- performance in reinforcement learning.
nection layers are constructed as follows. The first convolu-
tional layer contains 32 filters. Each filter’s size is 4 × 4 and C. Target Network
it moves 2 × 2 stride every time through the full depth of the
To update the parameters in the neural network, a tar-
input data. The second convolutional layer has 64 filters. Each
get value is defined to help guide the update process. Let
filter’s size is 2 × 2 and it moves 2 × 2 stride every time. The
Qtarget (s, a) denote the target Q value at the state s when
size of the output after two convolutional layers is 15×15×64.
taking action a. The neural network is updated by the Mean
The third convolutional layer has 128 filters with the size of
Square Error (MSE) in the following equation,
2 × 2 and the stride’s size is 1 × 1. The third convolutional X
layer’s output is a 15 × 15 × 128 tensor. A fully-connected J= P (s)[Qtarget (s, a) − Q(s, a; θ)]2 , (8)
layer transfers the tensor into a 128×1 matrix. After the fully- s
connected layer, the data are split into two parts with the same where P (s) denotes the probability of state s in the training
size 64 × 1. The first part is then used to calculate the value mini-batch. The MSE can be considered as a loss function to
and the second part is for the advantage. The advantage of an guide the updating process of the primary network. To provide
action means how well it can achieve by taking an action over stable update in each iteration, a separate target network
all the other actions. Because the number of possible actions θ− , the same architecture as the primary neural network but
in our system is 9 as shown in Fig. 3, the size of the advantage different parameters, is usually employed to generate the target
is 9 × 1. They are combined again to get the Q value, which value. The calculation of the target Q value is presented in the
is the architecture of the dueling Deep Q Network (DQN). double DQN part.
With the Q value corresponding to every action, we need The parameters θ in the primary neural network are updated
highly penalize illegal actions, which may cause accidents or by back propagation with (8). θ− is updated based on the θ
reach the max/min signal duration. The output combines the in the following equation,
Q value and tentative actions to force the traffic light to take
θ− = αθ− + (1 − α)θ. (9)
a legal action. Finally we get the Q values of every action in
the output with penalized values. The parameters in the CNN α is the update rate, which presents how much the newest
is denoted by θ. Q(s, a) now becomes Q(s, a; θ), which is parameters affect the components in the target network. A
estimated under the CNN θ. The details in the architecture are target network can help mitigate the overoptimistic value
presented in the next subsections. estimation problem.
D. Double DQN
The target Q value is generated by the double Q-learning
B. Dueling DQN algorithm [14]. In the double DQN, the target network is to
generate the target Q value and the action is generated from
the primary network. The target Q value can be expressed in
As mentioned before, our network contains a dueling DQN
the following equation,
[13]. In the network, the Q value is estimated by the value
at the current state and each action’s advantage compared to Qtarget (s, a) = r + γQ(s′ , arg max(Q(s′ , a′ ; θ)), θ− ). (10)
other actions. The value of a state V (s; θ) denotes the overall a′
expected rewards by taking probabilistic actions in the future It is shown that the double DQN effectively mitigates the
steps. The advantage corresponds to every action, which is overestimations and improves the performance [14].
defined as A(s, a; θ). The Q value is the sum of the value In addition, we also employ the ǫ-greedy algorithm to
V and the advantage function A, which is calculated by the balance the exploration and exploitation in choosing actions.
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 7
Update θ
Primary
CNN Q(s, a; θ)
θ
s -
Update θ
Primary
CNN
Current MSE
θ
′
state Primary Select action Observe next Save the s
. ′ Select by Mini-batch ′
CNN a with the state s and <s, a, r, s > tuple in ′ a
. prioritization <s, a, r, s >
Tentative θ max Q value reward r memory
.
actions ′ Target ′ ′ -
s r+Q(s , a ; θ )
CNN
-
θ
r ′ ′ -
Q(s , a ; θ )
With the increasing steps of training process, the value of ǫ second-order moments using the stochastic gradient descent
decreases gradually. We set a starting and ending values of ǫ procedure. Specifically, let θ denote the parameters in the
and the number of steps to reach the ending value. The value CNN and J(θ) denote the loss function. Adam first calculates
of ǫ linearly decreases to the ending value. When ǫ reaches the the gradients of the parameters,
ending value, it keeps the value in the following procedure.
g = ∇θ J(θ). (13)
E. Prioritized Experience Replay It then respectively updates the first-order and second-order
biased moments, s and r, by the exponential moving average,
During the updating process, the gradients are updated
through the experience replay strategy. A prioritized expe- s = ρs s + (1 − ρs )g,
(14)
rience replay strategy chooses samples from the memory r = ρr r + (1 − ρr )g,
based on priorities, which can lead to faster learning and
where ρs and ρr are the exponential decay rates for the first-
to better final policy [15]. The key idea is to increase the
order and second-order moments, respectively. The first-order
replay probability of the samples that have a high temporal
and second-order biased moments are corrected using the time
difference error. There are two possible methods estimating the
step t through the following equations,
probability of an experience in a replay, proportional and rank-
s
based. Rank-based prioritized experience replay can provide ŝ = ,
a more stable performance since it is not affected by some 1 − ρts
r (15)
extreme large errors. In this system, we take the rank-based r̂ = .
t
1 − ρr
method to calculate the priority of an experience sample. The
temporal difference error δ of an experience sample i is defined Finally the parameters are updated as follows,
in the following equation,
θ =θ + ∆θ
δi = |Q(s, a; θ)i − Qtarget (s, a)i |. (11)
ŝ (16)
=θ + −ǫr √ ,
The experiences are ranked by the errors and then the priority r̂ + δ
pi of experience i is the reciprocal of its rank. Finally, the where ǫr is the initial learning rate and δ is a small positive
probability of sampling the experience i is calculated in the constant to attain numerical stability.
following equation,
pτ G. Overall Architecture
Pi = P i τ . (12)
k pk In summary, the whole process in our model is shown in
τ presents how much prioritization is used. When τ is 0, it is Fig. 5. The current state and the tentative actions are fed
random sampling. to the primary convolutional neural network to choose the
most rewarding action. The current state and action along
with the next state and received reward are stored into the
F. Optimization memory as a four-tuple hs, a, r, s′ i. The data in the memory
In this paper, we optimize the neural networks by the are selected by the prioritized experience replay to generate
ADAptive Moment estimation (Adam) [26]. The Adam is eval- mini-batches and they are used to update the primary neural
uated and compared with other back propagation optimization network’s parameters. The target network θ− is a separate
algorithms in [27], which concludes that the Adam attains neural network to increase stability during the learning. We
satisfactory overall performance with a fast convergence and use the double DQN [14] and dueling DQN [13] to reduce the
adaptive learning rate. The Adam optimization method adap- possible overestimation and improve performance. Through
tively updates the learning rate considering both first-order and this way, the approximating function can be trained and the
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 8
TABLE II
PARAMETERS IN THE REINFORCEMENT LEARNING NETWORK −30000
3DQN
Parameter Value −40000 Fixed-time 30s
Fixed-time 40s
Replay memory size M 20000 −50000
Minibatch size B 64
Reward
Starting ǫ 1 −60000
45000 60
55
50000
60000 40
3DQN 35
65000 No double 3DQN
No dueling 30 Fixed-time 30s
No prioritization Fixed-time 40s
70000 25
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
Episode times Episode times
Fig. 9. The cumulative reward during all the training episodes in different Fig. 10. The average waiting time in all the training episodes during the rush
network architecture. hours with unbalanced traffic from all lanes.
our model, and the green line is the model without double time difference between two cycles. To handle the complex
network. The red line is the model without dueling network traffic scenario in our problem, we propose a double dueling
and the cyan line is the model without prioritized experience deep Q network (3DQN) with prioritized experience replay.
replay. We can see that our model can learn fastest among the The model can learn a good policy under both the rush hours
four models. It means our model reaches the best policy faster and normal traffic flow rates. It can reduce over 20% of the
than others. Specifically, even there is some fluctuation in the average waiting timing from the starting training. The pro-
first 400 iterations, our model still outperforms the other three posed model also outperforms others in learning speed, which
after 500 iterations. Our model can achieve greater than -47000 is shown in extensive simulation in SUMO and TensorFlow.
rewards while the others have less than -50000 rewards.
4) Average waiting time under rush hours: In this part,
we evaluate our model by comparing the performance under R EFERENCES
the rush hours. The rush hour means the traffic flows from all [1] S. S. Mousavi, M. Schukat, P. Corcoran, and E. Howley, “Traffic light
lanes are not the same, which is usually seen in the real world. control using deep policy-gradient and value-function based reinforce-
ment learning,” arXiv preprint arXiv:1704.08883, April 2017.
During the rush hours, the traffic flow rate from one direction
[2] W. Genders and S. Razavi, “Using a deep reinforcement learning agent
doubles, and the traffic flow rates in the other lanes keep the for traffic signal control,” arXiv preprint arXiv:1611.01142, November
same as normal hours. Specifically, in our experiments, the 2016.
arrival rate of vehicles on the lanes from the west to east [3] N. Casas, “Deep deterministic policy gradient for urban traffic light
control,” arXiv preprint arXiv:1703.09035, March 2017.
becomes 2/10 each second and the arrival rates of vehicles on [4] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Design of reinforce-
the other lanes are still 1/10 each second. The experimental ment learning parameters for seamless application of adaptive traffic
result is shown in Fig. 10. In this figure, the blue real line signal control,” Journal of Intelligent Transportation Systems, vol. 18,
no. 3, pp. 227–245, July 2014.
shows the results in our model and the green and red real [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
lines are the results from fixed-time traffic lights. The dotted MIT press Cambridge, March 1998, vol. 1, no. 1.
lines are the corresponding variances of the corresponding [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
color’s real lines. From the figure, we can see that the best M. Lanctot et al., “Mastering the game of go with deep neural networks
policy becomes harder to be learnt than the previous scenario. and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, January 2016.
This is because the traffic scenario becomes more complex, [7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering
which leads to more uncertain factors. But after trial and error, the game of go without human knowledge,” Nature, vol. 550, no. 7676,
our model can still learn a good policy to reduce the average p. 354, October 2017.
waiting time. Specifically, the average waiting time in 3DQN [8] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic multi-agent
system for traffic signals control,” Engineering Applications of Artificial
is about 33 seconds after 1000 episodes while the average Intelligence, vol. 26, no. 5, pp. 1575–1587, May-Jun 2013.
waiting time in the other two methods is over 45 seconds [9] H. Hartenstein and L. Laberteaux, “A tutorial survey on vehicular ad
and over 50 seconds. Our model reduces about 26.7% of the hoc networks,” IEEE Communications magazine, vol. 46, no. 6, June
2008.
average waiting. [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
VIII. C ONCLUSION Nature, vol. 518, no. 7540, pp. 529–533, February 2015.
In this paper, we propose to solve the traffic light control [11] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce-
ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,
problem using the deep reinforcement learning model. The pp. 247–254, July 2016.
traffic information is gathered from vehicular networks. The [12] E. van der Pol, “Deep reinforcement learning for coordination in traffic
states are two-dimension values with the vehicles’ position light control,” Master’s thesis, University of Amsterdam, August 2016.
[13] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and
and speed information. The actions are modeled as a Markov N. de Freitas, “Dueling network architectures for deep reinforcement
decision process and the rewards are the cumulative waiting learning,” arXiv preprint arXiv:1511.06581, November 2015.
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. XX, NO. XX 11