2022 - Hao Zhang

Neurocomputing 500 (2022) 329–340
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A temporal fusion transformer for short-term freeway traffic speed

multistep prediction
Hao Zhang a, Yajie Zou b,⇑, Xiaoxue Yang c, Hang Yang d
a
Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University No. 4800 Cao’an Road, Shanghai 201804, China
b
c
d
Faculty of Maritime and Transportation, Ningbo University No.769 Qixing South Road, Ningbo 315832, China
a r t i c l e i n f o a b s t r a c t
Article history: Accurate short-term freeway speed prediction is a key component for intelligent transportation manage-
Received 20 September 2021 ment and can help travelers plan travel routes. However, very few existing studies focus on predicting
Revised 1 April 2022 one-hour ahead or longer freeway speed. In this study, a novel architecture called Temporal Fusion
Accepted 23 May 2022
Transformer (TFT) is adopted to predict freeway speed with the prediction horizons from 5 min to
Available online 26 May 2022
Communicated by Zidong Wang
150 min. The TFT can capture short-term and long-term temporal dependence by a multi-head attention
mechanism. Moreover, the TFT utilizes the fusion decoder to import various types of inputs which can
improve the prediction accuracy. To demonstrate the advantage of the TFT, traffic speed data collected
Keywords:
Deep learning
from an interstate freeway in Minnesota are used to train and test the prediction model. The TFT predic-
Temporal fusion transformer tion performance is compared with several classic traffic prediction methods, and the results reveal that
Traffic speed the TFT performs best in speed prediction when the prediction horizon is longer than 30 min.
Multistep prediction Ó 2022 Elsevier B.V. All rights reserved.
1. Introduction [14–20]. Because traffic speed is mainly predicted based on the

time correlation of traffic speed sequence, recurrent neural net-
Travel speed information is an essential measure of traffic con- work (RNN) has been widely used to solve this problem. The
ditions. It is important to acquire accurate travel speed prediction RNN can capture the temporal evolution of traffic speed, especially
since the travel speed information is a key input to Intelligent long short-term memory (LSTM), can learn long-range dependen-
Transportation Systems (ITS) [1,2]. Multi-step ahead travel speed cies in speed sequence. Various topologies of RNN were designed
prediction allows transportation management agencies to relieve to predict multi-step ahead traffic speed [21–23]. However, two
traffic congestion in advance [3]. Moreover, it can help travelers issues were associated with traditional RNN models: huge demand
plan travel routes and improve travel efficiency [4–7]. It is valuable for computing resources during training and inability to consider
to improve the multi-step ahead travel speed prediction accuracy. long-term dependency [24]. To overcome the disadvantages of
Nonetheless, due to the difficulty of capturing temporal and spatial RNN mentioned above, researchers began to study a novel archi-
dependence of traffic data, accurate prediction of travel speed is tecture: attention mechanism. Because the attention mechanism
particularly challenging [8,9]. can model the dependencies in sequence more efficiently, it has
Previous studies have examined the multi-step ahead traffic been widely used in time series prediction areas [16,25–30]. A clas-
speed prediction. These traffic prediction studies mainly focus on sic example is Transformer networks, which rely entirely on atten-
varieties of traditional statistical methods such as vector autore- tion mechanisms to capture the dependencies between inputs and
gressive (VAR) models and integrated auto-regressive moving outputs [31]. Some Transformer-based models have been applied
average (ARIMA) based models [10–13]. The growing available to traffic studies, such as vehicle trajectory prediction [32–34]
traffic data provided transportation researchers an opportunity to and pedestrian trajectory prediction [35]. None of these studies
explore this problem from new perspectives. With the develop- use Transformer-based models to predict short-term freeway
ment of computational power, more neural network-based models speed. Transformer-based models can outperform state of the art
were applied in the realm of multi-step ahead traffic prediction prediction models according to the results of these research.
However, there are two issues related to these research: (1)
⇑ Corresponding author. Practical multi-step ahead traffic prediction commonly considers
E-mail address: [email protected] (Y. Zou). various inputs: information about the future that can be known
https://doi.org/10.1016/j.neucom.2022.05.083
0925-2312/Ó 2022 Elsevier B.V. All rights reserved.
H. Zhang, Y. Zou, X. Yang et al. Neurocomputing 500 (2022) 329–340
at the time of prediction, such as future holiday information; static to widely used neural networks, machine learning methods also
information that does not change over time, such as the position of include support vector machine (SVM), decision tree-based meth-
the detector [36]. These studies usually fail to incorporate the various ods, and Bayesian-based methods. For example, Wu et al. [47]
inputs mentioned above in multi-step ahead prediction [37]. The demonstrated the feasibility of applying SVM in travel-time pre-
main issues that affect traffic speed prediction accuracy are not the diction. Hou et al. [48] developed random forest and regression
seasonal pattern, but some external random factors (e.g., holidays, tree to predict traffic flow for planned work zone events. Polson
special events, traffic incidents, etc.) [38]. When using data-driven and Sokolov [49] utilized Bayesian-based methods to estimate
models for prediction, including these factors as inputs can enhance the traffic density state. Moreover, various types of neural net-
the model’s performance [39]. (2) To the best of our knowledge, most works were developed to predict traffic flow. In the beginning,
recent studies predicted traffic features under a one-hour extent feed-forward neural networks and their variants were mainly used.
[5,37]. However, it is also useful to predict traffic information further Chan et al. [50] utilized a new method to train neural network. The
than one hour [40,41]. Based on a large amount of traffic data, it is method combined the hybrid exponential smoothing and the
possible for researchers to predict information at any point in the Levenberg–Marquardt (LM). And finally, it was used for multi-
future [42–46]. This study selects ARIMA and several representative step ahead traffic prediction. Then researchers found that RNN is
machine learning models to analyze prediction performance under good at capturing the temporal information in traffic time series.
the prediction horizon between 5 min and 150 min. RNN based methods have a huge impact on traffic prediction. Ma
To address drawbacks above, the Temporal Fusion Transformer et al. [21] proposed LSTM to predict traffic speed. And the data
(TFT), a transformer-based architecture is utilized to predict traffic used in this study were collected by traffic microwave detectors.
speed in this study. Unlike the aforementioned methods, the TFT is Wang et al. [23] proposed the bidirectional long short-term mem-
able to consider various input variables and perform well for the ory neural network (Bi-LSTM NN) to predict network-wise traffic
traffic speed prediction further than one hour. The TFT introduces speed information. RNN based methods were also applied to other
several novel architectures to improve the prediction performance. traffic prediction tasks, such as travel time prediction [51–53] and
(1) The TFT utilizes the gating module and the variable screening demand forecasting [54,55].
network to incorporate temporal information of the traffic speed
data at different scales. The static information encoder is used to
2.2. Attention mechanism based deep learning methods
encode the number of detectors at the data collection site. (2)
The TFT employs a sequence to sequence layer to capture the
The attention mechanism can model the dependence in
short-term temporal dependence, and a self-attention mechanism
sequence more efficiently, it has been widely used in time series
to capture the long-term time correlation in traffic speed time ser-
prediction areas. For instance, Guo et al. [56] combined an atten-
ies. The contributions of this paper are two-fold: (1) A comparative
tion mechanism and a convolutional network to analyze spatial
study between the TFT and benchmark models was implemented
and temporal correlation of PeMSD4 and PeMSD8. However, it
for the prediction horizon between 5 min and 150 min. The results
neglected the fact that the autocorrelation coefficient of time series
show that the TFT performs best when the prediction horizon is
decreases with the increase of time lag. Liao et al. [57] introduced a
longer than one hour. This study provides a guideline for develop-
hybrid sequence-to-sequence framework to predict traffic speed,
ing prediction models for freeway speed prediction further than
incorporating three exogenous information. Exogenous informa-
one hour. (2) Short-term freeway speed prediction information is
tion included geographical attributes, crowd map queries, and road
a key component to intelligent transportation management. Accu-
intersections. A classic attention mechanism-based model is Trans-
rately predicting short-term freeway speed can help travelers plan
former networks, which rely entirely on attention mechanisms to
travel routes and relieve traffic congestion in advance.
capture the dependence between inputs and outputs. Vaswani
The rest of this paper is organized as follows: The second section
et al. [31] used transformer to model traffic time series. The trans-
provided the literature review of traffic state prediction. The third
former introduced attention mechanism and a position encoding
section introduced the framework of freeway speed prediction and
strategy. Since then, various transformer based models were used
the architecture of the TFT. To further evaluate the performance of
for time series forecasting [18,19]. Xue et al. [60] developed a
the TFT, a comparative study between the TFT and several classic
transformer based traffic flow prediction framework. This frame-
methods was conducted in the fourth section. The fourth section also
work proposed a novel prediction consistency block that can inte-
described the data we used in this study in detail. Finally, the conclu-
grate the learning of long-term correlation with short-term
sion of this study and possible future works are summarized. In addi-
prediction. Yu et al. [35] proposed a spatio-temporal graph trans-
tion, this section summarized the advantages and limitations of the
former framework to solve trajectory prediction problem. Giuliari
TFT which are helpful for later researchers.
et al. [32] considered the original Transformer Network (TF) and
the larger Bidirectional Transformer (BERT) in their study. They
2. Literature review
are utilized to predict pedestrian trajectory. Liu et al. [33] proposed
a novel transformer-based architecture for multimodal motion
In the past decades, multi-step ahead traffic prediction problem
prediction, called mmTransformer. Chen et al. [34] developed a
has been widely studied. In general, the predicted traffic variables
novel framework called spatio-temporal transformer network. This
of these research include traffic speed, traffic flow, travel time, and
framework utilized transformer to model the temporal sequences.
vehicle trajectory. Traffic prediction methods basically fall into
three categories: classical statistical methods, machine learning
methods and attention mechanism based deep learning methods. 3. Methodology
With the development of computational power and data source,
classical statistical methods are used in fewer scenarios. Thus, only 3.1. Prediction framework
latter two categories are summarized in the following section.
To overcome the aforementioned issues, TFT was adopted in
2.1. Machine learning methods this study to predict traffic speed. TFT was initially introduced by
[36]. In order to utilize the TFT in this freeway speed prediction
Multi-step ahead traffic prediction based on machine learning problem, this section introduces several procedures to process
methods attracted wide attention from researchers. In addition the data and debug the program. The prediction framework is illus-
330
trated in Fig. 1. Step 1: Preprocess raw data and extract three kinds Table 1
of inputs from raw data. Details of inputs are listed in Fig. 1. Step 2: Description of the training process of the TFT with fixed hyperparameters.
Create data format file and set the prediction horizon. Step 3: Use Pseudo-code of training the TFT with fixed hyperparameters
random search method to select optimized hyperparameters. Step
Input: Observed inputs:oi;tk:t ¼ oi;tk ; :::; oi;t
4: Train with best hyperparameters selected on the last step. Step
Known inputs:ki;tk:tþs ¼ ki;tk ; :::; ki;tþs
5: Analyze the prediction results. The pseudo-code of training TFT
Static inputs:si;tk:tþs ¼ si;tk ; :::; si;tþs

with best hyperparameters is presented in Table 1. Targets: yi;tk:t ¼ yi;tk ; :::; yi;t
Output: Targets:yi;tþs
Parameters: g- use GPU or not;
3.2. Temporal fusion transformer architecture t1 - defines the total number of time steps used by TFT;
t2 - determines length of LSTM encoder (i.e. history);
The TFT can incorporate various input variables and perform m- maximum number of epochs for training;
e- early stopping param for keras;
well for the traffic speed prediction after one hour. Let i represents
d– dropout rate;
unique detectors in traffic speed time series. Each detector i has a h– the size of hidden layer;
static covariate si . xi;t denotes other time related inputs, and yi;t l– learning rate;
denotes the target variables at time t, t 2 ½0; T i . Time dependent s– batch size;
1 /* Set the values of parameters. */
inputs include two categories xi;t ¼ oi;t ; ki;t . oi;t is observed inputs
2 /* Use GPU to accelerate learning process. */
that means they are unknown beforehand. They can only be mea- 3 /* Load raw data. */
sured when it comes to time step t. ki;t is known inputs that means 4 train, valid, test = preprocess (raw data, t1 , t2 )
they can be predetermined. The value of this variable is known 5 /* Set Hyperparameters according to hyperparameters
manager. */
when it is before time step t. This study used observed inputs
6 e, d, h, l, s = HyperparameterManager (best)
and targets until the time t(yi;tk:t ¼ yi;tk ; :::; yi;t , 7 /* Training process. */

oi;tk:t ¼ oi;tk ; :::; oi;t ) and known inputs across the full range 8 Sess = Session ()
9 TFT = ModelClass (e, d, h, l, s)
(ki;tk:tþs ¼ ki;tk ; :::; ki;tþs ). s is the prediction point. Finally, the
10 TFT = CacheData (train, valid, TFT)
prediction function is defined in Eq. (1). 11 Sess.run (TFT)
12 Valid loss = CalculateLoss (TFT, valid)
b
y i ðt; sÞ ¼ f s; yi;tk:t ; oi;tk:t ; ki;tk:tþs ; si ð1Þ 13 If valid loss < best loss:
14 Manager.update(parameters, valid loss)
To achieve optimal prediction performance, the TFT introduced 15 Best loss = valid loss
multiple novel architecture. The canonical components in TFT is 16 /* Get best parameters through manager. */
17 /* Running tests. */
used to represent each input (static, known, observed inputs).
18 Sess = Session ()
The main components of TFT are listed below. (1) Gating module 19 TFT = ModelClass (e, d, h, l, s, best parameters)
can ignore unnecessary components of the architecture. This 20 TFT = CacheData (test, TFT)
enables the structure of the model more concise, making the model 21 Output = TFT.predict ()
22 Sess.run (TFT)
applicable to complex scenarios and large datasets. (2) Variable
23 /* Calculate three indicators. */
screening network allows the model to identify the most impor- 24 Mae, Mape, Rmse = CalculateIndicator (targets, output)
tant input variables for predicting the target value and ignore the
less relevant variables. (3) Static information encoder is used to
encode the static information in practical prediction problem. This
kind of information is helpful to improve prediction accuracy. (4) dependency. The multi-head attention layer can learn long-term
Temporal dependency processing module contains a sequence- correlation. The overall architecture of the TFT is demonstrated
to-sequence layer and a multi-head attention module. The in Fig. 2. Next, this study describes each component in detail.
sequence-to-sequence layer can capture short-term temporal
Fig. 1. Framework of utilizing the TFT to predict freeway traffic speed.
331
3.2.1. Gating module GRNx ðp; cÞ ¼ LayerNormðp þ GLU x ðg1 ÞÞ ð2Þ

For machine learning models, a more complex model may not
necessarily result in better prediction performance. In the TFT, g1 ¼ W 1;x g2 þ b1;x ð3Þ
Gated Residual Network (GRN) was proposed to consider the com-
plexity of the inputs. The architecture of the GRN is shown in Fig. 3. g2 ¼ ELU ðW 2;x p þ W 3;x c þ b2;x Þ ð4Þ
It is built as a block of TFT to make the model have a suitable non-

linear structure. The input of GRN includes two vectors: the first x; x > 0
one is the primary information vector represented as p and the sec- f ð xÞ ¼ ð5Þ
@ ðex 1Þ; x 0
ond one is the context information vector represented as c. The cal-
culation process of GRN is defined in Eq. (2). The Exponential
GLU x ðcÞ ¼ rðW 4;x c þ b4;x Þ ðW 5;x c þ b5;x Þ ð6Þ
Linear Unit (ELU) activation function is defined in Eq. (5). g1 , g2
mean the outputs of two intermediate layers.
Gated Linear Unit (GLU) is the key part to realize the gate con- 3.2.2. Variable screening network
trol function. Suppose the input is c, the form of the GLU is shown The precise relationship between multiple input variables and
in Eq. (6). rð:Þ represents the sigmoid activation function, W ð:;:Þ and the target is hard to know in advance. Thus, it is difficult to know
bð:;:Þ represent weights and biases. Via the GLU, GRN can control the each variable’s specific contribution to the output. A variable
structure of the model and neglect the unnecessary layers. screening network in TFT was designed to select the most relevant
input variables. And this module can also eliminate the influence of
Fig. 2. The structure of Temporal Fusion Transformer.
332
3.2.4. Multi-head attention module

Based on the multi-head attention structure, the TFT can learn
the long-term correlation of time series. The self-attention mecha-
nism is calculated as the Eq. (11). Q is the ‘‘query”, K is the ‘‘key”,
and V is the ‘‘value”. It scales values based on relationships
between key and query. As shown in Eq. (12), Að:; :Þ is a normaliza-
tion function.
AttentionðQ; K; V Þ ¼ AðQ; K ÞV ð11Þ
pffiffiffiffiffiffiffiffiffi
AðQ ; K Þ ¼ Softmax QK T = dattn ð12Þ
The TFT used a multi-head attention structure to enhance the

fitting ability of the model. Multi-head attention is defined in
Eqs. (13) and (14). To overcome the deficiency that attention
weights cannot share values in each head, this study used the addi-
tive aggregation of all heads. The improved multi-head attention is
defined in Eqs. (15) and (16).

MultiHeadðQ ; K; V Þ ¼ H1 ; :::; HmH W H ð13Þ
ðhÞ ðhÞ ðh Þ
Hh ¼ Attention QW Q ; KW K ; VW V ð14Þ
e H
TTFMultiHeadðQ ; K; V Þ ¼ HW ð15Þ
X
mH
H e ðQ; K ÞVW V ¼ 1=H
e ¼A ðhÞ ðhÞ
Attention QW Q ; KW K ; VW V
ðhÞ
ð16Þ
Fig. 3. The structure of Gated Residual Network. h¼1
variables that is useless to improve the prediction accuracy. Entity 3.2.5. Temporal fusion decoder
embeddings were used to represent categorical variables. Linear To learn temporal relevance in the dataset, multiple layers were
transformations were used to represent continuous variables. The carefully designed in the temporal fusion decoder. The last layer
TFT encodes the input variable as a d dimensional vector. The was used to generate output. s 2 f1; :::; smax g represents the predic-
dimension of the vector matches the number of nodes in the net- tion step.
work. There is a dedicated variable screening network for each
kind of input variable. As the variable screening networks of three (1) Local correlation processing layer.
input variables take the same form, this study presents the net-
work for past inputs in detail. In time series, the value of a point is closely related to the value
The calculation process of variable screening weights is shown of its surroundings. This study used a sequence-to-sequence model
ðiÞ to capture local dependence. A gated skip connection was used as
in Eq. (7). nt is the output of gated residual network.cs is the con-
the input layer of the temporal fusion decoder.
text vector, which is from static covariate encoder. Then this study
used the weights obtained above to combine the input variables. e ðt; nÞ ¼ LayerNorm e
/ n tþn þ GLUe ð/ðt; nÞÞ ð17Þ
ðiÞ /
The process is shown in Eq. (8). As shown in Eq. (9), e
n t was calcu-
ðiÞ
lated by feeding nt in GRN.

v vt ¼ softmax GRNv v ðNt ; cs Þ ð7Þ (2) Encoding static information layer.
Xmx ð iÞ
e
nt ¼ i¼1
v vðiÞ en t ð8Þ
As static information usually has a remarkable influence on
t
time series prediction accuracy, this study used a static informa-
ð iÞ
tion encode layer to encode the static input variable. The static
e
n t ¼ GRNe
ðiÞ
nt ð9Þ enrichment layer takes the form as Eq. (18). n is position index,
n ðiÞ
ce is context vector.
3.2.3. Static information encoders e ðt; nÞ; ce

hðt; nÞ ¼ GRNh / ð18Þ
Compared with previous time series prediction models, the TFT
used information from static input variables. It employed different
GRN encoders to generate four kinds of context vectors, cs , ce , cc , ch .
These different kinds of context vectors were used in different (3) Core self-attention layer.
places in the temporal fusion decoder. Static information plays
an important role in these places: (1) Temporal variable selection The next layer is the self-attention layer. t is the forecast time,
(cs ); (2) Locally process temporal features (cc , ch ); (3) Enrichment HðtÞ is the matrix of a static-enriched temporal feature. BðtÞ takes
of temporal features (ce ). Taking the temporal variable selection the form as BðtÞ ¼ ½bðt; kÞ; :::; bðt; smax Þ. To facilitate training, an
context vector as an example, the function is defined as follows. additional gating layer is following the temporal self-attention
layer.
cs ¼ GRNcs ðfÞ ð10Þ
BðtÞ ¼ Interpretable Multi Head ðHðt Þ; HðtÞ; Hðt ÞÞ ð19Þ
333
dðt; nÞ ¼ Layer Norm ðhðt; nÞ þ GLU d ðbðt; nÞÞÞ ð20Þ ber 1st to November 30th (22 weekdays) to tune hyperparameters,
and a test group from April 2nd to April 30th (21 weekdays) for
performance evaluation. The selected models were used to predict
(4) Forward propagation layer. speed in the next 5 min, 15 min, 30 min, 60 min, 90 min, 120 min,
150 min. The target group was divided into two different parts: off-
This layer is used to process the calculation results of the self- peak hours and peak hours, as traffic speed in peak hours shows a
attention architecture. Moreover, there is a gated residual network completely different pattern compared with off-peak hours. The
in the TFT directly connecting to the next layer. peak hour period is from 7:00 AM to 9:00 AM and 3:30 PM to
wðt; nÞ ¼ GRN w ðdðt; nÞÞ ð21Þ 7:30 PM.
In order to verify the performance of the TFT, this study selected
e ðt; nÞ ¼ Layer Norm /
e ðt; nÞ þ GLU ðwðt; nÞÞ ARIMA, Multilayer Perceptron (MLP), SVM, and RNN as benchmark
w e ð22Þ
w models. The result analysis was categorized into peak hour part
and off-peak hour part. Three indicators were used to measure
b e ðt; sÞ þ b
y ¼ Ww ð23Þ the experimental results of each model: mean absolute error
(MAE), mean absolute percentage error (MAPE) and root mean
square error (RMSE). The optimal parameters of ARIMA were
4. Results and discussion achieved in R, using forecast package. This study determined opti-
mal parameters of the ARIMA based on the best Akaike Information
4.1. Data description Criterion (AIC) value. Radial Basis Function (RBF) was used as the
kernel function of SVM. Python language was used to implement
The traffic data were collected from Interstate 394 freeway in the remaining three machine learning algorithms and the TFT. This
Minnesota. Five neighboring traffic detectors were deployed along study used random search method to select optimized hyperpa-
the freeway. The total length of the studied road section is about rameters of the TFT. The search ranges for hyperparameters are
1.7 miles. There are detectors at intervals of about 0.5 miles. The listed in Table 2, the optimal hyperparameters are marked in bold.
locations of five detectors are shown in Fig. 4. The eastbound direc- Each machine learning algorithm was executed 5 times to reduce
tion has three lanes. The tool designed by the Minnesota Depart- randomness.
ment of Transportation was employed to download speed
datasets. The data were collected every 5 min from Nov. 1, 2017
to Apr. 30, 2018 [34]. The data missing rate was around 0.01%, 4.2. Result analysis and comparison
and missing records were imputed using the historical average
method. Because the traffic flow is usually low at night, and the Five different runs are conducted for the TFT to reduce the ran-
traffic speed pattern shows different characteristics during week- domness of the prediction results. The prediction results of each
ends, the data from 6:00 AM to 8:00 PM during weekdays were prediction horizon of the TFT are shown in Figs. 7 and 8. The model
used in this study. Fig. 5 demonstrates the traffic speed distribu- prediction results of each run are different. Note that the predic-
tion on Thursday, Apr. 12, 2018. There are two distinct peak peri- tion results are relatively different for some runs. For example,
ods on Thursday. As shown in Fig. 6, the time autocorrelation of for 150-minute ahead prediction, the minimum MAE value is
speed time series decreases with the increase of time lag. Consid- 13.48 and the maximum MAE value is 16.49 for peak hour traffic.
ering this low dependence, it is difficult to accurately predict traffic MAPE values range from 29.39 to 51.55. During off-peak hours, the
speed after one hour. traffic speed is more stable than peak hours and the prediction is
To check the performance of the proposed TFT model, the data- simpler, so there is smaller difference between five runs. This hap-
set was categorized into three groups: the data from January 1st to pens because deep learning is a random machine learning algo-
March 30st (42 weekdays) were selected as the training group to rithm. On the one hand, when the neural network is trained, it
determine the model parameters, a validation group from Novem- initializes the weights of the neural network, which is random.
Fig. 4. Location of I-394 road, from detector A to detector E [36].
334
Fig. 5. Traffic speed on Thursday, Apr. 12, 2018.
the more difficult the prediction and the more uncertain the pre-
diction results will be.
Table 3 demonstrates the prediction results of different predic-
tion horizon over all models during peak hours. The results of the
TFT model in Tables 3 and 4 are the arithmetic mean of the results
of five repeated experiments. Bold values indicate the smallest pre-
diction errors. A general trend can be found that the prediction
error for all models increases as the prediction horizon becomes
larger. For 5-minute to 15-minute ahead prediction, the TFT has
no advantages compared to other models. When the prediction
horizon is larger than 30 min, the prediction performance of the
TFT is gradually preferred over other prediction models. Specifi-
cally, the TFT has obvious advantages when the prediction horizon
is over 60 min, indicating that the TFT model can capture short-
term and long-term temporal features from traffic data. In TFT,
the multi-head self-attention module captures long-term depen-
Fig. 6. Autocorrelation of traffic speed at detector C. dencies, while a sequence-to-sequence layer learns short-term
dependencies. Traditional and machine learning methods
employed model-based specifications to analyze the seasonal pat-
Table 2
Hyperparameter search ranges.
tern and input time steps. The proposed temporal fusion decoder of
TFT can achieve the same effect. Thus, the prediction performance
Hyperparameter Range of the TFT has less tendency to change compared with other mod-
State size 10, 20, 40, 80, 160, 240, 320 els with increasing prediction horizon. Table 4 presents prediction
Dropout rate 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9 results in off-peak hours. The prediction performance during the
Minibatch size 10, 20, 30, 64, 128, 256
off-peak period is better than that during the peak period, which
Learning rate 0.0001, 0.001, 0.01
Max. gradient norm 0.01, 1.0, 100.0, 200.0 is consistent with the study conducted by Yang et al. [61]. The rea-
Num. heads 1, 2, 3, 4 son is that the speed of the selected off-peak hours has less fluctu-
ation. The adopted TFT model also has a good effect on the multi-
Bold values indicate the optimal hyperparameters.
step ahead prediction of traffic speed in off-peak hours. When it
comes to the prediction during off-peak period, the TFT model
has a better prediction accuracy than other models at 30 min
On the other hand, this study sets the dropout rate in the algorithm and later, which has the same trend as in the peak hours.
to avoid over-fitting. Setting the dropout rate means that the algo- Figs. 9 and 10 provide the proportion of performance improve-
rithm will randomly discard some neurons and network nodes ment by introducing the TFT model for different prediction hori-
during the training process. For the prediction results of the TFT zons in peak hours and off-peak hours. In terms of prediction
model, it can be found that three performance indicators generally horizon between 30 min and 150 min, the TFT model performs best
become worse with the increase of the prediction horizon. How- among all prediction models in both off-peak hours and peak
ever, the prediction performance indicators occasionally improve hours. Compared with the ARIMA model’s performance in the peak
with the increase of prediction horizon. This phenomenon is hours, the MAPE value has an improvement of about 148% under
caused by the randomness of the algorithm. In addition, it can be the prediction horizon of 150 min. During the off-peak hours, the
clearly found from the heatmap that with the increase of the pre- MAE value has an increase of about 97% under the prediction hori-
diction horizon, the randomness of the experimental results also zon of 120 min. This is mainly because traditional methods such as
becomes larger. It is because the larger the prediction horizon, the ARIMA is difficult to model long-term temporal feature and
335
Fig. 7. The results of five runs of the TFT in peak hours.
Fig. 8. The results of five runs of the TFT in off-peak hours.
Table 3
Prediction results of different models for peak hours at detector c.
MAE Prediction horizon

5 min* 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 4.80 8.14 11.70 16.86 21.10 22.74 23.16
SVM 4.68 7.54 10.90 15.25 19.11 19.72 19.08
MLP 4.56 7.47 10.13 15.18 17.25 18.21 17.55
RNN 4.64 7.85 10.81 14.00 17.63 17.94 18.27
TFT 4.73 7.79 8.72 10.85 11.84 13.08 14.81
MAPE 5 min 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 16.30 29.56 44.75 66.44 88.86 97.33 102.60
SVM 15.74 26.32 41.45 64.88 87.57 90.35 90.15
MLP 15.17 25.92 36.63 63.28 73.98 79.08 75.81
RNN 14.86 26.13 41.14 54.78 80.05 81.48 85.77
TFT 10.89 18.44 21.89 28.85 30.34 32.18 41.35
RMSE 5 min 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 7.16 11.82 15.90 21.07 25.95 27.45 27.93
SVM 7.03 11.46 15.35 20.17 23.96 24.71 24.36
MLP 6.81 11.34 14.70 19.52 20.68 21.55 21.28
RNN 7.00 11.74 14.70 18.09 20.76 21.17 21.51
TFT 7.08 11.42 12.60 15.37 16.63 17.58 19.52
*min denotes minutes.

Bold values indicate the best prediction results.
complex nonlinear relationship. With the increase of the prediction decoder that capture temporal feature, the TFT is more stable in
horizon, the prediction error of the selected benchmark model the prediction after 30 min. For prediction time interval larger than
increases obviously. Especially after 60 min, the accuracy of predic- 30 min, the great advantage of the TFT model can be verified from
tion decreases very seriously. Through introducing the architec- Fig. 9. As the prediction horizon increases, the proportion of perfor-
tures that deal with different data characteristics and the mance improvement increases. For example, the proportion of per-
336
Table 4
Prediction results of different models for off-peak hours at detector c.

5 min* 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 2.34 3.35 4.32 6.15 8.85 10.73 11.96
SVM 2.18 3.26 4.20 5.73 6.60 6.77 6.69
MLP 2.40 3.48 4.70 5.96 9.73 9.55 9.60
RNN 2.20 3.43 4.72 7.13 10.94 9.72 11.15
TFT 2.14 3.14 3.31 3.92 4.09 5.44 6.94
MAPE 5 min 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 4.83 7.57 9.71 13.40 17.42 20.58 22.88
SVM 4.52 7.32 9.51 13.47 14.83 15.06 14.96
MLP 4.84 7.67 10.10 13.32 18.90 18.67 18.97
RNN 4.45 7.60 10.24 14.78 20.56 18.91 21.33
TFT 4.61 8.48 9.49 10.47 11.29 12.80 15.32
RMSE 5 min 15 min 30 min 60 min 90 min 120 min 150 min
ARIMA 4.04 6.49 7.94 9.98 13.83 16.33 17.84
SVM 3.92 6.47 8.28 10.12 10.47 10.28 11.79
MLP 4.05 5.59 8.36 9.21 11.97 11.86 12.05
RNN 4.00 6.67 8.44 11.27 12.54 11.59 12.10
TFT 3.89 6.60 7.37 8.50 8.46 10.03 11.33

Bold values indicate the best prediction result.
Fig. 9. The proportion of performance improvement of the TFT in peak hours.
Fig. 10. The proportion of performance improvement of the TFT in off-peak hours.
formance improvement of 150 min is greater than that of 1 h. In The prediction result on Apr. 13, 2018 of detector c is demon-
other words, the error of the TFT increases slightly, the prediction strated in Fig. 11. The 5-minute ahead prediction results of TFT
accuracy is more stable compared with other models. From Fig. 10, are compared with observed traffic speed data. In addition to the
we can see that there is no upward trend in the rate of promotion morning and evening rush hours, there is an occasional traffic
during off-peak hours. This is probably because the benchmark jam at noon on this date. As shown in Fig. 11, TFT can accurately
models show good prediction accuracy for off-peak periods. predict not only frequent congestion but also occasional conges-
337
ment of about 24% for 30-minute ahead prediction and 37% for
150-minute ahead prediction during peak hours. In off-peak hours,
the MAE value reduces about 42% and 60% for 30-minute ahead
and 150-minute ahead prediction respectively, when comparing
TFT with RNN. Moreover, with the increase of the prediction hori-
zon, the prediction error of the selected benchmark model
increases obviously, especially when the prediction horizon is lar-
ger than 60 min. However, the TFT is more stable when the predic-
tion horizon is 60 min or longer.
Finally, the advantages and disadvantages of the TFT are sum-
marized. There are three advantages: (1) The TFT can consider dif-
ferent types of variables in real datasets, such as static input,
known input and observation input. (2) The TFT employs the
self-attention mechanism to capture the short-term and long-
term time correlation in time series. The self-attention mechanism
can model the time correlation better than the recurrent networks
of traditional RNN. (3) The TFT has great advantages over tradi-
Fig. 11. Prediction results of the TFT on Apr. 13, 2018 for detector c (5-minute tional prediction models when the prediction horizon is longer
ahead).
than one hour. It also provides experience for developing predic-
tion models with better long-term prediction performance. Note
Table 5 that the limitations of the TFT include: (1) The TFT is a deep learn-
Prediction accuracy of different models for Apr. 13, 2018. ing model that needs a large amount of data to train the model to
achieve good prediction performance. (2) This model is only suit-
able for time series prediction because it needs time information
60 min* 90 min 120 min
to improve the prediction performance. Future work can be con-
ARIMA 10.67 10.94 11.69 ducted by predicting the roadway network traffic state through
MLP 10.23 10.74 11.06 combining the TFT and the graph neural network, as the graph
SVM 9.55 10.28 10.99
RNN 9.4 10.8 12.94
neural network can capture the spatial correlation. Another inter-
TFT 7.36 9.4 9.49 esting research direction is fusing the traffic data with multi types
MAPE 60 min 90 min 120 min of information, such as geographic information and weather infor-
ARIMA 30.49 34.26 36.67 mation into speed prediction.
MLP 28.37 32.8 35.4
SVM 29.1 33.58 36.04
RNN 27.48 32.72 38.61 CRediT authorship contribution statement
TFT 20.21 27.48 19.37
RMSE 60 min 90 min 120 min
Hao Zhang: Methodology, Software, Validation, Formal analy-
ARIMA 13.58 14.77 15.56
MLP 12.05 13.47 14.2
sis, Writing – original draft. Yajie Zou: Conceptualization,
SVM 13.51 14.69 15.98 Resources, Writing – review & editing, Funding acquisition, Super-
RNN 13.46 14.54 16.18 vision. Xiaoxue Yang: Data curation, Writing – review & editing.
TFT 10.67 13.46 14.26 Hang Yang: Writing – review & editing.
Bold values indicate the best prediction result. Declaration of Competing Interest
The authors declare that they have no known competing finan-

tion. This has practical significance for traffic management in city. cial interests or personal relationships that could have appeared
It can be seen from Table 5 that the TFT is much better than other to influence the work reported in this paper.
models for predicting occasional congestion after one hour. Inter-
estingly, the prediction effect of RNN is worse than that of ARIMA Acknowledgments
when the prediction horizon is 120 min. This phenomenon shows
that in the prediction of occasional congestion, it is very difficult This research was funded by the National Natural Science Foun-
for some traditional machine learning methods to predict accu- dation of China (Grant No. 52172331) and the Fundamental
rately when the prediction horizon is 120 min. Research Funds for the Central Universities (Grant no.
22120220013).
5. Conclusion References
This study presents a novel temporal fusion transformer frame- [1] Z. Zhang, Y. Li, H. Song, H. Dong, Multiple dynamic graph based traffic speed
work to predict travel speed and introduces several novel architec- prediction method, Neurocomputing 461 (2021) 109–117, https://doi.org/
10.1016/j.neucom.2021.07.052.
tures to achieve optimal prediction performance. The proposed TFT [2] G. Tesoriere, A. Canale, A. Severino, I. Mrak, T. Campisi, The management of
is compared with several conventional traffic speed prediction pedestrian emergency through dynamic assignment: Some consideration
models using traffic speed data collected on Interstate 394 freeway about the ‘‘refugee Hellenism” Square of Kalamaria (Greece), AIP Conf. Proc.
2186 (2019), https://doi.org/10.1063/1.5138072.
stretch in Minnesota for prediction horizon from 5 min to 150 min.
[3] I.O. Olayode, A. Severino, L.K. Tartibu, F. Arena, Z. Cakici, Performance
The prediction results demonstrate that the TFT exceeds other pre- evaluation of a hybrid PSO enhanced ANFIS model in prediction of traffic
diction models when the prediction horizon is larger than 30 min. flow of vehicles on freeways: Traffic data evidence from South Africa,
Further, when the prediction horizon is over 60 min, the TFT has Infrastructures 7 (2022), https://doi.org/10.3390/INFRASTRUCTURES7010002.
[4] E.I. Vlahogianni, J.C. Golias, M.G. Karlaftis, Short-term traffic forecasting:
obvious advantages over other prediction algorithms. Specifically, Overview of objectives and methods, Transp. Rev. 24 (2004) 533–557, https://
compared with RNN model, MAE value of the TFT has an improve- doi.org/10.1080/0144164042000195072.
338
[5] E.I. Vlahogianni, M.G. Karlaftis, J.C. Golias, Short-term traffic forecasting: [31] N.P. Vaswani, Ashish, Noam Shazeer, Attention is all you need, IEEE Ind. Appl.
Where we are and where we’re going, Transp. Res. Part C Emerg. Technol. 43 Mag. 8 (2017) 8–15, https://doi.org/10.1109/2943.974352.
(2014) 3–19, https://doi.org/10.1016/j.trc.2014.01.005. [32] F. Giuliari, I. Hasan, M. Cristani, F. Galasso, Transformer networks for trajectory
[6] G. Tesoriere, T. Campisi, A. Canale, A. Severino, F. Arena, Modelling and forecasting, Proc. - Int. Conf. Pattern Recognit. (2020) 10335–10342, https://
simulation of passenger flow distribution at terminal of Catania airport, AIP doi.org/10.1109/ICPR48806.2021.9412190.
Conf. Proc. 2040 (2018), https://doi.org/10.1063/1.5079195. [33] Y. Liu, J. Zhang, L. Fang, Q. Jiang, B. Zhou, Multimodal Motion Prediction with
[7] H. Mirzahossein, A.A. Rassafi, Z. Jamali, R. Guzik, A. Severino, F. Arena, Active Stacked Transformers, (2021) 7573–7582. 10.1109/cvpr46437.2021.00749.
transport network design based on transit-oriented development and [34] W. Chen, S2TNet : Spatio-Temporal Transformer Networks for Trajectory
complete street approach: finding the potential in Qazvin, Infrastructures 7 Prediction in Autonomous Driving, (2021).
(2022) 23, https://doi.org/10.3390/infrastructures7020023. [35] C. Yu, X. Ma, J. Ren, H. Zhao, S. Yi, Spatio-Temporal Graph Transformer
[8] X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, B. Yin, Deep learning on traffic prediction: Networks for Pedestrian Trajectory Prediction, Springer International
methods, analysis and future directions, IEEE Trans. Intell. Transp. Syst. (2021) Publishing (2020), https://doi.org/10.1007/978-3-030-58610-2_30.
1–15, https://doi.org/10.1109/TITS.2021.3054840. [36] B. Lim, S. Arık, N. Loeff, T. Pfister, Temporal Fusion Transformers for
[9] Y. Wu, H. Tan, L. Qin, B. Ran, Z. Jiang, A hybrid deep learning based traffic flow interpretable multi-horizon time series forecasting, Int. J. Forecast. (2021),
prediction method and its understanding, Transp. Res. Part C Emerg. Technol. https://doi.org/10.1016/j.ijforecast.2021.03.012.
90 (2018) 166–180, https://doi.org/10.1016/j.trc.2018.03.001. [37] I. Lana, J. Del Ser, M. Velez, E.I. Vlahogianni, Road traffic forecasting: recent
[10] M.C. Tan, S.C. Wong, J.M. Xu, Z.R. Guan, P. Zhang, An aggregation approach to advances and new challenges, IEEE Intell. Transp. Syst. Mag. (2018), https://
short-term traffic flow prediction, IEEE Trans. Intell. Transp. Syst. 10 (2009) doi.org/10.1109/MITS.2018.2806634.
60–69, https://doi.org/10.1109/TITS.2008.2011693. [38] H. van Lint, C. van Hinsbergen, Short-term traffic and travel time prediction
[11] Y. Zou, X. Hua, Y. Zhang, Y. Wang, Hybrid short-term freeway speed prediction models, Transp. Res. Circ. 22 (2012) 22–41.
methods based on periodic analysis, Can. J. Civ. Eng. 42 (2015) 570–582, [39] E. Bolshinsky, R. Freidman, Traffic flow forecast survey, Tech. Inst. Technol.
https://doi.org/10.1139/cjce-2014-0447. Report.–15. (2012) 1–15.
[12] P. Shang, X. Li, S. Kamae, Chaotic analysis of traffic time series, Chaos, Solitons [40] H. Lin, R. Zito, M. Taylor, A review of travel-time prediction in transport and
Fractals. 25 (2005) 121–128, https://doi.org/10.1016/j.chaos.2004.09.104. logistics, East. Asia Soc. Transp. 5 (2005) 1433–1448.
[13] S.R. Chandra, H. Al-Deek, Predictions of freeway traffic speeds and volumes [41] F. Jin, S. Sun, Neural network multitask learning for traffic flow forecasting,
using vector autoregressive models, J. Intell. Transp. Syst. Technol. Planning, Proc. Int. Jt. Conf. Neural Networks. (2008) 1897–1901, https://doi.org/
Oper. 13 (2009) 53–72, https://doi.org/10.1080/15472450902858368. 10.1109/IJCNN.2008.4634057.
[14] M. Rajabi, H. Khodavirdi, A. Mojahed, Acoustic steering of active spherical [42] L. Li, X. Su, Y. Zhang, Y. Lin, Z. Li, Trend modeling for traffic time series analysis:
carriers, Ultrasonics 105 (2020), https://doi.org/10.1016/j.ultras.2020.106112 An integrated study, IEEE Trans. Intell. Transp. Syst. 16 (2015) 3430–3439,
106112. https://doi.org/10.1109/TITS.2015.2457240.
[15] W. Qiao, M. Khishe, S. Ravakhah, Underwater targets classification using local [43] J. Xu, D. Deng, U. Demiryurek, C. Shahabi, M. Van Der Schaar, Mining the
wavelet acoustic pattern and Multi-Layer Perceptron neural network situation: spatiotemporal traffic prediction with big data, IEEE J. Sel. Top.
optimized by modified Whale Optimization Algorithm, Ocean Eng 219 Signal Process. 9 (2015) 702–715, https://doi.org/10.1109/
(2021), https://doi.org/10.1016/j.oceaneng.2020.108415 108415. JSTSP.2015.2389196.
[16] S. Al-Janabi, A.F. Alkaim, A nifty collaborative analysis to predicting a novel [44] I. Laña, J. Del Ser, I.I. Olabarrieta, Understanding daily mobility patterns in
tool (DRFLLS) for missing values estimation, Soft Comput. 24 (2020) 555–569, urban road networks using traffic flow analytics, Proc. NOMS 2016–2016 IEEE/
https://doi.org/10.1007/s00500-019-03972-x. IFIP Netw Oper. Manag. Symp. (2016) 1157–1162, https://doi.org/10.1109/
[17] S. Al-Janabi, A.F. Alkaim, A Comparative Analysis of DNA Protein Synthesis for NOMS.2016.7502980.
Solving Optimization Problems: A Novel Nature-Inspired Algorithm, Springer [45] R. Chrobok, O. Kaumann, J. Wahle, M. Schreckenberg, Different methods of
International Publishing (2021), https://doi.org/10.1007/978-3-030-73603- traffic forecast based on real data, Eur. J. Oper. Res. 155 (2004) 558–568,
3_1. https://doi.org/10.1016/j.ejor.2003.08.005.
[18] S. Al-Janabi, I. Al-Shourbaji, M. Shojafar, M. Abdelhag, Mobile cloud [46] Z. Hou, X. Li, Repeatability and similarity of freeway traffic flow and long-term
computing: challenges and future research directions, Proc. - Int Conf. Dev. prediction under big data, IEEE Trans. Intell. Transp. Syst. 17 (2016) 1786–
ESystems Eng. DeSE. (2018) 62–67, https://doi.org/10.1109/DeSE.2017.21. 1796, https://doi.org/10.1109/TITS.2015.2511156.
[19] A. Sharifi, M. Ahmadi, M.A. Mehni, S. Jafarzadeh Ghoushchi, Y. Pourasad, [47] C.H. Wu, J.M. Ho, D.T. Lee, Travel-time prediction with support vector
Experimental and numerical diagnosis of fatigue foot using convolutional regression, IEEE Trans. Intell. Transp. Syst. 5 (2004) 276–281, https://doi.org/
neural network, Comput. Methods Biomech. Biomed. Engin. 24 (2021) 1828– 10.1109/TITS.2004.837813.
1840, https://doi.org/10.1080/10255842.2021.1921164. [48] I. Journal, G. Page, For Riew On For Riew On, Pom (2001) 14–27.
[20] J. Artin, A. Valizadeh, M. Ahmadi, S.A.P. Kumar, A. Sharifi, Presentation of a [49] Y. Hou, P. Edara, C. Sun, Traffic flow forecasting for urban work zones, IEEE
novel method for prediction of traffic with climate condition based on Trans. Intell. Transp. Syst. 16 (2015) 1761–1770, https://doi.org/10.1109/
ensemble learning of neural architecture search (NAS) and linear regression, TITS.2014.2371993.
Complexity 2021 (2021), https://doi.org/10.1155/2021/8500572. [50] K.Y. Chan, T.S. Dillon, J. Singh, E. Chang, Neural-network-based models for
[21] X. Ma, Z. Tao, Y. Wang, H. Yu, Y. Wang, Long short-term memory neural short-term traffic flow forecasting using a hybrid exponential smoothing and
network for traffic speed prediction using remote microwave sensor data, levenberg-marquardt algorithm, IEEE Trans. Intell. Transp. Syst. 13 (2012)
Transp. Res. Part C Emerg. Technol. 54 (2015) 187–197, https://doi.org/ 644–654, https://doi.org/10.1109/TITS.2011.2174051.
10.1016/j.trc.2015.03.014. [51] Y. Duan, Y. Lv, F.Y. Wang, Travel time prediction with LSTM neural network,
[22] Z. Cui, R. Ke, Z. Pu, Y. Wang, Deep Bidirectional and Unidirectional LSTM IEEE Conf, Intell. Transp. Syst. Proceedings, ITSC. (2016) 1053–1058, https://
Recurrent Neural Network for Network-wide Traffic Speed Prediction, (2018) doi.org/10.1109/ITSC.2016.7795686.
1–11. [52] H. Zhang, H. Wu, W. Sun, B. Zheng, DEEPTRAVEL: A neural network based
[23] J. Wang, R. Chen, Z. He, Traffic speed prediction for urban transportation travel time estimation model with auxiliary supervision, IJCAI Int. Jt. Conf.
network: A path based deep learning approach, Transp. Res. Part C Emerg. Artif. Intell. 2018-July (2018) 3655–3661. 10.24963/ijcai.2018/508.
Technol. 100 (2019) 372–385, https://doi.org/10.1016/j.trc.2019.02.002. [53] Y. Hou, P. Edara, Network scale travel time prediction using deep learning,
[24] L. Cai, K. Janowicz, G. Mai, B. Yan, R. Zhu, Traffic transformer: Capturing the Transp. Res. Rec. 2672 (2018) 115–123, https://doi.org/10.1177/
continuity and periodicity of time series for traffic forecasting, Trans. GIS. 24 0361198118776139.
(2020) 736–755, https://doi.org/10.1111/tgis.12644. [54] C. Xu, J. Ji, P. Liu, The station-free sharing bike demand forecasting with a deep
[25] W. Qiao, Z. Li, W. Liu, E. Liu, Fastest-growing source prediction of US electricity learning approach and large-scale datasets, Transp. Res. Part C Emerg. Technol.
production based on a novel hybrid model using wavelet transform, Int. J. 95 (2018) 47–60, https://doi.org/10.1016/j.trc.2018.07.013.
Energy Res. (2021) 1–23, https://doi.org/10.1002/er.7293. [55] Z. Zhao, W. Chen, X. Wu, P.C.V. Chen, J. Liu, LSTM network: A deep learning
[26] S. Al-Janabi, A. Alkaim, E. Al-Janabi, A. Aljeboree, M. Mustafa, Intelligent approach for short-term traffic forecast, IET Image Process. 11 (2017) 68–75,
forecaster of concentrations (PM2.5, PM10, NO2, CO, O3, SO2) caused air https://doi.org/10.1049/iet-its.2016.0208.
pollution (IFCsAP), Neural Comput. Appl. 33 (21) (2021) 14199–14229. [56] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention based spatial-temporal graph
[27] S. Al-Janabi, A.F. Alkaim, Z. Adel, An Innovative synthesis of deep learning convolutional networks for traffic flow forecasting, 33rd AAAI Conf. Artif.
techniques (DCapsNet & DCOM) for generation electrical renewable energy Intell. AAAI 2019, 31st Innov. Appl. Artif. Intell. Conf. IAAI 2019 9th AAAI Symp.
from wind energy, Soft Comput. 24 (2020) 10943–10962, https://doi.org/ Educ. Adv. Artif. Intell. EAAI 2019. (2019) 922–929. 10.1609/aaai.
10.1007/s00500-020-04905-9. v33i01.3301922.
[28] S. Al-Janabi, M. Mohammad, A. Al-Sultan, A new method for prediction of air [57] B. Liao, D. McIlwraith, J. Zhang, T. Chen, C. Wu, S. Yang, Y. Guo, F. Wu, Deep
pollution based on intelligent computation, Soft Comput. 24 (2020) 661–680, sequence learning with auxiliary information for traffic prediction, Proc. ACM
https://doi.org/10.1007/s00500-019-04495-1. SIGKDD Int. Conf. Knowl. Discov. Data Min. (2018) 537–546, https://doi.org/
[29] S. Peng, R. Chen, B. Yu, M. Xiang, X. Lin, E. Liu, Daily natural gas load forecasting 10.1145/3219819.3219895.
based on the combination of long short term memory, local mean [60] H. Xue, F.D. Salim, TRAILER: Transformer-based Time-wise Long Term Relation
decomposition, and wavelet threshold denoising algorithm, J. Nat. Gas Sci. Modeling for Citywide Traffic Flow Prediction, 1 (2020).
Eng. 95 (2021), https://doi.org/10.1016/j.jngse.2021.104175 104175. [61] X. Yang, Y. Zou, J. Tang, J. Liang, M. Ijaz, Evaluation of short-term freeway speed
[30] W. Qiao, W. Liu, E. Liu, A combination model based on wavelet transform for prediction based on periodic analysis using statistical models and machine
predicting the difference between monthly natural gas production and learning models, J. Adv. Transp. 2020 (2020), https://doi.org/10.1155/2020/
consumption of U.S., Energy. 235 (2021) 121216. 10.1016/j. 9628957.
energy.2021.121216.
339
Hao Zhang received the B.S. degree from Central South Xiaoxue Yang received the B.S. degree from Shandong
University, China, in 2020. He is currently pursuing the University, in 2017. Now, she is a PhD student in traffic
M.S. degree with college of transportation engineering, engineering, Tongji University. Her research interests
Tongji University, China. His research interests include include intelligent transportation system, autonomous
traffic flow prediction, transportation big data analysis and connected vehicles, traffic operations, traffic man-
and intelligent transportation systems. agement and control, data analysis.
Yajie Zou is associate professor at Tongji University, Hang Yang received his B.S. degree from Southwest
Shanghai, China. He holds the M.S. and Ph.D. in Trans- Jiaotong University, China, in 2014. He received Ph.D.
portation Engineering from Texas A&M University, and degree from Tongji University, China, in 2021. He will
B.S. in Engineering Mechanics from Tongji University, join Faculty of Maritime and Transportation, Ningbo
Shanghai, China. Dr. Zou’s main research interests are University, China, as an assistant professor. He special-
traffic operations, traffic safety and microscopic traffic izes in traffic operation, traffic network modelling and
simulation models. system optimization. His current research interests
include traffic state prediction in mixed network and
integration control and management in CV environ-
ment.
340

2022 - Hao Zhang

Uploaded by

Copyright:

Available Formats

2022 - Hao Zhang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022 - Hao Zhang

Uploaded by

Copyright:

Available Formats

Neurocomputing 500 (2022) 329–340

Contents lists available at ScienceDirect

A temporal fusion transformer for short-term freeway traffic speed

1. Introduction [14–20]. Because traffic speed is mainly predicted based on the

Fig. 1. Framework of utilizing the TFT to predict freeway traffic speed.

3.2.1. Gating module GRNx ðp; cÞ ¼ LayerNormðp þ GLU x ðg1 ÞÞ ð2Þ

Fig. 2. The structure of Temporal Fusion Transformer.

3.2.4. Multi-head attention module

The TFT used a multi-head attention structure to enhance the

3.2.3. Static information encoders e ðt; nÞ; ce

Fig. 4. Location of I-394 road, from detector A to detector E [36].

Fig. 5. Traffic speed on Thursday, Apr. 12, 2018.

Fig. 7. The results of five runs of the TFT in peak hours.

Fig. 8. The results of five runs of the TFT in off-peak hours.

MAE Prediction horizon

*min denotes minutes.

MAE Prediction horizon

*min denotes minutes.

Fig. 9. The proportion of performance improvement of the TFT in peak hours.

The authors declare that they have no known competing finan-

You might also like