Shin y Yoon - 2023 - Performance Evaluation of Building Blocks of Spati

Received 26 October 2023, accepted 23 November 2023, date of publication 30 November 2023,

date of current version 8 December 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3338223

Performance Evaluation of Building Blocks

of Spatial-Temporal Deep Learning Models
for Traffic Forecasting
Department of Civil and Environmental Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, South Korea
Corresponding author: Yoonjin Yoon ([email protected])
This work was supported in part by the National Research Foundation of Korea (NRF) Basic Research Laboratory under Grant
2021R1A4A1033486, and in part by the Midcareer Research Grant by the South Korean Government under Grant 2020R1A2C2010200.

ABSTRACT The traffic forecasting problem is a challenging task that requires spatial-temporal modeling
and gathers research interests from various domains. In recent years, spatial-temporal deep learning models
have improved the accuracy and scale of traffic forecasting. While hundreds of models have been suggested,
they share similar modules, or building blocks, which can be categorized into three temporal feature
extraction methods of recurrent neural networks, convolution, and self-attention and two spatial feature
extraction methods of convolutional graph neural networks (GNN) and attentional GNN. More importantly,
the models have been mostly evaluated for their entire architectures with limited efforts to characterize and
understand the performance of each category of building blocks. In this study, we conduct an extensive, multi-
faceted experiment to understand the influence of building block selection on traffic forecasting accuracy,
considering environmental characteristics and dataset distributions. Specifically, we implement six traffic
forecasting models using three temporal and two spatial building blocks. When we evaluate the models on
four datasets with diverse characteristics, the results show each building block demonstrates distinguishable
characteristics depending on study sites, prediction horizons, and traffic categories. The convolution models
demonstrate higher overall forecasting performance than other models, whereas self-attention models show
competitiveness in less frequent traffic categories, transition states, and the presence of outliers. Based on
the results, we also suggest an adaptive model evaluation framework for category-wise predictions of test
sets based on the performance of the models on validation sets. The results of this evaluation framework
demonstrate improved forecasting accuracy at most by 3.7% without further sophistication in existing model
architectures. The results enhance the utility of existing models and suggest guidelines for researchers
building traffic forecasting model architectures and for practitioners implementing these state-of-the-art
techniques in real-world applications.

INDEX TERMS Comparative study, deep learning, graph neural networks, spatial-temporal representation,
time-series prediction, traffic forecasting.

I. INTRODUCTION of future traffic states based on current and past traffic

Traffic forecasting is a complex problem that requires information. The accurate prediction of future traffic states
modeling spatial-temporal features of traffic data such as is a crucial technical capability in intelligent transportation
speed, density, and flow, to accurately predict future traffic systems (ITS) [1], [2], [3], enabling applications such as
states. As stated in [1], traffic forecasting aims to make network capacity evaluation [4], travel time estimation [5],
predictions on from few seconds to possibly a few hours signal optimization [6], and carbon emission reduction [7].
It is a long-studied problem which dates back to 1930s with
The associate editor coordinating the review of this manuscript and efforts from various domains of science and engineering [8].
approving it for publication was Vlad Diaconita . Owing to advancements in sensor technologies such as GPS

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

and loop detectors and deep learning techniques to learn layer that modeled traffic flow as a diffusion process on
from abundance of data, traffic forecasting problems garnered a graph and compared its performance with ChebNet [76]
much research interests in recent years. on a traffic flow dataset. Cui et al. [15] suggested traffic
In traditional approaches to the traffic forecasting prob- graph convolution (TGC) and compared it with spectral
lem, conventional time series models such as autoregres- CNN [77] and ChebNet [76] in terms of the number of
sive integrated moving average (ARIMA) [9] and vector parameters, computation time, and ability to extract localized
autoregressive (VAR) [10] have gained popularity. Other features. In addition, they showed that the TGC model
data-driven machine learning algorithms such as support outperformed the spectral GCN-based models in overall
vector regression (SVR) [11] and k-nearest neighbor (kNN) performance. Although these studies provide comparative
[12] have also been utilized. Some other studies imple- studies between new and existing GNN layers, they only
mented simulation [13], [14] and physical modeling [8]. discuss the performance in terms of overall accuracy and
Although these approaches all demonstrated promising efficiency. In the temporal dimension, Reza et al. [70] present
results, their applications have had limitations in accuracy, the overall performance comparison between SVR, LSTM,
spatial-temporal range, or computation time. GRU, and transformer without consideration of spatial fea-
The recent surge of deep learning algorithms offered tures. Therefore, an investigation beyond overall performance
methods to fit a wide variety of functions with a larger to characterize each building block is necessary to understand
number of parameters while avoiding overfitting problems, and justify traffic forecasting model architecture.
and researchers have been able to leverage these advanced This study addresses this gap by conducting an extensive
techniques to capture the complex spatial and temporal and multi-faceted experiment to characterize the building
features of transportation networks in traffic forecasting blocks of spatial-temporal deep learning models for traffic
problems [3]. Recurrent neural networks (RNN) have gained forecasting. First, we define the five categories of the building
popularity in capturing temporal features with their intrinsic blocks through an extensive literature review. They are
ability to handle sequential data [15], [16], [17], [18], [19], RNN, convolution, and self-attention for temporal feature
[20], [21], [22], [23], [24], [25], [26], [27], [28], [29], extractions, and convolutional GNN and attentional GNN
[30], [31], [32], [33], [34], [35], [36], [37], [38]. The for spatial feature extractions. Subsequently, we implement
RNN methods, however, have suffered from the vanishing six traffic forecasting models, each incorporating distinct
gradient problem, and the convolution-based temporal feature combination of three temporal and two spatial building
extraction has been suggested to overcome this intrinsic blocks. To construct the models, we draw three models from
problem of RNN [39], [40], [41], [42], [43], [44], [45], previous literature, each representing a temporal building
[46], [47], [48], [49], [50], [51], [52], [53], [54], [55], block. Through replacement of spatial building blocks in
[56], [57], [58], [59], [60]. More recently, the self-attention selected models with GCN [78] and GAT [79], we assemble
mechanism [61] has demonstrated meaningful advances in six traffic forecasting models for the experiment. Finally,
traffic forecasting [22], [40], [41], [43], [46], [51], [58], [60], we evaluate the performance of the models on four real-world
[62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72]. datasets with diverse characteristics. In the experiment,
Convolutional neural networks (CNN) and graph neural we assess the influence of building block selections, and
networks (GNN) provide key spatial feature extraction analyze the performance across different traffic categories
capabilities. Although they constitute the pioneering efforts and presence of outliers.
to adopt deep learning architectures to traffic forecasting As the results, we find that the convolution and self-
[20], [23], [25], [42], [48], [49], [50], [54], CNN models attention-based models demonstrate advantages over the
have limitations in modeling the complex topology of the RNN-based counterparts in extracting temporal features for
underlying transportation networks. In contrast, GNN takes traffic forecasting. In the overall performance evaluation,
advantage of the node-link structure to incorporate the the convolution models tend to outperform the self-attention
underlying transportation network topology. By modeling models in overall performance. However, the self-attention
traffic sensors and road segments as graph nodes, the hidden models show a smaller performance discrepancy in perfor-
representation of a target node is learned by aggregating mances between 15-min and 60-min predictions, indicating
information from the neighboring nodes connected by edges a potential advantage in long-term forecasting. In addi-
[15], [16], [17], [21], [22], [24], [26], [27], [28], [29], [31], tion, the self-attention provides more accurate results
[32], [34], [35], [36], [37], [38], [40], [41], [43], [44], [45], in low-frequency traffic categories, and shows higher
[46], [47], [51], [52], [53], [55], [57], [58], [59], [60], [63], robustness against outliers than the convolution models.
[64], [66], [67], [68], [71], [72], [73], [74], [75]. Furthermore, we suggest the adaptive model evaluation
Despite the success of deep learning models in processing framework that flexibly selects models to conduct prediction
large datasets with high accuracy, efforts to understand based on the category-wise performance evaluation. Using
each of theses components, or building blocks, of the this framework, traffic predictions with higher accuracy
models are limited. For spatial feature extraction, Li et al. can be achieved without further sophistication in model
[16] proposed diffusion convolution, a convolutional GNN architectures. In summary, our main contribution is fourfold:

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

• Categorize the building blocks for spatial-temporal support vector machine (SVM), and Kalman filtering [18],
deep neural networks for traffic forecasting through an [30]. When traffic data were categorized into congestion
extensive literature review. The categorization includes levels, the LSTM model combined with the restricted
three temporal feature extraction methods - RNN, Boltzmann machine (RBM) showed at most 93.8% accuracy
convolution, and self-attention - and two spatial feature for congestion prediction tasks [19]. The sequence-to-
extraction methods - convolutional GNN, and attentional sequence framework has been adopted in many models for
GNN. multiple prediction horizons [16], [17], [21], [27], [31],
• Conduct an extensive, multi-faceted experiment using [32], [36]. In Bai et al. [33], a linear transformation layer
six traffic forecasting models each representing a was implemented to conduct multi-step traffic prediction.
distinct combination of three temporal and two spatial Wang et al. [34] suggested a model that utilized GRU to
building blocks on four different datasets. The overall produce aggregated spatial-temporal representations. Several
performance evaluation discovers building block pairs models have employed multiple layers of RNN [16], [21],
that generally yield higher accuracy: convolutional GNN [33], [35], whereas others have used the attention mechanism
& convolution and attentional GNN & self-attention. [27], [28], [36], [37] to capture the long-term relationship in
• Discover the characteristics of each building block. traffic data.
The convolution-based temporal feature extraction max- Another building block to extract temporal features is
imizes the performance gain in frequent traffic cate- convolution. In the absence of sequential computation,
gories, whereas the self-attention and attentional GNN convolution have been able to efficiently train the models
have increased robustness in infrequent conditions, such and overcome the vanishing gradient problem of RNNs.
as low-frequency traffic categories, traffic transitions, Originally suggested to process image data, earlier CNN
and the presence of outliers. approaches have processed traffic data into an image
• Propose an adaptive evaluation framework for traffic with each row and column representing each node of the
forecasting, which makes predictions using multiple transportation network and time step, respectively [49], [54].
models based on the performance on distinct traffic Although these models have demonstrated higher forecasting
categories. The framework increases the previous state- power than traditional machine learning algorithms and
of-the-art performance by 3.7% in a highway traffic vanilla LSTM, the CNN structure is limited as it represents
speed prediction task, without further sophistication in only 1D spatial complexity. To model time series more
previous model architectures. appropriately, temporal convolutions such as the gated 1D
The remaining paper is organized as follows. Section II causal convolution [45], [47], [48], [56], [59], [60] applied
investigates the literature on deep learning models in traffic convolution operation only along the temporal dimension.
forecasting studies. The preliminaries for this study and By limiting the usage of future information during the
definitions are in Section III. The methods and data are temporal feature extraction stage, causal convolutions have
explained in Section IV, along with the experimental setting. become applicable to traffic time-series modelling problems.
In Section V, we present the results and discussion of the The dilated causal convolution [87] that applies dilation to 1D
experiment. Finally, Section VI provides the conclusion and causal convolution to increase the reception field size with a
future study. limited number of layers has shown improved performance
[40], [41], [43], [44], [46], [51], [53], [55], [57], [58], [73].
II. LITERATURE REVIEW Recently, self-attention has also been widely adopted in
Deep learning models have proven effective in various traffic forecasting studies. Reza et al. [70] demonstrated
research fields such as image classification [80], object recog- the advantage of the transformer architecture over RNN
nition [81], and machine translation [82]. With their ability to models. To impose sequential information of traffic data,
process huge data and model non-linear relationships, deep self-attention have been implemented with various positional
learning has also become cutting-edge in traffic forecasting encoding methods. While the original Transformer [61]
studies. Following earlier works on stacked autoencoders [83] implemented the sinusoid to encode the position information
and deep belief networks [84], many studies suggested deep of word sequences, Cai et al. [63] and Wen et al. [69] imple-
learning models that capture the spatial-temporal correlation mented the transformer architecture with variations in the
of traffic data. embedding of traffic data and positional encoding. Guo et al.
[64] modified the self-attention score to reflect trends in
A. TEMPORAL FEATURE EXTRACTION traffic data and implemented a dynamic graph convolution
To model time-series traffic data, recurrent neural networks module to replace the position-wise feed-forward layer of the
(RNN) and their variants, such as long-short term memory transformer. TrafficBERT [65] used the transformer encoder
(LSTM) [85] and gated recurrent unit (GRU) [86] have as in Devlin et al. [88] to retain the forecasting power
gained attention in extracting temporal features for traffic when training using data from multiple sources. Wang et al.
forecasting models. Implementation of vanilla LSTM has [72] proposed an approach in which the parameters for the
shown improved performance compared to traditional models self-attention layer is generated using regional distribution
such as auto-regressive integrated moving average (ARIMA), of Point-of-Interests (PoI). Self-attention in conjunction with

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

other temporal feature extraction methods such as GRU [22], Convolutional GNNs have pioneered GNN-based traffic
[34], [62] and dilated causal convolution [40], [46], [51], [58] forecasting research, and have been widely used in concurrent
have also been proposed. GMAN [68], and AI-GFACN [71] models [15], [16], [17], [22], [28], [29], [31], [32], [33], [35],
adopted self-attention for both spatial and temporal feature [37], [40], [43], [44], [47], [55], [56], [57], [58], [59], [60],
extractions. In addition, Zheng et al. [68] also introduced [62], [63], [64], [67], [71], [75], [95]. Several studies [29],
a transform attention layer that generated spatial-temporal [37], [56], [60] adopted spectral graph convolutions [76], [78]
embedding representations for the positional embedding of that showed higher forecasting power over the basic deep
future time steps. Xu et al. [67] proposed a model in learning models such as feed-forward neural networks and
which a temporal attention block followed a spatial attention FC-LSTM. Li et al. [16] suggested diffusion convolution,
block. The two attention blocks of the model shared similar which expanded the application of graph convolution to
structures, except that the graph convolution operation was directed graphs, and has been applied in many traffic
skip-connected to the output of the spatial attention block to forecasting studies [44], [55], [57], [63]. Cui et al. [15] sug-
reflect the static structure of the transportation network. gested traffic graph convolution (TGC), using element-wise
multiplication between learnable parameters and adjacency
B. SPATIAL FEATURE EXTRACTION WITH GRAPH NEURAL matrices. Zhang et al. [28] implemented traffic graph
NETWORKS convolution with an attention mechanism [96] to capture the
Earlier efforts have adopted CNN to extract spatial features of dependencies in the time steps regardless of distances. Using
traffic data. However, they operate in Euclidean space and fail a matrix factorization technique, Bai et al. [33] suggested
to represent the complex topology of transportation networks a convolutional GNN module that can apply node specific
[20], [23], [25], [26], [39], [42], [48], [49], [50], [54], [89], parameters. Attentional GNNs also have been widely used
[90], [91]. in traffic forecasting research [21], [26], [27], [36], [41],
GNNs have become a popular choice in traffic forecasting [45], [52], [66], [73]. The gated attention networks (GaAN)
since the early adoptions by Li et al. [16] and Yu et al. [26] outperformed diffusion convolution in short-term traffic
[56]. The core idea of GNN is to process the data into forecasting when combined with GRU. GAT [79] has also
graph structures and extract the spatial feature of each node been adopted in many studies [21], [27], [36], [41], [52],
by aggregating the information from neighboring nodes. [73]. Park et al. [66] constructed a new attentional GNN
Most GNN methods for supervised learning, such as classi- layer that adopts the scaled dot-product attention [61] with
fication and regression, can be grouped into convolutional, sentinel vectors to control the information from neighbor
attentional, and message-passing GNNs based on how they nodes. A few studies have implemented convolutional and
aggregate neighborhood information [92]. attentional GNNs in one model [46], [51], [72]. Message-
Convolutional GNNs multiply fixed weights to the source passing GNN traffic forecasting models have also been
node features and conduct aggregating operations, such as suggested using a dual graph that predicts node and edge
summation, pooling, and averaging, to extract target node features [74], and using bidirectional graphs in extracting
spatial features. The most widely used methods under con- aggregated spatial-temporal features [34]. Gupta et al. [38]
volutional GNN are the group of spectral graph convolutions proposed a message-passing GNN-based model with a spatial
[76], [77], [78], which approximates the filters in the spectral embedding and attention mechanism based on shortest-paths
domain. GraphSAGE [93] and diffusion convolution [16] are on graphs. Outside the existing taxonomy of GNNs, graph
other examples of convolutional GNNs. Attentional GNNs embedding techniques such as DeepWalk [97], LINE [98],
resemble convolutional GNNs in that they multiply the source and node2vec [99] have also been adopted to incorporate
node features with scalar weights. The difference, however, graph structures [24], [66], [68], [71], [89].
lies in that the attentional GNNs assign the weights through While these studies have achieved significant performance
a function of the source and target node features. Graph improvements, there have not been sufficient efforts to
attention networks (GAT) [79] and Gated Attention Networks understand the performance of individual building blocks that
(GaAN) [26] are popular attentional GNN models that constitute these models. Li et al. [16] introduced diffusion
implement self-attention mechanisms [61]. Finally, message- convolution as a convolutional GNN layer, employing it
passing GNNs compute output representations of a target to conceptualize traffic flow as a diffusion process occur-
node using a function of the target node and its neighbors. ring on a graph. This approach was then compared with
Gilmer et al. [94] is an example of message-passing GNN, the more traditional ChebNet [76] for their performance.
which computes the message using hidden representations Similarly, Cui et al. [15] conduct a comparative analysis
of source and target nodes and edges. The aggregated between the proposed traffic graph convolution (TGC)
messages and the target node features are passed through a and traditional convolutional GNNs such as spectral GNN
neural network to generate output representations. For more [77] and ChebNet [76] for their number of parameters,
explanations on GNNs taxonomy, see Bronstein et al. [92]. computational efficiency, feature localization ability, and
As transportation networks are inherently equipped with overall performance. For the temporal feature extraction
graph structures, the GNNs have become the most popular blocks, Reza et al. [70] evaluates the performances of the
spatial feature extraction method for traffic forecasting. transformer compared to other machine learning algorithms

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

nodes. As defined in Ye et al. [100], a node can represent

a sensor, road segment, or road intersection. In this study,
we used sensor and road segment graphs depending on the
dataset. The hypothetical construction of each type of graph is
in Fig. 1. An adjacency matrix A = (Aij ) ∈ RN ×N is a square
Boolean matrix, where the nodes vi , vj ∈ V are connected by
an edge (vi , vj ) ∈ E.
Definition 2: Graph Signal The signal from node vi at time
t is denoted as xit ∈ RC , where C is the number of features
of the signal. The graph signal is a matrix containing all
node signals at time t, denoted as X t = [x1t , x2t , . . . , xN
t ] ∈
RN ×C .
FIGURE 1. Graph construction from a transportation network. The
transportation network on the left consists of 6 road segments and B. TRAFFIC FORECASTING PROBLEM
5 traffic sensors. The network can be represented as (a) a sensor graph, or
(b) a road segment graph considering the locations and traffic directions. The traffic forecasting problem defined on the transportation
network graph G predicts future traffic states for T ′ time
TABLE 1. Summary of the literature review by spatial and temporal steps based on historical traffic information such as speed,
building blocks.
flow, and occupancy. Given historical graph signals for past
T time steps on the graph, G, the traffic forecasting problem is
defined as finding a function H that maps the historical data
to future traffic states:
h i
H : [X t−T +1 , . . . , X t ; G] → Ŷ t+1 , . . . , Ŷ t+T ′ (1)

where Ŷ t ∈ RN ×1 is the predicted traffic state at time t.


architecture with the absence of spatial feature extraction. This section explains the methods and materials used in
Although these studies evaluate the performance of traffic this study. First, we explain the GNN-based spatial building
forecasting models on overall performance and computation blocks, graph convolutional networks (GCN) [78] and graph
efficiency, a comprehensive building block-wise analysis attention networks (GAT) [79], and three base models with
needs to be conducted considering characteristics of datasets, different temporal building blocks. Then, we introduce the
traffic categories, and robustness. In this study, we address datasets and settings for the experiments. The study outline
this research gap through an extensive and multi-faceted is shown in Fig. 2.
experiment to reveal the inherent characteristics of the
In Table 1, spatial-temporal traffic forecasting models are NETWORKS
categorized by the implemented building blocks. Although
To investigate the differences between convolutional and
several studies fall under the miscellaneous category, most
attentional GNNs in traffic forecasting research, we imple-
studies can be categorized using the five building blocks
mented one module from each category. Specifically,
of spatial and temporal feature extraction. Note that several
we implemented the GCN model [78] from the convolutional
studies [46], [51], [62], [71], [72] use more than two building
GNNs category. The GCN model uses the first-order
blocks to extract the features. For more in-depth reviews of
Chebyshev polynomials to approximate the filter in the
traffic forecasting studies using deep learning models, please
Fourier-transformed space and incorporates spatial rela-
refer to Lee et al. [3], Ye et al. [100], and Jiang et al. [101].
tionships between nodes by aggregating information from
neighboring nodes. A GCN layer with input Xt ∈ RN ×d on
graph G and d-dimensional feature space at time t can be
This section explains the preliminaries of our study, which expressed as follows:
include the mathematical definition of the transportation
network graph, graph signal, and traffic forecasting problem. GCN(Xt , A) = σ (ÂXt W ), (2)

A. NOTATIONS AND DEFINITIONS where σ (·) is an activation function, and W ∈ Rd×h is the
Definition 1: Transportation network graph We represent weight parameter matrix where h is the output dimension.
the transportation network graph as a directed graph G = Whereas GCN originally used the normalized Laplacian
−1/2 −1/2
(V , E), where V is a set of |V | = N nodes and E is a matrix  = D̃ ÃD̃ where à = I N + A, and D̃ii =
set of edges representing pairwise connections between the 6j ãij , we use  = D̃ à to apply GCN on directed graphs.

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

FIGURE 2. Overview of this study. We first define three categories for temporal and two for spatial building blocks.
Combining a spatial and a temporal building block, we implement six models and conduct an extensive and multi-faceted
experiment using four different real-world traffic datasets. Finally, we analyze the results on overall performance,
performance in different traffic categories, performance on outliers, and adaptive model evaluation.

Information from further nodes can be aggregated by staking

multiple GCN layers.
From the family of attentional GNNs, we implemented
GAT [79]. GAT uses self-attention mechanisms [61] to weigh
the importance of each neighbor node and aggregates the
FIGURE 3. Architecture of T-GCN. The model extracts spatial features
information from neighbors accordingly. For each node pair from input graph signal using GNN layers. Then, the extracted features
vi and vj connected by edge (vi , vj ), the attention score for the are fed into GRU units to extract temporal features. The encoder-decoder
(k) framework is implemented for generating multiple time-step predictions.
k-th head αijt at time t is defined as follows: The GNN operation is GCN for T-GCN and GAT for T-GAT.
 D h iE
exp σ a(k) , xit W (k) , xt W (k) models with GCN and GAT. We first briefly explain the three
αijt =P  i (k) l (k)   , (3)
vl ∈Ni exp σ a , xt W , xt W
(k) base models used in this study.
T-GCN [29] is a spatial-temporal traffic forecasting model

where xti ∈ Rd is the signal of node vi at time t, a(k) ∈ R2h is combining GRU [86] and 2-layer GCN for temporal and
a learnable parameter vector for the k-th attention head with spatial feature extraction, respectively. The update gate ut ,

h′ dimension, W (k) ∈ Rd×h is a learnable weight parameter reset gate rt , and outputs ht of the GRU units at time t on
matrix for the k-th attention head, ⟨·, ·⟩ is the dot product input X t ∈ RN ×C are defined as follows:
operator, [·, ·] concatenates the vectors inside the bracket and
ut = σ (W u [f (A, X t ), hh−1 ] + bu ) , (5)
Ni is the neighbor set of node vi . The GAT layer with K heads
applied on the node vi with graph signal X t observed from rt = σ (W r [f (A, X t ), hh−1 ] + br ) , (6)
graph G at time t can be expressed as follows: ct = tanh (W c [f (A, X t ), (rt ⊙ ht−1 )] + bc ) , (7)
   ht = ut ⊙ ht−1 + (1 − ut ) ⊙ ct , (8)
X (k)
GAT(vi ; X t , G) = CATK k=1 σ
  atij xlt W (k)
 , (4) where ⊙ is the element-wise Hadamard product and
vl ∈Ni σ (·) is the sigmoid activation function, f (A, X t ) =
σ (ÂReLU(ÂX t W 0 )W 1 ) is the 2-layer GCN model with
where CATK k=1 [·] concatenates the outputs of the equation learnable parameters W 0 ∈ RC×p and W 1 ∈ Rp×d , W u , W r

in the bracket for k = 1 to K , W (k) v ∈ Rd×h is a , and W c ∈ Rd×dgru are learnable parameters, and bu , br ,
learnable parameter matrix for the k-th attention head. If the and bc are biases. Although the original TGCN adopted a
output dimension for GAT h′ × k is equal to that of GCN, many-to-one structure, we implemented the encoder-decoder
replacing one with the other becomes possible for any traffic framework for the multi-step prediction. In the following
forecasting model. discussions, we denote the encoder-decoder T-GCN model
as T-GCN, and T-GCN with GAT as T-GAT. Fig. 3 shows the
BUILDING BLOCKS Graph WaveNet [55] model combines dilated causal
We studied the temporal building block characteristics using convolution [87] and convolutional GNN layers. Since the
RNN-based T-GCN [29], convolution-based Graph WaveNet convolution-based temporal feature extraction requires no
[55], and self-attention-based GMAN [68] and compared the sequential computation, the model could overcome the
results by replacing spatial building blocks of these base vanishing gradient problem. The Graph WaveNet adopts

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

FIGURE 4. Illustration of dilated causal convolution (figure is adapted

from Fig. 3 of [87].

the Gated Activation Unit (GAU) [102] for dilated causal

convolution. The convolution with input Hti−T C1V t ∈
RN ×T ×d at time t with T historical graph signals can be
defined as:

H ′t−T +1:t = GAU ((01 , 02 ) ∗conv H t−T +1:t )

= tanh(01 ∗conv H t−T +1:t ⊙ σ (02 ∗conv H t−T +1:t ), (9)

where ∗conv is the dilated convolution operation with

convolution kernels 01 , and 02 ∈ Rp×d×d . For a node input
hit−1+T :t ∈ RT ×d for node vi at time t with T historical node
signals, the dilated convolution with kernel for one output FIGURE 5. Architecture of GWNet. The model extracts temporal features
using dilated causal convolution and gated activation unit. Then, a GNN
channel γ ∈ Rp×d is defined as follows: module is implemented after the convolution to extract spatial features.
A spatial-temporal (ST) layer consists of a dilated causal convolution with
γ ∗conv hit−T +1:t gated activation and a GNN module, and multiple ST-layers are stacked to
extract the final representation. The subscripts t in this figure indicate the
X p
d X graph signals from time step t − T + 1 to t .
= γ (p, b)hit−T +1:t (t − s × p, b), (10)
b=1 p=1
defined as:
where the p and b inside the parenthesis in γ (p, b) are the  
indices of the elements of kernel γ , and s is the dilation factor.
h i
(l) (k) (k) (l−1)
hti,t = CATK
 αt,τ · f0 hi,τ , ei,τ  Wo + bo ,
A dilated causal convolution layer is illustrated in Fig. 4. The
τ ∈Nt
output of the dilated causal convolution and gated activation
unit is then fed to a spatial building block to generate the (11)
layer output with dimension RN ×(T −s×(p−1))×d . Note that the (l)
temporal lengths of inputs for the later layers are shorter than where hti,t is the temporal feature vector of the l-th layer for
T. node vi at time t, hi,τ ∈ Rd is the output of the previous
Fig. 5 shows Graph WaveNet structure with the original layer for node vi at time t, ei,τ ∈ Rd is the spatial-temporal
spatial building block replaced by GNN. In [55], the GNN embedding vector, K is the number of head for multi-head
layer is implemented with a self-adaptive adjacency matrix attention, αt,τ is the attention score between the time step t
term added to diffusion convolution [16]. The dilated causal and τ for head k, Nt is a set of input time steps, and f0 is
convolution and GNN operation form a spatial-temporal (k)
a non-linear projection defined as f0 (x) = ReLU(xW + b)
′ ′
layer, with residual and skip connections added to prevent with learnable parameters W ∈ R 2d×d and b ∈ Rd , and

information loss from stacking multiple spatial-temporal Wo ∈ RKd ×d and bo ∈ Rd are learnable parameters. Here,
layers. For a more detailed description of Graph WaveNet, (k)
the attention score αt,τ can be obtained as
please refer to the original study [55]. This study replaces  
the spatial building block with GCN and GAT. Hereinafter, (k)
exp st,τ
we denote the Graph WaveNet implemented with GCN and αt,τ = P  , (12)
GAT as GWNet-GCN and GWNet-GAT, respectively. t ′ ∈Nt exp st,t ′
GMAN [68] is a self-attention-based model using spatial
and temporal attention modules to model traffic data. The where
model extracts spatial and temporal features separately and D
combines them using a gated fusion module. To impose f1 hi,t , ei,t , f2 hi,τ , ei,τ
st,τ = √ (13)
positional information on the nodes and time steps, the model d′
suggests spatial-temporal embedding, using time indicator
(k) (k)
vectors and node embedding vectors obtained by node2vec where f1 and f2 are non-linear projections, and d ′ is the
[99]. The temporal attention module of GMAN with input is dimension of each head. The gated fusion is implemented to

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

PeMS-Bay is a widely used speed dataset for traffic

forecasting collected by California Transportation Agencies
(CalTrans) Performance Measurement System (PeMS). The
dataset contains six months of data ranging from January 1,
2017, to June 30, 2017, with a data frequency of 5 min.
Spatially, 325 sensors in the Bay Area are included.
The dataset examines the model performances for loop
detector-based highway speed forecasting.
METR-LA traffic flow dataset contains data collected
from loop detectors on Los Angeles County highways and
is frequently used in traffic forecasting studies. The dataset
contains 5-min traffic flow data from 207 sensors, from
March 1, 2012, to June 30, 2012. The dataset analyses
differences in model performances on traffic speed and flow
FIGURE 6. The GMAN-GNN encoder. The model extracts temporal datasets.
features using the self-attention module and spatial features using GNN
modules. It can be regarded as a transformer [61] encoder expanded on For the PeMS-Bay and METR-LA, we followed the
spatial dimension. By excluding the GNN module and making a residual procedures in Li et al. [16] to process the dataset and generate
connection between the input and output of the attention layer, the
GMAN encoder can be transformed into a transformer encoder. The edges between traffic sensors. We construct the graphs and
subscripts t in this figure indicate the graph signals from time step build adjacency matrices based on the distances between
t − T + 1 to t .
nodes and the threshold Gaussian kernel [103]:
(l) (l)
combine the spatial features HS and temporal features HT
2 2
 ! !
from the attention modules:  exp − dij , if exp − dij ≥ ϵ and i ̸= j

H (l) = z ⊙ HS + (1 − z) ⊙ HT ,
(14) aij = σ2 σ2

0, otherwise


(l) (l)
z = σ HS W z,1 + HT W z,2 + bz , (15)

where W z,1 , W z,2 ∈ Rd×d , and bz ∈ Rd are learnable where dij is the distance between sensor vi and vj , σ is the
parameters, and σ (·) is the sigmoid activation. While the standard deviation, and ϵ = 0.1 is the threshold value.
(l) Urban-core and Urban-mix are 5-min speed data for
spatial attention layer to obtain spatial features HS is imple-
mented in a similar manner to the temporal attention layer in road segments in the Seoul traffic network. Both contain
the original work, we replaced the spatial attention module information for one month ranging from April 1, 2018,
with GCN and GAT, denoted GMAN-GCN and GMAN- to April 30, 2018. Urban-core includes 304 records of road
GAT (Fig.6). The transform attention layer is implemented segments in Gangnam, Seoul, one of the regions with the
between the encoder and decoder to enable the multi-step highest traffic and economic activities in the country. The
prediction and reduce error propagation in the prediction task. road segments have similar structural features, such as speed
GMAN can be regarded as a 2-dimensional expansion of the limit, degree, and length.
original transformer [61]. Two parallel self-attention modules Urban-mix is a spatial expansion of Urban-core and
are employed to extract features from both spatial and has road segments with more heterogeneous characteristics.
temporal dimensions, whereas transformer only considers a It contains the inner-city highway connecting the East and
single dimension. To merge representations from two self- West ends of the city, urban arterials, alleys, bridges, and a
attention modules, GMAN replaces the feedforward layer few intercity highway segments. The transportation network
in transformer with a feature fusion layer and makes one graph of Urban-mix has 1,007 road segments. The edges of
residual connection between the input and output of an transportation network graphs are set between road segments
encoder layer. For a more detailed description of GMAN, that share endpoints.
please refer to the original study [68]. When the four datasets are compared in terms of
complexity, the highway flow shows higher complexity
C. DATA than highway speed and urban speed demonstrate higher
To analyze the performance of each model, we select complexity than highway data as in Fig. 7. The approximate
four real-world datasets with diverse characteristics, namely, entropy values [104] on average are 0.52, 1.20, 1.40, and
PeMS-Bay, METR-LA [16],1 Urban-core, and Urban-mix 1.41 for PeMS-Bay, METR-LA, Urban-core, and Urban-mix,
[31].2 respectively. Table 2 summarizes the datasets.
1 PeMS-Bay and METR-LA datasets are available at D. EXPERIMENTAL SETTINGS
2 Urban-core and Urban-mix datasets are available at We adopt mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE)

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

TABLE 2. Summary of the datasets.

FIGURE 7. One day sample data from different datasets. (a) PeMS-Bay, (b) METR-LA, (c) Urban-core, and (d) Urban-mix. The traffic
flow data (METR-LA) shows higher entropy than traffic speed data (PeMS-Bay), and urban data (Urban-core and Urban-mix) show
higher entropy than highway data (PeMS-Bay).

as evaluation metrics for model performances. We calibrated each model hyperparameters as closely as
that in the original works [29], [55], [68]. We set the number
N T′ of hidden units to 64 for GMAN-GCN and GMAN-GAT
1 XX i
MAE = ′ ŷj − yij , (17) and 32 for T-GCN, T-GAT, GWNet-GCN, and GWNet-GAT,
TN batch size to 32, and learning rate to 0.001. For GAT, the
i=1 j=1
v number of heads and dimensions of each head are 8. The
u N T ′ ŷi − yi 2
uX X j j number of layers for GMAN models was 3 except for those in
RMSE = t , (18) Urban-mix because of memory limitation and GMAN-GAT
T ′N
i=1 j=1 in METR-LA because the model failed to converge with
N T′ i i 3 layers. A 2-layer model was used in these cases. We trained
1 X X ŷj − yj the models using the Adam optimizer, and L1 loss function.
MAPE = ′ , (19)
i=1 j=1
yij The experiment was conducted on a single NVIDIA TITAN
RTX with 24 GB memory (GPU) and Intel(R) Xeon(R) CPU
ES-2630 v4 @ 2.20 GHz (CPU).3
where T ′ is the total number of predicted time steps, N is the
number of nodes (sensors or road segments), and ŷij and yij are 3 The source codes are available at
the predicted and actual values.

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

TABLE 3. Model performance on traffic datasets.

V. RESULTS AND DISCUSSION 70.6%, 63.7%, and 52.4% for the T-GAT, GWNet-GCN, and
A. OVERALL PERFORMANCE GMAN-GAT, respectively. The differences in RMSE in all
Table 3 shows the model performances in the four traffic datasets are presented in Table 4. The self-attention shows
datasets for 15 min (3 steps), 30 min (6 steps), and 60 min robust performance against the increase in prediction horizon,
(12 steps) cases. When combined with the RNN model, yielding a smaller gap between the 15-min and 60-min
GAT-based spatial feature extraction yields more accurate prediction outcomes. This indicates possible advantages for
results than GCN, except for MAE on the 15-min forecast prediction horizons longer than one hour.
in METR-LA. The convolution shows improved predictions
when combined with GCN except for RMSE in Urban-mix B. PERFORMANCE IN DIFFERENT TRAFFIC CATEGORIES
for all prediction horizons. When using self-attention for In this subsection, we analyze the performance of each
temporal feature extraction, GMAN-GAT consistently yields model in different traffic categories. We divided the traffic
improved results than the GCN counterpart on at least states into unequal intervals, considering the range and
one performance metric in all datasets except 15-min and distribution of each dataset. In PeMS-Bay, we initially
30-min predictions in METR-LA. Overall, the convolution divided the speed data with equal intervals of 10 mph.
models yield the best performance among the comparative However, we merged the five lower speed intervals because
models except in long-term (60-min) prediction in PeMS-Bay each interval contained few observations, and merged the two
and Urban-core. Although T-GAT produces fair prediction higher speed intervals for the same reason. Since the 60∼70
outcomes, RNN shows no clear advantage over the other mph interval included nearly 80% of the data, we divided
building blocks for temporal feature extraction. the interval into two intervals of 5 mph. Finally, we have
The three temporal building blocks methods show dif- five speed categories in PeMS-Bay: 0∼50 mph, 50∼60 mph,
ferences in the gap between the forecasting accuracy on 60∼65 mph, 65∼70 mph, and 70∼90 mph.
the 15-min and 60-min predictions. The RMSE differences Table 5 presents the results of the traffic forecasting models
between the two prediction horizons in PeMS-Bay are in PeMS-Bay, across different traffic speed categories and

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

TABLE 4. RMSE gap between the 15-min and 60-min prediction for all datasets. The attention-based GMAN models show smaller gaps compared to the
other models.

prediction horizons. The best performance is observed in the TABLE 5. Performance in MAE by traffic speed categories in PeMS-Bay.
GWNet-GCN presents high performances in high-frequency categories,
65∼70 mph category, which contains the most observations. while GMAN-GAT performs better in low-frequency categories.
In contrast, the largest errors are observed in the 0∼50
mph category, which is furthest from the high-speed, high-
frequency 65∼70 mph category. In categories with over
60 mph, the category-wise errors are smaller than the overall
performance. Similar to the overall performance evaluation,
the two models outperform the RNN model. The convolution-
based GWNet-GCN achieved high performances in the high-
frequency categories. For 60-min prediction, GWNet-GCN
produces more accurate predictions than GMAN-GAT in
60∼65, 65∼70, and 70∼90 mph categories. In contrast,
GMAN-GAT shows more robust performance across dif-
ferent traffic categories than GWNet-GCN. In PeMS-Bay,
the 0∼50 mph category MAE is 9.9 times larger than the
65∼70 mph category MAE for GWNet-GCN on 60-min
prediction. In contrast, the ratio is 7.9 for GMAN-GAT. The
TABLE 6. Prediction results of 60-min during traffic transitions.
ratios are 6.9 and 6.4 on 15-min predictions for GWNet-GCN
and GMAN-GAT, respectively. Similar trends are observed
in other datasets. In METR-LA, GWNet-GCN performs
better in high-frequency categories (60∼65 and 65∼75
veh/h), while GMAN-GAT shows higher performance in
low-frequency categories (30∼50 and 50∼60 veh/h). In the
0∼30 veh/h category, the self-attention model performance
decreases, and the convolution model performance improves.
For Urban-core, the distributions are right-skewed as opposed
to highway datasets. Therefore, convolution models are more
effective at low-speed categories, whereas self-attention mod-
els are better suited for high-speed categories. In Urban-mix,
GWNet-GCN achieves the highest performance across all
speed categories and prediction horizons. The category-wise
performances for METR-LA, Urban-core, and Urban-mix are
presented in Fig. 8.
We also analyze model performances in conditions where
traffic states experience transitions. We denote the condition
where the speed increases or decreases more than 30 mph the performance of 60-min forecasting outcomes during
in 90 min (18 time steps) in PeMS-Bay as speed increase traffic transitions for all datasets. The results on the other
and decrease transitions, respectively, and compare the datasets show similar trends as for PeMS-Bay. The RNN and
60-min prediction results. During transitions, the model self-attention models show advantages over the convolution
performances differ from the overall performances. Whereas model except in the Urban-mix. The traffic transition
GWNet-GCN yielded low MAE and MAPE overall, GMAN- conditions in the other datasets are defined if the states
GAT outperformed GWNet-GCN in all performance metrics change by 30 veh/h, 10 km/h, and 20 km/h for METR-LA,
in increasing and decreasing transitions. Table 6 presents Urban-core, and Urban-mix, respectively.

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

FIGURE 8. Performance by traffic flow and speed categories on (a) METR-LA, (b) Urban-core, and (c) Urban-mix. The line
graphs are the MAE of each model in each traffic category, and the red histogram in the background is the ratio of each
category in each dataset. Among the three models, GWNet-GCN achieves the best performance in categories with high
observation percentage, and GMAN-GAT generally achieves the best performance in categories with low observation

C. ROBUSTNESS AGAINST OUTLIERS each prediction horizon l by averaging the two candidate
Another characteristic observed is robustness against outliers models, G1 and G2 , as follows:
in the labels. As in Figs. 9(a), (d), 10 (a) and (d), RNN
and convolution-based temporal feature extractions show l 1 
Ŷ (p) = G1 (X )l + G2 (X )l , (20)
delayed reactions to outliers, causing large errors within 2
a few time steps. While GWNet-GCN shows the highest
overall accuracy in most datasets and prediction horizons where Xval is the input data of validation sets, and (G(Xval )l
as presented in previous sections, the self-attention model is the output of the model G for prediction horizon l. The
is more robust against outliers than the other models. pseudo-labels are necessary to distribute the test sets in which
In addition, the attention-based GAT models also show more category they should be evaluated. For each traffic category s,
robustness than GCN models for spatial feature extraction, we compare the loss for the two models and make predictions
as shown in Fig. 11. Ŷtest l,s as follows:

 α ∗ G1 (Xtest
s l
) + (1 − α) ∗ G2 (Xtest
s l
 if L(G (X s )l , Y (p)l,s ) > L(G (X s )l , Y (p)l,s )

The model performance was found to change by traffic state

1 val val 2 val val
categories. In this section, we evaluate the models adaptively Ŷtest l,s =
 (1 − α) ∗ G (X
1 test
s l
) + α ∗ G (X s l
2 test )
by selecting the model depending on the prediction horizon

(p)l,s (p)l,s

s l
) , Yval ) < L(G2 (Xval
s l
) , Yval )

if L(G1 (Xval

and traffic category-wise performance on validation sets. For
adaptive evaluation, we first make pseudo-labels Ŷ (p) for (21)

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

FIGURE 9. 60-min prediction labels (blue) and outcomes of T-GAT (orange), GWNet-GCN (green), and GMAN-GCN (red)
in METR-LA for sample nodes. In (a) and (d), the RNN and convolution models show delayed reactions to outliers.
While the convolution model shows the highest overall accuracy for 60-min prediction in METR-LA, the attention
model shows more robustness against outliers.

FIGURE 10. 60-min prediction labels (blue) and outcomes of T-GAT (orange), GWNet-GCN (green), and GMAN-GAT
(red) in Urban-core for sample nodes. In (a) and (d), the RNN and convolution models show delayed reactions to
outliers. In Urban-core, the attention model achieves more robustness against outliers compared to the other models
along with the highest accuracy for 60-min prediction.

where α is a predefined value between 0.5 and 1, Y (p) is performance metrics in all datasets and prediction horizons
the pseudo-label for prediction horizon l included in category as shown in Table 7. For 60-min prediction in PeMS-Bay, the
s, and X s is the corresponding pseudo-label Y (p) . For the performance gain is the largest, outperforming the previous
final prediction, we calculate Eq. (21) for all categories and state-of-the-art GMAN and GWNet by 3.7%. When the
aggregate the category-wise results. In this experiment, α is Diebold-Mariano test is conducted for 60-min forecasts,
set to 0.7. The concept of this adaptive model evaluation forecasts on 57.5%, 44.4%, 31.3%, and 56.8% of nodes are
framework is visualized in Fig. 12. The adaptive evaluation statistically significant (α = 0.1) in PeMS-Bay, METR-LA,
framework achieved higher performance on at least two Urban-core, and Urban-mix, respectively.

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

FIGURE 11. Robustness against outliers by different spatial feature extraction methods. GAT models show more
robustness against GCN models for all T-GCN, GWNet, and GMAN models.

FIGURE 12. Adaptive model evaluation framework. It adaptively selects which model to conduct prediction for different
traffic categories based on the performance on validations sets. As a result, the prediction can be made with multiple
models, improving the utility of each model.

TABLE 7. Performance of adaptive model evaluation framework.

E. DISCUSSION For temporal building blocks, the self-attention models

An extensive and multi-faceted evaluation of six traffic demonstrate competitive long-term predictions, but the RNN
forecasting models was conducted to characterize and models show no advantages in any task. The results do not
understand the deep learning model building blocks for traffic imply that convolution and self-attention are superior to RNN
forecasting. The convolution models showed the highest but that they have clear advantages over RNN in traffic
forecasting power overall among the three temporal feature forecasting. When pairing the spatial and temporal feature
extraction methods. This supports the current practice in extraction methods, improved performances are noticed
which most deep learning-based traffic forecasting models when convolution is combined with convolutional GNN and
are built with convolution-based temporal feature extraction. self-attention with the attentional GNN except in METR-LA.

Y. Shin, Y. Yoon: Performance Evaluation of Building Blocks of Spatial-Temporal Deep Learning Models

We infer that these paired methods are similar in extracting Sophisticated state-of-the-art models could be investigated
information from input data. to discover whether the model characteristics would persist.
Further assessments reveal that the models show differ- Explainable artificial intelligence techniques [105], [106]
ent performance sensitivity to traffic state changes. The could also be adopted to explore the deep learning-based
convolution model performed well in high-frequency traffic traffic forecasting model characteristics. These techniques
categories, and the self-attention model showed robust have been rarely used in traffic forecasting studies [107]
performances even in low-frequency traffic categories and and could give a new direction if implemented appropriately.
with outliers. In addition, during the traffic transitions, the Moreover, the adaptive model evaluation framework will be
self-attention, and RNN models show advantages in long- refined to include predictions during transition states and
term prediction. The attention-based methods in spatial against time-series anomalies.
and temporal dimensions demonstrated improved robustness
with outliers. Overall, the convolution model achieves more GENERATIVE AI AND AI-ASSISTED TECHNOLOGIES
performance gain for the short-term (15-min) prediction During the preparation of this work the authors used ChatGPT
and high-frequency traffic categories. In contrast, the and Grammarly in order to check the grammar. After using
self-attention model has more advantages in prediction for this tool/service, the authors reviewed and edited the content
less-informed conditions such as longer prediction horizons, as needed and take full responsibility for the content of the
low-frequency traffic categories, and outliers. publication.
In addition, we suggest a framework that adaptively selects
a model for each category to make predictions based on the
validation set performance. The results reveal that the simple
The authors declare that there is no conflict of interest
implementation of an adaptive evaluation framework could
regarding the publication of this paper.
improve the performance of the previous state-of-the-art by
3.7% at most. This framework enhances traffic forecasting
performance using the existing models rather than developing REFERENCES
YUYOL SHIN received the B.S. and Ph.D. degrees

in civil and environmental engineering from the
Korea Advanced Institute of Science and Technol-
ogy (KAIST), Daejeon, South Korea, in 2016 and
2022, respectively. He is currently a Postdoctoral
Researcher in civil and environmental engineering
with KAIST. During his postdoctoral appointment,
he has visited UC Berkeley in civil and environ-
mental engineering as a Visiting Scholar, from
October 2022 to June 2023. His research interests
include spatial-temporal data mining, graph neural networks, artificial
intelligence applications, and transportation network analysis.

VOLUME 11, 2023 136495

