Graph Wavenet

Graph WaveNet for Deep Spatial-Temporal Graph Modeling
Zonghan Wu1 , Shirui Pan2∗ , Guodong Long1 , Jing Jiang1 , Chengqi Zhang1
1
Centre for Artificial Intelligence, FEIT, University of Technology Sydney, Australia
2
Faculty of Information Technology, Monash University, Australia
[email protected], [email protected],
{guodong.long, jing.jiang, chengqi.zhang}@uts.edu.au
arXiv:1906.00121v1 [cs.LG] 31 May 2019
Abstract
Spatial-temporal graph modeling is an important
task to analyze the spatial relations and temporal
trends of components in a system. Existing ap-
proaches mostly capture the spatial dependency on
a fixed graph structure, assuming that the under-
lying relation between entities is pre-determined.
However, the explicit graph structure (relation)
does not necessarily reflect the true dependency and Figure 1: Spatial-temporal graph modeling. In a spatial-temporal
genuine relation may be missing due to the incom- graph, each node has dynamic input features. The aim is to model
plete connections in the data. Furthermore, ex- each node’s dynamic features given the graph structure.
isting methods are ineffective to capture the tem-
poral trends as the RNNs or CNNs employed in
these methods cannot capture long-range tempo- et al., 2018], and driver maneuver anticipation [Jain et al.,
ral sequences. To overcome these limitations, we 2016]. For a concrete example, in traffic speed forecasting,
propose in this paper a novel graph neural network speed sensors on roads of a city form a graph where the edge
architecture, Graph WaveNet, for spatial-temporal weights are judged by two nodes’ Euclidean distance. As the
graph modeling. By developing a novel adaptive traffic congestion on one road could cause lower traffic speed
dependency matrix and learn it through node em- on its incoming roads, it is natural to consider the underlying
bedding, our model can precisely capture the hid- graph structure of the traffic system as the prior knowledge of
den spatial dependency in the data. With a stacked inter-dependency relationships among nodes when modeling
dilated 1D convolution component whose recep- time series data of the traffic speed on each road.
tive field grows exponentially as the number of A basic assumption behind spatial-temporal graph model-
layers increases, Graph WaveNet is able to handle ing is that a node’s future information is conditioned on its
very long sequences. These two components are historical information as well as its neighbors’ historical in-
integrated seamlessly in a unified framework and formation. Therefore how to capture spatial and temporal de-
the whole framework is learned in an end-to-end pendencies simultaneously becomes a primary challenge. Re-
manner. Experimental results on two public traf- cent studies on spatial-temporal graph modeling mainly fol-
fic network datasets, METR-LA and PEMS-BAY, low two directions. They either integrate graph convolution
demonstrate the superior performance of our algo- networks (GCN) into recurrent neural networks (RNN) [Seo
rithm. et al., 2018; Li et al., 2018b] or into convolution neural net-
works (CNN) [Yu et al., 2018; Yan et al., 2018]. While hav-
ing shown the effectiveness of introducing the graph structure
1 Introduction of data into a model, these approaches face two major short-
Spatial-temporal graph modeling has received increasing at- comings.
tention with the advance of graph neural networks. It aims First, these studies assume the graph structure of data re-
to model the dynamic node-level inputs by assuming inter- flects the genuine dependency relationships among nodes.
dependency between connected nodes, as demonstrated by However, there are circumstances when a connection does not
Figure 1. Spatial-temporal graph modeling has wide appli- entail the inter-dependency relationship between two nodes
cations in solving complex system problems such as traf- and when the inter-dependency relationship between two
fic speed forecasting [Li et al., 2018b], taxi demand pre- nodes exists but a connection is missing. To give each circum-
diction [Yao et al., 2018], human action recognition [Yan stance an example, let us consider a recommendation system.
In the first case, two users are connected, but they may have
∗
Corresponding Author. distinct preferences over products. In the second case, two
users may share a similar preference, but they are not linked node classification [Kipf and Welling, 2017], graph classifi-
together. Zhang et al. [2018] used attention mechanisms to cation [Ying et al., 2018], link prediction [Zhang and Chen,
address the first circumstance by adjusting the dependency 2018] and node clustering [Wang et al., 2017]. There are
weight between two connected nodes, but they failed to con- two mainstreams of graph convolution networks, the spectral-
sider the second circumstance. based approaches and the spatial-based approaches. Spectral-
Second, current studies for spatial-temporal graph mod- based approaches smooth a node’s input signals using graph
eling are ineffective to learn temporal dependencies. RNN- spectral filters [Bruna et al., 2014; Defferrard et al., 2016;
based approaches suffer from time-consuming iterative prop- Kipf and Welling, 2017]. Spatial-based approaches extract
agation and gradient explosion/vanishing for capturing long- a node’s high-level representation by aggregating feature in-
range sequences [Seo et al., 2018; Li et al., 2018b; Zhang formation from neighborhoods [Atwood and Towsley, 2016;
et al., 2018]. On the contrary, CNN-based approaches en- Gilmer et al., 2017; Hamilton et al., 2017]. In these ap-
joy the advantages of parallel computing, stable gradients and proaches, the adjacency matrix is considered as prior knowl-
low memory requirement [Yu et al., 2018; Yan et al., 2018]. edge and is fixed throughout training. Monti et al. [2017]
However, these works need to use many layers in order to learned the weight of a node’s neighbor through Gaussian
capture very long sequences because they adopt standard 1D kernels. Velickovic et al. [2017] updated the weight of a
convolution whose receptive field size grows linearly with an node’s neighbor via attention mechanisms. Liu et al. [2019]
increase in the number of hidden layers. proposed an adaptive path layer to explore the breadth and
In this work, we present a CNN-based method named depth of a node’s neighborhood. Although these methods as-
Graph WaveNet, which addresses the two shortcomings we sume the contribution of each neighbor to the central node
have aforementioned. We propose a graph convolution layer is different and need to be learned, they still rely on a pre-
in which a self-adaptive adjacency matrix can be learned from defined graph structure. Li et al. [2018a] adopted distance
the data through an end-to-end supervised training. In this metrics to adaptively learn a graph’s adjacency matrix for
way, the self-adaptive adjacency matrix preserves hidden spa- graph classification problems. This generated adjacency ma-
tial dependencies. Motivated by WaveNet [Oord et al., 2016], trix is conditioned on nodes’ inputs. As inputs of a spatial-
we adopt stacked dilated casual convolutions to capture tem- temporal graph are dynamic, their method is unstable for
poral dependencies. The receptive field size of stacked di- spatial-temporal graph modeling.
lated casual convolution networks grows exponentially with
an increase in the number of hidden layers. With the sup- 2.2 Spatial-temporal Graph Networks
port of stacked dilated casual convolutions, Graph WaveNet
is able to handle spatial-temporal graph data with long-range The majority of Spatial-temporal Graph Networks follows
temporal sequences efficiently and effectively. The main con- two directions, namely, RNN-based and CNN-based ap-
tributions of this work are as follows: proaches. One of the early RNN-based methods captured
• We construct a self-adaptive adjacency matrix which spatial-temporal dependencies by filtering inputs and hid-
preserves hidden spatial dependencies. Our proposed den states passed to a recurrent unit using graph convo-
self-adaptive adjacency matrix is able to uncover unseen lution [Seo et al., 2018]. Later works adopted different
graph structures automatically from the data without any strategies such as diffusion convolution [Li et al., 2018b]
guidance of prior knowledge. Experiments validate that and attention mechanisms [Zhang et al., 2018] to improve
our method improves the results when spatial dependen- model performance. Another parallel work used node-level
cies are known to exist but are not provided. RNNs and edge-level RNNs to handle different aspects of
temporal information [Jain et al., 2016]. The main draw-
• We present an effective and efficient framework to cap- backs of RNN-based approaches are that it becomes inef-
ture spatial-temporal dependencies simultaneously. The ficient for long sequences and its gradients are more likely
core idea is to assemble our proposed graph convolution to explode when they are combined with graph convolution
with dilated casual convolution in a way that each graph networks. CNN-based approaches combine a graph con-
convolution layer tackles spatial dependencies of nodes’ volution with a standard 1D convolution [Yu et al., 2018;
information extracted by dilated casual convolution lay- Yan et al., 2018]. While being computationally efficient,
ers at different granular levels. these two approaches have to stack many layers or use global
• We evaluate our proposed model on traffic datasets pooling to expand the receptive field of a neural network
and achieve state-of-the-art results with low compu- model.
tation costs. The source codes of Graph WaveNet
are publicly available from https://github.com/
nnzhan/Graph-WaveNet. 3 Methodology
2 Related Works In this section, we first give the mathematical definition of the
problem we are addressing in this paper. Next, we describe
2.1 Graph Convolution Networks two building blocks of our framework, the graph convolution
Graph convolution networks are building blocks for learning layer (GCN) and the temporal convolution layer (TCN). They
graph-structured data [Wu et al., 2019]. They are widely ap- work together to capture the spatial-temporal dependencies.
plied in domains such as node embedding [Pan et al., 2018], Finally, we outline the architecture of our framework.
3.1 Problem Definition doing so, we let the model discover hidden spatial depen-
A graph is represented by G = (V, E) where V is the set of dencies by itself. We achieve this by randomly initializing
nodes and E is the set of edges. The adjacency matrix de- two node embedding dictionaries with learnable parameters
rived from a graph is denoted by A ∈ RN ×N . If vi , vj ∈ V E1 , E2 ∈ RN ×c . We propose the self-adaptive adjacency
and (vi , vj ) ∈ E, then Aij is one otherwise it is zero. At matrix as
each time step t, the graph G has a dynamic feature matrix Ãadp = Sof tM ax(ReLU (E1 ET2 )). (5)
X(t) ∈ RN ×D . In this paper, the feature matrix is used inter- We name E1 as the source node embedding and E2 as the
changeably with graph signals. Given a graph G and its his- target node embedding. By multiplying E1 and E2, we de-
torical S step graph signals, our problem is to learn a function rive the spatial dependency weights between the source nodes
f which is able to forecast its next T step graph signals. The and the target nodes. We use the ReLU activation function to
mapping relation is represented as follows eliminate weak connections. The SoftMax function is applied
to normalize the self-adaptive adjacency matrix. The normal-
f
[X(t−S):t , G] −
→ X(t+1):(t+T ) , (1) ized self-adaptive adjacency matrix, therefore, can be consid-
ered as the transition matrix of a hidden diffusion process. By
where X(t−S):t ∈ RN ×D×S and X(t+1):(t+T ) ∈ RN ×D×T . combining pre-defined spatial dependencies and self-learned
hidden graph dependencies, we propose the following graph
3.2 Graph Convolution Layer convolution layer
Graph convolution is an essential operation to extract a node’s K
features given its structural information. Kipf et al. [2017]
X
Z= Pkf XWk1 + Pkb XWk2 + Ãkapt XWk3 . (6)
proposed a first approximation of Chebyshev spectral fil-
k=0
ter [Defferrard et al., 2016]. From a spatial-based perspec-
tive, it smoothed a node’s signal by aggregating and trans- When the graph structure is unavailable, we propose to use
forming its neighborhood information. The advantages of the self-adaptive adjacency matrix alone to capture hidden
their method are that it is a compositional layer, its filter spatial dependencies, i.e.,
is localized in space, and it supports multi-dimensional in- K
X
puts. Let Ã ∈ RN ×N denote the normalized adjacency ma- Z= Ãkapt XWk . (7)
trix with self-loops, X ∈ RN ×D denote the input signals , k=0
Z ∈ RN ×M denote the output, and W ∈ RD×M denote It is worth to note that our graph convolution falls into
the model parameter matrix, in [Kipf and Welling, 2017] the spatial-based approaches. Although we use graph signals in-
graph convolution layer is defined as terchangeably with node feature matrix for consistency, our
graph convolution in Equation 7 indeed is interpreted as ag-
Z = ÃXW. (2) gregating transformed feature information from different or-
ders of neighborhoods.
Li et al. [2018b] proposed a diffusion convolution layer
which proves to be effective in spatial-temporal modeling. 3.3 Temporal Convolution Layer
They modeled the diffusion process of graph signals with K We adopt the dilated causal convolution [Yu and Koltun,
finite steps. We generalize its diffusion convolution layer into 2016] as our temporal convolution layer (TCN) to capture
the form of Equation 2, which results in, a node’s temporal trends. Dilated causal convolution net-
K
works allow an exponentially large receptive field by increas-
X ing the layer depth. As opposed to RNN-based approaches,
Z= Pk XWk , (3)
dilated casual convolution networks are able to handle long-
k=0
range sequences properly in a non-recursive manner, which
where Pk represents the power series of the transition matrix. facilitates parallel computation and alleviates the gradient ex-
In the case of an undirected graph, P = A/rowsum(A). plosion problem. The dilated causal convolution preserves
In the case of a directed graph, the diffusion process have the temporal causal order by padding zeros to the inputs so
two directions, the forward and backward directions, where that predictions made on the current time step only involve
the forward transition matrix Pf = A/rowsum(A) and the historical information. As a special case of standard 1D-
backward transition matrix Pb = AT /rowsum(AT ). With convolution, the dilated causal convolution operation slides
the forward and the backward transition matrix, the diffusion over inputs by skipping values with a certain step, as illus-
graph convolution layer is written as trated by Figure 2. Mathematically, given a 1D sequence in-
put x ∈ RT and a filter f ∈ RK , the dilated causal convolu-
K
X tion operation of x with f at step t is represented as
Z= Pkf XWk1 + Pkb XWk2 . (4) K−1
k=0
X
x ? f (t) = f (s)x(t − d × s), (8)
Self-adaptive Adjacency Matrix: In our work, we propose a s=0
self-adaptive adjacency matrix Ãadp . This self-adaptive ad- where d is the dilation factor which controls the skipping dis-
jacency matrix does not require any prior knowledge and is tance. By stacking dilated causal convolution layers with di-
learned end-to-end through stochastic gradient descent. In lation factors in an increasing order, the receptive field of a
Output
TCN K layers Linear

Dilation = 4 Residual
+
TCN ReLU
Dilation = 2
GCN
TCN
Linear
Dilation = 1 Gated TCN
×
ReLU
tanh σ
Figure 2: Dilated casual convolution with kernel size 2. With a
dilation factor k, it picks inputs every k step and applies the standard
+
TCN-a TCN-b
1D convolution to the selected inputs.
Skip
connections
Linear
model grows exponentially. It enables dilated causal convo-
lution networks to capture longer sequences with less layers,
Input
which saves computation resources.
Gated TCN: Gating mechanisms are critical in recurrent Figure 3: The framework of Graph WaveNet. It consists of K
neural networks. They have been shown to be powerful to spatial-temporal layers on the left and an output layer on the right.
control information flow through layers for temporal convo- The inputs are first transformed by a linear layer and then passed to
lution networks as well [Dauphin et al., 2017]. A simple the gated temporal convolution module (Gated TCN) followed by
Gated TCN only contains an output gate. Given the input the graph convolution layer (GCN). Each spatial-temporal layer has
X ∈ RN ×D×S , it takes the form residual connections and is skip-connected to the output layer.
h = g(Θ1 ? X + b) σ(Θ2 ? X + c), (9) Data #Nodes #Edges #Time Steps

where Θ1 , Θ2 , b and c are model parameters, is the METR-LA 207 1515 34272
element-wise product, g(·) is an activation function of the PEMS-BAY 325 2369 52116
outputs, and σ(·) is the sigmoid function which determines
the ratio of information passed to the next layer. We adopt Table 1: Summary statistics of METR-LA and PEMS-BAY.
Gated TCN in our model to learn complex temporal depen-
dencies. Although we empirically set the tangent hyperbolic testing due to the fact that a model learns to make predictions
function as the activation function g(·), other forms of Gated for one step during training and is expected to produce predic-
TCN can be easily fitted into our framework, such as an tions for multiple steps during inference. To achieve this, we
LSTM-like Gated TCN [Kalchbrenner et al., 2016]. artificially design the receptive field size of Graph WaveNet
3.4 Framework of Graph WaveNet equals to the sequence length of the inputs so that in the last
spatial-temporal layer the temporal dimension of the outputs
We present the framework of Graph WaveNet in Figure 3. exactly equals to one. After that we set the number of output
It consists of stacked spatial-temporal layers and an output channels of the last layer as a factor of step length T to get
layer. A spatial-temporal layer is constructed by a graph con- our desired output dimension.
volution layer (GCN) and a gated temporal convolution layer
(Gated TCN) which consists of two parallel temporal con-
volution layers (TCN-a and TCN-b). By stacking multiple 4 Experiments
spatial-temporal layers, Graph WaveNet is able to handle spa- We verify Graph WaveNet on two public traffic network
tial dependencies at different temporal levels. For example, at datasets, METR-LA and PEMS-BAY released by Li et al.
the bottom layer, GCN receives short-term temporal informa- [2018b]. METR-LA records four months of statistics on traf-
tion while at the top layer GCN tackles long-term temporal fic speed on 207 sensors on the highways of Los Angeles
information. The inputs h to a graph convolution layer in County. PEMS-BAY contains six months of traffic speed in-
practice are three-dimension tensors with size [N,C,L] where formation on 325 sensors in the Bay area. We adopt the same
N is the number of nodes, and C is the hidden dimension, L data pre-processing procedures as in [Li et al., 2018b]. The
is the sequence length. We apply the graph convolution layer readings of the sensors are aggregated into 5-minutes win-
to each of h[:, :, i] ∈ RN ×C . dows. The adjacency matrix of the nodes is constructed by
We choose to use mean absolute error (MAE) as the train- road network distance with a thresholded Gaussian kernel
ing objective of Graph WaveNet, which is defined by [Shuman et al., 2012]. Z-score normalization is applied to in-
i=T j=N k=D puts. The datasets are split in chronological order with 70%
1 X X X (t+i) (t+i) for training, 10% for validation and 20% for testing. Detailed
L(X̂(t+1):(t+T ) ; Θ) = |X̂jk −Xjk |
T N D i=1 j=1 dataset statistics are provided in Table 1.
k=1
(10)
4.1 Baselines
Unlike previous works such as [Li et al., 2018b; Yu et al.,
2018], our Graph WaveNet outputs X̂(t+1):(t+T ) as a whole We compare Graph WaveNet with the following models.
rather than generating X̂(t) recursively through T steps. It • ARIMA. Auto-Regressive Integrated Moving Average
addresses the problem of inconsistency between training and model with Kalman filter [Li et al., 2018b].
15 min 30 min 60 min
Data Models
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
ARIMA [Li et al., 2018b] 3.99 8.21 9.60% 5.15 10.45 12.70% 6.90 13.23 17.40%
METR-LA
FC-LSTM [Li et al., 2018b] 3.44 6.30 9.60% 3.77 7.23 10.90% 4.37 8.69 13.20%
WaveNet [Oord et al., 2016] 2.99 5.89 8.04% 3.59 7.28 10.25% 4.45 8.93 13.62%
DCRNN [Li et al., 2018b] 2.77 5.38 7.30% 3.15 6.45 8.80% 3.60 7.60 10.50%
GGRU [Zhang et al., 2018] 2.71 5.24 6.99% 3.12 6.36 8.56% 3.64 7.65 10.62%
STGCN [Yu et al., 2018] 2.88 5.74 7.62% 3.47 7.24 9.57% 4.59 9.40 12.70%
Graph WaveNet 2.69 5.15 6.90% 3.07 6.22 8.37% 3.53 7.37 10.01%
ARIMA [Li et al., 2018b] 1.62 3.30 3.50% 2.33 4.76 5.40% 3.38 6.50 8.30%
PEMS-BAY
FC-LSTM [Li et al., 2018b] 2.05 4.19 4.80% 2.20 4.55 5.20% 2.37 4.96 5.70%
WaveNet [Oord et al., 2016] 1.39 3.01 2.91% 1.83 4.21 4.16% 2.35 5.43 5.87%
DCRNN [Li et al., 2018b] 1.38 2.95 2.90% 1.74 3.97 3.90% 2.07 4.74 4.90%
GGRU [Zhang et al., 2018] - - - - - - - - -
STGCN [Yu et al., 2018] 1.36 2.96 2.90% 1.81 4.27 4.17% 2.49 5.69 5.79%
Graph WaveNet 1.30 2.74 2.73% 1.63 3.70 3.67% 1.95 4.52 4.63%
Table 2: Performance comparison of Graph WaveNet and other baseline models. Graph WaveNet achieves the best results on both datasets.
• FC-LSTM Recurrent neural network with fully con- 70
nected LSTM hidden units [Li et al., 2018b]. 65
• WaveNet. A convolution network architecture for se-

quence data [Oord et al., 2016].
60
70
WaveNet
Value
Graph WaveNet
Real Value
• DCRNN. Diffusion convolution recurrent neural net- 65
55
work [Li et al., 2018b], which combines graph convo- 50
lution networks with recurrent neural networks in an

60
WaveNet
Value
Graph WaveNet
encoder-decoder manner. 55
45 Real Value
0 100 200 300 400
Time
• GGRU. Graph gated recurrent unit network [Zhang et

al., 2018]. Recurrent-based approaches. GGRU uses 50
Figure 4: Comparison of prediction curves between WaveNet and
attention mechanisms in graph convolution. Graph WaveNet for 60 minutes ahead prediction on a snapshot of
45
0 100 200 the test data of METR-LA.
300 400
• STGCN. Spatial-temporal graph convolution network

Time
[Yu et al., 2018], which combines graph convolution

with 1D convolution. It outperforms temporal models including ARIMA, FC-
LSTM, and WaveNet by a large margin. Compared to other
4.2 Experimental Setups spatial-temporal models, Graph WaveNet surpasses the pre-
vious convolution-based approach STGCN significantly and
Our experiments are conducted under a computer environ-
excels recurrent-based approaches DCRNN and GGRU at the
ment with one Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
same time. In respect of the second best model GGRU as sug-
and one NVIDIA Titan Xp GPU card. To cover the input se-
gested in Table 2, Graph WaveNet achieves small improve-
quence length, we use eight layers of Graph WaveNet with a
ment over GGRU on the 15-minute horizons; however, re-
sequence of dilation factors 1, 2, 1, 2, 1, 2, 1, 2. We use Equa-
alizes bigger enhancement on the 60-minute horizons. We
tion 4 as our graph convolution layer with a diffusion step
think this is because our architecture is more capable of de-
K = 2. We randomly initialize node embeddings by a uni-
tecting spatial dependencies at each temporal stage. GGRU
form distribution with a size of 10. We train our model us-
uses recurrent architectures in which parameters of the GCN
ing Adam optimizer with an initial learning rate of 0.001.
layer are shared across all recurrent units. In contrast, Graph
Dropout with p=0.3 is applied to the outputs of the graph
WaveNet employs stacked spatial-temporal layers which con-
convolution layer. The evaluation metrics we choose in-
tain separate GCN layers with different parameters. There-
clude mean absolute error (MAE), root mean squared error
fore each GCN layer in Graph WaveNet is able to focus on its
(RMSE), and mean absolute percentage error (MAPE). Miss-
own range of temporal inputs.
ing values are excluded both from training and testing.
We plot 60-minutes-ahead predicted values v.s real values
of Graph WaveNet and WaveNet on a snapshot of the test data
4.3 Experimental Results in Figure 4. It shows that Graph WaveNet generates more
Table 2 compares the performance of Graph WaveNet and stable predictions than WaveNet. In particular, there is a red
baseline models for 15 minutes, 30 minutes and 60 minutes sharp spike produced by WaveNet, which deviates far from
ahead prediction on METR-LA and PEMS-BAY datasets. real values. On the contrary, the curve of Graph WaveNet
Graph WaveNet obtains the superior results on both datasets. goes in the middle of real values all the time.
Dataset Model Name Adjacency Matrix Configuration Mean MAE Mean RMSE Mean MAPE
Identity [I] 3.58 7.18 10.21%
Forward-only [P] 3.13 6.26 8.65%
METR-
LR Adaptive-only [Ãadp ] 3.10 6.21 8.68%
Forward-backward [Pf , Pb ] 3.08 6.13 8.25%
Forward-backward-adaptive [Pf , Pb , Ãadp ] 3.04 6.09 8.23%
Identity [I] 1.80 4.05 4.18%
Forward-only [Pf ] 1.62 3.61 3.72%
PEMS-
BAY Adaptive-only [Ãadp ] 1.61 3.63 3.59%
Forward-backward [Pf , Pb ] 1.59 3.55 3.57%
Forward-backward-adaptive [Pf , Pb , Ãadp ] 1.58 3.52 3.55%
Table 3: Experimental results of different adjacency matrix configurations. The forward-backward-adaptive model achieves the best results
on both datasets. The adaptive-only model achieves nearly the same performance with the forward-only model.
8
Computation Time
14 Model
34 Training(s/epoch) Inference(s)
31
DCRNN 249.31 18.73
STGCN 19.10 11.37
41
43
Graph
10 WaveNet 53.68 2.27
47
9 40
Table 4: The computation cost on the METR-LA dataset.
(a) The heatmap of the learned (b) The geographical location

self-adaptive adjacency matrix of a part of nodes marked on Graph WaveNet runs five times faster than DCRNN but two
for the first 50 nodes. Google Maps. times slower than STGCN in training. For inference, we mea-
sure the total time cost of each model on the validation data.
Figure 5: The learned self-adaptive adjacency matrix.
Graph WaveNet is the most efficient of all at the inference
stage. This is because that Graph WaveNet generates 12 pre-
Effect of the Self-Adaptive Adjacency Matrix dictions in one run while DCRNN and STGCN have to pro-
To verify the effectiveness of our proposed adaptive adja- duce the results conditioned on previous predictions.
cency matrix, we conduct experiments with Graph WaveNet
using five different adjacency matrix configurations. Ta- 5 Conclusion
ble 3 shows the average score of MAE, RMSE, and MAPE
over 12 prediction horizons. We find that the adaptive-only In this paper, we present a novel model for spatial-temporal
model works even better than the forward-only model with graph modeling. Our model captures spatial-temporal depen-
mean MAE. When the graph structure is unavailable, Graph dencies efficiently and effectively by combining graph con-
WaveNet would still be able to realize a good performance. volution with dilated casual convolution. We propose an ef-
The forward-backward-adaptive model achieves the lowest fective method to learn hidden spatial dependencies automat-
scores on all three evaluation metrics. It indicates that if graph ically from the data. This opens up a new direction in spatial-
structural information is given, adding the self-adaptive adja- temporal graph modeling where the dependency structure of
cency matrix could introduce new and useful information to a system is unknown but needs to be discovered. On two pub-
the model. In Figure 5, we further investigate the learned lic traffic network datasets, Graph WaveNet achieves state-of-
self-adaptive adjacency matrix under the configuration of the the-art results. In future work, we will study scalable methods
forward-backward-adaptive model trained on the METR-LA to apply Graph WaveNet on large-scale datasets and explore
dataset. According to Figure 5a, some columns have more approaches to learn dynamic spatial dependencies.
high-value points than others such as column 9 in the left box
compared to column 47 in the right box. It suggests that some
nodes are influential to most nodes in a graph while other Acknowledgments
nodes have weaker impacts. Figure 5b confirms our observa-
tion. It can be seen that node 9 locates nearby the intersection This research was funded by the Australian Government
of several main roads while node 47 lies in a single road. through the Australian Research Council (ARC) under grants
1) LP160100630 partnership with Australia Government
Computation Time Department of Health and 2) LP150100671 partnership
We compare the computation cost of Graph WaveNet with with Australia Research Alliance for Children and Youth
DCRNN and STGCN on the METR-LA dataset in Table 4. (ARACY) and Global Business College Australia (GBCA).
References [Pan et al., 2018] Shirui Pan, Ruiqi Hu, Sai-fu Fung,
[Atwood and Towsley, 2016] James Atwood and Don Guodong Long, Jing Jiang, and Chengqi Zhang. Learn-
ing graph embedding with adversarial training methods.
Towsley. Diffusion-convolutional neural networks. In
In IJCAI, 2018.
NIPS, pages 1993–2001, 2016.
[Seo et al., 2018] Youngjoo Seo, Michaël Defferrard, Pierre
[Bruna et al., 2014] Joan Bruna, Wojciech Zaremba, Arthur
Vandergheynst, and Xavier Bresson. Structured sequence
Szlam, and Yann LeCun. Spectral networks and locally modeling with graph convolutional recurrent networks. In
connected networks on graphs. In ICLR, 2014. NIPS, pages 362–373, 2018.
[Dauphin et al., 2017] Yann N Dauphin, Angela Fan, [Shuman et al., 2012] David I Shuman, Sunil K Narang, Pas-
Michael Auli, and David Grangier. Language modeling cal Frossard, Antonio Ortega, and Pierre Vandergheynst.
with gated convolutional networks. In ICML, pages The emerging field of signal processing on graphs: Ex-
933–941, 2017. tending high-dimensional data analysis to networks and
[Defferrard et al., 2016] Michaël Defferrard, Xavier Bres- other irregular domains. arXiv preprint arXiv:1211.0053,
son, and Pierre Vandergheynst. Convolutional neural net- 2012.
works on graphs with fast localized spectral filtering. In [Velickovic et al., 2017] Petar Velickovic, Guillem Cucurull,
NIPS, pages 3844–3852, 2016. Arantxa Casanova, Adriana Romero, Pietro Lio, and
[Gilmer et al., 2017] Justin Gilmer, Samuel S Schoenholz, Yoshua Bengio. Graph attention networks. In ICLR, 2017.
Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural [Wang et al., 2017] Chun Wang, Shirui Pan, Guodong Long,
message passing for quantum chemistry. In ICML, pages Xingquan Zhu, and Jing Jiang. Mgae: Marginalized graph
1263–1272, 2017. autoencoder for graph clustering. In CIKM, pages 889–
[Hamilton et al., 2017] Will Hamilton, Zhitao Ying, and Jure 898. ACM, 2017.
Leskovec. Inductive representation learning on large [Wu et al., 2019] Zonghan Wu, Shirui Pan, Fengwen Chen,
graphs. In NIPS, pages 1024–1034, 2017. Guodong Long, Chengqi Zhang, and Philip S Yu. A
[Jain et al., 2016] Ashesh Jain, Amir R Zamir, Silvio comprehensive survey on graph neural networks. arXiv
Savarese, and Ashutosh Saxena. Structural-rnn: Deep preprint arXiv:1901.00596, 2019.
learning on spatio-temporal graphs. In CVPR, pages 5308– [Yan et al., 2018] Sijie Yan, Yuanjun Xiong, and Dahua
5317, 2016. Lin. Spatial temporal graph convolutional networks for
[Kalchbrenner et al., 2016] Nal Kalchbrenner, Lasse Espe- skeleton-based action recognition. In AAAI, pages 3482–
holt, Karen Simonyan, Aaron van den Oord, Alex Graves, 3489, 2018.
and Koray Kavukcuoglu. Neural machine translation in [Yao et al., 2018] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng
linear time. arXiv preprint arXiv:1610.10099, 2016. Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and
[Kipf and Welling, 2017] Thomas N Kipf and Max Welling. Zhenhui Li. Deep multi-view spatial-temporal network for
taxi demand prediction. In AAAI, pages 2588–2595, 2018.
Semi-supervised classification with graph convolutional
networks. In ICLR, 2017. [Ying et al., 2018] Zhitao Ying, Jiaxuan You, Christopher
Morris, Xiang Ren, Will Hamilton, and Jure Leskovec.
[Li et al., 2018a] Ruoyu Li, Sheng Wang, Feiyun Zhu, and
Hierarchical graph representation learning with differen-
Junzhou Huang. Adaptive graph convolutional neural net- tiable pooling. In NIPS, pages 4800–4810, 2018.
works. In AAAI, pages 3546–3553, 2018.
[Yu and Koltun, 2016] Fisher Yu and Vladlen Koltun. Multi-
[Li et al., 2018b] Yaguang Li, Rose Yu, Cyrus Shahabi, and
scale context aggregation by dilated convolutions. In
Yan Liu. Diffusion convolutional recurrent neural net- ICLR, 2016.
work: Data-driven traffic forecasting. In ICLR, 2018.
[Yu et al., 2018] Bing Yu, Haoteng Yin, and Zhanxing Zhu.
[Liu et al., 2019] Ziqi Liu, Chaochao Chen, Longfei Li, Jun Spatio-temporal graph convolutional networks: A deep
Zhou, Xiaolong Li, Le Song, and Yuan Qi. Geniepath: learning framework for traffic forecasting. In IJCAI, pages
Graph neural networks with adaptive receptive paths. In 3634–3640, 2018.
AAAI, 2019.
[Zhang and Chen, 2018] Muhan Zhang and Yixin Chen.
[Monti et al., 2017] Federico Monti, Davide Boscaini, Link prediction based on graph neural networks. In NIPS,
Jonathan Masci, Emanuele Rodola, Jan Svoboda, and pages 5165–5175, 2018.
Michael M Bronstein. Geometric deep learning on graphs
[Zhang et al., 2018] Jiani Zhang, Xingjian Shi, Junyuan Xie,
and manifolds using mixture model cnns. In CVPR, pages
5115–5124, 2017. Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gated at-
tention networks for learning on large and spatiotemporal
[Oord et al., 2016] Aaron van den Oord, Sander Diele- graphs. arXiv preprint arXiv:1803.07294, 2018.
man, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray
Kavukcuoglu. Wavenet: A generative model for raw au-
dio. arXiv preprint arXiv:1609.03499, 2016.

Graph Wavenet

Uploaded by

Copyright:

Available Formats

Graph Wavenet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graph Wavenet

Uploaded by

Copyright:

Available Formats

Graph WaveNet for Deep Spatial-Temporal Graph Modeling

TCN K layers Linear

h = g(Θ1 ? X + b) σ(Θ2 ? X + c), (9) Data #Nodes #Edges #Time Steps

• FC-LSTM Recurrent neural network with fully con- 70

nected LSTM hidden units [Li et al., 2018b]. 65

• WaveNet. A convolution network architecture for se-

work [Li et al., 2018b], which combines graph convo- 50

lution networks with recurrent neural networks in an

• GGRU. Graph gated recurrent unit network [Zhang et

• STGCN. Spatial-temporal graph convolution network

[Yu et al., 2018], which combines graph convolution

Table 4: The computation cost on the METR-LA dataset.

(a) The heatmap of the learned (b) The geographical location

You might also like