Graph Wavenet
Graph Wavenet
Graph Wavenet
Zonghan Wu1 , Shirui Pan2∗ , Guodong Long1 , Jing Jiang1 , Chengqi Zhang1
1
Centre for Artificial Intelligence, FEIT, University of Technology Sydney, Australia
2
Faculty of Information Technology, Monash University, Australia
[email protected], [email protected],
{guodong.long, jing.jiang, chengqi.zhang}@uts.edu.au
arXiv:1906.00121v1 [cs.LG] 31 May 2019
Abstract
Spatial-temporal graph modeling is an important
task to analyze the spatial relations and temporal
trends of components in a system. Existing ap-
proaches mostly capture the spatial dependency on
a fixed graph structure, assuming that the under-
lying relation between entities is pre-determined.
However, the explicit graph structure (relation)
does not necessarily reflect the true dependency and Figure 1: Spatial-temporal graph modeling. In a spatial-temporal
genuine relation may be missing due to the incom- graph, each node has dynamic input features. The aim is to model
plete connections in the data. Furthermore, ex- each node’s dynamic features given the graph structure.
isting methods are ineffective to capture the tem-
poral trends as the RNNs or CNNs employed in
these methods cannot capture long-range tempo- et al., 2018], and driver maneuver anticipation [Jain et al.,
ral sequences. To overcome these limitations, we 2016]. For a concrete example, in traffic speed forecasting,
propose in this paper a novel graph neural network speed sensors on roads of a city form a graph where the edge
architecture, Graph WaveNet, for spatial-temporal weights are judged by two nodes’ Euclidean distance. As the
graph modeling. By developing a novel adaptive traffic congestion on one road could cause lower traffic speed
dependency matrix and learn it through node em- on its incoming roads, it is natural to consider the underlying
bedding, our model can precisely capture the hid- graph structure of the traffic system as the prior knowledge of
den spatial dependency in the data. With a stacked inter-dependency relationships among nodes when modeling
dilated 1D convolution component whose recep- time series data of the traffic speed on each road.
tive field grows exponentially as the number of A basic assumption behind spatial-temporal graph model-
layers increases, Graph WaveNet is able to handle ing is that a node’s future information is conditioned on its
very long sequences. These two components are historical information as well as its neighbors’ historical in-
integrated seamlessly in a unified framework and formation. Therefore how to capture spatial and temporal de-
the whole framework is learned in an end-to-end pendencies simultaneously becomes a primary challenge. Re-
manner. Experimental results on two public traf- cent studies on spatial-temporal graph modeling mainly fol-
fic network datasets, METR-LA and PEMS-BAY, low two directions. They either integrate graph convolution
demonstrate the superior performance of our algo- networks (GCN) into recurrent neural networks (RNN) [Seo
rithm. et al., 2018; Li et al., 2018b] or into convolution neural net-
works (CNN) [Yu et al., 2018; Yan et al., 2018]. While hav-
ing shown the effectiveness of introducing the graph structure
1 Introduction of data into a model, these approaches face two major short-
Spatial-temporal graph modeling has received increasing at- comings.
tention with the advance of graph neural networks. It aims First, these studies assume the graph structure of data re-
to model the dynamic node-level inputs by assuming inter- flects the genuine dependency relationships among nodes.
dependency between connected nodes, as demonstrated by However, there are circumstances when a connection does not
Figure 1. Spatial-temporal graph modeling has wide appli- entail the inter-dependency relationship between two nodes
cations in solving complex system problems such as traf- and when the inter-dependency relationship between two
fic speed forecasting [Li et al., 2018b], taxi demand pre- nodes exists but a connection is missing. To give each circum-
diction [Yao et al., 2018], human action recognition [Yan stance an example, let us consider a recommendation system.
In the first case, two users are connected, but they may have
∗
Corresponding Author. distinct preferences over products. In the second case, two
users may share a similar preference, but they are not linked node classification [Kipf and Welling, 2017], graph classifi-
together. Zhang et al. [2018] used attention mechanisms to cation [Ying et al., 2018], link prediction [Zhang and Chen,
address the first circumstance by adjusting the dependency 2018] and node clustering [Wang et al., 2017]. There are
weight between two connected nodes, but they failed to con- two mainstreams of graph convolution networks, the spectral-
sider the second circumstance. based approaches and the spatial-based approaches. Spectral-
Second, current studies for spatial-temporal graph mod- based approaches smooth a node’s input signals using graph
eling are ineffective to learn temporal dependencies. RNN- spectral filters [Bruna et al., 2014; Defferrard et al., 2016;
based approaches suffer from time-consuming iterative prop- Kipf and Welling, 2017]. Spatial-based approaches extract
agation and gradient explosion/vanishing for capturing long- a node’s high-level representation by aggregating feature in-
range sequences [Seo et al., 2018; Li et al., 2018b; Zhang formation from neighborhoods [Atwood and Towsley, 2016;
et al., 2018]. On the contrary, CNN-based approaches en- Gilmer et al., 2017; Hamilton et al., 2017]. In these ap-
joy the advantages of parallel computing, stable gradients and proaches, the adjacency matrix is considered as prior knowl-
low memory requirement [Yu et al., 2018; Yan et al., 2018]. edge and is fixed throughout training. Monti et al. [2017]
However, these works need to use many layers in order to learned the weight of a node’s neighbor through Gaussian
capture very long sequences because they adopt standard 1D kernels. Velickovic et al. [2017] updated the weight of a
convolution whose receptive field size grows linearly with an node’s neighbor via attention mechanisms. Liu et al. [2019]
increase in the number of hidden layers. proposed an adaptive path layer to explore the breadth and
In this work, we present a CNN-based method named depth of a node’s neighborhood. Although these methods as-
Graph WaveNet, which addresses the two shortcomings we sume the contribution of each neighbor to the central node
have aforementioned. We propose a graph convolution layer is different and need to be learned, they still rely on a pre-
in which a self-adaptive adjacency matrix can be learned from defined graph structure. Li et al. [2018a] adopted distance
the data through an end-to-end supervised training. In this metrics to adaptively learn a graph’s adjacency matrix for
way, the self-adaptive adjacency matrix preserves hidden spa- graph classification problems. This generated adjacency ma-
tial dependencies. Motivated by WaveNet [Oord et al., 2016], trix is conditioned on nodes’ inputs. As inputs of a spatial-
we adopt stacked dilated casual convolutions to capture tem- temporal graph are dynamic, their method is unstable for
poral dependencies. The receptive field size of stacked di- spatial-temporal graph modeling.
lated casual convolution networks grows exponentially with
an increase in the number of hidden layers. With the sup- 2.2 Spatial-temporal Graph Networks
port of stacked dilated casual convolutions, Graph WaveNet
is able to handle spatial-temporal graph data with long-range The majority of Spatial-temporal Graph Networks follows
temporal sequences efficiently and effectively. The main con- two directions, namely, RNN-based and CNN-based ap-
tributions of this work are as follows: proaches. One of the early RNN-based methods captured
• We construct a self-adaptive adjacency matrix which spatial-temporal dependencies by filtering inputs and hid-
preserves hidden spatial dependencies. Our proposed den states passed to a recurrent unit using graph convo-
self-adaptive adjacency matrix is able to uncover unseen lution [Seo et al., 2018]. Later works adopted different
graph structures automatically from the data without any strategies such as diffusion convolution [Li et al., 2018b]
guidance of prior knowledge. Experiments validate that and attention mechanisms [Zhang et al., 2018] to improve
our method improves the results when spatial dependen- model performance. Another parallel work used node-level
cies are known to exist but are not provided. RNNs and edge-level RNNs to handle different aspects of
temporal information [Jain et al., 2016]. The main draw-
• We present an effective and efficient framework to cap- backs of RNN-based approaches are that it becomes inef-
ture spatial-temporal dependencies simultaneously. The ficient for long sequences and its gradients are more likely
core idea is to assemble our proposed graph convolution to explode when they are combined with graph convolution
with dilated casual convolution in a way that each graph networks. CNN-based approaches combine a graph con-
convolution layer tackles spatial dependencies of nodes’ volution with a standard 1D convolution [Yu et al., 2018;
information extracted by dilated casual convolution lay- Yan et al., 2018]. While being computationally efficient,
ers at different granular levels. these two approaches have to stack many layers or use global
• We evaluate our proposed model on traffic datasets pooling to expand the receptive field of a neural network
and achieve state-of-the-art results with low compu- model.
tation costs. The source codes of Graph WaveNet
are publicly available from https://github.com/
nnzhan/Graph-WaveNet. 3 Methodology
2 Related Works In this section, we first give the mathematical definition of the
problem we are addressing in this paper. Next, we describe
2.1 Graph Convolution Networks two building blocks of our framework, the graph convolution
Graph convolution networks are building blocks for learning layer (GCN) and the temporal convolution layer (TCN). They
graph-structured data [Wu et al., 2019]. They are widely ap- work together to capture the spatial-temporal dependencies.
plied in domains such as node embedding [Pan et al., 2018], Finally, we outline the architecture of our framework.
3.1 Problem Definition doing so, we let the model discover hidden spatial depen-
A graph is represented by G = (V, E) where V is the set of dencies by itself. We achieve this by randomly initializing
nodes and E is the set of edges. The adjacency matrix de- two node embedding dictionaries with learnable parameters
rived from a graph is denoted by A ∈ RN ×N . If vi , vj ∈ V E1 , E2 ∈ RN ×c . We propose the self-adaptive adjacency
and (vi , vj ) ∈ E, then Aij is one otherwise it is zero. At matrix as
each time step t, the graph G has a dynamic feature matrix Ãadp = Sof tM ax(ReLU (E1 ET2 )). (5)
X(t) ∈ RN ×D . In this paper, the feature matrix is used inter- We name E1 as the source node embedding and E2 as the
changeably with graph signals. Given a graph G and its his- target node embedding. By multiplying E1 and E2, we de-
torical S step graph signals, our problem is to learn a function rive the spatial dependency weights between the source nodes
f which is able to forecast its next T step graph signals. The and the target nodes. We use the ReLU activation function to
mapping relation is represented as follows eliminate weak connections. The SoftMax function is applied
to normalize the self-adaptive adjacency matrix. The normal-
f
[X(t−S):t , G] −
→ X(t+1):(t+T ) , (1) ized self-adaptive adjacency matrix, therefore, can be consid-
ered as the transition matrix of a hidden diffusion process. By
where X(t−S):t ∈ RN ×D×S and X(t+1):(t+T ) ∈ RN ×D×T . combining pre-defined spatial dependencies and self-learned
hidden graph dependencies, we propose the following graph
3.2 Graph Convolution Layer convolution layer
Graph convolution is an essential operation to extract a node’s K
features given its structural information. Kipf et al. [2017]
X
Z= Pkf XWk1 + Pkb XWk2 + Ãkapt XWk3 . (6)
proposed a first approximation of Chebyshev spectral fil-
k=0
ter [Defferrard et al., 2016]. From a spatial-based perspec-
tive, it smoothed a node’s signal by aggregating and trans- When the graph structure is unavailable, we propose to use
forming its neighborhood information. The advantages of the self-adaptive adjacency matrix alone to capture hidden
their method are that it is a compositional layer, its filter spatial dependencies, i.e.,
is localized in space, and it supports multi-dimensional in- K
X
puts. Let à ∈ RN ×N denote the normalized adjacency ma- Z= Ãkapt XWk . (7)
trix with self-loops, X ∈ RN ×D denote the input signals , k=0
Z ∈ RN ×M denote the output, and W ∈ RD×M denote It is worth to note that our graph convolution falls into
the model parameter matrix, in [Kipf and Welling, 2017] the spatial-based approaches. Although we use graph signals in-
graph convolution layer is defined as terchangeably with node feature matrix for consistency, our
graph convolution in Equation 7 indeed is interpreted as ag-
Z = ÃXW. (2) gregating transformed feature information from different or-
ders of neighborhoods.
Li et al. [2018b] proposed a diffusion convolution layer
which proves to be effective in spatial-temporal modeling. 3.3 Temporal Convolution Layer
They modeled the diffusion process of graph signals with K We adopt the dilated causal convolution [Yu and Koltun,
finite steps. We generalize its diffusion convolution layer into 2016] as our temporal convolution layer (TCN) to capture
the form of Equation 2, which results in, a node’s temporal trends. Dilated causal convolution net-
K
works allow an exponentially large receptive field by increas-
X ing the layer depth. As opposed to RNN-based approaches,
Z= Pk XWk , (3)
dilated casual convolution networks are able to handle long-
k=0
range sequences properly in a non-recursive manner, which
where Pk represents the power series of the transition matrix. facilitates parallel computation and alleviates the gradient ex-
In the case of an undirected graph, P = A/rowsum(A). plosion problem. The dilated causal convolution preserves
In the case of a directed graph, the diffusion process have the temporal causal order by padding zeros to the inputs so
two directions, the forward and backward directions, where that predictions made on the current time step only involve
the forward transition matrix Pf = A/rowsum(A) and the historical information. As a special case of standard 1D-
backward transition matrix Pb = AT /rowsum(AT ). With convolution, the dilated causal convolution operation slides
the forward and the backward transition matrix, the diffusion over inputs by skipping values with a certain step, as illus-
graph convolution layer is written as trated by Figure 2. Mathematically, given a 1D sequence in-
put x ∈ RT and a filter f ∈ RK , the dilated causal convolu-
K
X tion operation of x with f at step t is represented as
Z= Pkf XWk1 + Pkb XWk2 . (4) K−1
k=0
X
x ? f (t) = f (s)x(t − d × s), (8)
Self-adaptive Adjacency Matrix: In our work, we propose a s=0
self-adaptive adjacency matrix Ãadp . This self-adaptive ad- where d is the dilation factor which controls the skipping dis-
jacency matrix does not require any prior knowledge and is tance. By stacking dilated causal convolution layers with di-
learned end-to-end through stochastic gradient descent. In lation factors in an increasing order, the receptive field of a
Output
+
TCN-a TCN-b
1D convolution to the selected inputs.
Skip
connections
Linear
model grows exponentially. It enables dilated causal convo-
lution networks to capture longer sequences with less layers,
Input
which saves computation resources.
Gated TCN: Gating mechanisms are critical in recurrent Figure 3: The framework of Graph WaveNet. It consists of K
neural networks. They have been shown to be powerful to spatial-temporal layers on the left and an output layer on the right.
control information flow through layers for temporal convo- The inputs are first transformed by a linear layer and then passed to
lution networks as well [Dauphin et al., 2017]. A simple the gated temporal convolution module (Gated TCN) followed by
Gated TCN only contains an output gate. Given the input the graph convolution layer (GCN). Each spatial-temporal layer has
X ∈ RN ×D×S , it takes the form residual connections and is skip-connected to the output layer.
FC-LSTM [Li et al., 2018b] 2.05 4.19 4.80% 2.20 4.55 5.20% 2.37 4.96 5.70%
WaveNet [Oord et al., 2016] 1.39 3.01 2.91% 1.83 4.21 4.16% 2.35 5.43 5.87%
DCRNN [Li et al., 2018b] 1.38 2.95 2.90% 1.74 3.97 3.90% 2.07 4.74 4.90%
GGRU [Zhang et al., 2018] - - - - - - - - -
STGCN [Yu et al., 2018] 1.36 2.96 2.90% 1.81 4.27 4.17% 2.49 5.69 5.79%
Graph WaveNet 1.30 2.74 2.73% 1.63 3.70 3.67% 1.95 4.52 4.63%
Table 2: Performance comparison of Graph WaveNet and other baseline models. Graph WaveNet achieves the best results on both datasets.
Value
Graph WaveNet
Real Value
• DCRNN. Diffusion convolution recurrent neural net- 65
55
WaveNet
Value
Graph WaveNet
encoder-decoder manner. 55
45 Real Value
0 100 200 300 400
Time
Table 3: Experimental results of different adjacency matrix configurations. The forward-backward-adaptive model achieves the best results
on both datasets. The adaptive-only model achieves nearly the same performance with the forward-only model.
8
Computation Time
14 Model
34 Training(s/epoch) Inference(s)
31
DCRNN 249.31 18.73
STGCN 19.10 11.37
41
43
Graph
10 WaveNet 53.68 2.27
47
9 40